[{"title":"BTL-3: a rank-32 LoRA that turns Qwen3.6-27B into a tool-use agent","description":"Bad Theory Labs' BTL-3 isn't a new model — it's a frozen ~934 MB PEFT LoRA adapter (rank 32) RL-tuned on top of Qwen3.6-27B for agentic coding and structured tool use. The base capability is Qwen's; what BTL-3 adds is the loop behaviour — reason, call tools, inspect results, recover, and, distinctively, stop when no tool is needed (91.2% BFCL 'irrelevance'). Self-reported: 95.1% HumanEval, 88.5% BFCL v4 AST, 88.1% LiveCodeBench v6 — with an honest cliff to 26.4% on BigCodeBench-Hard. Ships an 8.39 GB single-file Compact build. Apache-2.0. A grounded read of a thin model card.","date":"2026-07-24","tags":["agents","tool-use","code-generation","fine-tuning","open-weights","explainer"],"draft":false,"featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"btl-3","body":"Most \"new models\" are not new weights. **BTL-3**, from **Bad Theory Labs**, is a clean example: it\nis not a from-scratch 27B model but a **frozen rank-32 PEFT LoRA adapter** — about **934 MB** of\nweights — post-trained on top of a pinned revision of **Qwen3.6-27B**. Load the base, apply the\nadapter, and you get an agent tuned for coding, repository work, and structured tool use. The raw\ncapability is Qwen's; what BTL-3 contributes is **behaviour** — how the model runs an agent loop and,\nnotably, when it decides *not* to act.\n\nThat framing matters for reading the numbers honestly, so keep it in mind: this is a post-training\nresult on a strong open base, released under **Apache-2.0**, with the base checkpoint pinned to an\nexact revision for reproducible loading. The card labels this frozen release **\"RL-0013,\"** and the\nmaximum RL sequence length (65,536 tokens) tells you the adapter was shaped by reinforcement learning\nover long, multi-step trajectories — not just supervised fine-tuning on completions.\n\n## The loop it was tuned to run\n\nAn agent model earns its keep inside a loop, not on a single completion. BTL-3's stated job is to\n\"reason, act, inspect tool results, recover from failures, and stop when no action is required.\" That\nlast clause is the interesting one. A tool-happy model that always reaches for a function call is easy\nto train and annoying to deploy; the harder behaviour is the **route** decision — recognising when a\nquestion needs a tool and when it just needs an answer. Pick a scenario and watch which path the model\ntakes:\n\n<AgentLoop />\n\nThe four scenarios line up with the four things BFCL (the Berkeley Function-Calling Leaderboard) v4\nactually measures: a **single** call, **parallel** calls fired at once, recovery when a call **fails**,\nand **irrelevance** — correctly declining to call anything. BTL-3's headline highlight is that last\none: a self-reported **91.2%** on knowing when to stay its hand. The loop is the product here; the\nadapter's whole point is to make Qwen3.6-27B move through it reliably.\n\n## The tool-use profile\n\nBreak BFCL v4 down by category and the shape of the model shows. It is strongest on the\nstraightforward cases and gives ground exactly where you'd expect — when it has to compose *several*\ntools that each take *several* arguments:\n\n<BenchBars\n  title=\"BFCL v4 · category breakdown (self-reported %)\"\n  unit=\"%\"\n  max={100}\n  bars={[\n    { label: \"Multiple\", value: 95.5 },\n    { label: \"Simple\", value: 93.2 },\n    { label: \"Irrelevance\", value: 91.2, highlight: true },\n    { label: \"Parallel\", value: 87.0 },\n    { label: \"Parallel-multiple\", value: 70.0 },\n  ]}\n/>\n\nThe aggregate is **88.5% BFCL v4 AST** (1097/1240 on the full official set). The 70.0% on\nparallel-multiple is the honest soft spot: issuing several correct calls at once, each with the right\narguments, is where structured tool use is genuinely hard, and a fifth of those cases still slip. The\n**91.2% irrelevance** number is the one worth internalising — it is the difference between an agent you\ncan leave in a loop and one that invents work.\n\n## Coding, and where it falls off\n\nOn standard code-generation benchmarks in **thinking mode**, BTL-3 posts strong pass-rates — and then\ndrops sharply on the hardest composite tasks. That gap is the useful part of the picture, not a number\nto bury:\n\n<BenchBars\n  title=\"Coding pass-rate · self-reported, thinking mode (%)\"\n  unit=\"%\"\n  max={100}\n  bars={[\n    { label: \"HumanEval\", value: 95.12, highlight: true },\n    { label: \"LiveCodeBench v6\", value: 88.1, highlight: true },\n    { label: \"BigCodeBench-Hard\", value: 26.35 },\n  ]}\n/>\n\nHumanEval at **95.12%** (156/164) and LiveCodeBench v6 at **88.1%** (170/193) are the flattering\nfigures — well-scoped \"write this function\" problems. **BigCodeBench-Hard Instruct at 26.35%**\n(39/148) is the sobering one: strict pass@1 on tasks that chain many library calls into one correct\nprogram is a different sport, and here the model solves roughly one in four. (BTL reports a softer\n**59.25%** at the individual *test* level on the same suite — useful context, but a test-level score is\nnot a solved-task score, so read the strict 26.35% as the real one.) These are different benchmarks at\ndifferent difficulties, not a like-for-like ladder — the labels carry that.\n\n## The Compact edition\n\nAlongside the adapter, Bad Theory Labs ships **BTL-3 Compact**: the complete text model packed into a\nsingle **8.39 GB** native file — smaller than an 8B model stored in FP16, which works out to an\neffective **under 2.5 bits per parameter**. The claimed cost of that compression is measured on a\n\"fresh private 100-turn tool-contract gate\": Compact retained **83 of the 90 behaviours** the full\nmodel completed correctly, which BTL reports as **92.2% conditional tool-behaviour retention**.\n\nRead that metric for exactly what it is. It is a *private* gate that BTL defined and ran, conditioned\non cases the full model already passed — so it says \"Compact reproduces most of what the full model got\nright,\" not \"Compact loses only 8% overall.\" It's a reasonable internal check and a genuinely useful\nartifact (a 27B-class agent in 8.39 GB is easy to self-host), but it is not an independent quality\nmeasurement.\n\n## Running it\n\nBecause BTL-3 is a LoRA adapter, deployment is \"load Qwen3.6-27B at the pinned revision, then apply the\nadapter\" — a few lines with PEFT and Transformers, or `vllm serve` with `--enable-lora` and\n`--max-lora-rank 32` and the Qwen XML tool parser for structured calls. The architectural context\nwindow is **262,144 tokens** (inherited from Qwen3.6's hybrid attention), though the published\nbenchmarks were run at a **32,768-token** launch context. BTL recommends **thinking mode** for coding\nand reasoning, which is also the mode every headline score was measured in.\n\n<Callout type=\"warn\">\nThe model card itself is direct about this: **run generated code and tool calls in a sandbox**, and\nrequire **explicit confirmation before destructive, privileged, financial, or otherwise high-impact\nactions**. An agent that scores 88.5% on tool calls still gets more than one call in ten wrong — that\nresidual is exactly where an unsandboxed loop does damage.\n</Callout>\n\n## The honest read\n\n<Callout type=\"note\">\nEvery number here is **self-reported by Bad Theory Labs** — there is no independent evaluation yet. The\nunderlying **capability is Qwen3.6-27B's**; BTL-3 is a **post-training / RL result**, so credit the\nadapter for the *loop behaviour* (tool routing, irrelevance, recovery), not for raw reasoning power.\nScores were measured in **thinking mode** at a **32K context** on protocols BTL chose, the coding wins\nsit next to a **26.35% BigCodeBench-Hard** floor, and the Compact edition's **92.2% retention** is a\nprivate, conditional gate, not a public benchmark. The model card ships **no figures or diagrams** — the\nloop diagram above is my own reconstruction of the described behaviour. No training-data disclosure is\nprovided.\n</Callout>\n\n## The take\n\nBTL-3 is a modest, honest kind of release: take a strong open base, spend an RL budget teaching it to\nbehave inside an agent loop, and ship the ~934 MB of difference under Apache-2.0. The most interesting\nclaim isn't a coding score — it's the **91.2% irrelevance**, the tuned instinct to *not* call a tool.\nFor anyone assembling a private, self-hosted coding agent, that plus the 8.39 GB Compact build is a\nconcrete, deployable proposition. Just hold the framing straight: the intelligence is Qwen's, the\ndiscipline is BTL's, and until someone outside Bad Theory Labs runs the suite, every figure is a\nvendor's own.\n\n---\n\n*Source: the [BTL-3 model card](https://huggingface.co/badtheorylabs/BTL-3) and\n[BTL-3 Compact](https://huggingface.co/badtheorylabs/BTL-3-Compact) on Hugging Face, plus the\n[runtime source](https://github.com/Badtheorylabs/BTL-3). The card ships no figures, so the diagram\nhere is mine; all benchmark numbers are Bad Theory Labs' self-reported values.*\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/articles/btl-3","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Token-level RL is a first-order approximation to the reward you actually want","description":"The Qwen team's formulation for RL with LLMs: token-level objectives like REINFORCE and GRPO are a first-order approximation to the true sequence-level reward, valid only when the training–inference discrepancy and policy staleness are both small. That one lens explains why importance-sampling correction, clipping, and Routing Replay for MoE models stabilize training — validated across hundreds of thousands of GPU hours on a 30B MoE. A first-principles walk through the formulation, the MoE routing problem, and the honest empirical recipe.","date":"2026-07-24","tags":["reinforcement-learning","llm","mixture-of-experts","post-training","explainer"],"draft":false,"cover":"/articles/first-order-rl/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"first-order-rl","body":"RL for reasoning models rests on a mismatch nobody had really justified. The **reward**\nis assigned to a *whole response* — you sample a full chain of thought, check the final\nanswer, and hand back one scalar. But the **optimizer** — REINFORCE, GRPO, and the rest —\nworks one *token* at a time. We reward the sequence and update the tokens, and we mostly\njust trust that closing the loop this way improves the thing we scored.\n\nThe Qwen team's [Stabilizing Reinforcement Learning with LLMs](https://arxiv.org/abs/2512.01374)\n(Zheng et al., arXiv:2512.01374) takes that trust and makes it a theorem with fine print.\nTheir claim: the token-level objective is a **first-order approximation** to the true\nsequence-level reward — exact in the limit, and valid only when two specific gaps are\nsmall. The nice part is what falls out of it. Importance-sampling correction, clipping,\nand Routing Replay for Mixture-of-Experts models — a grab-bag of stabilization tricks that\neach arrived with its own justification — turn out to be the *same move*: keep the\napproximation valid. One lens, and the whole toolbox lines up behind it.\n\n## The objective you can't optimize\n\nWrite the thing we actually want to maximize — expected reward over responses the current\npolicy would generate:\n\n$$\nJ^{\\text{seq}}(\\theta) \\;=\\; \\mathbb{E}_{x\\sim\\mathcal{D},\\; y\\sim\\pi_\\theta(\\cdot|x)}\\big[R(x,y)\\big].\n$$\n\nThere's an immediate wrinkle: we don't sample $y$ from the policy we're training. Responses\ncome out of a fast **inference engine** (vLLM, SGLang) running policy $\\mu_{\\theta_{\\text{old}}}$,\nwhile gradients are taken in a **training engine** (Megatron, FSDP) holding $\\pi_\\theta$.\nThe standard fix is an importance-sampling reweight onto the rollout policy $\\mu$:\n\n$$\nJ^{\\text{seq}}(\\theta) \\;=\\; \\mathbb{E}_{x\\sim\\mathcal{D},\\; y\\sim\\mu_{\\theta_{\\text{old}}}(\\cdot|x)}\\!\\left[\\underbrace{\\frac{\\pi_\\theta(y|x)}{\\mu_{\\theta_{\\text{old}}}(y|x)}}_{\\text{sequence-level IS weight}} R(x,y)\\right].\n$$\n\nThis is correct and completely impractical. A sequence likelihood is a product of hundreds\nor thousands of per-token probabilities, so the ratio $\\pi_\\theta(y|x)/\\mu_{\\theta_{\\text{old}}}(y|x)$\nswings across an enormous dynamic range with brutal variance. Its gradient is technically\nright and numerically hopeless. Nobody trains on it directly.\n\n## The surrogate everyone actually uses\n\nSo instead we optimize the **token-level** objective — sum the per-token IS ratios instead\nof multiplying them:\n\n$$\nJ^{\\text{token}}(\\theta) \\;=\\; \\mathbb{E}_{x\\sim\\mathcal{D},\\; y\\sim\\mu_{\\theta_{\\text{old}}}(\\cdot|x)}\\!\\left[\\sum_{t=1}^{|y|}\\underbrace{\\frac{\\pi_\\theta(y_t|x,y_{<t})}{\\mu_{\\theta_{\\text{old}}}(y_t|x,y_{<t})}}_{\\text{token-level IS weight}} R(x,y)\\right].\n$$\n\nIts gradient is just REINFORCE with a per-token IS weight — stable, cheap, the workhorse of\nevery modern RL post-training run. The paper's key move is to show *why* it's allowed to\nstand in for the sequence objective. Write each token ratio as $\\tfrac{\\pi_\\theta(y_t|\\cdot)}{\\mu_{\\theta_{\\text{old}}}(y_t|\\cdot)} = 1+\\delta_t$\nwith $\\delta_t$ small. Then the true sequence ratio is a product, and a product of\nnear-ones is, to first order, one plus the sum:\n\n$$\n\\frac{\\pi_\\theta(y|x)}{\\mu_{\\theta_{\\text{old}}}(y|x)} \\;=\\; \\prod_{t=1}^{|y|}(1+\\delta_t) \\;\\approx\\; 1 + \\sum_{t=1}^{|y|}\\delta_t \\;+\\; O(\\delta^2).\n$$\n\nDrop the $O(\\delta^2)$ terms and the gradient of the intractable sequence objective becomes\n*exactly* the gradient of the token surrogate. That's the whole theorem: **the token-level\nobjective is the linear part of the sequence objective.** They agree when the per-token\nratios hug 1, and they part ways as those ratios drift — because the neglected second-order\ncross terms $\\sum_{i<j}\\delta_i\\delta_j$ are precisely what the linear surrogate throws away.\nDrive the two gap sources and watch the true product pull away from the surrogate:\n\n<ApproxGap />\n\nThe intuition to keep: the surrogate isn't wrong, it's *truncated*. As long as the policy\nyou're optimizing stays close to the policy that generated the data, the truncation is\nnegligible and improving the cheap objective improves the real reward. Let them separate and\nthe surrogate starts optimizing something that isn't the reward anymore — which, in practice,\nis exactly what a training collapse looks like.\n\n## Two gaps, and the tricks that close them\n\n\"Keep $\\pi_\\theta$ close to $\\mu_{\\theta_{\\text{old}}}$\" sounds abstract until you factor the\ntoken ratio into its two honest sources:\n\n$$\n\\frac{\\pi_\\theta(y_t|\\cdot)}{\\mu_{\\theta_{\\text{old}}}(y_t|\\cdot)} \\;=\\; \\underbrace{\\frac{\\pi_{\\theta_{\\text{old}}}(y_t|\\cdot)}{\\mu_{\\theta_{\\text{old}}}(y_t|\\cdot)}}_{\\text{training–inference discrepancy}} \\;\\times\\; \\underbrace{\\frac{\\pi_\\theta(y_t|\\cdot)}{\\pi_{\\theta_{\\text{old}}}(y_t|\\cdot)}}_{\\text{policy staleness}}.\n$$\n\n- **Training–inference discrepancy** is numerical. The same weights produce slightly\n  different probabilities in the training and inference engines — different kernels,\n  and inference deliberately disables batch-invariant kernels for throughput, so even one\n  engine isn't self-consistent. This is the gap between $\\pi_{\\theta_{\\text{old}}}$ and\n  $\\mu_{\\theta_{\\text{old}}}$.\n- **Policy staleness** is procedural. To use more compute per rollout, we split a big batch\n  of responses into mini-batches and take several gradient steps, so later mini-batches are\n  optimized by a $\\pi_\\theta$ that has already drifted from the $\\pi_{\\theta_{\\text{old}}}$\n  that generated them. Asynchronous frameworks make it worse.\n\nNow the stabilization toolbox reads as one idea — shrink these two gaps so the first-order\napproximation holds:\n\n- The **IS weight itself** is not an optional variance trick; it *is* the first-order term.\n  Drop the training–inference correction and you're no longer approximating the sequence\n  objective at all.\n- **Clipping** (the PPO move) stops gradients on tokens whose ratio has run too far,\n  directly capping policy staleness.\n- **Routing Replay**, for MoE models, closes both — and it needs its own section, because\n  MoE breaks the story in a way dense models don't.\n\n## Why MoE breaks it, and how Routing Replay repairs it\n\nIn a Mixture-of-Experts model, the probability of a token depends on *which experts the\nrouter activated* for it. That turns the token ratio into a comparison over possibly\n*different active parameters*: the inference engine routes the token to expert set $e^\\mu$,\nthe training engine to $e^\\pi$, and when those sets disagree the ratio\n$\\pi_\\theta(y_t|\\cdot)/\\mu_{\\theta_{\\text{old}}}(y_t|\\cdot)$ stops measuring \"a small change\nin the policy\" and starts measuring \"two different subnetworks.\" The $\\delta_t$ are no longer\nsmall, and the first-order approximation collapses. Routing is entangled with *both* gaps —\nthe engines can route differently (discrepancy) and the router's choice can shift as weights\nupdate (staleness). Toggle the fix:\n\n<RoutingReplay />\n\n**Routing Replay** ([Zheng et al., 2025](https://arxiv.org/abs/2507.18071); Ma et al., 2025)\npins the routed experts during optimization so the token is scored over one fixed subnetwork\n— the model is optimized like a dense one, and the ratio means what it should again. The\npaper formalizes two flavors, differing only in *whose* routing you replay:\n\n| | replays | closes | first mini-batch |\n|---|---|---|---|\n| **R2** — Vanilla Routing Replay | the training engine's rollout experts $e^\\pi_{\\text{old}}$ | policy staleness | target policy **unaltered** |\n| **R3** — Rollout Routing Replay | the inference engine's experts $e^\\mu_{\\text{old}}$ | discrepancy **and** staleness | target policy altered |\n\nThere's no free lunch here, and the paper is careful to say so. Fixing the experts restores\nthe approximation but **biases the target policy** — you're now optimizing a model whose\nrouting is frozen to a past choice, not the routing it would pick itself. R2 leaves the first\nmini-batch's target policy untouched; R3 alters it from step one but kills more of the\ndiscrepancy. Which bias is worth paying turns out to depend on how off-policy you run — a\nquestion only experiments can settle.\n\n## MiniRL: the smallest honest baseline\n\nTo test the formulation instead of a pile of confounded tricks, the authors strip RL down to\n**MiniRL** — REINFORCE with the token-level IS weight, group-normalized advantages (subtract\nthe per-prompt mean reward), and PPO-style clipping. That's it. It's deliberately the minimal\nalgorithm whose gradient stays faithful to the surrogate the theory justifies, which makes it\nthe right probe: if the formulation is real, the things that preserve the approximation should\nbe the things that stabilize MiniRL.\n\nThe setup is a genuine stress test. A 30B MoE (cold-started from Qwen3-30B-A3B-Base), **FP8\ninference against BF16 training** — deliberately mismatched precisions to *inflate* the\ntraining–inference discrepancy — on 4,096 verifiable math problems, scored as average accuracy\nover 32 samples on HMMT25, AIME25, and AIME24. Hundreds of thousands of GPU hours, roughly\n5–6 GPU-hours per gradient step. They track not just reward but two diagnostics: policy\n**entropy** and the **training–inference KL divergence**, since a collapse announces itself as\na KL spike before the score falls.\n\n## What the experiments say\n\n**On-policy** (one gradient update per batch), the ablation lands exactly where the theory\npredicts:\n\n<Figure\n  src=\"/articles/first-order-rl/fig1.png\"\n  alt=\"Four-panel on-policy training curves over 1,200 gradient steps: benchmark score, training-inference KL divergence, training reward, and entropy. MiniRL climbs highest and steadiest; adding length normalization is slightly worse; removing the training-inference IS correction collapses within ~150 steps with a KL spike and an entropy crash.\"\n  caption=\"On-policy training (gbs = mbs = 1,024). MiniRL (blue) is the most stable; dropping the training–inference IS correction (green) collapses within ~150 steps as KL spikes and entropy craters; length normalization is stable but suboptimal (arXiv:2512.01374, Figure 1).\"\n/>\n\nThree reads, each a prediction of the formulation:\n\n- **MiniRL wins.** The plain first-order-faithful objective is the most stable and scores\n  highest.\n- **Removing the training–inference IS correction collapses training** almost immediately —\n  the green curve nose-dives, entropy crashes, KL explodes. The IS weight was never optional;\n  it's the approximation's load-bearing term.\n- **Length normalization is stable but worse.** Dividing the objective by response length is\n  common (GRPO and CISPO both do it), but it *invalidates* the first-order approximation — the\n  gradient no longer lines up with the true sequence objective — and the benchmark score pays\n  for it. Notably, **Routing Replay does not help on-policy**; with the gaps already small,\n  its bias is all cost and no benefit.\n\n<BenchBars\n  title=\"on-policy final benchmark score, avg of HMMT25/AIME25/AIME24 (read from Fig. 1, approximate)\"\n  unit=\"\"\n  bars={[\n    { label: \"MiniRL\", value: 0.77, highlight: true },\n    { label: \"MiniRL + R3\", value: 0.76 },\n    { label: \"+ length-norm\", value: 0.75 },\n    { label: \"− train-infer IS (collapsed)\", value: 0.60 },\n  ]}\n/>\n\n**Off-policy** (split the batch into $N$ mini-batches for $N$ updates), staleness enters and\nthe picture changes. Now clipping *and* Routing Replay both become necessary — drop either and\ntraining collapses early:\n\n<Figure\n  src=\"/articles/first-order-rl/fig2.png\"\n  alt=\"Four-panel off-policy training curves (global batch = 4x mini-batch) over 4,000 gradient steps. MiniRL without clipping and R2 without clipping both collapse early; MiniRL+R2 collapses around 2,500 steps with an entropy blow-up; MiniRL+R3 stays stable to 4,000 steps with a slowly rising KL.\"\n  caption=\"Off-policy training (gbs = 4 × mbs). Without clipping, runs collapse fast; even MiniRL+R2 destabilizes near ~2,500 steps (entropy blow-up, KL spike), while MiniRL+R3 (red) sustains stable training the longest (arXiv:2512.01374, Figure 3).\"\n/>\n\nThe nuance the paper draws out: at **small off-policiness** ($\\text{gbs}=2\\times\\text{mbs}$),\n**R2 beats R3** — R2's lighter bias wins when the approximation is only mildly stressed. At\n**larger off-policiness** ($4\\times$, $8\\times$), **R3 wins** — R2 can't hold training\ntogether and R3's stronger discrepancy-killing earns its bias back. The recipe isn't \"always\nuse X\"; it's \"match the replay to how off-policy you're willing to run.\"\n\n<Callout type=\"warn\">\nThe benchmark numbers in the bar chart above are **read off the training curves in Figure 1**,\nnot a reported results table — treat them as approximate. And the whole study is one task\n(verifiable math), one model family (Qwen MoE), and a deliberately harsh FP8-inference /\nBF16-training setup chosen to *amplify* the discrepancy. The mechanism is clean; how the exact\ncrossover points transfer to other rewards, modalities, and precision regimes is not something\none paper can settle.\n</Callout>\n\n## The result that reframes the field\n\nThe finding I keep coming back to isn't a trick — it's about what *matters*. Take one base\nmodel, cold-start it three different ways (distilling from Qwen3-Max-Thinking-Preview,\nDeepSeek-R1-0528, and gpt-oss-120b), then run the same stable recipe. They converge to the\n**same place**:\n\n<Figure\n  src=\"/articles/first-order-rl/fig3.png\"\n  alt=\"Two panels over ~600 gradient steps. Left: benchmark score (AIME25 and AIME24) for three cold-start initializations rising and converging to roughly 0.86. Right: response length for the three, drifting apart but with all improving.\"\n  caption=\"Three different cold-start initializations, one stable RL recipe (MiniRL+R2), converging to comparable final accuracy on AIME25 & AIME24 (arXiv:2512.01374, Figure 5).\"\n/>\n\n<BenchBars\n  title=\"final AIME25 & AIME24 accuracy by cold-start init — same recipe (read from Fig. 5, approximate)\"\n  unit=\"\"\n  bars={[\n    { label: \"Qwen3-Max-Thinking\", value: 0.86, highlight: true },\n    { label: \"DeepSeek-R1-0528\", value: 0.86, highlight: true },\n    { label: \"gpt-oss-120b (high)\", value: 0.855, highlight: true },\n  ]}\n/>\n\nOnce training is stable, *how you started barely matters* — prolonged RL washes out the\ncold-start differences and even on-policy and off-policy runs reach comparable peaks. The\nimplication is pointed: the field spends enormous effort curating cold-start data, and this\nsays that effort is mostly erased by enough stable RL. The lever that actually moves the\nceiling is **stability**, not initialization.\n\n## The take\n\n- **One approximation, one toolbox.** Token-level RL is the first-order truncation of the\n  sequence objective. IS correction, clipping, and Routing Replay aren't three unrelated\n  patches — they're three ways to keep the truncation valid by shrinking the\n  training–inference discrepancy and policy staleness.\n- **The IS weight is structural, not cosmetic.** It's the linear term itself; removing it\n  doesn't add variance, it changes what you're optimizing, and training collapses on contact.\n- **MoE needs Routing Replay, and it's a real trade.** Pinning experts restores the ratio but\n  biases the target policy. Use R2 when you're near on-policy, R3 when you push off-policy —\n  the paper's clearest practical recipe.\n- **Stability is the scaling lever.** Different cold-starts, on-policy vs off-policy — once\n  stable, they land in the same place. The honest limits: one task, one model family, a\n  stress-test precision setup, and headline benchmark values read from curves rather than a\n  table. The formulation is the durable part; the exact numbers are a single, if very large,\n  data point.\n\n---\n\n*Built on [Stabilizing Reinforcement Learning with LLMs: Formulation and\nPractices](https://arxiv.org/abs/2512.01374) (Zheng et al., Qwen Team, Alibaba). Routing\nReplay's two flavors trace to [GSPO](https://arxiv.org/abs/2507.18071) (R2) and Ma et al.,\n2025 (R3, arXiv:2510.11370). Figures are the paper's; the interactive diagrams are mine.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/first-order-rl","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"FLUX 3: when an image model decides to become a world model","description":"Black Forest Labs turned the image-only FLUX line into one multimodal flow-matching model that generates image, video (up to 20s with native audio), and robot action from a single backbone. A first-principles walk through flow matching and joint multimodal attention, the real architecture and Self-Flow figures, a sample clip, and BFL's own preliminary human-preference numbers — caveats kept in view.","date":"2026-07-24","tags":["diffusion","image-generation","video-generation","flow-matching","black-forest-labs","explainer"],"draft":false,"cover":"/articles/flux-3/fig1.png","featured":true,"interest":5,"helpful":4,"kind":"articles","slug":"flux-3","body":"Every FLUX before this one was an image model. FLUX.1 was a 12B rectified-flow text-to-image\ntransformer; FLUX.2 scaled the same recipe to a ~32B image-and-editing model. [FLUX 3](https://bfl.ai/blog/flux-3),\nannounced by [Black Forest Labs](https://bfl.ai) on 2026-07-23, is a different kind of object. It is a\nsingle **multimodal foundation model** that learns jointly from images, video, and audio in one\narchitecture — and then generates all of them, plus robot *action*, from the same backbone. The framing\nin the title is deliberate: \"Real World Models.\" BFL is no longer trying to make prettier pictures; it is\ntrying to build a model of how the world looks, moves, and sounds.\n\n<Figure\n  src=\"/articles/flux-3/fig1.png\"\n  alt=\"A dark collage titled FLUX 3, with thumbnails labelled image (a galloping horse), video (a breaking wave), audio (a market scene), and action (a lynx), arranged around the wordmark.\"\n  caption=\"FLUX 3 spans image, video+audio, and action from one model — the thesis in one frame (FLUX 3, official announcement).\"\n/>\n\nThe argument for one model is an information argument, and it is worth stating in BFL's own terms. No\nsingle modality is a complete description of reality — each is a lossy projection captured by a different\nsensor. Images fix spatial structure at one instant; video restores time and reveals physical dynamics;\naudio exposes causal links between mechanical events and the sounds they make; language ties all of it to\ngoals and instructions. Learn from one projection and you get a good model of that projection. Learn from\nall of them at once and their **mutual constraints** teach you more: the sound has to match the impact,\nthe motion has to obey the mass, the future has to follow from the past. The modalities stop being\nseparate problems and start being evidence about one underlying reality.\n\nThat is the pitch. Because FLUX 3 shipped as an Early Access *capabilities* announcement rather than a\ntech report — no parameter count, no full architecture, benchmarks that are BFL's own — the honest way to\ncover it is to explain the mechanism it is built on from first principles, show the evidence BFL did\npublish, and keep the caveats visible. So: two mechanisms first, then the numbers.\n\n## Mechanism 1: flow matching, and why few steps is the whole game\n\nFLUX has always been a **flow-matching** model, and FLUX 3 builds on an approach BFL calls\n[Self-Flow](https://bfl.ai/research/self-flow). Flow matching is the cleaner cousin of diffusion, and the\nidea is simple enough to hold in your head. You want to turn a sample of pure noise into a sample of data.\nSo you define a path between them — for training, literally the straight line\n$x_t = (1-t)\\,x_{\\text{noise}} + t\\,x_{\\text{data}}$ — and you train a network to predict the **velocity**\n$v_\\theta(x, t)$ that points along that path. Generation is then just solving an ordinary differential\nequation: start at noise, and take steps in the direction the velocity field tells you, until $t = 1$.\n\nThe subtlety is *how many steps*. Each step is a full forward pass of a huge transformer, so steps are the\ncost. And here the geometry of the path matters enormously. If the learned field is **straight** — a\nconstant velocity, which is what \"rectified\" flow aims for — then a first-order Euler integrator lands on\nthe data exactly, no matter how few steps you take. If the field is **curved**, few steps cut the corner\nand miss the target; the error only closes as you add steps (and cost). Drag the step count and flip the\nfield to feel it:\n\n<FlowSteps />\n\nThis is why \"straighten the transport paths\" has been the central obsession of the whole FLUX / rectified-flow\nlineage: a straighter field means a few-step sampler that still lands, which means a 20-second video is\nmerely expensive instead of impossible. The trajectory geometry above is the same one the\n[Mage-Flow](/articles/mage-flow) piece leans on for its 4-step Turbo — few-step generation is a property of\nthe *field*, not a trick bolted on afterward.\n\n## Mechanism 2: one sequence, joint attention across modalities\n\nThe second mechanism is how four modalities fit in one model. FLUX 3 sits in the **MMDiT** lineage — the\nmultimodal diffusion transformer that the [Mage-Flow explainer](/articles/mage-flow) walks through in\ndetail — and the load-bearing idea is that every modality is tokenized into a *single* sequence, and one\nattention operation runs over the whole thing. A query token is not confined to its own modality: an audio\ntoken can attend to the video frame that produced the sound; a video token can attend to the text that\ndescribes the scene. Pick the query's modality below and flip joint versus per-modality attention:\n\n<JointAttention />\n\nPer-modality attention gives you four models in a trenchcoat. **Joint** attention is what lets the mutual\nconstraints actually do their work — it is the mechanical form of \"the modalities are evidence about one\nreality.\" BFL's own diagram makes the shape concrete: shared per-modality encoders feed a single\nmultimodal transformer, shared decoders read it back out, and *action* is added as one more\nlane — explicitly marked extensible.\n\n<Figure\n  src=\"/articles/flux-3/fig2.png\"\n  alt=\"FLUX 3 architecture: image, video, audio and action inputs each pass through their own encoder into a single shared Multimodal Transformer, with a text encoder feeding in from the side; matching decoders produce image, video, audio and action outputs. The action lane is labelled extensible.\"\n  caption=\"One shared Multimodal Transformer with per-modality encoders/decoders and a text conditioner; the action lane is marked extensible (FLUX 3, official announcement).\"\n/>\n\nSelf-Flow is BFL's method for aligning generation and understanding inside that one backbone, and the one\nquantitative claim they put behind it is that unifying the two lifts *both* at once. Their ablation\nreports lower generation error (Fréchet distance) per modality against a flow-matching baseline, and —\nmore interestingly — faster and higher-climbing success on downstream robot-control tasks when the\nbackbone is finetuned:\n\n<Figure\n  src=\"/articles/flux-3/fig3.png\"\n  alt=\"Two-panel chart. Left: generation error (Fréchet distance) for video, image and audio, Self-Flow versus Flow Matching, each normalized to FM=100; Self-Flow is lower on all three (66.3 vs 72.9 video, 3.69 vs 4.04 image, 149.8 vs 153 audio). Right: robot-control success rate over training steps, Self-Flow climbing to 47% versus 35% for flow matching, marked 2x faster learning.\"\n  caption=\"Self-Flow vs. Flow Matching: lower generation error across all three modalities, and ~2× faster / higher robot-control success after finetuning (FLUX 3, official announcement). Vendor-reported, normalized to FM = 100.\"\n/>\n\nRead that right panel carefully, because it is the real bet: the same weights that generate video are a\n**dynamics-aware prior** for physical control. That is the bridge from \"content tool\" to \"world model,\"\nand it is the part most worth being skeptical of until there is a paper.\n\n## What it actually does: video, with sound\n\nThe headline capability in Early Access is video. FLUX 3 generates up to **20 seconds in a single pass**,\nand — the part competitors mostly don't have — every clip comes with **native audio**, generated jointly\nrather than dubbed on afterward. The capability list is broad: text-to-video, image-to-video (animate a\nstill or use images as visual references), video-to-video (carry a character from a reference clip into a\nnew scene), keyframe-to-video for controlled transitions, multilingual dialogue, and *agentic chaining*\nof clips into multi-shot sequences minutes long with consistent characters. Here is a representative shot\nfrom BFL's own reel — a single continuous take of a galloping horse, the kind of coherent physical motion\nthe \"world model\" framing is really about:\n\n<Video\n  src=\"/articles/flux-3/flux3-video\"\n  poster=\"/articles/flux-3/flux3-video-poster.jpg\"\n  alt=\"A dappled grey horse galloping across a green plain under a dark stormy sky, with motion blur and debris blowing through the air.\"\n  caption=\"A ~4s excerpt from BFL's FLUX 3 reel (muted/looped here; the source clips carry native audio). Text-to-video, 4K source (FLUX 3, official announcement).\"\n/>\n\n<Callout type=\"note\">\nThe clip is muted and trimmed to keep the page light; the point it carries is temporal coherence — the\nhorse's gait, mane, and the blown debris stay physically consistent across the shot. BFL's full reel runs\nimage, video, and audio together; native sound is one of the model's stronger claims.\n</Callout>\n\n## The numbers — and exactly what they are\n\nFor the preliminary evaluation, BFL generated 10-second text-to-video clips at 720p with audio and ran\npairwise human-preference comparisons against a spread of current video models. These are the results\nthey published — read as \"share of comparisons where a rater preferred FLUX 3 over the named model\":\n\n<BenchBars\n  title=\"FLUX 3 text-to-video — human-preference win rate vs. each model (%)\"\n  unit=\"%\"\n  max={100}\n  bars={[\n    { label: \"Luma Ray 3.2\", value: 93, highlight: true },\n    { label: \"Runway Gen-4.5\", value: 77, highlight: true },\n    { label: \"Grok Imagine Video\", value: 69 },\n    { label: \"Kling v3 Pro\", value: 60 },\n    { label: \"Happy Horse v1\", value: 59 },\n    { label: \"Happy Horse 1.1\", value: 57 },\n    { label: \"Seedance 2.0\", value: 52 },\n    { label: \"Gemini Omni Flash\", value: 52 },\n  ]}\n/>\n\n<Callout type=\"warn\">\nHold these loosely. **50% is a tie**, not a loss — so the 52% against Seedance 2.0 and Gemini Omni Flash\nis essentially even, while the 93% against Luma Ray 3.2 is a rout. The two highlighted bars (Runway,\nLuma) are the comparisons BFL leads with in its post; I highlighted *their* emphasis, not mine. Every\nnumber here is **vendor-reported**, from BFL's own harness, on short 720p clips, for a model BFL calls\n\"still in development\" — there is no independent third-party evaluation yet, and the Grok figure is quoted\nby BFL as \"up to 69%.\" This is a preview signal, not a settled ranking.\n</Callout>\n\n## Image and action\n\nImage generation and editing are coming a little behind video (Early Access \"in the following weeks\").\nBFL says even mid-training FLUX 3 is a clear step over earlier FLUX on complex-prompt handling and\nhigh-accuracy multilingual **text rendering** — the two things that have defined the modern image race.\nThe sample grid spans photographic, product, painterly, and graphic styles:\n\n<Figure\n  src=\"/articles/flux-3/fig4.jpg\"\n  alt=\"A grid of eight FLUX 3 image samples: pink smoke in a petri dish, a blue-framed chair with a red seat, molten lava pouring over rock, a painterly stool in a sunlit corner, a minimalist lighthouse at night, a tulip frozen in an ice block, a close-up of an octopus eye, and a black car on a worn asphalt lot.\"\n  caption=\"FLUX 3 image samples across photographic, product, painterly and graphic styles (FLUX 3, official announcement).\"\n/>\n\nThe most unusual branch is **action**. FLUX 3's world understanding is meant to extend to predicting what\nhappens next — and BFL takes two routes to it: native action prediction folded into the model, and using\nthe pretrained video backbone as a dynamics-aware foundation that specialized robot-control models finetune\nfrom with little task-specific data. The first partner is [mimic robotics](https://bfl.ai/blog/flux-3),\nwith whom BFL built **FLUX-mimic**, a video-action model for dexterous manipulation reportedly tested on\nproduction tasks at Audi. The bet, again, is that content creation and physical AI run on the *same*\nfoundation — the claim the right-hand panel of the Self-Flow chart is quietly staking out.\n\n## The variants, and what \"open\" means this time\n\nEverything ships from one underlying multimodal flow-matching model, rolled out in phases behind\nEarly Access gates for safety testing:\n\n| Model | Covers | Access | Status (Jul 2026) |\n|---|---|---|---|\n| **FLUX 3 Video** | video + audio, gen & edit | API + private weights | Early Access (now) |\n| **FLUX-mimic / FLUX 3 Action** | action prediction | research & commercial partners | rolling out (mimic robotics) |\n| **FLUX 3 Image** | image, gen & edit | API + private weights | \"in the following weeks\" |\n| **FLUX 3 Dev** | image + video + audio + action backbone | **open weights** | promised, not yet released |\n\nThat last row is the one to watch. FLUX.1 and FLUX.2 earned their standing partly because BFL shipped\nopen `dev` weights the community could actually run; FLUX 3 Dev promises the same for a *multimodal*\nbackbone — but as of the announcement it is a promise, and the near-term reality is API and private-weight\naccess. \"Open\" is on the roadmap, not on the table yet.\n\n## The take\n\nFLUX 3 is the most ambitious repositioning in the open-ish image world this year: from a best-in-class\nimage model to a single flow-matching backbone that treats image, video, audio, and action as one\nlearning problem. The mechanism is sound and well-motivated — joint attention over a unified token sequence\nis the honest way to let modalities constrain each other, and rectified flow is what makes generating 20\nseconds of it tractable. The evidence is thinner than the ambition: preliminary vendor evals on short\nclips, a Self-Flow ablation without a paper behind it, an open release that is still a promise, and the\nboldest claim — that a video generator is also a robot-control prior — resting on one chart. If it holds\nup, \"image model\" will look like a strangely narrow way to have described what BFL was building. Worth\nwatching the Dev weights and the tech report; until then, admire the direction and keep the caveats.\n\n---\n\n*Source: [FLUX 3 — Real World Models](https://bfl.ai/blog/flux-3) (Black Forest Labs, 2026-07-23) and\n[Self-Flow](https://bfl.ai/research/self-flow). Architecture, Self-Flow, sample, and benchmark figures are\nBFL's own, shown for commentary; all evaluations are vendor-reported and preliminary. The flow-matching\nand joint-attention interactives are mine. Related: [Mage-Flow](/articles/mage-flow) on the MMDiT backbone\nand few-step Turbo sampling.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/flux-3","signal":{"interest":5,"helpful":4,"score":9,"level":5,"label":"Essential"}},{"title":"GEPA: optimize anything you can score and describe","description":"GEPA (Genetic-Pareto) optimizes a text system by reflecting on its execution traces in natural language and evolving a Pareto frontier of candidates — turning a handful of rollouts into a large gain where RL needs thousands. Its paper reports beating GRPO by up to 20% with 35x fewer rollouts. The 'optimize anything' interface generalizes this to any text artifact with a scoring function, and the new 'omni' release composes GEPA with agent-based optimizers into a meta-optimizer that, on Frontier-CS (10 problems, $20 each), tops every standalone optimizer.","date":"2026-07-24","tags":["optimization","prompt-optimization","evolutionary","agents","explainer"],"draft":false,"cover":"/articles/gepa-optimize-anything/fig1.png","featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"gepa-optimize-anything","body":"Most ways to make an LLM system better are expensive. Reinforcement learning like GRPO needs\nthousands of rollouts to squeeze a scalar reward into the weights; hand-tuning a prompt needs a human in\nthe loop. **GEPA** (Genetic-Pareto) makes a different bet: language is a *richer* learning signal than a\nnumber. Instead of a policy gradient from a sparse reward, GEPA reads the system's own execution traces,\n**reflects on them in natural language** to work out what went wrong, and writes a better prompt — keeping\na **Pareto frontier** of candidates so specialist wins are never averaged away. The result, per its paper,\nis a prompt optimizer that turns *a few* rollouts into large gains.\n\nThat core has now been generalized twice. `optimize_anything` drops the \"prompt\" assumption entirely — it\noptimizes any text artifact you can **score and describe** — and the new **omni** release composes GEPA\nwith agent-based optimizers into a single meta-optimizer. This is a walk through the mechanism, then what\neach layer adds, honestly labelled.\n\n<Callout type=\"note\">\nTwo sources, two sets of numbers. The GEPA mechanism and the RL comparison are from the paper *GEPA:\nReflective Prompt Evolution Can Outperform Reinforcement Learning* (Agrawal et al., arXiv 2507.19457).\nThe `optimize_anything` / omni interface and the Frontier-CS results are from the GEPA team's July 2026\nblog post. Every number below is **author-reported** on the authors' own tasks and harness; I have not\nre-run them. The interactive diagrams are illustrations of the mechanism, not measured traces.\n</Callout>\n\n## The core: reflect, don't reward\n\nGEPA treats a prompt (or a whole multi-prompt system) as the thing to evolve. One turn of its loop is\nsmall and legible — sample a candidate, run it, read what happened, and let an LLM rewrite it:\n\n<ReflectiveLoop />\n\nThe move that makes this work is stage 4. A GRPO update collapses an entire trajectory into one scalar\nadvantage and nudges billions of weights; almost all of the information in *why* the run failed is thrown\naway. GEPA does the opposite: it hands a **reflective LLM call** the full trace — the reasoning, the tool\ncalls, the tool outputs — together with the evaluator's **textual feedback**, and asks it to diagnose the\nfailure and propose a fix in words. \"You mis-parsed the date format on the third call\" is a far denser\nlesson than \"reward = 0.2\", and it applies after a single rollout instead of thousands.\n\nBecause the feedback is language, the same loop works on any system with one or more LLM prompts, and the\ngains show up fast. Across six tasks the paper reports GEPA **outperforming GRPO by 6% on average and by up\nto 20%, while using up to 35x fewer rollouts** — and beating **MIPROv2**, the previous leading prompt\noptimizer, **by over 10%** (for example, +12% accuracy on AIME-2025). The claim isn't that reflection is\nmagic; it's that natural language is a higher-bandwidth channel than a reward scalar when your optimizer is\nitself a language model.\n\n<Callout type=\"tip\">\nGEPA comes out of the DSPy lineage — it ships as a DSPy optimizer, and the baseline it beats, MIPROv2, is\nDSPy's earlier prompt optimizer. If you already express your system as a DSPy program, GEPA is a\ndrop-in optimizer over its prompts; the omni post also references a \"DSPy full program evolution\" tutorial\nthat evolves the program itself, not just its instructions.\n</Callout>\n\n## Why a Pareto frontier, not the best-so-far\n\nThe \"Genetic\" half of the name is evolution — mutate, evaluate, keep the good ones. The subtle part is\n*which* ones you keep. A greedy search keeps the single best-scoring candidate and mutates that. GEPA keeps\nthe whole **Pareto frontier**: every candidate that is the best found so far on **at least one task\ninstance**. Toggle between the two policies and watch what greedy throws away:\n\n<ParetoFrontier />\n\nAveraging is lossy. A prompt that nails a hard sub-case but is middling overall looks worthless to a\ngreedy optimizer and gets discarded — along with the one trick it had figured out. By keeping the frontier\nand **sampling parents from it**, GEPA preserves those specialists so their lessons can be recombined later,\nand it keeps the search from collapsing into a single lineage that stalls in a local optimum. It's the same\ninstinct as maintaining a diverse population in a genetic algorithm, made concrete on per-instance scores.\n\n## Optimize anything: score + describe\n\nOnce you notice that nothing in the loop actually requires the artifact to be a *prompt*, the interface\ngeneralizes. `optimize_anything` asks for exactly two things: a **seed** text artifact, and an\n**evaluator** that returns a score and some feedback. Everything else — objective, background context — is\nplain English the reflective LLM reads.\n\n```python\nfrom gepa.optimize_anything import optimize_anything, OptimizeAnythingConfig\n\ndef evaluate(candidate: str) -> tuple[float, dict]:\n    score, feedback = run_judge(candidate)      # your metric + any text you want the LLM to see\n    return score, {\"Feedback\": feedback}\n\nseed = open(\"seed_solution.py\").read()\ntask = dict(\n    evaluator=evaluate,\n    objective=\"Maximize the score for this competitive-programming problem.\",\n    background=\"A sandboxed judge runs hidden tests and returns a 0-100 score.\",\n)\n\nresult = optimize_anything(seed, **task, config=OptimizeAnythingConfig(engine=\"gepa\"))\n```\n\nThe artifact can be a prompt, a code file, a system message, a config, a plan — anything expressible as\ntext and gradeable by a function. The `feedback_dict` is the whole trick: whatever diagnostic text you put\nin it (a failing test's stderr, a judge's rationale, a lint report) becomes the material the reflective LLM\nreasons over. Score tells the search *whether* a candidate is better; feedback tells it *why*, which is what\nmakes the next mutation informed instead of random.\n\n## Omni: no single optimizer wins, so compose them\n\nThe July 2026 post starts from an inconvenient observation: on hard, open-ended problems, **no single\noptimizer dominates**. GEPA's reflective mutation, an autonomous coding agent, and an agent-based proposer\neach win on different problems — and each one eventually **plateaus**. The `engine=` argument turns this\ninto a lever: the same `optimize_anything` task can be dispatched to any of three families.\n\n| engine | family | how it proposes |\n|---|---|---|\n| `gepa` | LLM-based optimizer | one reflective LLM call mutates a parent drawn from the Pareto frontier; the framework owns the loop |\n| `meta_harness` | agent-based | a coding-agent proposer mutates the candidate; the framework still owns the loop |\n| `autoresearch` | autonomous agent | a long-horizon agent session owns the *entire* loop — selection, proposal, and orchestration |\n\nAcross ten problems the winner is unpredictable — the paper's per-problem tally is GEPA 3, AutoResearch 3,\nMeta-Harness 4 — so betting on one engine is betting wrong 60–70% of the time. Worse, when an engine\nplateaus, *seeding a different engine from the stuck candidate usually breaks through*: on one problem GEPA\nstalled at 54.4 after about \\$1.3 of budget and switching to AutoResearch lifted it to 62.7; on another,\nAutoResearch stalled at 50.0 and both other engines climbed from there to a perfect 100.\n\n<Figure\n  src=\"/articles/gepa-optimize-anything/fig3.png\"\n  alt=\"Two-panel bar chart for Frontier-CS. Left panel: average score across ten problems — GEPA 43.8, AutoResearch 55.4, Meta-Harness 50.9. Right panel: a per-problem breakdown where the winning optimizer changes from problem to problem, with a final win tally of GEPA 3, AutoResearch 3, Meta-Harness 4.\"\n  caption=\"No single optimizer dominates: averages are close and the per-problem winner keeps changing (GEPA, Optimize Anything Omni).\"\n/>\n\n**omni** turns that into a strategy. It splits a fixed budget in two phases: **explore**, running all three\nengines in parallel on a small slice (about \\$5 each) and keeping the best candidate; then **continue**,\nseeding a *fresh* optimizer instance with that winner and spending the rest (about \\$5) to push past the\nplateau — all capped at \\$20 total per problem.\n\n<Figure\n  src=\"/articles/gepa-optimize-anything/fig1.png\"\n  alt=\"The omni meta-optimizer as a flow. Phase 1 on the left: GEPA, AutoResearch, and Meta-Harness each run in parallel on a small slice of budget (each labeled five dollars), feeding a 'Pick Best' node. Phase 2 on the right: the best candidate seeds a fresh optimizer instance (fresh GEPA, fresh AutoResearch, or fresh Meta-Harness, each five dollars) that continues the search, producing the omni variants, each capped at twenty dollars total.\"\n  caption=\"omni: explore with all engines on a small slice, pick the best, then continue from it with a fresh optimizer (GEPA, Optimize Anything Omni).\"\n/>\n\nThe composition itself is exposed as a small kit of primitives — `optimize_best_of` (parallel, keep the\ntop), `optimize_sequential` (chain engines), `optimize_vote` (fair cross-engine comparison), and\n`optimize_adaptive_sequential` (auto-switch on plateau detection) — so omni is one policy you can write, not\na hardcoded pipeline.\n\n## Results on Frontier-CS\n\nThe benchmark is **Frontier-CS**: ten open-ended competitive-programming problems, a \\$20 budget each, using\nClaude Sonnet 4.6 with medium thinking. A single zero-shot LLM call averages **7.72**, so there is real\nheadroom. Under a matched \\$20 budget, omni tops every standalone optimizer:\n\n<BenchBars\n  title=\"Frontier-CS — mean score, $20/problem (higher is better)\"\n  unit=\"\"\n  bars={[\n    { label: \"zero-shot (1 call)\", value: 7.72 },\n    { label: \"GEPA\", value: 43.8 },\n    { label: \"Meta-Harness\", value: 50.9 },\n    { label: \"AutoResearch\", value: 55.4 },\n    { label: \"omni (best)\", value: 63.2, highlight: true },\n  ]}\n/>\n\nThe per-engine story is that omni lifts *every* base optimizer, not just the strongest — the biggest jump is\nGEPA's, which nearly doubles once it stops having to break through on its own:\n\n| optimizer | standalone | as omni | lift |\n|---|---|---|---|\n| GEPA | 43.8 | 61.8 | +18.0 (+41%) |\n| AutoResearch | 55.4 | 63.2 | +7.8 (+14%) |\n| Meta-Harness | 50.9 | 59.3 | +8.4 (+16%) |\n\nThe mechanism behind those lifts is the plateau-break — a stuck candidate handed to a different optimizer\nkeeps climbing:\n\n<Figure\n  src=\"/articles/gepa-optimize-anything/fig2.png\"\n  alt=\"Two trajectory panels of best score versus cumulative cost in dollars on Frontier-CS. Left, problem P0: GEPA climbs fast then plateaus at 54.4 after about one dollar thirty (solid then dashed line); switching to AutoResearch lifts it to 62.7, while switching to Meta-Harness stays flat at 54.4. Right, problem P85: AutoResearch plateaus at 50.0 almost immediately after about fifty cents; switching to either GEPA or Meta-Harness climbs all the way to a perfect 100.\"\n  caption=\"Seeding a fresh, different optimizer from a stuck candidate unblocks the plateau (GEPA, Optimize Anything Omni).\"\n/>\n\n<Callout type=\"warn\">\nRead these as a promising engineering result, not a settled benchmark. Frontier-CS is ten problems on the\nauthors' own harness with an LLM judge, and the scores are averages over a small set with real variance —\nthe per-problem winner already swings widely. omni also spends its budget on *three* engines plus a\ncontinue phase, so the fair comparison is the matched \\$20 cap, which the post does hold. The headline is\nnarrow and honest: under that budget, composing beat every single optimizer they tried — not that omni is\noptimal.\n</Callout>\n\n## The take\n\nGEPA is a clean idea executed in layers. The core is that **language is a denser training signal than a\nreward** when the optimizer is an LLM: reflect on the trace, keep a Pareto frontier so specialists survive,\nand a few rollouts go a long way — up to 20% over GRPO at up to 35x fewer rollouts, on the paper's tasks.\n`optimize_anything` strips away the \"prompt\" assumption and leaves a genuinely general interface: *any text\nartifact you can score and describe* is now optimizable, with the feedback string doing the heavy lifting.\nAnd omni's contribution is an honest one — since no single optimizer wins and each one plateaus, **explore\nacross engines, then continue from the best**, which on Frontier-CS beat every standalone optimizer at a\nmatched budget. The caveats are the usual ones for a fresh result: small benchmark, LLM judge, provider\nnumbers. But the shape is compelling, and because the whole thing is `pip install` and a scoring function,\nit's unusually easy for others to check on their own artifacts.\n\n---\n\n*Sources: [GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning](https://arxiv.org/abs/2507.19457)\n(Agrawal et al., 2025) for the mechanism and the RL/MIPROv2 comparisons, and the GEPA team's\n[optimize_anything goes omni](https://gepa-ai.github.io/gepa/blog/2026/07/22/optimize-anything-omni/) post\n(Tan, Agrawal, Lee, Zhang, Klein, Sen, Dimakis, Zaharia, 2026) for the interface and Frontier-CS results.\nGEPA is developed in the [open-source repo](https://github.com/gepa-ai/gepa) and integrates with\n[DSPy](https://dspy.ai). Figures are reproduced from the post for commentary; the interactive diagrams are\nmine. All benchmark numbers are author-reported.*\n","readingTimeMins":10,"url":"https://ai.thesatyajit.com/articles/gepa-optimize-anything","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"Lanyon: proving a PDE solver correct before you run it","description":"Lanyon is a neurosymbolic system that writes numerical PDE solvers and proves them against an axiomatization of IEEE-754 floating-point arithmetic — checking that the code matches the math before any simulation runs. On its own initial benchmark of simple linear PDE solvers (linear advection, Maxwell), Lanyon reports 20–250× fewer output tokens and seconds-not-minutes wall-clock versus Fable 5, Opus 4.8, GPT-5.6 Sol, GPT-5.5 and Kimi K3 — and says it catches the misformalizations (sorry/native_decide escape hatches, CFL=1 degenerate demos, real-number tactics on float code) those models commit. Self-reported, not independently replicated. The mechanism, and the skeptic's read.","date":"2026-07-24","tags":["neurosymbolic","theorem-proving","pde","scientific-computing","verification","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"lanyon-neurosymbolic","body":"When a frontier model writes a numerical PDE solver, it does two hard things at once: it *derives* the\nscheme (the math) and it *implements* the scheme (the code), and nothing forces those two to agree. The\nproof — if there even is one — can quietly be about a different object than the code that runs. [Lanyon](https://lanyon.ai/research/linear-benchmarking/)\nis a **neurosymbolic** system built to close exactly that gap: it emits a solver and a machine-checkable\nproof from *one* domain-specific specification, and a symbolic engine type-checks the proof and compiles\nthe kernels **before any simulation runs**.\n\nLanyon has published the first in a promised series of benchmarking posts, comparing itself on **simple\nlinear PDE solvers** — one-dimensional linear advection and the Maxwell equations — against five frontier\nmodels: **Claude Fable 5, Claude Opus 4.8, GPT-5.6 Sol, GPT-5.5, and Kimi K3**. The claims are large:\n**20–250× fewer output tokens**, wall-clock measured in seconds rather than the frontier models' minutes,\nand that — unlike every model tested — Lanyon's proofs respect the IEEE-754 floating-point axioms.\n\nTwo things are true at once here, and this piece keeps both in view. The *idea* — proving numerical code\ncorrect against real floating-point semantics, before you trust its output — is genuinely important and\nunder-served. And the *numbers* are Lanyon's own, from an initial benchmark it designed and graded, on a\ndeliberately easy slice of the problem space, with no independent replication yet. Read the mechanism as\nplausible; read the multipliers as vendor claims.\n\n## The loop: propose, then prove before you run\n\nThe architecture is a tight loop between a neural **proposer** and a symbolic **engine**. The proposer\nwrites a candidate solver together with its formal spec in a domain-specific language (DSL); because the\nproof and the code are expanded from the *same* DSL specification, Lanyon's pitch is that the classic\nautoformalization failure — proving one thing while implementing another — is designed out rather than\ncaught after the fact. The engine then asks a question you can answer without a single time-step:\n**does the expanded proof type-check, and do the expanded kernels compile?** If not, the specification is\nwrong, and the error routes straight back to the proposer.\n\n<NeurosymbolicLoop />\n\nDrag the stage control and flip the candidate between *clean* and *has bug*. The point of the diagram is\nthe contrast, not the animation: the symbolic engine is a **gate** that rejects a bad solver before it\never produces a number, whereas a pure-LLM path writes solver code token-by-token straight to an\nunverified output. Lanyon frames this as a *tighter* reinforcement-learning loop than agentic\napproaches — it can verify a specification *before* execution, where a frontier agent \"can only verify\n*post hoc*,\" by running the code and eyeballing whether the plots look right.\n\n## Why a symbolic engine can be far cheaper\n\nThe token-efficiency claim is the one most worth understanding mechanistically, because it is the most\nplausible. An LLM that solves a structured math problem end-to-end pays for the *derivation* in tokens:\nit expands algebra, tracks indices, and reasons about stability step by step, in natural language and\nscratch work, and it re-does that reasoning on every run. A symbolic engine does the exact algebra once,\nin a representation built for it, and the neural component only has to *propose the specification* — the\nexpensive, exact, repeatable computation is offloaded to machinery that does it in closed form and\nverifies it, rather than re-deriving it token by token.\n\nFor well-posed, structured problems — and a linear PDE solver is about as structured as scientific\ncomputing gets — that division of labor is exactly where a hybrid should win. The honest flip side, which\nwe return to below, is that it is *also* exactly the regime where the symbolic half has the most to\nexploit; the argument gets weaker as problems get less clean.\n\n## Proving code correct under IEEE-754, not the reals\n\nThis is the genuinely interesting engineering, and it is subtler than \"we use a theorem prover.\"\n\nFloating-point arithmetic is not real arithmetic. The most important way it differs: addition is\ncommutative but **not associative** — $(a \\oplus b) \\oplus c \\neq a \\oplus (b \\oplus c)$ in general,\nbecause each $\\oplus$ rounds its result. Distributivity fails for the same reason. So a proof that a\nnumerical scheme is stable or conservative, if it is carried out with ordinary **real-number** algebra,\ncan be a proof about a program that *does not exist* — an idealized version of your kernel that never\nrounds. The code that actually runs obeys the weaker, messier IEEE-754 algebra.\n\nLanyon's answer is an internal, **Lisp-based** symbolic theorem-prover (not Lean) that builds on\nGorard and Hakim's 2025 work on formal verification of PDE solvers within finite-precision arithmetic.\nIts symbolic layer is constrained to properties that actually hold under IEEE-754 — commutativity yes,\nassociativity no. For this benchmark, where every system's output is audited as Lean proofs plus C code\nso the comparison is apples-to-apples, Lanyon's translation to Lean is deliberately careful: expressions\nare **parenthesized** so any algebraic manipulation stays consistent with the IEEE-754 axioms, and where\nthere is any ambiguity it reaches for `simp only` instead of `simp`, and restricts `ring_nf` and\n`field_simp` to cases where a commutative (semi)ring or field structure can be *safely* assumed. Those\nLean tactics quietly assume the real-number identities — associativity, distributivity — that floats\nbreak, so using them freely is how a \"proof\" drifts away from the code. Lanyon's claim is blunt: **none\nof the frontier models follow this more restrictive discipline.**\n\n## What the errors look like\n\nBecause Lanyon graded every run against a rubric, the failure taxonomy is concrete. The verdicts:\n\n| Verdict | Meaning |\n|---|---|\n| **Faithful** | The Lean proof matches the C formulas; the proofs are substantive |\n| **Partial** | Honest proofs, but limited to a subset / dead code / a degenerate regime |\n| **Misformalized** | Verification leans on escape hatches or invalid (vacuous, true-by-construction) theorems |\n| **Disconnected** | The Lean proof is about a different object than the C actually implements |\n\nThe specific behaviors Lanyon reports observing in the frontier runs, especially under *terse* prompts:\n\n- **Escape hatches.** `sorry`, `admit`, `native_decide`, vacuous hypotheses, and \"true-by-construction\"\n  theorems dressed up as substantive results.\n- **Degenerate demos.** Running one-dimensional advection only at `CFL = 1.0`, where the scheme collapses\n  to an *exact shift* and verification becomes trivial — in one case despite a terse prompt explicitly\n  asking for a more general solver.\n- **Algorithm substitution.** A GPT-5.6 Sol Maxwell run that used a different finite-volume method than\n  the one requested.\n- **Incomplete verification.** Leaving limiters, the two-dimensional extensions, and time-dependent\n  properties like stability unproven while presenting the result as verified.\n- **Timeouts.** Kimi K3 failed to finish two of three detailed Maxwell trials inside a two-hour window.\n\nThe through-line Lanyon draws is *misformalization*: \"the proof not matching the code is the precise\nfailure mode run to run of other agents, especially under ambiguous prompts.\" Notably, under the\n**detailed** prompts every model that finished was graded Faithful — the divergence shows up when the\nprompt is terse and the model is left to decide what \"verified\" means.\n\n## The numbers (vendor-reported)\n\n<Callout type=\"warn\">\nEverything below is Lanyon's own measurement, on a benchmark it authored and graded, over two simple\nlinear PDE problems with three trials each. Lanyon reports **20–100×** (linear advection) to **50–250×**\n(Maxwell) fewer output tokens than the frontier models, wall-clock in **seconds** versus their **minutes**,\nand that its cost stays roughly flat from 1D to 2D while the frontier models' rises 1.5–2×. To reduce\nself-bias each model reviewed every (anonymized) run — a reasonable control — but the benchmark is\nself-selected, the rubric is Lanyon's, and none of it has been independently replicated. Treat the\nmultipliers as claims, not facts.\n</Callout>\n\nThe output-token spread is the visual that carries the token-efficiency argument. These are the frontier\nmodels' reported output tokens on the **Maxwell** solver under the detailed prompt; Lanyon reports its own\nsolver \"takes seconds to generate\" with orders-of-magnitude fewer tokens (the 50–250× figure), and does\nnot publish its exact count in the post:\n\n<BenchBars\n  title=\"Maxwell solver — frontier output tokens (thousands), detailed prompt · vendor-reported\"\n  unit=\"k\"\n  bars={[\n    { label: \"Claude Opus 4.8\", value: 219 },\n    { label: \"Claude Fable 5\", value: 149 },\n    { label: \"GPT-5.5\", value: 35 },\n    { label: \"GPT-5.6 Sol\", value: 30 },\n  ]}\n/>\n\nUnder the **detailed** prompts, every model that finished produced a faithful proof — the differences are\nin cost and wall-clock, not correctness:\n\n**Linear advection — detailed prompt**\n\n| Model | Output tokens | Cost / trial | Wall-clock | Verdict |\n|---|---|---|---|---|\n| Claude Opus 4.8 | 153k | $7.03 ± 2.21 | 30.5 min | Faithful ×3 |\n| Claude Fable 5 | 101k | $7.39 ± 1.17 | 20.9 min | Faithful ×3 |\n| GPT-5.6 Sol | 23k | $2.56 ± 2.23 | 8.7 min | Faithful ×3 |\n| GPT-5.5 | 22k | $1.39 ± 0.29 | 5.4 min | Faithful ×3 |\n| Kimi K3 | not reported | $2.26 ± 0.38 | 37.9 min | Faithful ×3 |\n\n**Maxwell equations — detailed prompt**\n\n| Model | Output tokens | Cost / trial | Wall-clock | Verdict |\n|---|---|---|---|---|\n| Claude Opus 4.8 | 219k | $13.00 ± 5.16 | 49.0 min | Faithful ×3 |\n| Claude Fable 5 | 149k | $11.03 ± 0.63 | 30.9 min | Faithful ×3 |\n| GPT-5.6 Sol | 30k | $2.28 ± 0.79 | 12.3 min | Faithful ×3 |\n| GPT-5.5 | 35k | $2.63 ± 0.91 | 9.0 min | Faithful ×3 |\n| Kimi K3 | not reported | $5.43 ± 3.01 | 92.0 min | Faithful (1/3 finished; 2 DNF) |\n\nThe rubric bites under the **terse** prompts, where the model has to decide for itself what a \"verified\"\nsolver means. This degradation table is really the substance of Lanyon's correctness claim:\n\n| Model | Advection (terse) | Maxwell (terse) |\n|---|---|---|\n| Claude Fable 5 | Faithful ×2, Partial ×1 | Misformalized ×1, Partial ×2 |\n| Claude Opus 4.8 | Partial ×3 | Partial ×3 |\n| GPT-5.6 Sol | Faithful ×2, Partial ×1 | Faithful ×2, Partial ×1 |\n| GPT-5.5 | Partial ×2, Misformalized ×1 | Partial ×3 |\n| Kimi K3 | Misformalized ×2, Faithful ×1 | Partial ×3 |\n\n## The skeptic's read\n\nTake the mechanism seriously and the numbers skeptically.\n\n- **The domain is the easiest possible.** Simple *linear* PDEs with well-posed, structured solutions are\n  precisely where a symbolic engine has the most to exploit and an LLM has the least edge. Nonlinear,\n  stiff, shock-forming, or turbulent problems — where numerical analysis actually gets hard, limiters\n  matter, and closed-form structure evaporates — are exactly the regime this benchmark does not touch.\n  Lanyon calls this \"the first in a series\"; the interesting posts are the later ones.\n- **Self-selected and self-graded.** Lanyon chose the problems, wrote the rubric, and defined what\n  \"Faithful\" means. The cross-model anonymized review is a real mitigation against self-bias, but it does\n  not make the benchmark neutral, and a rubric that centers *formal verification discipline* is one Lanyon\n  is built to win by construction.\n- **The \"errors\" need replication.** The escape-hatch and degenerate-demo findings are specific and\n  falsifiable — which is good — but they are single-digit trial counts from one evaluator. \"Frontier\n  models game terse prompts\" is a claim that should be independently reproduced before it is repeated as\n  fact, not least because prompt phrasing is doing a lot of work here (the detailed prompts were all\n  Faithful).\n- **Lanyon doesn't show its own homework.** It reports the frontier models' tokens and times but not its\n  own exact figures, so the headline multipliers are computed against a number (\"seconds,\" \"far fewer\n  tokens\") the reader can't inspect.\n\nNone of that undermines the core idea. Verifying that a numerical kernel satisfies its spec *under\nIEEE-754 semantics* — not under an idealized real-number fiction — is the right thing to want, and doing\nit *before* the simulation runs is a real advantage over \"run it and see if the plot looks physical.\" That\ninstinct is the same one behind verifier-gated systems like [Leanstral](/articles/leanstral-formal-proofs),\nwhich grade with a checker built to reject `sorry` and `native_decide` outright; Lanyon points the same\ndiscipline at floating-point numerical code.\n\n## The take\n\nLanyon's contribution, stripped of the multipliers, is a stance worth taking seriously: derive the solver\nand its proof from one specification, prove the proof against the arithmetic the hardware actually uses,\nand reject the program before it ever runs if the two don't line up. That is a cleaner story than any\nsingle benchmark number, and it is the part that would still matter if the numbers were half as large.\n\nThe numbers themselves are early, self-reported, and drawn from the friendliest possible corner of\nscientific computing. Twenty-to-two-hundred-fifty-fold is the kind of figure that demands independent\nreplication and harder problems before it means anything durable — and the honest version of the excitement\nis not \"Lanyon is 250× better,\" it's \"a neurosymbolic system that offloads exact computation and\nfloat-faithful verification to a symbolic engine *should* be dramatically more efficient on structured\nmath, and here is the first, self-graded evidence that one is.\" Whether that holds when the PDEs stop being\nlinear is the whole question — and exactly what the promised follow-up posts have to answer.\n\n---\n\n*Source: Lanyon's [linear-PDE benchmarking write-up](https://lanyon.ai/research/linear-benchmarking/)\n(Lanyon, 2026), which builds on Gorard & Hakim (2025) on formal verification of PDE solvers in\nfinite-precision arithmetic. All benchmark numbers, cost/token figures, error findings, and speed/efficiency\nmultipliers are Lanyon's own self-reported results on a benchmark it designed and graded; they have not been\nindependently verified. The interactive diagram is mine.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/lanyon-neurosymbolic","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Ling-3.0-flash: a 124B open MoE that runs like a 5B and reaches for 1M tokens","description":"inclusionAI's Ling-3.0-flash is a 124B-parameter, ~5.1B-active open MoE that reaches a 1M-token context by interleaving Kimi Delta Attention (linear-time) with Gated MLA (full attention) at a 5:1 ratio, over a 512-expert / 8-active FFN. A first-principles tour of the hybrid-attention stack, the E512A8 + shared-expert MoE, and where the launch benchmarks land it against the 1T flagship and the frontier — plus the striking resemblance to Kimi K3.","date":"2026-07-24","tags":["llm","mixture-of-experts","linear-attention","hybrid-attention","open-weights","explainer"],"draft":false,"cover":"/articles/ling-3-0-flash/fig1.png","featured":true,"interest":5,"helpful":4,"kind":"articles","slug":"ling-3-0-flash","body":"[Ling-3.0-flash](https://huggingface.co/inclusionAI/Ling-3.0-flash), released 2026-07-23 by **inclusionAI**\n(Ant Group's Ling/Bailing MoE family), is a **124B-parameter** open Mixture-of-Experts that activates only\n**~5.1B parameters per token** — about **4%**. It ships with a **256K native context** that is designed to\nextend to **1M tokens**, a hybrid-reasoning (\"thinking\") mode, and a claim that lands harder than the spec sheet:\nwith roughly **1/8 the total** and **1/12 the active** parameters, it *matches or beats inclusionAI's own 1T\nflagship* on most of the benchmarks it was launched against.\n\nThe interesting part is not the sparsity ratio on its own — it's the **attention stack** that makes a cheap\nmillion-token context tractable. Ling-3.0-flash is built on **hybrid linear attention**: it interleaves **KDA\n(Kimi Delta Attention)** — a gated delta-rule linear-attention layer with constant-size memory — with **Gated\nMLA** full attention, at a **5:1 ratio**, and feeds each block into a **512-expert MoE** that fires just **8\nexperts** plus **1 shared expert** per token. This is a first-principles tour of each piece, why it matters, and\nwhere the launch numbers put it.\n\n<LingArchitecture />\n\nRead the center stack bottom-to-top: tokens are embedded, pass through **7 groups** of blocks — each group is\n**5×** (KDA + MoE) followed by **1×** (Gated MLA + MoE) — then a final norm, a 157k-vocab output projection, and\na multi-token-prediction head. The two right-hand panels expand the attention modules. Take the ideas one at a\ntime.\n\n## KDA: constant-size memory over a very long context\n\nOrdinary softmax attention keeps a **KV cache** that grows by one entry per token. At a 256K–1M context that cache\n*is* the cost: decoding becomes [memory-bound on a cache that scales with sequence length](/articles/how-llm-inference-works),\nand it only gets heavier as the context fills.\n\n**KDA** avoids that. It is a **gated delta-rule linear attention**: instead of a growing cache it keeps a\n**fixed-size recurrent state** $S_t$ that each token updates in place — it *erases* a little of the old state (a\nper-channel gated decay) and *writes* the new key/value association (the delta rule). A compact way to write the\nfamily is\n\n$$\nS_t = \\mathrm{Diag}(\\alpha_t)\\, S_{t-1} + \\beta_t\\, k_t v_t^{\\top}, \\qquad o_t = S_t\\, q_t\n$$\n\nwhere $\\alpha_t$ is the gated decay (the erase), $\\beta_t\\, k_t v_t^{\\top}$ is the written delta, and $o_t$ reads\nthe state with the query. The state $S_t$ is a fixed $d \\times d$ matrix — its size does **not** depend on how many\ntokens came before. That is exactly the structure the right-hand KDA panel draws: queries and keys go through a\nshort conv and an L2 norm, values through a conv, and learned $\\alpha$/$\\beta$ gates (softplus $\\varphi$ and sigmoid\n$\\sigma$) control the decay and write, with a final sigmoid output gate.\n\nBecause the state is constant-size, KDA runs in **linear time and constant memory** in the sequence length — so\nfive of every six layers pay *no* growing-cache cost at all. Only the single Gated-MLA layer per group keeps a\nreal KV cache, so the model's total long-context memory grows at roughly **1/6 the slope** of an all-attention\nmodel. Drag the context length and watch the gap open up:\n\n<KvVsState />\n\nThis is the lever behind \"1M-token context\" being a design target rather than a marketing number. It is not free —\na linear-attention state is a **lossy summary**, not a perfect record — which is exactly why Ling keeps full\nattention in the mix.\n\n## Gated MLA: one full-attention layer per group for exact recall\n\nEvery group's sixth block is a **Gated MLA** layer — **Multi-head Latent Attention** with **RoPE** and a learned\nsigmoid gate. MLA compresses keys and values into a low-rank latent before attending, which shrinks the KV cache\nof the full-attention layers themselves; RoPE gives the positional structure that makes long-range *exact* recall\nwork. The left panel's callout says it plainly: the **1M-token context** rides on RoPE-equipped MLA, while KDA\ncarries the **linear time complexity**.\n\nThe division of labour is the whole point. KDA is cheap but forgetful; full attention is exact but expensive. A\n**5:1** interleave keeps one exact-recall layer in every group so the lossy linear layers have something precise to\nanchor to — you get most of full attention's fidelity at a fraction of its memory. This is the same bet Moonshot\nmade in [Kimi K3](/articles/kimi-k3), and the resemblance is not subtle (more on that below).\n\n## The MoE: 8 of 512, plus a shared expert, kept balanced\n\nLing-3.0-flash's feed-forward is a **fine-grained MoE**: **512 routed experts**, of which only **8** fire per\ntoken — an activation ratio of **1/64** — plus **1 shared expert** that runs on every token to carry the common,\nalways-useful computation. That extreme sparsity is what lets a 124B model spend only ~5.1B parameters per token:\nthe compute is that of a ~5B model while the *knowledge capacity* is that of a 124B one.\n\nAt 1/64 activation two problems that are mild in a denser MoE turn first-order. **Routing** has to be learned well\nor most of the capacity is wasted, and **load balance** matters even more — if a few experts hog the tokens, the\nrest never train and the effective model collapses to something far smaller than 124B. Ling's answer is\n**ALF-LB (adaptive load balancing)**: rather than a single brittle auxiliary-loss coefficient, it adapts the\nbalancing pressure per expert so utilisation stays even without destabilising training — the same *spirit* as K3's\naux-loss-free balancing, aimed at keeping all 512 experts alive.\n\nTwo more structural details worth stating:\n\n- **The first 2 blocks use a dense FFN instead of MoE.** Early layers do broad, low-level feature mixing where\n  routing buys little and can hurt stability, so Ling keeps them dense and only switches to sparse experts deeper\n  in the stack — a common, load-bearing choice at this sparsity.\n- **Multi-Token Prediction (MTP).** The training objective is next-token prediction **plus** an auxiliary\n  multi-token-prediction head (the node at the top of the stack). MTP densifies the learning signal per step and\n  doubles as a **self-speculative decoding** draft head at inference, which is part of how a \"flash\" model earns\n  the name.\n\nRounding out the sheet: a **157k-token vocabulary**, an **embedding dimension of 2,560**, and the 7-group hybrid\nstack above.\n\n## The Kimi K3 resemblance\n\nIf this all sounds familiar, it should. [Kimi K3](/articles/kimi-k3) is built on the same three bets — **KDA**\nfor constant-size long-context memory, **full attention interleaved** for exact recall, and an **extreme-but-stable\nsparse MoE** with a balancing scheme that avoids a brittle aux-loss knob. Ling-3.0-flash runs the same playbook at\na very different scale: **124B/5.1B** for a fast production model versus K3's **2.8T/~50B** frontier system, and\n**8-of-512** routing versus K3's 16-of-896. The convergence is the story — two independent open labs arriving at\nthe *same* architecture for efficient long-context reasoning strongly suggests this hybrid-linear + sparse-MoE\nrecipe is where open models are settling.\n\n## The benchmarks\n\nOn its launch suite, inclusionAI compares the **Ling-3.0-flash(RC3)-Thinking** build against a field of thinking\nmodels: its own 1T **Ring-2.6-1T**, **MiniMax-M2.7**, **Step-3.7-Flash-high**, **Deepseek-v4-flash-max**,\n**Nemotron-3-Super-120B**, **GPT-5.4-mini-high**, and **Claude-Sonnet-4.6-maxthink**. The full grid:\n\n<Figure\n  src=\"/articles/ling-3-0-flash/fig1.png\"\n  alt=\"A 12-panel grouped bar chart of Ling-3.0-flash(RC3)-Thinking against Ring-2.6-1T-expert, MiniMax-M2.7, Step-3.7-Flash-high, Deepseek-v4-flash-max, Nemotron-3-Super-120B, GPT-5.4-mini-high and Claude-Sonnet-4.6-maxthink across SWE-Bench Pro, SWE-Bench Multilingual, Terminal-Bench v2.1-AA, Tau3-banking-AA, MCP-Atlas, SkillsBench, WideSearch, BrowseComp, IFBench, SysBench, MRCR-128k and Multi-IF. Ling-3.0-flash is highlighted in blue and leads or ties on most agentic panels.\"\n  caption=\"Ling-3.0-flash(RC3)-Thinking vs a field of thinking models across coding, agentic, long-context and instruction-following benchmarks (inclusionAI, launch report).\"\n/>\n\nThe headline result is coding. On **SWE-Bench Pro** the 124B model edges the entire field — including the 1T\nsibling and every frontier opponent it was tested against:\n\n<BenchBars\n  title=\"SWE-Bench Pro (%)\"\n  bars={[\n    { label: \"Ling-3.0-flash\", value: 56.63, highlight: true },\n    { label: \"Step-3.7-Flash\", value: 56.3 },\n    { label: \"MiniMax-M2.7\", value: 56.2 },\n    { label: \"Ring-2.6-1T\", value: 53.9 },\n    { label: \"Deepseek-v4\", value: 52.6 },\n    { label: \"Claude-Sonnet-4.6\", value: 48.29 },\n    { label: \"GPT-5.4-mini\", value: 47.88 },\n    { label: \"Nemotron-3-120B\", value: 34.06 },\n  ]}\n/>\n\nLong-context and instruction-following are where the hybrid stack should pay off, and it does. On **MRCR-128k** it\nsits second, just behind Claude and comfortably ahead of the 1T Ring and everything else — while GPT-5.4-mini,\nMiniMax and Step fall off a cliff:\n\n<BenchBars\n  title=\"MRCR-128k — long context (%)\"\n  bars={[\n    { label: \"Claude-Sonnet-4.6\", value: 92.46 },\n    { label: \"Ling-3.0-flash\", value: 90.78, highlight: true },\n    { label: \"Ring-2.6-1T\", value: 90.06 },\n    { label: \"Deepseek-v4\", value: 88.5 },\n    { label: \"GPT-5.4-mini\", value: 56.09 },\n    { label: \"Nemotron-3-120B\", value: 40.76 },\n    { label: \"Step-3.7-Flash\", value: 39.19 },\n    { label: \"MiniMax-M2.7\", value: 27.68 },\n  ]}\n/>\n\nOn **SysBench** (system-prompt adherence) it's in a three-way near-tie at the top with Claude and Deepseek:\n\n<BenchBars\n  title=\"SysBench (%)\"\n  bars={[\n    { label: \"Claude-Sonnet-4.6\", value: 94.85 },\n    { label: \"Deepseek-v4\", value: 93.86 },\n    { label: \"Ling-3.0-flash\", value: 93.63, highlight: true },\n    { label: \"GPT-5.4-mini\", value: 93.31 },\n    { label: \"Step-3.7-Flash\", value: 91.38 },\n    { label: \"Nemotron-3-120B\", value: 90.73 },\n    { label: \"Ring-2.6-1T\", value: 86.47 },\n    { label: \"MiniMax-M2.7\", value: 86.19 },\n  ]}\n/>\n\nIt is not a clean sweep. On **Terminal-Bench v2.1-AA** — long-horizon, tool-heavy agent work — Claude-Sonnet-4.6\nis well clear and Deepseek leads the open pack; Ling lands mid-field, ahead of its own 1T sibling but not the\nfrontier:\n\n<BenchBars\n  title=\"Terminal-Bench v2.1-AA (%)\"\n  bars={[\n    { label: \"Claude-Sonnet-4.6\", value: 71.2 },\n    { label: \"Deepseek-v4\", value: 62 },\n    { label: \"Ling-3.0-flash\", value: 57, highlight: true },\n    { label: \"GPT-5.4-mini\", value: 55.81 },\n    { label: \"MiniMax-M2.7\", value: 55 },\n    { label: \"Ring-2.6-1T\", value: 43.1 },\n    { label: \"Step-3.7-Flash\", value: 39.3 },\n    { label: \"Nemotron-3-120B\", value: 39 },\n  ]}\n/>\n\nThe pattern is consistent with the architecture: Ling is strongest where **structured recall and instruction\nadherence** dominate (SWE-Bench Pro, MRCR-128k, SysBench, IFBench), and merely competitive on the longest-horizon\nagent loops where a top proprietary model still pulls ahead. For a 124B model activating ~5.1B parameters, being in\nthat conversation — and beating a 1T model at 1/12 the active compute — is the result.\n\n<Callout type=\"warning\">\n**Read these as vendor numbers.** (1) Every score above is **inclusionAI's own launch report** for the\n**RC3-Thinking** build, run against opponents at their listed settings (e.g. `Claude-Sonnet-4.6-maxthink`); treat\ncross-lab comparisons as directional, not audited. (2) The **1M-token context** is a stated design target extending\na **256K native** window — long-context quality at the far end is not established by MRCR-128k alone. (3)\nArchitecture details (the 5:1 KDA:MLA interleave, E512A8 + shared expert, ALF-LB, dense first-2-blocks, MTP) are\ndrawn from inclusionAI's release and community write-ups; exact per-layer counts may differ in the final tech\nreport. (4) The diagram is a **faithful recreation** of the launch architecture figure in our house style, not the\noriginal image.\n</Callout>\n\n## The take\n\nStrip away the \"beats a 1T model\" headline and what's genuinely useful about Ling-3.0-flash is a **coherent,\nreproducible recipe**: **KDA** buys a linear-time, constant-memory path to very long context; a **1-in-6 Gated MLA**\nlayer buys back the exact recall linear attention loses; an **8-of-512 + shared-expert** MoE with **ALF-LB** buys\n124B of capacity at ~5B of active compute and keeps all the experts trained; and **MTP** plus dense early blocks\nmake the whole thing converge and decode fast. That it is essentially the [Kimi K3](/articles/kimi-k3) architecture\nat 1/22 the size is the most telling part — the frontier recipe for efficient long-context reasoning is now open,\nand it runs on a single node.\n\n---\n\n*Sources: the [Ling-3.0-flash model card](https://huggingface.co/inclusionAI/Ling-3.0-flash) and inclusionAI's\nlaunch materials (architecture, hybrid KDA/MLA interleave, MoE configuration, benchmarks), the\n[Kilo announcement](https://blog.kilo.ai/p/announcing-ling-30-flash-free-on) (124B/5.1B, 256K→1M context), and\ninclusionAI's launch benchmark chart (reproduced above). Benchmark numbers are quoted from inclusionAI's own\nreport for the RC3-Thinking build. The architecture diagram is a house-style recreation of the launch figure; the\nKV-vs-state chart is illustrative (order-of-magnitude, to show the shape).*\n","readingTimeMins":10,"url":"https://ai.thesatyajit.com/articles/ling-3-0-flash","signal":{"interest":5,"helpful":4,"score":9,"level":5,"label":"Essential"}},{"title":"MAI-Image-2.5-Pro and MAI-Voice-2-Flash: Microsoft builds its own","description":"Microsoft AI put two in-house models into public preview — MAI-Image-2.5-Pro for text-to-image and MAI-Voice-2-Flash for fast speech — and quietly swapped them into Bing, PowerPoint, OneDrive and Dynamics 365. The story isn't a benchmark; it's the strategy: MAI now builds frontier image and voice models on its own data and serves them into Microsoft's surface at a fraction of the GPU cost of third-party models.","date":"2026-07-24","tags":["image-generation","tts","microsoft","multimodal","explainer"],"draft":false,"cover":"/articles/mai-image-2-5-voice-2/fig1.jpg","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"mai-image-2-5-voice-2","body":"For most of the last three years, \"Microsoft's AI\" mostly meant OpenAI's models wearing a Copilot badge.\n[This announcement](https://microsoft.ai/news/introducing-mai-image-2-5-pro-and-mai-voice-2-flash/) is\nthe other Microsoft — **MAI**, Mustafa Suleyman's Microsoft AI group — shipping two of its *own* frontier\nmodels into public preview: **MAI-Image-2.5-Pro**, a text-to-image model tuned for quality, and\n**MAI-Voice-2-Flash**, a speech model tuned for speed. Neither is a wrapper. Both are trained in-house,\nand both are already swapped into products you use.\n\nThe headline isn't a leaderboard score — it's a supply-chain move. Microsoft has spent years renting its\nimage and voice capability from third parties; MAI is now making that capability itself, on its own data,\nand serving it into Microsoft's product surface at a large discount. That's the frame worth reading these\ntwo releases through.\n\n## Two tracks, one strategy\n\nThe pair is deliberately split by objective. **Pro** is the quality lane — hero imagery, detailed edits,\nprecise in-image text, priced like a premium model. **Flash** is the throughput lane — fast, cheap speech\nfor high-volume voice, where responsiveness beats everything. What ties them together is where they land:\neach has been dropped into Microsoft products in place of an outside model, and each ships with a\nself-reported serving win.\n\n<DeployMap />\n\nRead the fan the way Microsoft wants you to: these aren't demos looking for a home. Bing Image Creator is\nnow **100% in-house** on MAI-Image-2.5; PowerPoint's image-to-image runs on it at a claimed **84% lower\nGPU cost than GPT-Image-2**; OneDrive made it the default editor. On the voice side, MAI-Voice-2-Flash\npowers Dynamics 365 Contact Center at a claimed **89% GPU-cost reduction** and feeds Azure Voice Live.\nThe numbers are Microsoft's own, but the direction is unambiguous — every one of these was previously a\nplace a third-party model would have run.\n\n## MAI-Image-2.5-Pro: quality, and its own data\n\nMAI-Image-2.5 is the model line; **Pro** is its high-fidelity tier, the one you reach for when the output\nis the deliverable rather than a thumbnail. When the base MAI-Image-2.5 model debuted on\n[LMArena](https://lmarena.ai) it landed at **No. 3 for text-to-image and No. 2 for image editing** — a\nnotch behind OpenAI's image models but, per third-party arena coverage, roughly level with Google's\nNano Banana 2. For a first fully in-house image model, that's a real result.\n\nThe sample reel leans hard on the two things generators historically fumble: **product photography** and\n**legible in-image text**. Brand lockups, packaging copy, poster typography — the kind of output where a\nsingle wrong glyph gives the game away.\n\n<Figure\n  src=\"/articles/mai-image-2-5-voice-2/fig1.jpg\"\n  alt=\"A collage of eight images generated by MAI-Image-2.5: a blue ORPHÉON perfume brand poster with rendered serif text, a yellow LEMONS juice carton product shot, a purple BATIZ handbag ad, a dog on a London zebra crossing, a person reading in a park, a 'Fun Birds' magazine mockup with a fluffy chicken, silver shoes on checkerboard tile, and a tiled bathroom interior.\"\n  caption=\"Sample generations shown with the release — product shots and rendered in-image text (ORPHÉON, LEMONS, BATIZ, Fun Birds) are the pitch (Microsoft AI, announcement).\"\n/>\n\nThat emphasis shows up in the self-reported Arena breakdown. Against the prior MAI-Image-2, Microsoft\nreports a **+75 overall Elo gain**, and the two categories that moved most were exactly the hard ones:\n\n<BenchBars\n  title=\"MAI-Image-2.5 — self-reported Arena Elo gain over MAI-Image-2, by category\"\n  bars={[\n    { label: \"Text rendering\", value: 107, highlight: true },\n    { label: \"Cartoon / anime\", value: 90 },\n    { label: \"Overall\", value: 75 },\n  ]}\n/>\n\nThese are *deltas versus the previous generation*, not absolute scores against rivals, and they're\nMicrosoft's own Arena tallies — read them as \"where the team pushed,\" not as a competitive ranking. The\none claim that is genuinely strategic rather than aesthetic sits in the fine print: MAI says the model is\ntrained on **\"clean, traceable, enterprise-grade data, without distillation from third-party models.\"**\nFor an enterprise buyer nervous about provenance and copyright, \"we didn't distill someone else's model\nand we can trace our data\" is a feature, not a footnote — and it's a pointed contrast to the murkier\nlineage of much of the field.\n\nPricing tells you which lane Pro is in: **$5 / 1M text-input tokens**, **$8 / 1M image-input tokens**,\nand **$106 / 1M image-output tokens** — priced as a premium generation model, not a commodity one.\n\n## MAI-Voice-2-Flash: the throughput lane\n\nThe voice release is smaller in ambition and clearer in purpose. **MAI-Voice-2-Flash** is a distilled,\nspeed-first sibling of MAI-Voice-2: Microsoft reports it is **2× faster** and **32% cheaper** while\nkeeping \"the natural prosody and high acoustic quality\" of the parent. It's priced at **$15 / 1M\ncharacters** — the kind of number that only matters at contact-center volume, which is exactly the\ntarget.\n\nThe MAI-Voice line has been a speed story from the start: its first model was pitched on generating a\nfull minute of audio in under a second on a single GPU. Flash extends that lineage in the direction that\nmatters for the deployment above — a call-center agent that has to respond *now*, thousands of\nconversations in parallel, where a half-second of latency is the difference between natural and robotic.\nPairing \"good enough prosody\" with \"cheap and instant\" is the entire product thesis, and it's why the\nDynamics 365 and Azure Voice Live integrations lead the voice half of the announcement rather than a\nquality benchmark.\n\n<Callout type=\"note\">\nMicrosoft frames this as a *family*, not a single model: a **Pro/quality** tier and a **Flash/speed**\ntier per modality, so a product team picks the point on the cost–quality curve it needs. That's the same\n\"pick your lane\" packaging the rest of the industry has converged on (Pro vs. Flash, Opus vs. Haiku) —\nMicrosoft is now doing it with models it owns end-to-end.\n</Callout>\n\n## Why in-house, and why now\n\nStrip away the model cards and the strategic logic is a spreadsheet. Every image or utterance Microsoft\ngenerates from a third-party API is marginal cost it doesn't control and margin it doesn't keep. Owning\nthe model turns that into an internal transfer — and the reported serving wins (**−84%** GPU cost in\nPowerPoint, **−89%** in Dynamics 365, **2.5× efficiency** with a **25%** P95-latency cut and a **26%**\nhigher save rate in OneDrive) are the payoff, measured across products that run at Microsoft scale. At\nthat volume, a double-digit-percent cost cut on a capability embedded in Office and Azure is a very large\nnumber.\n\nIt's also insurance. MAI already builds its own [text models](https://microsoft.ai) and voice models;\nadding a competitive image model means Microsoft can staff Copilot, Bing, Office and Azure from its own\nfrontier lab if it ever needs to — reducing dependence on any single outside provider. Two public-preview\nmodels are a small headline; \"Microsoft no longer *has* to rent its image and voice stack\" is the actual\none.\n\n<Callout type=\"warn\">\nKeep the caveats attached. Every number here is **vendor-reported**: the Arena deltas are Microsoft's own\ntallies, and the GPU-cost and efficiency figures are Microsoft's internal measurements against its own\nbaselines, not independently reproduced. There's **no technical report** — no architecture, parameter\ncount, or training detail was published, only capability claims and prices. Both models are in **public\npreview**, which means the quality bar and the pricing can still move.\n</Callout>\n\n## The take\n\nMAI-Image-2.5-Pro and MAI-Voice-2-Flash are not the most capable image and voice models in the world, and\nMicrosoft doesn't claim they are. What they are is *sufficient* — a top-three image model and a fast,\ncheap voice model, both good enough to swap into the real products where Microsoft used to pay someone\nelse. That's the whole move: not winning a leaderboard, but owning the supply chain and pocketing the\nGPU-cost delta at Office-and-Azure scale, on data Microsoft says it can trace. It pairs naturally with the\nresearch-lab counterpart from the same company, [Mage-Flow](/articles/mage-flow) — a 4B efficiency bet —\nand with [Qwen-Image-3.0](/articles/qwen-image-3), another vendor deciding its image model should be\n*useful* infrastructure rather than an art toy. The frontier that's being contested here isn't quality.\nIt's who owns the model behind the button.\n\n---\n\n*Source: [Introducing MAI-Image-2.5-Pro and MAI-Voice-2-Flash](https://microsoft.ai/news/introducing-mai-image-2-5-pro-and-mai-voice-2-flash/)\n(Microsoft AI, 2026-07). LMArena placements and the \"level with Nano Banana 2\" comparison are from the\nearlier [MAI-Image-2.5 launch](https://microsoft.ai/news/introducing-mai-image-2-5/) and third-party\narena coverage. All benchmark, cost, and efficiency numbers are Microsoft's own; the sample image is the\nannouncement's, shown for commentary. The interactive is mine.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/mai-image-2-5-voice-2","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"SANA-Video 2.0: keeping video attention linear without losing the picture","description":"NVIDIA's SANA-Video 2.0 generates 720p, multi-second video on a single H100 — 13.06s for a 720p/5s clip, a claimed 120× over Wan 2.2-A14B. It gets there by refusing quadratic attention: a high-compression VAE shrinks the clip to a modest token count, then a hybrid backbone keeps three of every four layers linear and makes the fourth a full-softmax anchor, with Block Attention Residuals carrying the refreshed features across depth. A first-principles walk through why linear attention plus deep compression makes long, high-res video cheap — with the paper's real figures, a sample clip, and its own numbers kept honest.","date":"2026-07-24","tags":["diffusion","video-generation","linear-attention","efficient-inference","nvidia","explainer"],"draft":false,"cover":"/articles/sana-video2/fig1.png","featured":true,"interest":5,"helpful":4,"kind":"articles","slug":"sana-video2","body":"Video generation has a scaling problem that images mostly dodge. After a VAE compresses a clip, a single\n1080p video still spans *tens of thousands* of latent tokens — and a standard video diffusion transformer\nruns full 3D softmax attention over all of them, at every layer, at a cost that grows with the *square* of\nthe token count. Double the resolution or the duration and the attention bill doesn't double, it\nquadruples. That $O(N^2)$ wall is why most open video models cap out at a few seconds and lean on clusters.\n\n[SANA-Video 2.0](https://arxiv.org/abs/2607.21553), from NVIDIA, is an argument that you can walk around\nthe wall instead of paying to climb it. It generates high-quality video up to 720p **on a single GPU** —\nthe 5B model renders a 720p, 5-second clip in **13.06 seconds** on one H100, which the paper clocks at\nroughly **120× faster** than Wan 2.2-A14B — while claiming quality parity with much larger full-softmax\nsystems. It does this by being disciplined about where quadratic attention is actually worth its price.\n\n<Figure\n  src=\"/articles/sana-video2/fig1.png\"\n  alt=\"SANA-Video 2.0 teaser. Top: text-to-video sample frames unrolled over time — a woman on a sunset beach, a painter, an eagle catching a fish, latte art being poured, a robot arm, an ocean wave, a red panda, a rally car. Bottom left: a bar chart of one-H100 720p/5s generation latency in seconds, from Wan 2.2-A14B at 1556s down to SANA 5B + Sol-Engine at 13.06s, marked 120×. Bottom right: DiT-forward time versus clip duration, full-softmax curling steeply upward while SANA's stays low, marked 3.2× at 60s.\"\n  caption=\"SANA-Video 2.0 at a glance: text-to-video samples, one-H100 720p/5s latency (13.06s, VBench 84.30), and DiT-forward time that stays flat as clips lengthen (SANA Video2 / arXiv:2607.21553, Figure 1).\"\n/>\n\nThe organizing idea is the same \"spend cost where the signal is\" instinct that the\n[Mage-Flow explainer](/articles/mage-flow) applied to image tokenizers — but pointed at *attention* and\nstretched across *time*. To see why it works, start with the family it comes from.\n\n## The SANA lineage: efficiency as a house style\n\nSANA has always been the efficiency line. The original **SANA** image model made three bets that each cut a\ndifferent cost: a **deep-compression autoencoder** (DC-AE) that squeezes an image by 32× per side instead of\nthe usual 8×, so the transformer sees far fewer tokens; a **Linear Diffusion Transformer** that replaces\nquadratic self-attention with linear attention; and a **decoder-only text encoder** — a small Gemma LLM in\nplace of the heavy T5 — for conditioning. Together they let a laptop-class GPU generate 1K and even 4K\nimages. **SANA-Video 1.0** carried the recipe to video with a **2B, pure-linear** DiT and posted big\nspeedups at competitive quality.\n\nBut pure linear attention pays for its $O(N)$ speed with **expressiveness**. Linear attention compresses all\nof the past into a single fixed-size state matrix $S \\in \\mathbb{R}^{d \\times d}$; that state simply cannot\nencode every token-to-token interaction, and for video — where precise spatiotemporal correspondence and\nfine detail matter — the missing interactions show up as softness and drift. This is the exact tension the\n2.0 paper sets out to resolve, and it borrows its fix from a place you might not expect: recent large\nlanguage models. Qwen3-Next and Kimi-Linear keep a mostly-linear stack but insert a few **softmax anchors**\n— periodic full-attention layers that restore exact interactions at fixed depths — and route information\nacross depth with **attention residuals**, a combination Kimi K3 runs at trillion-parameter scale.\nSANA-Video 2.0 asks whether the same trick unlocks long, high-resolution video. It answers yes.\n\n## Two levers, and why they compound\n\nSANA's speed isn't one trick; it's two multiplicative ones. The first is upstream of the transformer\nentirely: a **high-compression VAE** (SANA-Video 2.0 uses LTX-VAE 2.3, with a stride of **8×32×32** —\n32× on each spatial axis, 8× in time) turns a clip into a modest sequence of latent tokens before the DiT\never runs. A 5-second 720p clip that is millions of pixels collapses to on the order of ten thousand tokens.\nThe second lever is the one this paper is about: given that token sequence, **how does attention cost scale\nas the clip grows?** Drag the length and flip the resolution:\n\n<AttentionScaling />\n\nThe point isn't the exact numbers — it's the *shape*. Because attention is quadratic in the token count, a\nfull-softmax DiT's cost curls sharply upward as clips lengthen and sharpen; the linear-dominated hybrid's\nstays low, so the gap *widens* precisely in the long, high-resolution regime where video is most expensive.\nThe paper's compiled profiling puts the DiT forward pass at **1.55× faster at 5 seconds rising to 3.2×\nfaster at 60 seconds** versus a matched full-softmax baseline at 720p, and 2.01× faster even at 1080p/121\nframes. Compression gives you few tokens; linear attention makes each token cheap; the two savings multiply.\n\n## The mechanism: 25% softmax, and residuals to carry it\n\nSo how much softmax do you actually need? SANA-Video 2.0's answer, established through reduced-resolution\nproxy studies, is **one layer in four**. Its backbone is a stack of **Hybrid Linear–Softmax Attention**\nlayers at a **3:1 ratio**: three gated-linear-attention layers — cheap $O(N)$ mixing — for every one\n**gated-softmax anchor** that restores the full-rank interactions the linear layers can't represent. Then a\nsecond mechanism, **Block Attention Residuals (AttnRes)**, groups the layers into blocks and routes each\n*completed* block's summary forward into later linear layers, so the anchors' refreshed representations\npropagate across depth instead of decaying — worth about a **12% lift in deep-layer effective rank** in the\npaper's probes. Flip the regime and toggle the residuals:\n\n<HybridStack />\n\nTwo design choices are worth flagging as honest engineering, not magic. First, SANA-Video 2.0 is trained\n**from scratch** as a hybrid — it is not a pretrained softmax model that was later \"linearized,\" a shortcut\nthat usually leaves quality on the table. Second, the 25% figure is a *measured* trade-off point, not a\nround number: fewer anchors and quality slips; more and you're paying for softmax you didn't need. The\npaper's architecture diagram lays out both pieces — the hybrid layer, the 8-layer blocks, and the shared-query\nrouter that does the aggregation:\n\n<Figure\n  src=\"/articles/sana-video2/fig2.png\"\n  alt=\"SANA-Video 2.0 architecture. Left: one hybrid DiT layer, whose attention branch is either a 75% gated linear-attention path or a 25% gated softmax path, wrapped by AttnRes aggregation blocks around cross-attention and a SwiGLU feed-forward. Middle: the backbone as four eight-layer blocks stacked on a patch-embedding, each block emitting a completed-block feature. Right: the AttnRes module, where a shared-query depth router aggregates the input, all completed-block features, and the current block into the layer output.\"\n  caption=\"The hybrid DiT layer (left), the block-structured backbone with per-block summaries (middle), and the shared-query AttnRes router that aggregates completed-block features across depth (right) (SANA Video2 / arXiv:2607.21553, Figure 2).\"\n/>\n\nThe two models share this design at different sizes: the **5B** is a 32-layer, width-2,560 backbone; the\n**14B** is 40 layers at width-4,096 (14.25B parameters). Both operate on LTX-VAE 2.3 latents and draw text\nfeatures from **Gemma-2-2B-IT** — the decoder-only text encoder carried straight from the SANA lineage —\nthrough cross-attention at every layer.\n\n## What \"Video2\" adds: making it a real generator\n\nA cheap backbone is only half a video model; the other half is the training pipeline that teaches it motion\nand taste. SANA-Video 2.0 is trained with **flow matching** (the same few-step-friendly objective the\n[FLUX 3 explainer](/articles/flux-3) walks through for video), then sharpened in stages: a **Self-Flow**\ndistillation that compresses the sampler, **Direct Preference Optimization**, and an online\n**Reward-Feedback-Learning** RL loop. It generates 480p–720p at 81, 121, or 193 latent frames — multi-second\nclips, extendable to 8 seconds after fine-tuning — and, because the whole stack was built to be\nhardware-friendly, a final **Sol-Engine** pass (kernel fusion, caching, and sparse attention) squeezes out a\nfurther **3.58×** end-to-end, which is what brings the 5B pipeline to that 13.06s figure. Here is a\nrepresentative clip from the project page — a surreal \"world in a bottle\" bobbing on the ocean, the kind of\nshot whose value is in staying *coherent* across time:\n\n<Video\n  src=\"/articles/sana-video2/sana-demo\"\n  poster=\"/articles/sana-video2/sana-demo-poster.jpg\"\n  alt=\"A corked glass bottle floating on rolling ocean waves under a blue cloudy sky; inside the bottle sits a tiny island with a red church and cottage among pine trees, the whole miniature world lit warmly as the water moves around it.\"\n  caption=\"A ~5s excerpt from SANA-Video 2.0's text-to-video samples (muted/looped and re-encoded to keep the page light). The test is temporal coherence — the waves, reflections, and refraction through the glass stay consistent across the shot (SANA Video2 / project page).\"\n/>\n\n<Callout type=\"note\">\nThe clip is trimmed and recompressed from the project page's 8-second 720p sample to keep the page light;\nthe source reel runs at full resolution. What it's meant to show is stability over time — the failure mode\npure-linear video models fall into (drift, flicker, softening detail) is exactly what the softmax anchors\nare there to prevent.\n</Callout>\n\n## The numbers — and what they are\n\nThe headline is latency, and it is dramatic. Reading straight off the paper's one-H100, 720p/5s profile and\nexpressing each baseline as a multiple of SANA 5B's 13.06s, the field looks like this:\n\n<BenchBars\n  title=\"SANA-Video 2.0 (5B) — speedup vs. each model at 720p/5s, one H100 (×, higher = SANA faster)\"\n  unit=\"×\"\n  max={120}\n  bars={[\n    { label: \"Wan 2.2-A14B\", value: 119, highlight: true },\n    { label: \"Bernini-R (14B)\", value: 118, highlight: true },\n    { label: \"HunyuanVideo (13B)\", value: 60 },\n    { label: \"Wan 2.1 (1.3B)\", value: 31 },\n    { label: \"Lance (7.1B)\", value: 27 },\n    { label: \"LTX-2.3 (22B)\", value: 10 },\n    { label: \"Cosmos-3 (16B)\", value: 7.9 },\n    { label: \"SANA 14B (own)\", value: 5.3 },\n  ]}\n/>\n\nThe efficiency claim only means something if quality holds, and here the evidence is a **VBench** score of\n**84.30** for the 5B at 40 sampling steps — essentially level with the 14B's 84.23 and with the Wan 2.2\nquality point the paper marks on its chart — reached in a small fraction of the latency. The paper's own\nframing is the honest one: *match* full-softmax quality while keeping linear attention's long-sequence\nscaling.\n\n<Callout type=\"warn\">\nHold these the right way. The speedups above are **derived from the paper's own** latency table (Figure 1b)\n— a single-GPU H100 profile from NVIDIA's harness, on their chosen baselines (including in-house or renamed\nsystems like \"Bernini-R\" and \"Lance\"), with both sides compiled on their best kernels. The quality claim\nrests on **VBench**, one automated benchmark that correlates only loosely with human preference; there is no\nindependent third-party evaluation yet. And the strongest numbers stack two separate wins — the hybrid\n*architecture* (the 3.2× DiT-forward gap) **and** the Sol-Engine *systems* pass (a further 3.58×) — so the\n\"120×\" is an end-to-end pipeline figure, not the attention mechanism alone. It's a strong, well-instrumented\nresult; it is not a settled head-to-head ranking.\n</Callout>\n\n## Honest limitations\n\nThe ceiling is real: **720p** is the top resolution and clips are **seconds**, not minutes — this is not yet\na long-form or 1080p+ model, and the 720p/8s operating point comes from a small supervised fine-tuning stage\n($\\sim 10^4$ clips), so the highest-resolution, longest-duration quality is the least battle-tested part.\nThe 25% ratio is validated at reduced-resolution proxy scale and then trusted at full scale. The VAE that\ndoes so much of the compression work is a **licensed external component** (LTX-VAE 2.3), not SANA's own\nDC-AE — worth noting because it means the headline contribution here is squarely the *attention* design, not\nthe tokenizer. And as always with a fresh tech report, every number is the authors'. None of this undercuts\nthe core result; it just sizes it.\n\n## The take\n\nSANA-Video 2.0 is a clean, well-argued answer to the question that has quietly bounded open video\ngeneration: *do you have to pay quadratic attention to get softmax-quality video?* The answer is no — keep\nthree layers in four linear, spend softmax only at periodic anchors, carry the anchors' work forward with\nresiduals, and feed the whole thing from a high-compression VAE so the token count is modest to begin with.\nThe savings compound exactly where video is most expensive, which is why the gap grows with length and\nresolution rather than shrinking. It pairs naturally with [FLUX 3](/articles/flux-3), which bets on *scale*\nand joint multimodality to reach video, and with [Mage-Flow](/articles/mage-flow), which makes the same\nco-design argument for images: efficiency is an architecture problem, not only a compute one. Worth the\nusual caveats on vendor benchmarks and the 720p/seconds ceiling — but as a demonstration that linear\nattention can carry real video without visibly losing the picture, it's the most convincing one so far.\n\n---\n\n*Source: [SANA-Video 2.0: Hybrid Linear Attention with Attention Residuals for Efficient Video Generation](https://arxiv.org/abs/2607.21553)\n(Chen et al., NVIDIA, 2026), the [project page](https://nvlabs.github.io/Sana/Video2/), and the\n[SANA repository](https://github.com/NVlabs/Sana). The teaser and architecture figures and the sample clip\nare the authors', shown for commentary; all benchmarks are paper-reported. The attention-scaling and\nhybrid-stack interactives are mine. Related: [Mage-Flow](/articles/mage-flow) on efficient tokenizers and\n[FLUX 3](/articles/flux-3) on flow-matching video.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/sana-video2","signal":{"interest":5,"helpful":4,"score":9,"level":5,"label":"Essential"}},{"title":"Scaling agentic RL: 365,000 environments behind one contract","description":"Prime Intellect integrated 23 agentic tasksets — software-engineering, terminal, and web-search — behind a single taskset API: ~365,000 tasks with prebuilt per-task sandbox images, grading material withheld until scoring, and gold-patch/no-op-validated re-uploads. A walk through why agentic RL is bottlenecked on verified, reproducible environments; the one-contract design (verifiers v1's taskset/harness/runtime split and the Harbor format); the catalog and its counts; the validation pipeline; and the honest failure modes — reward hacks and PR-test false negatives.","date":"2026-07-24","tags":["reinforcement-learning","agents","environments","infrastructure","prime-intellect","explainer"],"draft":false,"cover":"/articles/scaling-agentic-rl/cover.png","featured":true,"interest":5,"helpful":4,"kind":"articles","slug":"scaling-agentic-rl","body":"The reinforcement-learning recipe for agents is, by now, boring in the good way: put the agent in an environment, let it act, check whether it succeeded, and reward it for succeeding. The hard part was never the algorithm. It is the phrase \"check whether it succeeded.\" At scale you need *hundreds of thousands* of tasks that each come with a sandbox the agent can act in and a **grader that produces a clean, reproducible reward** — and the open-source ecosystem, for all its great datasets, does not ship that. Every SWE benchmark, every terminal corpus, every search eval invents its own harness, its own image conventions, its own grading scripts, and its own failure modes. They do not compose.\n\n[Prime Intellect's post](https://www.primeintellect.ai/blog/scaling-agentic-rl) (Daniel Auras and team, July 2026) is an engineering answer to exactly that: they took **23 agentic tasksets across three domains and put them behind one taskset API** — roughly **365,000 tasks** (~198,000 software-engineering across 20+ languages, ~28,600 terminal, ~137,600 search), each with a prebuilt sandbox image, each grader withheld until scoring, many re-uploaded only after gold-validation. This piece walks the thesis, the design, the catalog, and the caveats.\n\n<Callout type=\"warn\">\nThis is a **company engineering post**. Prime Intellect sells the sandbox/compute platform (Prime Sandboxes, `prime-rl`) that this catalog runs on, so read the framing as a product argument, not a neutral survey. The task **counts are their displayed, shipped figures**; several shrink after validation (the [validation table](#gold-validated-then-re-uploaded) below shows by how much). And there is **no independent benchmark of training quality** here — the post ships a training *config* (GLM-4.5-Air on `scaleswe_v1`, 6 H200 nodes, 2 days) but no accuracy numbers, so treat the value as *infrastructure*, not a SOTA result. What is genuinely useful is the design pattern, which stands on its own.\n</Callout>\n\n## The bottleneck is the environment, not the algorithm\n\nHere is the shape of the problem the post is solving. Each upstream taskset made reasonable choices for its *own* harness, and those choices don't compose: SWE-bench applies test patches inside a generated eval script; R2E-Gym bakes tests into the image and compares against expected outputs; every search benchmark invents its own judge. If you want to train **one** agent across all of them, you have to normalize those lifecycles without breaking each taskset's own scoring semantics — because the scoring semantics are the whole point. A reward you can't trust is worse than no reward.\n\nPrime Intellect frames it in one line: *\"A taskset row is only useful if it can produce a clean reward signal.\"* And a \"surprising fraction\" of open agentic data fails that precondition — broken images, network-dependent tests, expected outputs that drifted, and tasks that score as solved without touching the code at all. Scaling RL, in this telling, is far less about the loss function (see [Ring-Zero](/articles/ring-zero-trillion-scale-rl) and [frontier RL economics](/articles/frontier-rl-cheaper) for the loss/systems side) and far more about **manufacturing verified, reproducible environments in bulk**.\n\n## One environment, three layers\n\nThe enabling idea is [verifiers v1](https://www.primeintellect.ai/blog/verifiers-v1), which decomposes an environment into three independent layers:\n\n- **Taskset** — the data and its scoring logic (what problem, what counts as solved). *This post is the taskset layer.*\n- **Harness** — how the agent is driven (Codex, their own harness, or yours).\n- **Runtime** — where it executes (a Prime sandbox, local Docker, …).\n\nBecause those layers are independent, one command can run any taskset in any harness on any runtime:\n\n```bash\n# ScaleSWE, in the Codex harness, on Prime Sandboxes:\nuv run eval scaleswe-v1 --harness.id codex --harness.runtime.type prime -n 3\n```\n\nThe taskset-layer packaging format is called **Harbor**: SWE-bench Verified \"runs the Harbor Hub packaging against the official instance images,\" and Terminal-Bench 2 is wrapped \"through the same Harbor taskset, so the eval suite and the training corpora share one task format and one scoring contract.\" That last clause is the whole design in miniature — *the thing you evaluate on and the thing you train on speak the same format and the same scoring contract.*\n\n## A rollout, end to end\n\nConcretely, every taskset exposes the same handful of hooks, and a rollout walks them in the same order. The one beat worth internalizing is the **integrity** move: during a rollout the agent lives inside the *same* sandbox as the grading machinery, so \"anything readable in the container is fair game for a reward hack.\" So the grading material — the test patch, the expected outputs, the grader — is **withheld until scoring**, then restored only to compute the reward. Step through it:\n\n<RolloutPipeline />\n\nThe contract that makes this uniform is small — a typed data schema plus four hooks:\n\n```python\nimport verifiers.v1 as vf\n\nclass MyTaskData(vf.TaskData):\n    base_commit: str    # state the sandbox resets to\n    test_patch: str     # grading material — withheld until scoring\n    gold_patch: str     # reference fix — used only by `validate`\n\nclass MyTask(vf.Task[MyTaskData]):\n    async def setup(self, runtime): ...            # prepare the repo in the task's image\n    async def finalize(self, trace, runtime): ...  # capture the agent's diff into the trace\n\n    @vf.reward\n    async def solved(self, runtime) -> float:      # restore tests, apply test_patch,\n        ...                                        # run the taskset's own upstream grader\n\n    async def validate(self, runtime) -> bool:     # gold patch must score 1.0;\n        ...                                        # the no-op (setup-only) run must not\n```\n\nSome upstream authors deliberately ship the tests readable — R2E-Gym keeps its grading tests at `/r2e_tests`, Multi-SWE leaves grading scripts and `test.patch` under `/home`. That is fine for an attach-at-eval harness, but not for a *live RL sandbox* under optimization pressure, so Prime Intellect's integrations hide those artifacts and restore them only for scoring.\n\n## Four decisions that make tasksets compose\n\nThe integrations keep every taskset's original grading path — upstream log parsers, upstream report generation, upstream test commands — and normalize everything *around* it. Four decisions do the work:\n\n- **One API.** Every taskset loads from a typed config (dataset, split, filters), provisions a sandbox from the task's image, and scores with the taskset's own logic. Swapping splits or adding a `filter_fn` works the same everywhere.\n- **One image registry.** Task images live in Prime's own registry, co-located with the sandboxes — **~135,000 prebuilt open-source task images**, which they claim is the largest such catalog hosted by any sandbox provider. The point is operational: no Docker Hub rate limits \"when running a thousand concurrent rollouts,\" and reproducibility from a pinned per-task image rather than a build step that can drift.\n- **One integrity standard.** Grading material withheld until scoring, as above — the reward-hack defense.\n- **One validation bar.** Before a dataset earns a default slot, it runs gold-patch and no-op validation, and a cleaned version is re-uploaded with exclusions preserved. That is the next section.\n\n## The catalog: 23 tasksets, three domains\n\nPick a domain to see its tasksets and their (shipped) task counts. The three domains are lopsided — software engineering and search dominate the raw count; terminal is smaller but denser in verified benchmarks.\n\n<TasksetMap />\n\nThe domain split of the ~365,000 total:\n\n<BenchBars\n  title=\"Tasks by domain (~365,000 total)\"\n  unit=\"k\"\n  bars={[\n    { label: \"SWE · 20+ langs\", value: 198, highlight: true },\n    { label: \"Search\", value: 137.6 },\n    { label: \"Terminal\", value: 28.6 },\n  ]}\n/>\n\n### Software engineering\n\nReal repositories, real diffs; the reward is whether hidden tests pass after the agent's patch. Counts are the displayed shipped totals; parenthetical notes flag the gold-validated re-upload sizes where they differ.\n\n| Taskset | Tasks | What it is |\n|---|---|---|\n| SWE-bench Verified | 500 | Human-filtered GitHub issues in major Python repos; the canonical benchmark |\n| SWE-bench Multilingual | 300 | The canonical set across C, C++, Go, Java, JS/TS, PHP, Ruby, Rust |\n| SWE-bench Pro | 731 | Harder successor; large-scale diffs from license-friendly repos |\n| SWE-smith | 83,519 | Bugs *injected* into healthy repos, keeping the tests that catch them (8 languages) |\n| R2E-Gym | 4,578 | Executable envs from real commits with synthesized issues (4,522 gold-validated) |\n| Multi-SWE | 6,835 | Containerized RL + eval instances across 7 languages (2,232 in the validated RL set) |\n| SWE-rebench-V2 | 32,079 | Continuously mined fresh PRs, 20 languages, decontaminated by recency (6,275 verified) |\n| Scale-SWE | 17,202 | Python tasks with test patches applied just before eval (from 20,181 raw) |\n| SWE-Lego | 15,903 | SWE-bench-style training data at scale; tests applied only at scoring |\n| OpenSWE | 36,884 | Tasks paired with per-task eval scripts kept out of the sandbox until scoring |\n| Senior SWE-Bench | 50 | Investigation/design tasks from 12 production repos; pytest/vitest + optional LLM rubric |\n\n### Terminal\n\nGive the agent a shell and a goal; a hidden pytest grader checks the end state. Smaller in raw count, but this is where the community-standard evals live.\n\n| Taskset | Tasks | What it is |\n|---|---|---|\n| TMax | 14,600 | Terminal tasks, each pinned to a prebuilt image; all 14,600 boot-and-setup verified |\n| Terminal-Lego | ~13,800 | Docker-verified Terminal-Bench-style tasks built from real StackOverflow issues |\n| OpenThoughts-TBLite | 100 | High-signal 100-task terminal-agent benchmark; hidden grader |\n| Terminal-Bench 2 | 89 | Community-standard eval; 89 rigorously verified tasks |\n\n### Search\n\nThe search tasksets share one design decision: they are **harness-agnostic and tool-free**. The taskset ships questions and scoring *only* — the harness brings its own search tool (the Codex harness's built-in web search, Prime's search skill, or yours). The same tasks then train and evaluate any search-capable agent without the environment prescribing a retrieval pipeline.\n\n| Taskset | Tasks | What it is |\n|---|---|---|\n| PaperSearchQA | 59,907 | Biomedical deep-research QA (54,907 train + 5,000 test); judge-graded |\n| WideSeek | 44,632 | WideSearch-style table compilation; scored by item-level cell F1 |\n| S1-DeepResearch | ~15,000 | Multi-hop resolution questions with gold answers; judge-graded |\n| OpenSeeker | 11,677 | Web-research QA with the original judge prompt |\n| DeepDive | 3,250 | Hard multi-hop research (2,234 RL + 1,016 SFT); strict boxed-answer judge |\n| BrowseComp | 1,266 | OpenAI's browsing benchmark, in its Explanation/Exact-Answer/Confidence format |\n| REDSearcher | 1,000 | Long-horizon web-research questions |\n| BrowseComp-Plus | 830 | BrowseComp re-grounded in a fixed 100,195-doc corpus, with a controlled BM25 `search` tool |\n\nBrowseComp-Plus is the one exception to bring-your-own-search: because it serves the benchmark's own BM25 retriever over a fixed corpus, the retriever becomes a *controlled variable* and runs are reproducible — evidence recall is tracked alongside accuracy.\n\n## Gold-validated, then re-uploaded\n\nThis is the part that separates the catalog from a link farm. For each dataset, Prime Intellect ran the **gold patch through the full scoring path** in fresh sandboxes, **retried failures up to 10×** to separate flaky from deterministically broken, ran **independent second passes** to catch noisy rows, and ran **multiple no-edit passes** to drop tasks that score `1.0` with no fix at all. The two-sided precondition is simple: *gold patch applied → tests pass; no patch → tests fail.* Every dropped row is persisted in the re-upload so you can audit the exclusion.\n\nThe shrinkage is not cosmetic — for the noisiest sources, most of the raw rows do not survive:\n\n| Verified re-upload | Raw | Verified | What dropped |\n|---|---|---|---|\n| R2E-Gym-Subset-Verified | 4,578 | 4,522 | 56 network/timing-sensitive `aiohttp`/`tornado` tests |\n| SWE-Lego-Real-Data-Verified | 4,432 | 4,323 | flaky rows, via two independent passes |\n| Multi-SWE-RL-Verified | 4,703 | 2,232 | a no-edit filter caught tasks gradeable as solved with zero edits |\n| SWE-rebench-V2-Filtered-Verified | 32,079 | 6,275 | wholesale-broken images; inline GitHub issue/PR references scrubbed |\n| SWE-Bench-Verified-Quick | 500 | 468 | the slowest examples, for quick online-evals |\n\nTwo of these deserve a callout. **SWE-rebench-V2** goes from 32,079 to 6,275 — an 80% cut — and its design goal is worth stealing: it *\"continuously mines fresh GitHub PRs into tasks… naturally decontaminated by recency.\"* If your tasks are always newer than any model's training cutoff, benchmark contamination stops being a worry by construction. And **Multi-SWE**'s no-edit filter is the quiet hero: a task that grades as solved before the agent does anything is pure reward-hack fuel, and it takes a dedicated pass to find them. The same tooling ships publicly — `uv run validate <taskset-id>` is the model-free sibling of `eval`, running the gold check and the setup-only no-op check in independent runtimes.\n\n## Why this matters for RL at scale\n\nStrip the product framing and the reusable lesson is a data-engineering one. RL at scale does not fail on the gradient; it fails on **thousands of tiny reward bugs** — a flaky test, a drifted output, a container that won't boot, a task solvable without work — each of which quietly poisons the learning signal. The contribution here is treating environments as a *manufactured, versioned, validated artifact*: one task format, one scoring contract, prebuilt per-task images for reproducibility, grading hidden until scoring for integrity, and a gold/no-op validation gate before anything is trusted. That is the same discipline data teams already apply to training corpora, finally applied to the *reward* side — which, for agentic RL, is where the actual difficulty lives.\n\n## Where the reward signal still lies\n\nThe post is refreshingly candid that this is mitigation, not a solved problem. A reward signal can lie in two directions, and Prime Intellect names both.\n\n<Callout type=\"warn\">\n**False positives — reward hacks.** As long as grading runs *where the agent lives*, \"a policy under RL pressure will eventually find whatever seam is left\" — an editable test file, tamperable grading state, an artifact that leaks the answer. Withholding grading material raises the bar significantly but is not a guarantee; the structural fix they put on the roadmap is **grading in isolated sandboxes**, so the environment the agent can touch and the one that scores it are separate.\n\n**False negatives — correct-but-different fixes.** Validation cannot catch this one by construction. Tasks mined from merged PRs inherit that PR's tests, and those tests often assert *implementation details* rather than behavior — an exact error string, a private helper's name, a precise return shape. An agent that fixes the underlying issue a different but equally correct way still fails them, and the reward reads as a false negative. Gold-patch validation is blind to it (the original patch passes its own tests by definition). At RL scale these near-misses are noise that *punishes correct work*. Their mitigation — \"Agentic Judging\" — is announced but not yet detailed.\n</Callout>\n\n## The take\n\nThe headline number — 365,000 environments — is the least interesting thing here. The interesting thing is the **contract**: 23 datasets that each shipped their own harness, image conventions, and grader now load through one typed API, run on prebuilt per-task images, hide their grading material until scoring, and pass a gold/no-op validation gate before they are trusted — with the failed rows kept for audit. That is the unglamorous, correct answer to \"how do you get reproducible reward at scale,\" and it is exactly the layer that has been missing while everyone argued about losses. Take the counts as shipped figures and the training value as unbenchmarked; take the design pattern as the real deliverable. If agentic RL is bottlenecked on verified environments — and the evidence says it is — then a validated, versioned, one-contract catalog is a more load-bearing contribution than another clever objective.\n\n---\n\n*Built on Prime Intellect's [Scaling Agentic RL: 365,000+ Environments for SWE, Terminal, and Search](https://www.primeintellect.ai/blog/scaling-agentic-rl) (Daniel Auras and the Prime Intellect Team, July 2026), with the taskset details drawn from the post and the [research-environments](https://github.com/PrimeIntellect-ai/research-environments) and [verifiers](https://github.com/PrimeIntellect-ai/verifiers) repos it links. All task counts and validation figures are Prime Intellect's own reported numbers. The two interactive diagrams are my redrawings of the mechanism (the rollout pipeline and the taskset map), not reproductions of the post's charts; the hero image is the post's own cover art. There is no independent benchmark of training outcomes in the source, and I have not run one.*\n","readingTimeMins":13,"url":"https://ai.thesatyajit.com/articles/scaling-agentic-rl","signal":{"interest":5,"helpful":4,"score":9,"level":5,"label":"Essential"}},{"title":"Solar Open 2: Upstage's 250B-A15B hybrid-attention MoE","description":"Upstage's Solar Open 2 is a 250B-parameter (15B active) Mixture-of-Experts model built on a hybrid attention stack — twelve blocks of one softmax layer and three linear-attention (KDA) layers, no positional encoding, a 1M-token context, and open weights under the Upstage Solar License. A walk through why only 12 of 48 layers keep a KV cache, the selective-weight-transfer init from Solar Open 1 (not depth up-scaling), the ~12T-token training, and the full self-reported English and Korean benchmark suite.","date":"2026-07-24","tags":["llm","open-weights","upstage","moe","long-context"],"draft":false,"cover":"/articles/solar-open2-250b/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"solar-open2-250b","body":"**Solar Open 2** is Upstage's newest open-weight model: a **250B-parameter Mixture-of-Experts** that activates **15B per token**, ships on Hugging Face, and serves a **1M-token context**. Upstage frames it as an *agentic specialist* — built for tool calling, multi-step reasoning, and document-heavy officework — and as a sovereign-AI play, strong in Korean and Japanese as well as English. What makes it worth a close read is not the parameter count but the **architecture**: instead of a conventional softmax-attention transformer, Solar Open 2 runs a **hybrid stack** that interleaves one softmax-attention layer with three linear-attention layers, and drops positional encoding entirely.\n\nI read the [model card](https://huggingface.co/upstage/Solar-Open2-250B) and the accompanying [technical report](https://huggingface.co/upstage/Solar-Open2-250B/blob/main/Solar_Open_2_Tech_Report.pdf) for this. Two framing notes before any benchmark chart: every number below is **self-reported** by Upstage on its own harness, and the comparison set (DeepSeek-V4-Flash, MiMo-V2.5, Command A+, and others) is Upstage's chosen field. I keep both caveats attached throughout.\n\n<Callout type=\"warn\">\nAll benchmark numbers here are **Upstage-reported** — its own eval harness, its own choice of comparison models and settings. Solar Open 2 tops **no** row against the strongest open model in its bracket: on most English benchmarks **DeepSeek-V4-Flash** (284B-A13B) leads and Solar Open 2 comes second. Read it as a **strong model for its 15B active size** and a genuinely interesting *architecture*, not a leaderboard winner. The daggered Korean rows (`Ko-AIME'25`, `KBank-MMLU`, `Ko-GDPval`) are Upstage **in-house** benchmarks; treat those as least comparable across vendors.\n</Callout>\n\n## The Solar name, and what this is not\n\nIf you know the **Solar** name, you probably know it for **depth up-scaling** — the 2023 trick behind Upstage's original SOLAR 10.7B, where you grow a model by duplicating and stacking layers from a smaller base rather than training a new shape from scratch. It's reasonable to expect the lineage to continue. It doesn't. **Solar Open 2's card describes no depth up-scaling.** What it describes instead is a **selective weight transfer**: the model is initialized from its predecessor, **Solar Open 1** (102B-A12B), but *\"only the 2.3% of weights that survive the architectural change are carried over, and everything else is randomly initialized.\"*\n\nThat 2.3% is the honest headline of the training story. The architectural change is drastic enough — a new hybrid attention stack, no positional encoding, an expert pool grown from 128 to 320 — that almost none of the old model's weights fit the new shape. What transfers cleanly are the parts the two generations share: the token embeddings and output layer (same 196,608-token tokenizer), and the fragments of attention and MoE that survive. Everything else starts from noise. So this is not a re-skin of Solar Open 1 and not a depth-up-scaled Solar 10.7B — it is a **mostly-fresh 250B model** that borrows a running start.\n\n## The hybrid attention stack\n\nHere is the distinctive move, and the reason the 1M context is affordable. Solar Open 2 has **48 layers**, arranged as **twelve identical blocks**, each block being **one softmax-attention layer followed by three linear-attention layers** — the pattern the card writes as `[Softmax, Linear×3] × 12`. So **12 of the 48 layers are softmax; 36 are linear.**\n\nThe reason this matters is memory. A **softmax** layer keeps a **KV cache** that grows linearly with the sequence — every past token's keys and values must be stored so the next token can attend to them. A **linear-attention** layer does not: it folds the entire past into a **fixed-size recurrent state**, so its memory is constant no matter how long the context runs. By making three of every four layers linear, Solar Open 2 keeps a growing KV cache on only **12 of 48 layers** — *\"holding long-context memory to roughly a quarter of an all-softmax model of the same shape,\"* per the card.\n\nToggle between the hybrid stack and an all-softmax baseline, and drag the context length to watch the KV-cache footprint each one carries:\n\n<HybridStack />\n\nThe KV-cache arithmetic is worth doing by hand, because it is the whole efficiency argument. Each softmax layer is **grouped-query attention** with **8 KV heads** and `head_dim` 128; storing K and V in fp16 costs, per token per softmax layer:\n\n$$\n2 \\times n_{kv} \\times d_{head} \\times b = 2 \\times 8 \\times 128 \\times 2 = 4096 \\text{ bytes}\n$$\n\nMultiply by the number of KV-bearing layers and the context length. At **1M tokens**:\n\n- **Solar Open 2 (12 softmax layers)** → about **48 GiB** of KV cache.\n- **An all-softmax stack (48 layers)** → about **192 GiB** — exactly 4× more.\n\nThat 4× is the margin that turns a 1M-token window from \"possible on a rack\" into \"fits alongside the weights.\" If linear attention is new to you, I built the mechanism up in [how transformers attention works](/articles/how-transformers-attention-works) and the sparse-attention variants in [MiniMax sparse attention](/articles/minimax-sparse-attention); the KV-cache side of the story is [how LLM inference works](/articles/how-llm-inference-works), and squeezing the cache further is [TurboQuant](/articles/turboquant-kv-cache).\n\nThree details make the linear layers actually work at this depth, and the card is specific about all three:\n\n- **NoPE — no positional encoding.** Because the linear layers *\"encode token order intrinsically in their recurrent state,\"* Upstage removes rotary encoding entirely. The upside the card claims: no RoPE extrapolation limit, so the trained window is not tied to a length distribution seen during training.\n- **KDA with negative eigenvalues.** The linear layers use **Kimi Delta Attention** (the Kimi Linear lineage — see [Kimi K3](/articles/kimi-k3)), but with `allow_neg_eigval=True`, widening the state-transition write strength to $\\beta = 2\\sigma(\\cdot) \\in (0, 2)$. Standard linear cores restrict eigenvalues to $[0,1]$ (decay or persist only); allowing the sign to flip restores the ability to *erase* and self-correct — the card ties this to genuine state-tracking (parity, modular counting).\n- **A sigmoid output gate on the softmax layers**, which the card says suppresses the \"attention sink\" pathology and improves long-context extrapolation.\n\nOne ordering detail separates it from its cousins: within each block the **softmax layer comes first** (`S-L-L-L`), unlike the linear-first ordering (`L-L-L-S`) of Kimi Linear and Qwen3.5. Upstage's own architecture figure lays the whole thing out — the 12× block on the left, and insets for the MoE, the GQA softmax layer, and the KDA linear layer, color-coded by which weights transferred from Solar Open 1:\n\n<Figure\n  src=\"/articles/solar-open2-250b/fig2.png\"\n  alt=\"Solar Open 2 architecture diagram. Left: the 48-layer stack as a block repeated 12 times, each block a softmax-attention layer plus MoE followed by a linear-attention layer plus MoE, fed by a token embedding with a NoPE label and topped by a linear output layer. Insets detail the Mixture-of-Experts block (320 routed experts plus one shared expert, a router, and a sum), the softmax attention layer (GQA with a scaled dot-product and an elementwise sigmoid gate), and the linear attention layer (KDA with L2-normed Q/K, convolutions, a gated delta rule, and a negative-eigenvalue term beta = 2 sigma of x). Modules are colored: blue for full weight transfer from Solar Open 1, green for partial transfer, yellow for random initialization.\"\n  caption=\"Solar Open 2 architecture: the [Softmax, Linear×3] × 12 stack, with MoE, GQA-sigmoid-gate softmax, and KDA linear-attention insets. Blue = transferred from Solar Open 1, green = partial, yellow = randomly initialized (Upstage, Solar Open 2 Technical Report, Figure 3).\"\n/>\n\n## Where the parameters live\n\nThe sparsity is the economic argument, so account for it. Solar Open 2 is **250B total, 15B active** — a **6% activation rate**. Each MoE block holds **321 experts: 320 routed plus 1 shared**; the router keeps the **top-8 routed** experts per token, and the shared expert always runs, so 9 experts fire per token. There are **no dense layers** — every block is MoE. The backbone it inherits from Solar Open 1 is **48 layers, hidden size 4096, head dim 128, 64 query / 8 KV heads**, and the **196,608-token** vocabulary.\n\nIf MoE routing is unfamiliar, I built the router, the top-k gate, and the sparsity argument from nothing in [Mixture of Experts, from scratch](/articles/mixture-of-experts-from-scratch) — the same machinery that lets a 250B model serve at roughly the cost of a 15B dense one.\n\n## Training\n\nUpstage is thinner on the data story than on the architecture, and I won't pad it. The concrete figures: **~12 trillion pre-training tokens**, on **NVIDIA B200** GPUs, for **2M GPU-hours**, initialized by the 2.3% selective transfer above. The technical report adds that the run *\"maximizes value per token over a globally deduplicated corpus\"* and trains on *\"purpose-built long-horizon agent scenarios\"* spanning conversational tool use, coding, and officework — the agentic focus is baked into the data, not just the eval suite.\n\nThe tokenizer is the other inherited advantage. Solar Open 2 reuses Solar Open 1's **Korean-efficient** byte-level BPE tokenizer unchanged; the report claims global-model tokenizers spend **1.2–1.9× more tokens** on the same Korean text. In long agent trajectories, where the working context accumulates over many turns, fewer tokens per unit of Korean text translates directly into lower inference cost and a longer effective window — a real, if narrow, edge.\n\n## The benchmarks\n\nUpstage's headline figure bundles six benchmarks across knowledge/reasoning and agentic/professional work, with Solar Open 2 (dark violet) against Solar Open 100B and the sub-320B open field:\n\n<Figure\n  src=\"/articles/solar-open2-250b/fig1.png\"\n  alt=\"Six grouped bar panels — MMLU-Pro, MATH (HMMT'26/AIME'26), LiveCodeBench v6, APEX-Agents, SWE-Bench Verified, MCP-Atlas — comparing Solar Open 2 (dark violet) and Solar Open 100B (light violet) against Command A+, Mistral Medium 3.5, MiMo-V2.5, and DeepSeek-V4-Flash (grey). Solar Open 2 leads MMLU-Pro (86.2), MATH (94.8), LiveCodeBench (92.4) narrowly, and APEX-Agents (16.6) by a wide margin; on SWE-Bench Verified (70.4) and MCP-Atlas (58.2) it trails MiMo-V2.5 and DeepSeek-V4-Flash.\"\n  caption=\"Upstage's headline suite. Solar Open 2 leads its bracket on MMLU-Pro, MATH, LiveCodeBench, and (by a wide margin) APEX-Agents; it trails on the harder agentic panels (Upstage, Solar Open 2 model card).\"\n/>\n\nThe clearest *win* is agentic tool-use in the **APEX-Agents** suite, where Solar Open 2 roughly doubles the rest of its bracket — a result consistent with the agent-scenario training:\n\n<BenchBars\n  title=\"APEX-Agents (%) — Upstage-reported\"\n  bars={[\n    { label: \"Solar Open 2\", value: 16.6, highlight: true },\n    { label: \"MiMo-V2.5\", value: 13.4 },\n    { label: \"DeepSeek-V4-Flash\", value: 13.2 },\n    { label: \"Mistral Medium 3.5\", value: 6.1 },\n    { label: \"Command A+\", value: 1.6 },\n  ]}\n/>\n\nOn coding it edges the field on **LiveCodeBench v6** (92.4 vs DeepSeek-V4-Flash's 92.3) but sits behind on the harder **SWE-Bench Verified** agentic coding task — the \"competitive, not leading\" pattern in miniature:\n\n<BenchBars\n  title=\"SWE-Bench Verified (%) — Upstage-reported\"\n  bars={[\n    { label: \"Solar Open 2\", value: 70.4, highlight: true },\n    { label: \"Mistral Medium 3.5\", value: 69.6 },\n    { label: \"MiMo-V2.5\", value: 73.0 },\n    { label: \"DeepSeek-V4-Flash\", value: 73.8 },\n  ]}\n/>\n\nThe full English suite, with the best value in each row bolded (Upstage's own marking):\n\n| Benchmark | Solar Open 2<br/>250B-A15B | Solar Open 100B<br/>102B-A12B | Command A+<br/>218B-A25B | Mistral Medium 3.5<br/>128B | MiMo-V2.5<br/>310B-A15B | DeepSeek-V4-Flash<br/>284B-A13B |\n|---|--:|--:|--:|--:|--:|--:|\n| MMLU-Pro | **86.2** | 80.4 | 79.0 | 81.2 | 84.6 | 85.9 |\n| GPQA-Diamond | 86.3 | 66.2 | 75.6 | 77.5 | 83.0 | **88.9** |\n| HLE (no tools) | 28.8 | 11.5 | 11.4 | 12.8 | 24.3 | **32.3** |\n| LiveCodeBench v6 | **92.4** | 56.5 | 86.1 | 84.9 | 89.1 | 92.3 |\n| ArtifactsBench | 55.9 | 43.4 | 42.8 | 49.8 | 59.3 | **61.0** |\n| HMMT 2026 | 93.9 | 68.9 | 73.5 | 62.9 | 61.4 | **94.7** |\n| AIME 2026 | 95.7 | 87.7 | 96.0 | 89.0 | 92.3 | **97.0** |\n| Multi-Challenge | 61.0 | 40.5 | 45.8 | 49.8 | 39.0 | **62.0** |\n| IFBench | 80.0 | 57.7 | 73.9 | 69.0 | 67.1 | **80.3** |\n| AA-LCR | 62.3 | 36.0 | 46.0 | 61.0 | 62.7 | **63.7** |\n| SWE-Bench Verified | 70.4 | 15.4 | 14.4 | 69.6 | 73.0 | **73.8** |\n| Terminal-Bench Hard | 28.3 | 2.3 | 25.0 | 33.3 | **41.7** | 34.1 |\n| APEX-Agents | **16.6** | 2.4 | 1.6 | 6.1 | 13.4 | 13.2 |\n| MCP-Atlas | 58.2 | 34.4 | 27.2 | 30.7 | **63.9** | 58.2 |\n| τ³ (banking) | 19.6 | 7.4 | 5.8 | 5.8 | 8.7 | **22.3** |\n| GDPval-AA v2 (ELO) | 1128 | – | 712 | 929 | 1145 | **1187** |\n\nThe shape is consistent: Solar Open 2 leads on **MMLU-Pro, LiveCodeBench, and APEX-Agents**, and is otherwise a close **second to DeepSeek-V4-Flash** — which, at 284B-A13B, is a comparable-scale sparse model. Against the smaller **Solar Open 100B**, the jump is large and uniform (SWE-Bench Verified 15.4 → 70.4, APEX-Agents 2.4 → 16.6), which is the more meaningful comparison since it isolates a generation of progress on one team's harness.\n\nWhere Solar Open 2 actually **leads** is Korean — unsurprising given the tokenizer and data focus. It tops **CLIcK, HAE-RAE, KBank-MMLU, KBL, and Ko-GDPval**, beating even the closed **GPT-5.4 mini** and **Claude Haiku 4.5** on several:\n\n<BenchBars\n  title=\"CLIcK — Korean cultural/commonsense (%) — Upstage-reported\"\n  bars={[\n    { label: \"Solar Open 2\", value: 90.7, highlight: true },\n    { label: \"GPT-5.4 mini\", value: 89.6 },\n    { label: \"DeepSeek-V4-Flash\", value: 89.2 },\n    { label: \"Claude Haiku 4.5\", value: 53.5 },\n  ]}\n/>\n\n| Benchmark | Solar Open 2 | Solar Open 100B | MiMo-V2.5 | DeepSeek-V4-Flash | Claude Haiku 4.5 | GPT-5.4 mini |\n|---|--:|--:|--:|--:|--:|--:|\n| KMMLU-Pro | 78.4 | 64.0 | 69.1 | **78.9** | 67.9 | 78.1 |\n| CLIcK | **90.7** | 78.9 | 78.4 | 89.2 | 53.5 | 89.6 |\n| HAE-RAE v1.1 | **73.8** | 73.3 | 61.7 | 73.1 | 38.5 | 69.4 |\n| Ko-AIME'25 † | 97.7 | 80.0 | 88.0 | **98.0** | 81.7 | 90.7 |\n| HRM8K | 92.2 | 87.6 | 90.7 | **93.4** | 90.6 | 91.3 |\n| KBank-MMLU † | **80.8** | 65.5 | 71.0 | 79.5 | 68.9 | 79.0 |\n| KBL | **75.5** | 65.5 | 69.8 | 72.8 | 69.9 | 75.3 |\n| KorMedMCQA | 93.0 | 84.4 | 87.7 | 94.1 | 87.0 | **94.2** |\n| Ko-GDPval † | **86.8** | 3.4 | 81.0 | 85.0 | 68.3 | 59.4 |\n\n*† Upstage in-house benchmarks — least comparable across vendors.* The report's boldest claim rides on the last row: on Ko-GDPval, a Korean officework-agent benchmark, it says Solar Open 2 *\"essentially matches DeepSeek-V4-Pro (1.6T) at less than a sixth of its size.\"* That is an in-house benchmark and a self-comparison, so weight it accordingly — but the direction (a Korean-specialized 250B beating much larger generalists on Korean agentic work) is plausible and repeated across the daggered rows.\n\n## Running it\n\nThe weights are ~250B in bf16, so this is multi-GPU territory: Upstage lists a **minimum of 4× H200** (141 GB) and **recommends 8× H200**. The supported production path is **vLLM** (an Upstage fork), with expert-parallel MoE:\n\n```bash\nvllm serve upstage/Solar-Open2-250B \\\n  --served-model-name solar-open2-250b \\\n  --tensor-parallel-size 8 \\\n  --enable-expert-parallel \\\n  --moe-backend triton \\\n  --reasoning-parser solar_open2 \\\n  --tool-call-parser solar_open2 \\\n  --enable-auto-tool-choice\n```\n\nSolar Open 2 is a reasoning model with a two-position knob: `reasoning_effort=\"high\"` turns on chain-of-thought (a reasoning block capped at 131,072 tokens), and `reasoning_effort=\"none\"` answers directly. Because the reasoning trace counts against `max_tokens`, Upstage recommends leaving room — up to 256K for the full response in high-effort mode — and preserving prior reasoning traces across turns.\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8000/v1\")\n\nresp = client.chat.completions.create(\n    model=\"solar-open2-250b\",\n    messages=[{\"role\": \"user\", \"content\": \"Prove that the square root of 2 is irrational.\"}],\n    reasoning_effort=\"high\",\n    temperature=1.0,\n    top_p=1.0,\n    max_tokens=131584,\n)\nprint(resp.choices[0].message.reasoning)  # reasoning returned separately\nprint(resp.choices[0].message.content)\n```\n\nTwo nice touches for agent builders. The same vLLM server exposes **both** an OpenAI-compatible `/v1` endpoint and an **Anthropic-compatible `/v1/messages`** endpoint, so **Claude Code** can point straight at it (`ANTHROPIC_BASE_URL=http://localhost:8000`) with no proxy, and MCP tools reach the model through the standard tool-calling interface. For smaller boxes, **NotaAI** publishes official quantized builds — INT4, NVFP4, and an INT4-GlobalPruned variant; the [NVFP4](/articles/nemotron-nvfp4) format is the same 4-bit floating layout NVIDIA pushes for Blackwell.\n\n## License: open weights, with a name tax\n\nSolar Open 2 is **open-weight, not fully open-source**. It ships under the **Upstage Solar License**, and the derivative terms are specific: any model you create, fine-tune, or distill from it must **prefix its name with \"Solar\"** (e.g. `Solar-MyModel-v1`), **prominently display \"Built with Solar\"** in public materials, and **include a copy of the license**. That is looser than a research-only license — commercial use and derivatives are allowed — but it is a **branded** license, not Apache-2.0. If you plan to build on it, the naming and attribution requirements are load-bearing, not boilerplate.\n\n## The take\n\nSolar Open 2's real interest is **architectural**, not positional. It is a clean, well-documented instance of the **hybrid linear/softmax** direction — three linear-attention layers for every softmax one, no positional encoding, KDA with negative eigenvalues — that makes a **1M-token context** affordable by keeping a KV cache on only a quarter of its layers. Paired with a 6%-activation MoE and a Korean-efficient tokenizer, that is a coherent systems story aimed squarely at long-horizon agents, and the honest, unusual init note (2.3% of weights transferred, the rest random) is a refreshing departure from the Solar brand's depth-up-scaling past.\n\nThe caveats are the standard open-weights ones, stated plainly. Every number is Upstage's own harness against a field it chose, and on that field Solar Open 2 is **a consistent second to DeepSeek-V4-Flash** on English work — leading its bracket on a few benchmarks (MMLU-Pro, LiveCodeBench, APEX-Agents) but not the pack. Its clearest edge is **Korean**, much of it measured on **in-house** benchmarks. And the license carries a name-and-attribution tax that Apache-2.0 models don't. For a team that wants an **open, agent-capable, genuinely long-context** model — especially one working in Korean — and can run 4–8× H200, Solar Open 2 earns a serious look. As the strongest open model at 250B, that title still belongs to the model it keeps finishing behind.\n\n---\n\n*Built from the [Solar Open 2 model card](https://huggingface.co/upstage/Solar-Open2-250B) and [technical report](https://huggingface.co/upstage/Solar-Open2-250B/blob/main/Solar_Open_2_Tech_Report.pdf) (250B-A15B, hybrid attention, 1M context, Upstage Solar License). All benchmark numbers are Upstage-reported; the two figures are reproduced from Upstage's model card and technical report for commentary. The interactive stack diagram is an illustration of the mechanism — the layer pattern, the KV-cache-bearing layers, and the fp16 KV-cache arithmetic use the published config; the linear-attention state memory is a small constant left out of the readout for clarity.*\n","readingTimeMins":15,"url":"https://ai.thesatyajit.com/articles/solar-open2-250b","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Ternary15M: a language model where every weight is −1, 0, or +1","description":"A from-scratch 15M-parameter TinyStories model where all 42 linear layers are ternary — each weight is exactly −1, 0, or +1 (BitNet b1.58 lineage). That turns every dot product into signed add/subtract/skip with a single scale multiply per output channel. A walk through ternary quantization, why three-valued weights make matmuls multiply-free, quantization-aware training with the straight-through estimator and an absmean scale, the ~1.58-bits/weight footprint, and the honest result: hard-ternary inference costs only +0.01 val loss over the latent model (1.61 vs 1.60), shipped at 43 MB — trained for about $0.70 on one L40S.","date":"2026-07-24","tags":["quantization","ternary","bitnet","efficiency","from-scratch","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"ternary15m","body":"Most quantization is an afterthought: train a model in FP16, then squeeze the weights down to 8 or 4 bits for\ndeployment and hope the accuracy survives. [Ternary15M](https://github.com/brianbell-x/ternary15M) does the opposite,\nand takes it to the extreme. It's a 15.19M-parameter, Llama-style language model where **every linear layer is\nternary** — each weight is one of exactly three values, `−1`, `0`, or `+1`, scaled by a single per-channel number.\nThe model is *trained that way from scratch*, not clipped down after the fact. It's a tiny, readable member of the\n[BitNet b1.58](https://arxiv.org/abs/2402.17764) family, and it's a clean place to see why three-valued weights are\nsuch an interesting bet.\n\nThe architecture is deliberately small and standard: dim 288, 6 layers, 6 query / 6 KV heads, a SwiGLU MLP with\nhidden size 768, a 32k vocab, and a 256-token context. All 42 of its attention and feed-forward linear layers\n(6 layers × 7 projections) are ternary; only the embedding table and the RMSNorm gains stay in full precision.\n\n## The whole trick is the sign\n\nHere's why anyone cares about `{−1, 0, +1}` specifically. A neural network is, underneath, a pile of dot products:\neach output is a weighted sum of its inputs. In full precision, every one of those weights is an arbitrary float, so\nevery term in the sum is a hardware **multiply**. Multiplies are the expensive part — they dominate the energy and the\nsilicon area of a matmul.\n\nNow make each weight a sign instead of a number. If the weight is `+1`, the term is just the input — **add** it. If\nit's `−1`, **subtract** the input. If it's `0`, the input contributes nothing — **skip** it. The multiplies vanish; the\ninner loop of the dot product becomes signed accumulation over the inputs. Toggle between the two below and watch the\nmultiply count collapse:\n\n<SignedAccumulate />\n\nThe single multiply that survives is the per-output-channel **scale** — one number `s` that rescales the whole\naccumulated sum back to a sensible magnitude. So a ternary matmul is: signed adds across the row, then one multiply at\nthe end. On general-purpose GPUs the win is mostly memory-bandwidth (you move far fewer weight bytes); on hardware\ndesigned for it, killing the multiplies is the point. Either way, the arithmetic is genuinely different from\n\"small floats.\"\n\n## 1.58 bits, and where the bytes actually go\n\nThree states carry log₂(3) ≈ **1.58 bits** of information each — that's where the \"b1.58\" name comes from, and why a\nternary weight is often quoted as costing ~1.58 bits versus 16 or 32 for a float. Ternary15M doesn't bit-pack that\ntightly; its exported checkpoint stores each ternary weight as an `int8` (one byte) plus one FP32 scale per output\nchannel, which the author notes compresses to roughly 2 bits with basic entropy coding. The latent training checkpoint\nis 182 MB; the deployed ternary model is **43 MB**.\n\nBut 43 MB is bigger than 1.58 bits × 15M would suggest, and the reason is worth sitting with:\n\n<Callout type=\"note\">\nAt 15M parameters, the model is mostly its embedding table. The tied vocab embedding is 32,000 × 288 ≈ **9.2M\nparameters** — over 60% of the model — and it stays FP32, so it alone is ~37 MB of the 43 MB file. The ternary trick\nonly compresses the ~6M weights in the 42 linear layers. The lesson generalizes: at small scale, quantizing the matmuls\nbuys you less than you'd hope because the un-quantized embedding dominates the footprint. Ternary pays off hardest on\n*deep* models, where the linear layers, not the vocabulary, are the bulk of the weights.\n</Callout>\n\n## Teaching a network to live with three values\n\nYou can't train ternary weights directly, because rounding to `{−1, 0, +1}` has a gradient of zero almost everywhere —\nnudging a latent weight from 0.31 to 0.32 doesn't change the rounded output, so ordinary backprop would see no signal\nand learn nothing. Quantization-aware training gets around this with two ideas working together.\n\nFirst, the **scale**. Each output channel's weights are ternarized around their own average magnitude — the mean of the\nabsolute weights in that row, `absmean(W)`. Dividing by that scale before rounding is what decides which weights round\nto `±1` and which collapse to `0`, and multiplying the scale back afterward keeps the layer's outputs at roughly the\nright size. It's computed per output channel, so every neuron sets its own threshold.\n\nSecond, the **straight-through estimator (STE)**. The forward pass uses the *ternarized* weights — so the network\nactually experiences quantization while it learns — but the backward pass pretends the rounding was the identity\nfunction and passes the gradient straight through to a full-precision **latent** copy of the weights. Those FP32 latents\nare what the optimizer updates; the ternary weights are re-derived from them every forward pass. In PyTorch the whole\nthing is a few lines:\n\n```python\ndef forward(self, x: torch.Tensor) -> torch.Tensor:\n    weight = self.weight                                   # FP32 latent weights\n    scale = weight.abs().mean(dim=1, keepdim=True)         # absmean per output channel\n    safe_scale = scale.clamp_min(torch.finfo(weight.dtype).eps)\n    qweight = torch.round(torch.clamp(weight / safe_scale, -1, 1)) * scale\n    weight_ste = weight + (qweight - weight).detach()      # STE: value = qweight, grad → weight\n    return F.linear(x, weight_ste)\n```\n\nThe `weight + (qweight - weight).detach()` line is the STE in one expression: numerically it equals `qweight` (the term\nyou subtract is detached from the graph), but its gradient with respect to `weight` is 1, so the optimizer trains the\nlatent weights as if the quantizer weren't there. At export time the latents are thrown away and the weights are frozen\nto `int8` in `{−1, 0, +1}` plus the FP32 scales:\n\n```python\ndef ternary_components(weight: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:\n    \"\"\"Return int8 {-1, 0, 1} weights and FP32 per-output scales.\"\"\"\n    scale = weight.detach().float().abs().mean(dim=1, keepdim=True)\n    safe_scale = scale.clamp_min(torch.finfo(torch.float32).eps)\n    qweight = torch.round(torch.clamp(weight.detach().float() / safe_scale, -1, 1))\n    return qweight.to(torch.int8), scale\n```\n\n## Does it work?\n\nThe honest, useful result is that at this scale the quantization is nearly free. Trained on\n[TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) (a synthetic corpus of simple children's stories)\nfor 655M tokens, the model's validation loss barely moves when you go from the latent weights to the hard-ternary ones:\n\n<BenchBars\n  title=\"TinyStories validation loss (lower is better; self-reported, single run)\"\n  unit=\"\"\n  bars={[\n    { label: \"final training\", value: 1.5895 },\n    { label: \"latent (STE eval)\", value: 1.5970 },\n    { label: \"hard ternary\", value: 1.6074, highlight: true },\n  ]}\n/>\n\nThe bars are almost identical on purpose — that *is* the finding. Freezing the network to pure ternary weights costs\nonly **+0.01 loss** over the latent model it was trained as. The quantization the model trained under is the same one it\nships with, so there's no distribution shift at export. And the deployed artifact is a fraction of the size:\n\n<BenchBars\n  title=\"Deployed model size\"\n  unit=\" MB\"\n  bars={[\n    { label: \"latent checkpoint\", value: 182 },\n    { label: \"ternary export\", value: 43, highlight: true },\n  ]}\n/>\n\nThe whole run cost about **$0.70** — ~50 minutes on a single L40S at ~200k tokens/second. That cheapness is a feature:\nit makes ternary QAT something you can actually reproduce and poke at, not a claim you take on faith.\n\n<Callout type=\"warn\">\nThese numbers are the author's own, from a single training run on one small synthetic dataset, and TinyStories loss is\nnot a general capability benchmark. Treat them as a clean proof-of-concept that ternary-from-scratch *converges* at this\nscale — not as evidence about how ternary trades off against full precision on a real, large model. BitNet's own papers\nargue the gap stays small up to billions of parameters, but that's a separate, much larger claim than this repo makes.\n</Callout>\n\n## What a 15M ternary model can and can't do\n\nIt's worth being blunt about the ceiling. TinyStories exists precisely so that tiny models can learn *something*\ncoherent: the vocabulary and grammar are simple, the stories are short, and 256 tokens of context is plenty. Within that\nbox, Ternary15M does the job — it generates grammatical, on-topic little stories, and it does so from weights that are\nalmost entirely signs. That's the point of the artifact.\n\nWhat it can't do is everything a real LM does. There's no world knowledge, no reasoning, no code, no long context, no\ninstruction following — 15M parameters and a children's-story corpus don't reach any of that, and ternary quantization\ndoesn't change the ceiling in either direction. The value here isn't the model's outputs; it's that the *training\nrecipe* — BitLinear layers, an absmean scale, an STE, and a from-scratch schedule — demonstrably works end to end and\nlands within a hundredth of a nat of its full-precision-latent self.\n\n## The take\n\nTernary15M is a good teaching artifact for a genuinely surprising idea: you can restrict every weight in a network to\none of three values and, if you *train* it that way rather than clipping after the fact, pay almost nothing in loss. The\nmechanism is clean — signs replace floats, so dot products become signed accumulation with one scale multiply per\nchannel; the STE lets gradients flow to a latent copy the quantizer hides; the absmean scale keeps magnitudes sane. The\nhonest caveats are that this is a 15M model on TinyStories with self-reported numbers, and that at this scale the FP32\nembedding table, not the ternary matmuls, dominates the file size. But as a from-scratch, $0.70, reproducible window\ninto how BitNet-style quantization actually works, it's about as legible as this idea gets.\n\n---\n\n*Source: the [Ternary15M repository](https://github.com/brianbell-x/ternary15M) (Brian Bell, MIT license) — its\n`README`, `MODEL_CARD.md`, `RESULTS.md`, and `ternary15m/model.py`. The lineage is\n[BitNet b1.58](https://arxiv.org/abs/2402.17764) (Ma et al.). Code snippets are from the repo; the interactive diagram\nis mine. The repo ships no figures, so there are none to reproduce here — the numbers above are quoted from its\n`RESULTS.md` and model card.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/ternary15m","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Antares: a 1B model that hunts vulnerabilities like a person","description":"Cisco's Foundation AI team trained sub-1B open-weight models to do one hard security job — localize where a known vulnerability lives in a codebase — as a terminal agent that greps, reads, backtracks, and submits a ranked list of files. Antares-1B scores 0.209 File F1 on a new 500-task benchmark, beating GLM-5.2 (753B) and Gemini 3 Pro at 15–172× lower cost, from a Granite 4.0 backbone that scores 0.000 untrained. The whole gain is training, not scale.","date":"2026-07-22","tags":["security","ai","open-weights","agents","explainer"],"draft":false,"cover":"/articles/antares/fig1.png","featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"antares","body":"[Antares](https://blogs.cisco.com/ai/introducing-antares-the-most-efficient-open-weight-ai-models-for-vulnerability-localization)\nis a family of small security models from Cisco's Foundation AI team, built for one narrow, expensive\njob: **vulnerability localization** — given a vulnerability class and a codebase, pinpoint the source\nfiles that actually contain the flaw. Two are open-weight today under Apache 2.0 — **Antares-350M** and\n**Antares-1B** — with a 3B on the way, and all three are fine-tuned from **IBM Granite 4.0**. The claim\nthat makes them worth a look: they beat models tens to hundreds of times their size on this task, at a\nfraction of the cost, and they're small enough to run **locally** so proprietary code never leaves the\nbuilding.\n\n## The job: localize, don't fix\n\nAntares doesn't patch anything, doesn't explain *why* a file is vulnerable, and doesn't emit exploits.\nIt answers one question — *which files should a human look at first?* — and it does it the way an\nanalyst would: as a **terminal agent**. Given only a CWE identifier and its generic description (no\nadvisory text, no file hints), with the repo mounted read-only, it issues shell commands — `grep`,\n`find`, `cat` — reads the output, reasons, changes direction when a lead goes cold, and finally calls\n`submit_vulnerable_files` with a ranked list. The catch that makes it hard: a budget of **15 terminal\ncalls per task**. No vector database, no retrieval index — just exploration. Scrub a run:\n\n<SearchTrace />\n\nThat \"search, read, revise, backtrack\" loop is the thing Cisco's earlier research argued you can\n*train* into a small model — useful retrieval behavior from learned strategy, not from scale. Antares\nis the test of whether it transfers to security.\n\n## The numbers, and why the task is genuinely hard\n\nTo measure it they had to build a benchmark, because general code-search sets (SWE-Bench and the like)\ntest finding code relevant to a *dev task*, not localizing a *vulnerability* from a CWE description.\n**VLoc Bench** is 500 tasks across 290 real repositories, 6 package ecosystems, and 147 CWE categories\n(78% carry a real CVE); each repo is reconstructed at its pre-fix commit, and ground truth is the set\nof files the actual security fix touched. The metric is **File F1** — the harmonic mean of how many\nsubmitted files were right and how many of the right files were found.\n\nRead the scores with the ceiling in mind: this is hard enough that **the best frontier model tops out\naround 0.23**. Against that, a 1B open model at **0.209** is the story:\n\n<Figure\n  src=\"/articles/antares/fig1.png\"\n  alt=\"Scatter of File F1 score against parameter count on a log axis. The Antares family (350M, 1B, 3B) sits at the top-left with high F1 at tiny size; a dozen much larger open and closed models spread across the middle and right at lower or comparable F1; the closed frontier (GPT-5.5) is top-right.\"\n  caption=\"File F1 vs parameters (log). The Antares family sits on the efficient frontier — tiny and high — while a dozen far larger models score lower (Antares, Figure 1).\"\n/>\n\n<BenchBars\n  title=\"File F1 on VLoc Bench — higher is better\"\n  unit=\"\"\n  bars={[\n    { label: \"Antares-3B (soon)\", value: 0.223, highlight: true },\n    { label: \"GPT-5.5 · frontier\", value: 0.221 },\n    { label: \"Antares-1B\", value: 0.209, highlight: true },\n    { label: \"GLM-5.2 · 753B\", value: 0.186 },\n    { label: \"Gemini 3 Pro · frontier\", value: 0.152 },\n    { label: \"Antares-350M\", value: 0.135, highlight: true },\n    { label: \"Qwen3.5-122B-A10B\", value: 0.091 },\n    { label: \"Llama-3.3-70B\", value: 0.012 },\n    { label: \"Granite 4.0 1B (untrained base)\", value: 0.0 },\n  ]}\n/>\n\nAntares-1B (0.209) clears **GLM-5.2 at 753B** (0.186) and **Gemini 3 Pro** (0.152), and the 3B\nessentially matches GPT-5.5. Meanwhile several giants flail — Llama-3.3-70B lands at 0.012, plain GPT-5\nat 0.048 — which tells you this isn't a capability that falls out of scale; it has to be trained in.\n\n## Training, not scale\n\nThe cleanest evidence is the backbone itself. The same **Granite 4.0 1B** weights, untrained for this\ntask, score a flat **0.000** — they can't navigate a repo and submit useful files at all. Everything\nAntares can do comes from a two-stage pipeline: **SFT** on cybersecurity reasoning, deep-research\ntraces, and terminal code-search trajectories, then **GRPO** — reinforcement learning over full\nmulti-turn agent trajectories with verifiable rewards for localization quality, valid submissions,\ntool-use compliance, and exploration behavior. Pick a size and watch the build-up:\n\n<StageLift />\n\nGRPO isn't cosmetic — it adds a real slice on top of SFT (+0.021 File F1 at 1B) by teaching the model\nto *verify and stop* rather than imitate a trajectory. And it stacks down the size ladder: even the\n**350M** GRPO model (0.135) beats a 753B open model.\n\n## The economics: cheap enough to run on every commit\n\nAccuracy-per-parameter only matters if it turns into accuracy-per-dollar, and this is where small wins\noutright. The full 500-task sweep costs about **$0.71** for Antares-1B — versus **$12.50** for GLM-5.2\n(15.2× more) and **$141** for GPT-5.5 (172× more) — and Antares-1B finishes it in **~13 minutes on a\nsingle H100** with 16 parallel workers.\n\n<Figure\n  src=\"/articles/antares/fig2.png\"\n  alt=\"Cost-per-evaluation (log, USD) against runtime in hours. The Antares family clusters at the bottom-left near $0.60–$0.82 and well under an hour; GLM-5.2 sits at $12.50 (15.2x more), and GPT-5.5 at $141 and ~4.7 hours (172x more).\"\n  caption=\"Estimated cost and runtime for a full benchmark sweep — Antares is 15.2× cheaper than the best open model and 172× cheaper than the frontier (Antares, Figure 2).\"\n/>\n\nThat's the unlock the researchers keep pointing at: as Stanford's Amin Saberi puts it, \"near-frontier\naccuracy on secure code reasoning at a fraction of the cost, fast enough to run on every commit.\" A\nmodel this size runs on-prem, so — in NUS professor Reza Shokri's framing — \"proprietary code never\nleaves the machine,\" which matters most for the universities, public-sector teams, and smaller shops\nthat were priced out of token-heavy frontier models. It ships with a CLI that sweeps a read-only repo\nsnapshot and returns candidates as human-readable, JSON, or **SARIF** for CI/CD triage.\n\n## Where it breaks\n\nCisco is refreshingly specific about the limits, and they follow directly from the design. The\n15-command budget means performance **degrades on large repos** (>10MB) and on multi-file\nvulnerabilities needing 5+ files of context. It's strong on **grep-able** patterns (CWE-843 Type\nConfusion, CWE-1321 Prototype Pollution) and weak on ones that need real semantic understanding\n(CWE-732 Incorrect Permissions, CWE-667 Improper Locking, CWE-401 Memory Leak). It has an April 2025\nknowledge cutoff, and — by design — it tells you *which* files, never *why*. It's a first-pass triage\naid with a human in the loop, not a replacement for the security toolchain.\n\n## The take\n\nAntares is a clean demonstration of a claim that keeps getting more useful: for a **narrow, well-shaped\ntask**, the behavior that matters — search, verify, backtrack, know when to stop — can be trained into\na sub-1B model until it beats models 100–750× its size, cheaply enough to run always-on. It won't\ngeneralize; it's not supposed to. It's part of Cisco's broader push (alongside its Foundry Security\nSpec and CodeGuard efforts) to make AI security tooling something you can measure and deploy rather\nthan demo — and \"frontier-adjacent accuracy at $0.71 a run, on hardware you own\" is a genuinely\ndifferent offer than one more giant model behind an API.\n\n---\n\n*Source: [Introducing Antares](https://blogs.cisco.com/ai/introducing-antares-the-most-efficient-open-weight-ai-models-for-vulnerability-localization)\n(Cisco Foundation AI, 21 July 2026), the [Antares model cards](https://huggingface.co/collections/fdtn-ai/antares),\nand the [technical report](https://cisco-foundation-ai.github.io/antares/technical-report.pdf). Figures\nare Cisco's; the interactives are mine.*\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/articles/antares","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"Gigatoken: tokenizing at gigabytes per second","description":"Tokenization is the unglamorous first step of every LM data pipeline — and it's slower than you'd guess, even though HuggingFace and tiktoken already run multithreaded Rust. Gigatoken is a drop-in tokenizer that hits GB/s and runs up to ~1000× faster, not by out-threading them but by killing the regex pretokenization step with SIMD and caching every word it's already seen. A walk through why the regex was the bottleneck, the per-CPU numbers, and what 'tokenize the whole internet in 6.5 hours' actually means.","date":"2026-07-22","tags":["systems","tokenization","performance","open-source","explainer"],"draft":false,"cover":"/articles/gigatoken/fig1.png","featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"gigatoken","body":"Tokenization is the step nobody thinks about. Before a language model sees a single token, terabytes\nof raw text have to be turned into integer IDs — and if you've ever run that job over a real corpus,\nyou know it's slower than it has any right to be. [Gigatoken](https://github.com/marcelroed/gigatoken)\n(Marcel Rød) is a drop-in tokenizer that does it at **gigabytes per second** — up to roughly **1000×\nfaster** than HuggingFace's `tokenizers`.\n\nThe detail that makes this interesting: it isn't a Python-vs-Rust story. HuggingFace `tokenizers` and\nOpenAI's `tiktoken` are *already* multithreaded Rust. Gigatoken beats them by a factor of a thousand\nanyway — which means the win is algorithmic, not a language swap.\n\n<Figure\n  src=\"/articles/gigatoken/fig1.png\"\n  alt=\"Bar chart of GPT-2 tokenizer throughput on a 12 GB file on an Apple M4 Max: gigatoken 8.27 GB/s, tiktoken 61.5 MB/s, HuggingFace tokenizers 6.2 MB/s — the gigatoken bar dwarfs the other two.\"\n  caption=\"GPT-2 on a 12 GB file, M4 Max: 8.27 GB/s vs 6.2 MB/s for HuggingFace — and the output is validated to match exactly (gigatoken README).\"\n/>\n\n## Why a regex was the bottleneck\n\nA BPE tokenizer does two jobs. First it **pretokenizes** — splits the text into word-ish chunks,\nalmost universally by running a big regular expression over every byte. Then it applies the **BPE\nmerges** within each chunk. Everyone pictures the merges as the expensive part; in practice the\n**regex pretokenization is the majority of the wall-clock**. Toggle it:\n\n<PretokenPipeline />\n\nGigatoken's two moves both target that. It replaces the regex with a hand-written **SIMD** scanner\nthat performs the exact same split at over 2 GB/s per thread, and it **caches pretoken→token\nmappings**, so any word it has already encoded becomes a lookup instead of a re-run of BPE. Real text\nis mostly repeated words, so the cache hits constantly. Add minimal Python round-trips and threads\nthat barely touch each other, and you get the numbers below.\n\n## The numbers, per CPU\n\nThis is throughput encoding an 11.9 GB slice of OpenWebText, with the speedup over HuggingFace beside\neach bar. Note the honest split baked into the colors:\n\n<ThroughputBars />\n\n**BPE tokenizers** — GPT-2, Llama, Qwen, DeepSeek, GPT-OSS, and friends — hit ~20+ GB/s on a big EPYC\nand clear three-digit speedups. **SentencePiece-based** ones (Gemma, Mistral, CodeLlama) are the\nweak spot the author flags openly: still faster, but ~10–20×, because Gigatoken hasn't optimized that\npath. And the hardware matters as much as the tokenizer: a 144-core EPYC does GPT-2 at 24.5 GB/s, an\nM4 Max laptop at 8.8 GB/s (its best speedups actually *exceed* 1000× because HF is slower there too),\nand a single 8-core desktop Ryzen still lands around 100×.\n\n## What \"gigabytes per second\" buys you\n\nNumbers this large stop meaning anything without a yardstick, so here's one: at the EPYC's rate you\ncould tokenize **all of Common Crawl — about 130 trillion tokens, effectively the whole public\ninternet — in just under 6.5 hours.** The same job on HuggingFace's tokenizer runs for the better part\nof a year. Drag the dataset size:\n\n<CommonCrawlClock />\n\nThat's the real point. Tokenization is pure overhead on the path to training — you pay it every time\nyou change a vocab, re-shuffle a corpus, or add data — and a tokenizer that runs at disk speed turns a\nmulti-day preprocessing job into a coffee break.\n\n## Using it\n\nTwo modes. **Compatibility mode** is the drop-in: wrap an existing tokenizer and it behaves like the\noriginal, output matched exactly (`gt.Tokenizer(hf_tokenizer).as_hf()` or `.as_tiktoken()`) — a bit of\nspeed traded for bit-for-bit parity. The **Gigatoken API** is the fast path, letting the Rust side\nread files directly and skip Python overhead entirely. You can benchmark any HuggingFace tokenizer\nagainst your own data without installing anything:\n\n```bash\nuvx --with tokenizers gigatoken bench 'openai-community/gpt2' owt_train.txt \\\n    --validate --doc-separator \"<|endoftext|>\"\n```\n\nIt's honest about the edges, too: SentencePiece is under-optimized, WordPiece isn't supported yet,\nWindows is untested (use WSL), and there's still ABI3 overhead the author expects to claw back another\n2× from. The README even carries an AI-use disclosure noting most of the code was hand-written, with\nAI help mainly for the user-facing API and porting SIMD strategies across AVX-512/AVX2/NEON.\n\n## The take\n\nGigatoken is a clean reminder that \"already optimized\" is not the same as \"optimal.\" A step everyone\nhad mentally checked off as solved — it's Rust, it's threaded, move on — was still leaving a **1000×**\non the table, because the actual hot loop (a regex nobody questioned) had never been rewritten for the\nhardware. It won't change what your model learns. It will change whether the tokenizer is ever the\nthing you're waiting on again.\n\n---\n\n*Source: the [Gigatoken README and benchmarks](https://github.com/marcelroed/gigatoken#benchmarks)\n(Marcel Rød, 2026). Throughput measured on OpenWebText across EPYC 9565, Apple M4 Max, and Ryzen\n9800X3D CPUs; the figure is the project's, the interactives are mine.*\n","readingTimeMins":4,"url":"https://ai.thesatyajit.com/articles/gigatoken","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"Mage-Flow: a 4B image model that bets on its tokenizer","description":"Open image generators keep getting bigger — Z-Image at 6B, Qwen-Image at 20B, FLUX.2 at 32B, Hunyuan-Image at 80B. Microsoft's Mage-Flow bets the other way: a compact 4B stack for text-to-image and instruction editing that stays competitive by co-designing the pieces. Mage-VAE cuts tokenizer MACs ~12×/22× at matched fidelity, a native-resolution MMDiT packs any aspect ratio into one model, and fused CUDA kernels give 2.5× faster training — so a 4-step Turbo renders a 1024² image in 0.59s on one A100, at ~18 GB. A walk through the co-design.","date":"2026-07-22","tags":["image-generation","diffusion","open-weights","efficiency","explainer"],"draft":false,"cover":"/articles/mage-flow/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"mage-flow","body":"The open image-generation frontier has been scaling backbones hard: Z-Image at 6B, Qwen-Image at 20B,\nFLUX.2 at 32B, Hunyuan-Image-3.0 at 80B. [Mage-Flow](https://arxiv.org/abs/2607.19064), from Microsoft's\nMage team, bets the other direction — a **compact 4B stack** for text-to-image generation *and*\ninstruction-based editing that stays competitive with those much larger systems by **co-designing** its\nthree layers instead of just growing one. It's MIT-licensed and released as an open research baseline.\n\n<Figure\n  src=\"/articles/mage-flow/fig1.png\"\n  alt=\"A dense gallery of images generated and edited by Mage-Flow, spanning photorealistic portraits, illustrations, posters with rendered text, and edited scenes across many aspect ratios.\"\n  caption=\"A showcase of Mage-Flow generation and editing across styles and aspect ratios (Mage-Flow, model gallery).\"\n/>\n\nThe organizing idea is \"codec-aligned efficiency\" — *spend representation capacity where the signal\nis* — and it shows up as three co-designed components: a cheap tokenizer, a native-resolution backbone,\nand a fused-kernel training system.\n\n## The tokenizer that pays for everything\n\nThe VAE is the quiet tax on high-resolution diffusion: every image is encoded to latents and decoded\nback, and at 2K that cost dominates. **Mage-VAE** is a lightweight pixel-diffusion tokenizer distilled\nfrom the FLUX.2-VAE latent space, using one-step encode/decode with anchor-latent KL regularization. It\nmatches FLUX.2-VAE's reconstruction fidelity while doing far less work:\n\n<VaeEfficiency />\n\nMake the tokenizer an order of magnitude cheaper without losing quality, and every downstream stage —\ntraining and inference — inherits the saving. That's the lever the rest of the stack is built on.\n\n## A backbone that doesn't crop\n\nThe generator itself is a **Native-Resolution Multimodal Diffusion Transformer**. Text prompts are\nencoded by Qwen3-VL; images are turned into compact latents by Mage-VAE; and — the key move — images of\nany resolution and aspect ratio are flattened into **variable-length token sequences** and packed\ntogether with the text tokens in one batch. Per-sample 2D rotary embeddings and variable-length\nFlashAttention let the 4B MMDiT process those packed sequences while preserving each image's native\nspatial layout, so there are no fixed resolution buckets and no center-crops.\n\n<Figure\n  src=\"/articles/mage-flow/fig2.png\"\n  alt=\"Mage-Flow architecture diagram: a Qwen3-VL text encoder and a Mage-VAE encoder feed variable-length packed token sequences into a stack of Native-Resolution MMDiT blocks, then a Mage-VAE decoder; the right panel shows the MMDiT block with separate text and image streams joined by packed multi-head self-attention and 2D-RoPE.\"\n  caption=\"Text (Qwen3-VL) and image (Mage-VAE) tokens are packed and processed by the Native-Resolution MMDiT — modality-specific norms, joint self-attention, per-sample 2D-RoPE (Mage-Flow, Figure 5).\"\n/>\n\nThat's what makes one checkpoint span the whole range — 512² up to 2048², any ratio, out to an extreme\n4:1 panorama:\n\n<NativeResolution />\n\n## One backbone, three rungs\n\nOn top of that foundation Mage-Flow ships a family. A **Base** model trained with rectified flow\nmatching is aligned into the **RL** model with **Diffusion-NFT** (better prompt following, text\nrendering, aesthetics, editing fidelity), then distilled into a **4-step Turbo** with Decoupled-DMD and\nadversarial perceptual guidance. The same pattern produces the editing line. Watch the step count — and\nthe latency — fall:\n\n<VariantLadder />\n\nThe Turbo rung is the point: it turns a 30-step diffusion model into a **4-step** one, so at 1024² on a\nsingle A100, generation drops from 4.37s to **0.59s** and editing to **1.02s** — interactive speed from\na model you can actually fit.\n\n## The frontier that matters\n\nMage-Flow's case isn't that it tops any single benchmark — it's that it sits on a favorable\n**quality–speed–memory** frontier. On GenEval (generation) and GEdit-Bench-EN (editing) it's\ncompetitive with or ahead of much larger systems, while its peak GPU memory stays the lowest of the\nfield:\n\n<Figure\n  src=\"/articles/mage-flow/fig3.png\"\n  alt=\"Two scatter plots — GenEval vs inference time for text-to-image, and GEdit-EN vs inference time for editing — with marker area proportional to peak GPU memory. The Mage-Flow points sit toward the upper-left (high quality, low latency) with small markers (low memory).\"\n  caption=\"Quality vs inference time, marker area ∝ peak GPU memory. Mage-Flow sits high-and-left with the smallest markers (Mage-Flow, Figure 4).\"\n/>\n\nThe concrete number is memory: across generation and editing, Mage-Flow's peak GPU memory stays around\n**18–20 GB** — versus 58.8 GB for Qwen-Image, 65.5 GB for HiDream-I1, and a two-GPU 179.6 GB for\nFLUX.2-dev. ~18 GB is a single desktop-class card, which is the whole pitch: a strong generation-and-editing\nmodel that runs *locally*.\n\n## The take\n\nMage-Flow is a clean argument that **image-model efficiency is a co-design problem, not a scale\nproblem**. The headline speed (0.59s at 1024²) comes from the tokenizer being cheap, the backbone\navoiding resolution buckets, the kernels being fused, and the sampler being distilled — each layer\npulling its weight so a 4B model can stand next to 20–80B ones. It pairs naturally with the far larger\n[Qwen-Image-3.0](/articles/qwen-image-3): same task, opposite bet on where the capability should live.\nWorth the usual caveat — the weights are MIT but released for research use, and the benchmark framing is\nthe authors' own — but the frontier it draws is a genuinely useful one.\n\n---\n\n*Source: [Mage-Flow: An Efficient Native-Resolution Foundation Model for Image Generation and Editing](https://arxiv.org/abs/2607.19064)\n(Zhang et al., Microsoft, 2026), the [Mage repo](https://github.com/microsoft/Mage), and the\n[model collection](https://huggingface.co/collections/microsoft/mage). Figures are the paper's; the\ninteractives are mine.*\n","readingTimeMins":4,"url":"https://ai.thesatyajit.com/articles/mage-flow","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Nanbeige4.2-3B: looping a small model up to a big one's depth","description":"A 3B-non-embedding open-weight agentic model that reports beating Qwen3.5-9B and Gemma4-12B on tool-use, code-agent, and most reasoning benchmarks. Now with the technical report: the lever is a Looped Transformer run twice — trained from scratch, not upcycled — plus a 28T-token pretrain, a STEM-to-agentic SFT curriculum, and multi-stage RL with outcome-and-process rewards. A walk through the loop, the parameter-efficiency story, and which numbers to trust.","date":"2026-07-22","updated":"2026-07-24","tags":["llm","agents","small-models","open-weights","explainer"],"draft":false,"cover":"/articles/nanbeige-4-2-3b/fig1.png","featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"nanbeige-4-2-3b","body":"[Nanbeige4.2-3B](https://huggingface.co/Nanbeige/Nanbeige4.2-3B) is a compact agentic model — **3B\nnon-embedding parameters** (4B total), Apache-2.0, bilingual EN/ZH — from the Nanbeige LLM Lab at Boss\nZhipin. Its claim is the one every good small model makes: it performs \"well beyond its parameter\nscale,\" reporting wins over **Qwen3.5-9B** and **Gemma4-12B** across tool-use, office-agent, code-agent,\nand most reasoning benchmarks. The mechanism behind the claim is the interesting part — a **Looped\nTransformer** — and with the [technical report](https://huggingface.co/Nanbeige/Nanbeige4.2-3B/blob/main/Nanbeige42_report.pdf)\nnow out, the config is no longer a mystery: it's a two-pass loop, **pretrained from scratch on 28T\ntokens**, then shaped by a four-stage post-training pipeline. Here's the loop, the efficiency story, and\nwhich of the numbers to lean on.\n\n## The loop: depth without parameters\n\nA standard transformer buys reasoning depth by stacking more distinct layers, each with its own\nweights. A **looped** transformer buys it by running the *same* layers several times — so effective\ncompute depth is (physical layers × loops), while the parameter count stays at just the physical\nlayers. You pay for depth in FLOPs, not weights.\n\n<LoopedDepth />\n\nThree design choices in the report make this more than a slogan:\n\n- **Two passes, not more.** The report studies the loop count directly and lands on **2** as the sweet\n  spot: it keeps roughly **75% of a standard Transformer's token efficiency** while adding real\n  capacity. More passes buy almost nothing and make training slower and less stable — so the model\n  loops exactly twice.\n- **From scratch beats upcycling.** You *could* pretrain a normal transformer and then convert it into a\n  looped one (\"upcycling\"). Nanbeige compared both and found training the looped architecture from\n  scratch performs **significantly better** — the model needs to adapt its representations to repeated\n  layer reuse throughout pretraining, not have the loop bolted on afterward.\n- **They kept the full KV cache.** Looping twice normally doubles the attention compute, so they tried\n  sharing the KV cache across passes to halve it. It consistently underperformed, so they **kept the\n  full, non-sharing loop** — a deliberate choice to spend inference memory on quality.\n\nIf that recurrent-depth bet sounds familiar, it's the same one as [LOTUS](/articles/lotus-latent-reasoning),\nwhich loops a padded 3B Transformer to reason in its hidden states — and it's a cousin of the\narchitecture-over-scale thesis in [Motif 2.6B](/articles/motif-2-6b). You can explore the tradeoff in\nthe widget above; ×2 is what actually ships.\n\n## A stronger base to start from\n\nBefore any agent training, the looped base model already leads its weight class. Pretrained from scratch\non a 28T-token corpus (larger and cleaner than Nanbeige 4.1's, with up-weighted math, code, and\nsynthetic-QA data — and a first taste of agentic trajectories mixed in), **Nanbeige4.2-3B-Base** beats\nQwen3.5-4B-Base, Gemma4-E4B-Base, and its own predecessor on *every* reported base benchmark: GSM8K\n**92.7**, BBH **81.6**, MBPP **67.6**, SuperGPQA **35.2**, GPQA **53.3**. The knowledge gap is the\nclearest — on MMLU-Pro the 3B looped base outscores a 4B Qwen base by twelve points:\n\n<BenchBars\n  title=\"MMLU-Pro — base models (report, Table 1)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Gemma4-E4B\", value: 37.6 },\n    { label: \"Nanbeige4-3B\", value: 47.6 },\n    { label: \"Qwen3.5-4B\", value: 51.8 },\n    { label: \"Nanbeige4.2-3B\", value: 63.8, highlight: true },\n  ]}\n/>\n\nThat head start — the loop plus the refined 28T-token mixture — is what the post-training then turns into\nagentic behavior.\n\n## The results, and how to read them\n\nHere's the headline chart from the report: the same benchmark suite against Gemma4 and Qwen3.5, with\nNanbeige4.2-3B in teal, essentially topping every agent and code panel and most reasoning ones.\n\n<Figure\n  src=\"/articles/nanbeige-4-2-3b/fig1.png\"\n  alt=\"A grid of grouped bar charts over Agent Tasks (MCP-atlas, PinchBench-v2, GDPval, ClawEval, OfficeQA Pro, SWE-bench Verified, SWE-bench Pro, Terminal Bench 2.0) and Reasoning Tasks (HLE, GPQA-Diamond, HMMT-Feb-2026, SciCode), comparing Gemma4-E4B, Gemma4-12B, Qwen3.5-4B, Qwen3.5-9B, and Nanbeige4.2-3B. Nanbeige is highest in almost every panel.\"\n  caption=\"Nanbeige4.2-3B (teal) against models 2–3× its non-embedding size across agent and reasoning tasks (report, Figure 1).\"\n/>\n\nThe cleaner way to see the efficiency argument is to put score against parameters directly. On the\npublic code and agent benchmarks the teal point sits up-and-to-the-left — smaller *and* higher:\n\n<ParamEfficiency />\n\nNow the honesty pass, because not all of these bars are the same kind of evidence:\n\n- **The comparable, public ones** are the strongest signal: **SWE-Bench Verified 63.6** (vs Qwen3.5-9B\n  53.1), **SWE-Bench Pro 46.9** (vs 33.8), **Terminal-Bench 2.0 44.1** (vs 29.2), **LiveCodeBench-V6\n  72.5**, **HMMT-Feb-2026 82.8**. A 3B model at 63.6 on SWE-Bench Verified is genuinely notable if it\n  holds up in third-party harnesses. The report evaluates everyone under standardized protocols\n  (Table 3), which is a stronger footing than a model card's loose bars.\n- **Treat the eye-catching ones with care.** GPQA-Diamond **87.4** for a 3B model is near-frontier and\n  surprising — it's self-reported in Think mode, the regime where small models gain the most and where\n  contamination is hardest to rule out. Several agent scores (GDPval, AgentIF-Oneday, OfficeQA-Pro) run\n  through Nanbeige's own agent stack, and **Recruit-Bench is Nanbeige's in-house benchmark** — all\n  reasonable to publish, none of it fully apples-to-apples.\n- **Where it doesn't lead:** Gemma4-12B still wins **SciCode** (38.2 vs 35.6), **IF-Bench** (73.5 vs\n  54.6), and **Recruit-Bench** (69.4 vs 63.3) — strict instruction-following and some scientific coding\n  aren't the strong suit. The report is upfront about this: best on five of six reasoning benchmarks,\n  not all of them.\n\n## Training: synthesize the environments, reward the process\n\nThe report's four-stage post-training pipeline is where the agent behavior is actually built.\n\n**1. SFT with a STEM-to-agentic curriculum.** Starting from the pretrained checkpoint, supervised\nfine-tuning runs in three stages that stretch the context window 64K → 128K → 256K while sliding the\nmix of target tokens from reasoning toward agentic interaction — think first, then act:\n\n<SftCurriculum />\n\nThe trajectories themselves come from large-scale environment *synthesis*: a repository-to-task pipeline\nfor software engineering (mine real repos, reconstruct a sandboxed container, keep only fail-to-pass\nverified tasks), a hybrid real-plus-simulated pipeline for tool use (live MCP servers, Python-reconstructed\nAPIs, and LLM-simulated virtual tools), and an artifact-centric pipeline for office cowork (reports,\nslides, spreadsheets). Crucially, the same task is solved by **multiple heterogeneous scaffolds** —\nClaude Code, OpenHands, SWE-agent, Codex-style drivers — so the model learns scaffold-*invariant* repair\nstrategies rather than the quirks of one harness. A **turn-level loss mask** keeps bad intermediate turns\nin context but out of the loss, so the model learns to recover from mistakes without being trained to\nrepeat them.\n\n**2. Two-stage RLHF for hybrid thinking.** A pointwise reward model cleans up the failure modes a small\nmodel is prone to — repetitive reasoning, cyclic reflection, delayed termination, malformed output. The\nreport's interesting finding is that this general-purpose RLHF **generalizes two ways**: *cross-task*\n(fixing repetition and formatting also lifts math, code, and agentic scores — many agent failures are\ngeneration loops, not reasoning errors) and *cross-mode* (behavior learned on Non-Think responses\ntransfers to Think mode). It's RLHF doing more than safety and style.\n\n**3. Length-controlled reasoning RL.** A difficulty-aware penalty discourages over-long reasoning on\nproblems the model already solves reliably, while leaving still-hard problems room to explore — cutting\ntokens without trading away correctness.\n\n**4. Agentic RL with action-centric rubrics.** Finally, outcome rewards are combined with **process\nrewards** — per-turn rubrics scoring tool-call accuracy and the information gained each step — for denser\ncredit assignment over long trajectories. For a model this small, the report finds it more stable to run\nagentic RL on *easier* tasks (short trajectories, higher pass@8) than on the hardest ones. Across the RL\npipeline, accuracy rises while output tokens fall (e.g. AA-LCR 50.0 → 58.7 with average length dropping\n19.5k → 6.7k tokens; PinchBench-V2 55.9 → 74.7).\n\n## Small enough to live on your laptop\n\nThe payoff is deployment. At 3B non-embedding params the model is meant to run **locally** — the card\nships recipes for vLLM, SGLang, `llama.cpp`/GGUF, and Ollama (including MLX on Apple silicon), with a\nconfigurable thinking mode (`enable_thinking`, `preserve_thinking`) and XML-format tool calls. Under the\n**OpenClaw** agent framework — evaluated with the *same* scaffold and tools for every model — Nanbeige\nreports beating both Qwen3.5-4B and 9B across all six daily, office, and deep-research benchmarks, with\nthe widest gaps on office workflows (GDPval 68.8 vs 38.0, AgentIF-Oneday 58.9 vs 32.1). The pitch is a\nprivate, on-device assistant that can still carry multi-step tool workflows.\n\n## The take\n\nNanbeige4.2-3B is another data point for a thesis this site keeps returning to: **architecture and\ntraining, not raw scale, are increasingly what a small model needs to punch up a weight class** — the\nsame lesson as [LOTUS](/articles/lotus-latent-reasoning), [Motif 2.6B](/articles/motif-2-6b), and the\nsub-1B security models in [Antares](/articles/antares). The looped-transformer bet is the genuinely\ninteresting bit, and now that the report shows the working — two passes, trained from scratch, full KV\ncache — it reads less like a marketing line and more like a set of measured tradeoffs. The public\ncode-agent numbers are strong enough to take seriously; just keep the in-house scaffolds and\nself-reported reasoning scores in the \"promising, pending third-party replication\" column — which is\nexactly where an open-weight release lets anyone go check.\n\n---\n\n*Source: the [Nanbeige4.2-3B technical report](https://huggingface.co/Nanbeige/Nanbeige4.2-3B/blob/main/Nanbeige42_report.pdf)\n(Nanbeige LLM Lab, 2026) and the [model card](https://huggingface.co/Nanbeige/Nanbeige4.2-3B).\nEvaluations are self-reported, largely in Think mode, some using in-house scaffolds and benchmarks. The\nperformance figure is Nanbeige's; the interactives are mine.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/nanbeige-4-2-3b","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"NVIDIA Rubin: co-designing a GPU for the shape of agentic inference","description":"Agentic workloads aren't one prompt and one answer — they're sustained inference across many reasoning steps, which stresses long-context attention, MoE decode, KV-cache capacity, kernel handoffs, scale-up communication, and power all at once. NVIDIA's Rubin GPU claims up to 10× more agentic throughput per watt than Blackwell by attacking each of those bottlenecks with a specific feature: 2:4 sparse attention and 4× softmax, shared MoE descriptors, 2× K-dimension GEMMs, 288 GB of 22 TB/s HBM4, tile-level kernel triggering, NVLink counted writes, and rack-scale power smoothing. A map of the co-design — with the vendor-number caveat kept in view.","date":"2026-07-22","tags":["hardware","gpu","inference","systems","explainer"],"draft":false,"cover":"/articles/nvidia-rubin/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"nvidia-rubin","body":"An agentic workload isn't a single prompt and response — it's sustained inference across many reasoning\nsteps that plan, call tools, verify, and revise over long contexts. That execution pattern stresses a\nGPU differently than a chatbot does: it wants low *per-step* latency, high decode throughput, efficient\nlong-context attention, large KV-cache capacity, and the ability to spread a model across tightly\ncoupled GPUs. [NVIDIA's Rubin GPU](https://developer.nvidia.com/blog/inside-nvidia-rubin-gpu-architecture-powering-the-era-of-agentic-ai/)\nis a bet that the right response is to co-design for exactly that pattern — and the headline claim is\n**up to 10× more agentic throughput per unit of energy than Blackwell.**\n\n<Figure\n  src=\"/articles/nvidia-rubin/fig1.png\"\n  alt=\"Pareto-frontier chart of agent throughput versus interactivity for Hopper, Blackwell, and Rubin systems; the Rubin NVL72 + Vera frontier sits far above the others, roughly 10× more agents and about 2× more tool calls.\"\n  caption=\"Pareto frontiers of throughput vs interactivity — a ~10× generational uplift on NVIDIA's internal 2T-MoE agentic workload (NVIDIA, Figure 1).\"\n/>\n\nThat \"10×\" is a vendor number on an internal 2-trillion-parameter MoE workload, so read it as a design\ntarget rather than an independent benchmark. What's more interesting than the single figure is *how*\nit's assembled — because it isn't one trick, it's a checklist of agentic bottlenecks each with its own\nanswer.\n\n## The chip\n\n<Figure\n  src=\"/articles/nvidia-rubin/fig2.png\"\n  alt=\"Annotated die diagram of the NVIDIA Rubin GPU: two compute dies joined by NV-HBI, graphics processor clusters, HBM4 controllers, a central L2 cache, NVLink, and PCIe Gen 6.\"\n  caption=\"The Rubin GPU: two reticle-limited dies unified over the NV-HBI inter-die link (NVIDIA, Figure 2).\"\n/>\n\nPhysically, Rubin is **two reticle-limited compute dies** fused into one package over a high-speed\ninter-die link (NV-HBI): **336 billion transistors, 224 SMs, 896 Tensor Cores**, a third-generation\nTransformer Engine that flexes precision across formats for **up to 50 petaflops of NVFP4**, up to\n**288 GB of HBM4 at 22 TB/s**, and NVLink 6 at **3,600 GB/s** of scale-up bandwidth. Those are the raw\nnumbers; the architecture is about turning them into *sustained* utilization.\n\n## One checklist, many bottlenecks\n\nHere's the spine of the whole design. Each agentic-inference bottleneck gets a specific Rubin feature —\nclick through them:\n\n<BottleneckMap />\n\nTwo are worth dwelling on. **Long-context attention** is where agentic runs spend their time, and Rubin\nattacks it from two sides at once: it compresses the intermediate attention scores into a structured\n**2:4 sparse** form so softmax and the second attention GEMM operate on fewer values, and it raises\nexponential throughput so softmax — which becomes the bottleneck once the matrix math speeds up — keeps\npace.\n\n<Figure\n  src=\"/articles/nvidia-rubin/fig3.png\"\n  alt=\"Diagram showing a dense activation matrix converted into a 2:4 sparse matrix plus metadata, applied to the attention and MLP activation stages of a transformer block.\"\n  caption=\"Activations compressed to 2:4 sparse (values + metadata), cutting work in softmax and the second attention GEMM without changing the block's interface (NVIDIA, Figure 5).\"\n/>\n\nThe other is **MoE decode**: as expert counts climb, just locating and moving expert weights becomes\nthe cost. Blackwell tracks one memory descriptor per expert; Rubin keeps a single shared descriptor and\noverrides the pointer and stride inline in the TMA instruction at runtime — less metadata bookkeeping,\nmore GPU time on actual matmuls.\n\n## Generation over generation\n\nThe concrete comparatives NVIDIA gives are memory bandwidth and softmax (exponential) throughput. On\nboth, the Blackwell-Ultra-to-Rubin step is the large one:\n\n<GenSpecs />\n\nMemory is the quiet star here. Decode — the token-by-token generation phase — is fundamentally\n**memory-subsystem bound**, and agentic workloads spend more of their runtime there (long contexts,\nbig KV caches, interactive generation). HBM4 doubles the interface width of HBM3e for **2.8× the\nbandwidth** of Blackwell, while 288 GB of capacity keeps trillion-parameter models and their KV state\nresident instead of spilling to slower memory. Capacity and bandwidth do different jobs — one holds the\ncontext, the other feeds the cores — and decode needs both.\n\n## The data center as one unit of compute\n\nThe last move is to stop thinking about a GPU and start thinking about the **AI factory** as a fixed\npower budget. Rubin's efficiency story is rack-scale: Intelligent Power Smoothing uses on-rack energy\nstorage to absorb the sharp power swings of AI workloads (about −10% average draw and −20% on 50 ms\npeaks), and **DSX MaxLPS** turns that reclaimed headroom into more GPUs — up to **40% more** in the\nsame megawatt envelope.\n\n<PowerFactory />\n\n<Figure\n  src=\"/articles/nvidia-rubin/fig5.png\"\n  alt=\"Grid comparison of AI-factory capacity: without DSX MaxLPS much of the power budget is stranded and unused; with DSX MaxLPS up to 40% more GPU slots fit in the same budget.\"\n  caption=\"For a fixed power budget, DSX MaxLPS recovers stranded capacity to provision up to 40% more GPUs (NVIDIA, Figure 10).\"\n/>\n\nAll of this lands in the Vera Rubin NVL72 rack — third-gen MGX, cable-free trays, 45°C liquid cooling,\nhot-swappable NVLink switch trays — designed so compute, networking, cooling, and power behave as one\nexecution domain.\n\n## The take\n\nRubin is a useful lens on where inference hardware is going: not chasing a bigger peak-FLOPs poster\nnumber, but **co-designing every layer around the execution pattern of agents** — sparse long-context\nattention, distributed MoE decode, tighter kernel handoffs, fused scale-up communication, and power\ntreated as the real budget. It's the opposite end of the spectrum from running a model on a single\n[DGX Spark](/articles/dgx-spark-batching), and the same underlying question — *how do you keep\nexpensive compute actually busy?* — answered at rack scale. Keep the caveat in mind: the marquee\nnumbers (10× throughput/watt, 40% more GPUs) are NVIDIA's own, on NVIDIA's workloads, for a platform\nstill rolling out. The architecture is real and specific; the multipliers are the vendor's to prove.\n\n---\n\n*Source: [Inside NVIDIA Rubin GPU Architecture](https://developer.nvidia.com/blog/inside-nvidia-rubin-gpu-architecture-powering-the-era-of-agentic-ai/)\n(NVIDIA, 21 July 2026). Performance figures are NVIDIA's, several on an internal 2T-MoE workload; the\nfigures are NVIDIA's, the interactives are mine.*\n","readingTimeMins":5,"url":"https://ai.thesatyajit.com/articles/nvidia-rubin","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"The harness is the generalizer: how a scaffold learns to solve longer, unseen tasks","description":"Transformers are unreliable at compositional generalization — recombining known pieces to solve a novel problem. Alex Zhang and Omar Khattab argue the fix doesn't have to live in the weights: a harness that keeps every LM call locally in-distribution (via context offloading and programmatic sub-calls) can be RL-trained on short tasks and then generalize to ones 8–32× longer, and even to entirely different domains — approaching a frontier model while the base Transformer flatlines.","date":"2026-07-21","tags":["agents","harness","llm","systems","explainer"],"draft":false,"cover":"/articles/harness-compositional-generalization/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"harness-compositional-generalization","body":"Compositional generalization is the thing humans do without thinking: given a novel problem, break it\ninto familiar sub-problems, solve each, recombine. Transformers, famously, are *unreliable* at it — a\nmodel that aces 32k-token tasks does not simply keep working at 2M tokens, and one trained to classify\nJeopardy questions does not automatically transfer to spam detection. The usual answer is *scale*:\nmore data, more parameters, and the ragged edges smooth out. [Alex Zhang and Omar\nKhattab](https://alexzhang13.github.io/blog/2026/harness/) make a different argument, and it's a sharp\none: **the capacity for compositional generalization can live in the harness** — the program that wraps\nthe model — rather than in the weights.\n\nTheir one-line thesis: *\"the primary job of the harness should be to carry a higher-level inductive\nbias that can reduce unfamiliar and complex problems to compositions of simpler ones.\"* Scaling data\nstill matters most, they're careful to say — but *\"the machinery that we feed that data into and its\ninductive biases are what will determine the coefficients of that scaling.\"*\n\n## The trick: keep every call locally in-distribution\n\nHere is the mechanism that makes it work, and it's worth sitting with. A task can be wildly\nout-of-distribution *as a whole* — no model trained on 32k-token inputs has ever seen a 2M-token one —\nwhile every *individual* model call inside it stays comfortably in-distribution. Zhang calls this\nproperty **locally in-distribution (LID)**. Drag the length multiplier and watch what it buys you:\n\n<LidBand />\n\nA base Transformer reads the whole long task in one context window. Past the length regime it trained\non, that context is unfamiliar — the model degrades, the phenomenon people now call *context rot*. The\nharness that stays LID never puts the model in that position: it decomposes the task so each call sees\nonly a short, familiar slice, and accuracy holds as the task grows.\n\n<Figure\n  src=\"/articles/harness-compositional-generalization/fig1.png\"\n  alt=\"Left: a task prompt plus reasoning fans out into sub-queries and tool calls, each marked with an eye icon meaning an individual LM call sees only that in-distribution slice. Right: the same content laid out as one long flat sequence that an individual LM would have to read whole, labelled out-of-distribution / unseen.\"\n  caption=\"Locally in-distribution: decomposed, each call sees a short in-distribution slice (left); flattened into one sequence, the whole thing is unseen (right) (paper, Figure 3).\"\n/>\n\n## How the harness does it: RLM\n\nThe concrete harness they study is a **Recursive Language Model (RLM)**, and it earns LID with two\nmoves. The first is **context offloading**. Instead of appending each raw observation — a tool output,\na retrieved document, a sub-agent's answer — to the running context, the RLM stores it in a REPL\nvariable and passes only a tiny symbolic *handle*. The root LM's view stays a short, task-agnostic\nprefix; the bulk data sits in the environment, peeked at through small probes. Drag the step count and\nwatch the two context sizes diverge:\n\n<ContextOffload />\n\nThe second move is **programmatic sub-agent calling**. Sub-agents behave like functions: they run,\nand their output lands in a REPL variable rather than being spliced back into the caller's context.\nZhang stresses these are equal partners — *\"programmatic sub-calling is equally as important as context\noffloading\"* — because together they're what keep the root context from bloating step over step, which\nis exactly what would drag it out of distribution.\n\n<Figure\n  src=\"/articles/harness-compositional-generalization/fig2.png\"\n  alt=\"Two side-by-side comparisons. Left: with context offloading the root LM sees a short REPL-based prefix; without it, it sees a large raw context block. Right: with programmatic sub-calling, sub-agent outputs stay out of the root context; without it, every tool and sub-agent output is appended to what the root LM sees.\"\n  caption=\"The two mechanisms. Offloading keeps a short task-agnostic prefix; programmatic sub-calls keep sub-agent outputs out of the root context entirely — what the root LM sees stays small (paper, Figure 5).\"\n/>\n\nStandard agent patterns — ReAct, CodeAct, and by extension most coding agents — fail LID precisely\nbecause they append everything to a growing history. The RLM is the same idea run in reverse: the\nharness works to *keep the model's window small and familiar*, and lets the environment hold the state.\n\n<Callout type=\"note\">\nThis is a different claim from the two harness pieces already on this site. [Agent\nharnesses](/articles/agent-harness) is about *engineering the loop* — tools, context policy, the\nself-improving outer loop. [The harness effect](/articles/harness-effect) is about *token economics* —\nsame model, cheaper orchestration. This one is about *generalization*: the harness as an inductive bias\nyou can train, so the scaffold itself learns to solve tasks it never saw.\n</Callout>\n\n## It generalizes — and the base model doesn't\n\nThe payoff is measured, not asserted. Zhang RL-trains the RLM on **short** tasks and evaluates on long\nheld-out ones, across six long-context benchmarks. The training signal is short-task reward; the\ninteresting question is whether the eval reward on much longer tasks *tracks* it.\n\n<Figure\n  src=\"/articles/harness-compositional-generalization/fig3.png\"\n  alt=\"Six training-curve panels — MRCRv2, GraphWalks, LongBench-Pro, OOLONG, OOLONG-Pairs, Ada-LEval. In each, the RLM's held-out long-task eval reward climbs with training and approaches the dotted RLM(GPT-5.5) reference line, while the base Transformer with YaRN stays flat and low.\"\n  caption=\"Train short, evaluate 8–32× longer. The RLM's long-task eval reward (blue) rises with training and approaches frontier RLM(GPT-5.5); the base Transformer + YaRN (orange) flatlines (paper, Figure 6).\"\n/>\n\nIt does. Training only on short tasks — 150 steps on `Qwen3-30B-A3B-Instruct` — the RLM generalizes to\ntasks **8–32× longer**, with eval reward that *\"more closely matches the train reward on shorter\ntasks,\"* while the base Transformer's eval stays flat even as its train reward rises. On MRCRv2,\nGraphWalks, and OOLONG the trained 30B RLM approaches or exceeds a frontier `GPT-5.5` RLM. Zhang reports\nroughly **10× the eval lift for the same train lift** versus a vanilla Transformer.\n\n<BenchBars\n  title=\"Trained on short tasks, evaluated this much longer\"\n  unit=\"×\"\n  bars={[\n    { label: \"MRCRv2 (64k → 2M)\", value: 32, highlight: true },\n    { label: \"Ada-LEval (8k → 128k)\", value: 16 },\n    { label: \"GraphWalks (128k → 1M)\", value: 8 },\n    { label: \"LongBench-Pro (32k → 256k)\", value: 8 },\n    { label: \"OOLONG (32k → 256k)\", value: 8 },\n    { label: \"OOLONG-Pairs (8k → 32k)\", value: 4 },\n  ]}\n/>\n\nAnd it isn't only length. In a separate **strategy generalization** test, the RLM trained on one domain\ntransfers to a completely different one — Jeopardy-style TREC classification to spam/ham; essay\nsimilarity to *math-problem* similarity; Twitter stance detection to error-detection in chat logs.\nAgain the RLM's train reward tracks its eval reward across the domain gap, and again the base\nTransformer plateaus. The decomposition strategy the harness learns is the thing that transfers, not\nthe surface task.\n\n<Callout type=\"warning\">\nIt isn't free. The RLM runs **1.5–3× slower** than the base Transformer per sample — multiple LM calls\nper step, sub-call latency — and for a couple of benchmarks (MRCRv2) it needed a light *\"nudge to\ndecompose\"* to converge on a generalizing strategy rather than a brittle one. Zhang's read: at scale no\nsupervision should be necessary, but a hint buys sample efficiency.\n</Callout>\n\n## The take\n\nThe reflex in this field is to push every capability into the weights and let scale sort it out. This\nwork is a reminder that *where* an inductive bias lives is a design choice. A harness that holds each\ncall locally in-distribution turns \"solve a 2M-token task\" into \"solve a sequence of 32k-token tasks,\"\nand that reframing is learnable — you can RL-train it on cheap short tasks and watch it generalize to\nlong, unseen, even cross-domain ones. It fits the pattern the other harness pieces on this site keep\ncircling: the layer *around* the model is not glue. Here it's the part that generalizes.\n\n---\n\n*Source: [Language model harnesses are compositional generalizers](https://alexzhang13.github.io/blog/2026/harness/)\n(Alex L. Zhang, with Omar Khattab), July 2026. Figures are the post's; the two interactives are mine.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/harness-compositional-generalization","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Laguna S 2.1: an 8B-active model that won't give up","description":"Poolside's new agentic coding model is a 118B/8B-active MoE with a 1M-token context — small enough to run on a single DGX Spark, and the most capable coding model in its weight class by a wide margin. It gets there not by adding raw intelligence but by training behaviors: persistence, verification, and a willingness to backtrack. A walk through the weight-class story, the thinking-mode lever, three real trajectories, and the post-training that produced them — built in under nine weeks by the same Model Factory.","date":"2026-07-21","tags":["llm","agents","coding","open-weights","explainer"],"draft":false,"cover":"/articles/laguna-s-2-1/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"laguna-s-2-1","body":"[Laguna S 2.1](https://poolside.ai/blog/introducing-laguna-s-2-1) is poolside's new agentic coding\nmodel: a **118B-parameter Mixture-of-Experts with 8B activated per token**, a **1M-token context** in\nboth thinking and no-thinking modes, and open weights on day one under OpenMDW-1.1. It went from the\nstart of training to launch in **under nine weeks**, and it is small enough to run on a single\n[NVIDIA DGX Spark](/articles/dgx-spark-batching). The claim poolside makes is precise and, unusually,\nfalsifiable — they released full evaluation trajectories for every trial: it is *the most capable\nagentic coding model in its weight class, by a wide margin.*\n\nThe interesting part is how they say they got there. Not by making the model smarter in the raw sense —\nby making it **behave** better: verify more, take less for granted, stop declaring victory early, keep\ngoing. It's the same family, and the same industrialized pipeline, as the models from\n[Laguna's Model Factory](/articles/laguna-model-factory) — S 2.1 is the third release in three months.\n\n## Punching above its weight class\n\nHere's the whole pitch in one plot: Terminal-Bench 2.1 score against model size, log axis. Toggle to\n**active params** and Laguna S 2.1 — at 8B active — sits above open models that activate five to seven\ntimes as many parameters, and a dozen points under a closed frontier (GPT-5.6, Claude Fable 5) whose\nmodels don't even disclose their size.\n\n<WeightClassScatter />\n\nRead that honestly, the way poolside does: on Terminal-Bench 2.1 its **70.2%** is well short of the\n88% ceiling that Kimi K3, GPT-5.6 Sol, and Claude Fable 5 share. Their own framing is that the top of\nthese benchmarks is saturating — \"as the frontier advances, top scores cluster in the 70–90% range and\nmodels that behave very differently end up no more than a few points apart.\" Where it actually leads is\n**SWE-Bench Multilingual**, edging the much larger field:\n\n<BenchBars\n  title=\"SWE-Bench Multilingual — resolved %\"\n  unit=\"%\"\n  bars={[\n    { label: \"Laguna S 2.1 (8B active)\", value: 78.5, highlight: true },\n    { label: \"Qwen 3.7 Max\", value: 78.3 },\n    { label: \"DeepSeek-V4-Pro Max (49B active)\", value: 76.2 },\n    { label: \"Tencent Hy3 (21B active)\", value: 75.8 },\n    { label: \"Nemotron 3 Ultra (55B active)\", value: 67.7 },\n  ]}\n/>\n\nPoolside is candid about the softer spots too — on **DeepSWE v1.1**, the least saturated of the set\n(frontier models spread from 54% to 73%, and some 1T-plus models score under 10%), S 2.1 lands at\n40.4%, mid-pack, and it reports that in its own harness rather than the leaderboard's. The takeaway\nisn't \"it wins\" — it's \"score-per-parameter,\" and on that axis nothing its size is close.\n\n## The second axis: behaviors, not intelligence\n\n> What we've done in this model is not necessarily add more intelligence, but improve the behaviors\n> that lead to a more capable model: more verification, less taking things for granted, not declaring\n> victory early, and being more persistent. — Pengming Wang, co-head of Applied Research\n\nThe concrete lever behind that is **test-time compute**. Of every model poolside has trained, S 2.1\nhas the largest gap between its no-thinking and max-thinking modes — its internal monologue is doing\nreal work, especially on the hard problems. Flip it:\n\n<ThinkingDelta />\n\nThere's no user-facing low/medium/high dial yet — it's off or max (default), with the model choosing\nits own budget. Poolside says it has watched coherent, productive reasoning run for **hours and\nhundreds of thousands of tokens**, which is also why the 1M-token context matters: long agentic\nsessions genuinely accumulate that much working state.\n\n## Seeing it work\n\nBenchmarks are a proxy; the trajectories are the evidence. Poolside published three unedited runs.\n\n**A browser engine from an empty folder.** Asked to build an HTML/CSS rendering engine in vanilla\nJavaScript, Laguna S 2.1 worked one 50-minute session, 181 steps, no human intervention — building the\nfull pipeline (HTML tokenizer → DOM → CSS parser with specificity → cascade → box-model layout →\ncanvas-2D renderer). The resourceful part: with no vision of its own, it needed a way to *check* its\noutput, so it ran **headless Chromium to read the canvas back and compared screenshots numerically**\nagainst a real browser.\n\n<Figure\n  src=\"/articles/laguna-s-2-1/fig1.png\"\n  alt=\"A gallery app: the model's own canvas rendering of HTML snippets on the left, beside the hosting browser's iframe rendering of the same markup on the right, with a header reporting pixel dimensions and the measured difference.\"\n  caption=\"The model's engine (left) beside the browser's own rendering of the same markup (right) — it built its own reference check (poolside, Laguna S 2.1 trajectory).\"\n/>\n\nThe verbatim prompt, if you want to reproduce it:\n\n```text\nyour job is it to build a simple browser engine (just html/css) in\njavascript to demonstrate the capabilities of poolsides new \"Laguna S\"\nmodel. the goal is to take render html snippets in a canvas like a real\nbrowser. to demonstrate it the engine, build a self-contained single\npage app that showcases a gallery of multiple html snippets and renders\nthem side by side (canvas with our render engine + iframe letting the\nhosting browser render it for real for comparison). support for most\ncommon layout and styling elements\n```\n\n**Optimizing poolside's own harness.** Pointed at the agent harness that trains and serves the models,\nin an automated loop S 2.1 made it **5.2% faster with ~70% lower memory allocation** — finding an\nO(n²) string-concatenation in token accumulation and swapping in buffers, then memoizing trajectory\nmaterializations and pre-allocating slices. When speedups got marginal, it *kept going*, switching its\nattention to allocations because those were still measurable. (Validated with Go's race detector and\n`go vet` gating — real gains, not hidden race conditions.)\n\n**Re-deriving Erdős problem #397, in Perl.** With no Python in the sandbox, it found Perl, did exact\nprime factorizations there, conjectured a family, and proved it — a closed-form infinite family of\neight-index solutions to a problem open for over 50 years. It's a **re-discovery** (GPT-5.2 Pro solved\nit in January 2026), and poolside says so plainly — but the construction is structurally different\n(eight indices growing linearly vs the known six-index family), so it's a fresh derivation, not recall.\n\n## What actually changed in training\n\nS 2.1 is a **scale-up of the Laguna XS family, on exactly the same pre-training data as XS 2.1** — the\nstep up was scale, training-code fixes, and small recipe tweaks, not new data. Almost everything that\nseparates it comes from **post-training**, in two stages: an SFT stage (partly synthetic) that\nbootstraps capability, then RL reserved for tasks the model can't already solve at a high pass rate.\nIt's also poolside's first model to run **RL in FP8 precision**.\n\nThe task corpus is the substance: **409k environments** — 83k terminal-focused, 168k standard\nsoftware-engineering — mostly grounded in real code history, the largest source reproducing **~38,000\nreal commits across ~17,000 repositories**, plus merged-PR reconstruction, injected-bug fixing, and a\nnew agentic step: given a repo, install every dependency and get its test suite running. And three\nchanges to the loop map straight onto the \"persistence\" story:\n\n- **More generous rollout budgets** — longer timeouts, more tokens per turn, more turns per task than\n  any earlier model (likely why it keeps going).\n- **A new sandbox** — background processes, selective network blocking to shrink the reward-hacking\n  surface, artifact caching.\n- **Multi-harness rollouts** — the same prompts rolled out across several agent scaffolds, so it\n  learns behaviors that transfer instead of overfitting to one harness.\n\nThat cadence — M.1 and XS.2 in April, XS 2.1 in July, S 2.1 weeks later — is exactly what the\n[Model Factory](/articles/laguna-model-factory) was built to enable: reproducible foundations, so each\nrelease inherits the last one's work automatically. Poolside is upfront about the rough edges shipped\nto move fast: some tool-schema slips in *third-party* harnesses (it leans on memory of its own tool\ninterface), invalid JSON in tools that expect array arguments, and occasional overthinking on\ncompetition math.\n\n## The take\n\nLaguna S 2.1 is the clearest example yet of a thesis worth taking seriously: **how a model works is a\nseparate axis from how smart it is, and it's trainable.** Persistence, verification, and backtracking\naren't emergent gifts of scale here — they're the product of longer rollouts, a better sandbox, and\nmulti-harness RL, poured into an 8B-active model that then holds its own against giants. It won't top\nthe leaderboards, and poolside doesn't pretend it does. But \"frontier behavior at a size you can run on\none desktop box\" is a more useful thing to ship than another point of benchmark score — and it's the\n[Model Factory](/articles/laguna-model-factory) that makes shipping it every few weeks look routine.\n\n---\n\n*Source: [Introducing Laguna S 2.1](https://poolside.ai/blog/introducing-laguna-s-2-1) (poolside,\n21 July 2026); benchmark figures as published there (pass@1 averaged over 3–4 attempts), with full\ntrajectories at trajectories.poolside.ai. The browser-engine screenshot is poolside's; the\nvisualizations are mine.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/laguna-s-2-1","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"LongStraw: fitting million-token RL onto a fixed GPU budget","description":"Inference reaches million-token context; RL post-training is mostly stuck at 256K or below. LongStraw (MindLab + Fudan) closes that gap as a systems problem, not an accuracy one: capture the long prompt as detached resident state, replay one GRPO response at a time, and 2M-token — even 4.45M-token — steps fit on eight H20 GPUs. What it measures, the real numbers, and the honest fine print it is careful to print about what it does not claim.","date":"2026-07-21","tags":["rl","systems","long-context","llm","explainer"],"draft":false,"cover":"/articles/longstraw-2m-rl/fig1.png","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"longstraw-2m-rl","body":"There's a widening gap in how long a context a model can *use* versus how long a context you can *train*\nit on with RL. Inference servers now run million-token windows; RL post-training, as publicly described,\nmostly sits at **256K tokens or below** and leans on length generalization at deployment. That gap\nmatters most for agents, whose observations, tool outputs, retrieved documents, and past decisions pile\nup into the history that conditions the next action. [LongStraw](https://github.com/MindLab-Research/longstraw)\n(MindLab and Fudan) sets out to close it — and it is unusually disciplined about framing the result as a\n*systems feasibility* claim, not an accuracy one.\n\n<Callout type=\"note\">\nRead this as a systems paper. LongStraw shows you **can execute** million-token GRPO steps under a fixed\nGPU budget — the memory transaction completes, on real hardware, with finite values. It does **not**\nclaim these runs improve reasoning accuracy, and it is careful to say so repeatedly. I'll keep that\ndistinction front and center, because it's the whole integrity of the work.\n</Callout>\n\n## Why GRPO is the hard case\n\nInference has an easy memory story: prefill the prompt, keep the state you need to decode, throw the\nforward graph away. GRPO can't. It samples **G** responses for one prompt and updates the policy from\ntheir *relative* rewards — so old-policy scoring, reference scoring, and every policy response all depend\non the *same* long history, and policy learning must also retain or reconstruct what backward needs.\nQuadratic attention plus long-lived backward state make GPU memory the wall. The paper's framing: the\npractical limit is **state lifetime, replay, and distributed ownership — not the attention kernel alone.**\n\n## The move: change the graph boundary\n\nLongStraw's core data structure is **resident state**: evaluate the shared prompt with *no autograd* and\nkeep only the minimal, model-native state later tokens need — recurrent state, KV pages, latent pages —\nnot the full prompt activation graph. Formally it stores z̄ₚ = stopgrad(zₚ(θ)). Its core algorithm is\n**response replay**: restore that boundary, score the old and reference branches graph-free, rebuild\n*one* policy response under autograd, backpropagate, and pop back to the boundary.\n\n<Figure\n  src=\"/articles/longstraw-2m-rl/fig1.png\"\n  alt=\"Two-panel diagram. (a) Conventional full-sequence autograd: each group member runs a policy forward with autograd that retains prompt and suffix activations, so the live graph spans the whole prompt P plus the response R_i, then backward and one step. (b) Captured prompt state with serial response replay: a no-grad prompt capture produces a read-only state, old and reference scores are computed pre-step, then responses are replayed serially, each forward-backward freeing its graph before the next, so the live autograd graph spans only R_i.\"\n  caption=\"Changing the graph boundary, not the objective. Conventional autograd keeps prompt-dependent activations in every member graph (top); LongStraw captures a read-only prompt state and replays one response graph at a time (bottom). It computes the direct response gradient and, by its own note, does not recover the gradient through the captured prompt state (paper, Figure 2).\"\n/>\n\nThat last point is the honest heart of it. A full-sequence gradient has two terms — the direct response\ngradient and the prompt-side term that flows back through zₚ(θ). LongStraw keeps the first and drops the\nsecond. In the paper's own words, printed right on the figure: *\"conditional-response gradient only …\nfull-sequence gradient parity is not claimed.\"* The measured objective is **response-only execution**.\n\n## Serial replay makes G a schedule, not a memory multiplier\n\nBecause responses are replayed one at a time, the live autograd graph is bounded by a single response —\nso the group size **G** becomes a scheduling/time dimension instead of a memory multiplier. This is the\nlever that makes million-token steps fit. Drag G and watch the measured peak barely move:\n\n<GroupSerialization />\n\nOn eight H20 GPUs, Qwen3.6-27B completes an exact-attention response-only GRPO step at **2,097,152\npositions** for both G=2 and G=8 — and going from G=2 to G=8 adds only **0.208 GB** of peak allocated\nmemory per rank (97.503 → 97.711 GB). Holding all G response graphs, by contrast, would scale activation\nmemory with G and overrun the budget. That's the trade the design makes: G costs wall-clock (271 s per\nmember), not memory.\n\n## Two incompatible architectures, same transaction\n\nThe design is instantiated for two structurally different models, which is most of the engineering:\n\n- **Qwen3.6-27B** (8× H20, CP8) — 48 Gated-DeltaNet layers + 16 full-attention layers, dense FFNs. It\n  keeps compact recurrent state for GDN and right-sized CP8-sharded KV pages for full attention, composes\n  rank-local softmax statistics into the exact global attention output, replays response blocks in\n  reverse, and allocates K/V gradient pages only when a response backward touches them. NF4 QLoRA,\n  116.7M trainable parameters.\n- **GLM-5.2** (32× H20, CP32/EP32) — 78 MLA/DSA attention layers (21 index + 57 IndexShare) and a\n  3-dense/75-MoE feed-forward stack routing each token to 8 of 256 experts (+1 shared). It holds\n  CPU-resident MLA latent pages and DSA index-key pages, reconstructs the sparse selection across\n  context-parallel owners, and replays the real Megatron router, EP32 all-to-all, and expert compute\n  under whole-layer checkpointing. (This is the same [GLM 5.2](/articles/glm-5-2) whose IndexShare makes\n  a million-token context cheap.)\n\nQwen solves *preserve and replay dense/recurrent history*; GLM extends it to *dynamic sparse selection,\nrouted experts, and cross-rank communication*.\n\n## The numbers, and the ceiling\n\nThe headline is how far the training context moves — from the usual quarter-million to millions of\npositions on fixed hardware:\n\n<BenchBars\n  title=\"RL post-training context reached — millions of positions\"\n  unit=\"M\"\n  bars={[\n    { label: \"Qwen prefix-reuse, 8× H20 (4.45M)\", value: 4.456, highlight: true },\n    { label: \"Qwen 2M / GLM 2M step\", value: 2.097 },\n    { label: \"typical RL post-training\", value: 0.256 },\n  ]}\n/>\n\nAt **4,456,448 positions**, one captured prefix supports **eight consecutive G=8 optimizer cycles** — 64\nresponse replays — at **83.894 GB per rank**, comfortably under the H20's 150.755 GB. On 32 H20 GPUs, GLM\ncompletes a deterministic 2M execution across all 78 layers with two full backward passes. The memory\nplot shows how the operating points sit against the ceiling:\n\n<Figure\n  src=\"/articles/longstraw-2m-rl/fig2.png\"\n  alt=\"Scatter plot of peak memory per rank in GB against context positions in millions, with a dashed line marking the H20 ceiling at 150.755 GB. A prefix-only pass at 2.1M sits near 59 GB, a conditional-response run near 97 GB, the replay / 8-step pass at 4.45M near 83 GB, and train-block proxy passes climb toward the ceiling near 4.5M where a proxy variant OOMs.\"\n  caption=\"Peak memory per rank vs. context positions, Qwen on eight H20s. The 8-step reuse run at 4.45M sits at ~83 GB while train-block proxies climb into the ceiling near 4.5M — the operating point is set by state lifetime and replay, not a single kernel (paper, Figure 10).\"\n/>\n\nNote what the plot says and doesn't: 2M is *\"an achieved operating point rather than a measured capacity\nceiling.\"* They ran it; they don't claim it's the max.\n\n## The refresh knob\n\nCapturing a million-token prompt is the dominant cost, so LongStraw reuses one capture across several\noptimizer steps. But each update moves the parameters, and the cached prompt state goes stale. A 1M\nfresh-prefix oracle measures exactly how stale — and turns \"how long can I reuse a prefix\" into a number:\n\n<PrefixReuseDrift />\n\nReuse is nearly free for a step or two (loss drifts ~0.04–0.12%) and clearly not by step four. So the\nrefresh interval becomes a measured control, not an assumption.\n\n## The fine print (which is the point)\n\nThis is where LongStraw earns its credibility. It defines **four levels of validation** and states plainly\nwhere each path lands:\n\n<Callout type=\"warning\">\n**What is not claimed.** The runs establish *response-only execution*, not full-sequence gradient parity.\nQwen's global attention merge uses a BF16 numerator; its distributed **optimizer finalization is\nincomplete** — the prototype all-reduces dQ but leaves page-owner K/V gradients rank-local, so the eight\nAdamW instances are locally-applied, not replica-equivalent. The GLM 2M run predates restored gradient\nfinalization, so its stronger global-update claim is *unestablished*. And the 2M workloads are\n**synthetic**: β=0, unclipped surrogate, old and reference scores coincide at step one, rewards and\nadvantages are synthetic. The real online sampling→reward→train loop (vLLM-DAPO-MATH) is validated only\nin **short-context, archived external runs** — *\"not a 2M online rollout, repeated policy learning, or\nfull-sequence gradient parity,\"* and no evidence of long-context policy improvement.\n</Callout>\n\nSpelling that out is not a weakness of the paper — it's the substance. A lesser report would have shown a\n2M run and let you assume it means a better model. LongStraw shows a 2M run and tells you exactly which\nnarrow, well-defined thing completed.\n\n## The take\n\nThe useful reframing here is that **long-context RL is a state-lifetime and ownership problem, not an\nattention-kernel problem.** Once you treat the long prompt as detached resident state and replay\nresponses serially, the memory transaction — not the kernel — is what sets how far you can train, and\nmillion-token steps fit on inventory you already have. It's a real feasibility milestone, and it sits\nnext to the other systems-first takes on RL cost like\n[frontier RL is cheaper than you think](/articles/frontier-rl-cheaper). What's left — and the paper is the\nfirst to say it — is closing the distance from \"the step executes\" to \"the gradient is exact and the\npolicy actually improves at 2M tokens.\" That's the next paper, honestly labeled.\n\n---\n\n*Source: [LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget](https://arxiv.org/abs/2607.14952)\n(Zhou et al., MindLab & Fudan University), July 2026. Figures are the paper's; the two interactives are\nmine, built on its measured numbers.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/longstraw-2m-rl","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Qwen-Image-3.0: chasing “useful” instead of “good-looking”","description":"Qwen's third-generation image model reframes the goal from pretty pictures to deployable ones. Its one-word pitch is “Real”: 4.5k-token prompts that lay out nine infographics in a single pass, text legible down to 10px, nested UIs inside UIs, and rendering across 12 languages with real world knowledge. A walk through what it claims — with the interactives to feel the levers, the official examples, and the gaps independent testers found.","date":"2026-07-21","tags":["image-generation","multimodal","diffusion","qwen","explainer"],"draft":false,"cover":"/articles/qwen-image-3/cover.jpg","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"qwen-image-3","body":"Every image model since the first Stable Diffusion has been optimised, implicitly, for one thing:\nmaking a picture you'd want to look at. [Qwen-Image-3.0](https://qwen.ai/blog?id=qwen-image-3.0) —\nthe third generation of Qwen's image line — says the flex out loud and then walks away from it. Its\nwhole pitch is one Chinese word, **实 (\"Real\")**: not *prettier*, but *useful* enough to put in a\nproduction pipeline. Where 1.0's keyword was \"Precision\" and 2.0's was a five-word mouthful\n(\"Precision, Variety, Completeness, Beauty, and Authenticity\"), 3.0 collapses to a single claim and\nsplits it three ways — **Rich Content, Authentic Details, Deep Knowledge**.\n\nTwo things to set expectations first. This was a **capabilities announcement**, not a tech report:\nQwen published no architecture, parameter count, or benchmark table — only examples. And unlike the\nopen-weight 1.0 and 2.0 (Apache-2.0 releases the community could run), 3.0 arrived **closed**,\nreachable through [Qwen Chat](https://chat.qwen.ai/?inputFeature=t2i) and Alibaba's API rather than a\nweights download. So this is a piece about *what it claims to do* and how to reason about it — with\nthe model's own examples, the levers behind them, and an honest look at where it cracks.\n\n## Rich Content: how much fits in one image\n\nThe demo that opens the announcement looks like a single tidy math slide. It isn't — it's one cell of\na **3×3 grid**, generated in a single pass, where every cell is a different dense infographic:\nprojectile motion, the Sylow theorems, a parasitology explainer, a bank internal-control diagram, a\ncell-DNA comparison, and more. Nine unrelated technical posters, each with its own Chinese and English\ntext, formulas, and charts, laid out without bleeding into one another.\n\n<Figure\n  src=\"/articles/qwen-image-3/fig1.jpg\"\n  alt=\"A 3-by-3 grid of nine distinct, dense infographics — each a different technical subject with its own text, formulas, diagrams and cartoon figures — generated as one image.\"\n  caption=\"Nine complex infographics in one pass. Fully specifying the grid takes a ~3.7k-token prompt (Qwen-Image-3.0, official blog).\"\n/>\n\nThe thing actually being demonstrated isn't the grid — it's the **prompt budget**. Describing that\ngrid precisely takes about 3.7k tokens, and 3.0 raises the instruction ceiling to **4.5k tokens**,\nseveral times what 2.0 would reliably follow. \"How much you can draw\" turns out to be gated by \"how\nlong an instruction the model will still honour.\" Drag it:\n\n<PromptBudget />\n\nThere's a second axis to \"rich\": not just width, but **depth** — interfaces nested inside interfaces.\nOne instruction renders, outer to inner, a VSCode window holding a Qwen Chat window holding a WeChat\nthread holding a pour-over-coffee poster, each preserving its own authentic chrome.\n\n<NestedUI />\n\n<Figure\n  src=\"/articles/qwen-image-3/fig3.jpg\"\n  alt=\"A picture-in-picture-in-picture image: a VSCode editor containing a Qwen Chat interface containing a WeChat conversation containing a coffee poster, each rendered in its own authentic UI style.\"\n  caption=\"Logical nesting, not collage — four distinct UI grammars held at once, each inside the last (Qwen-Image-3.0, official blog).\"\n/>\n\n## Authentic Details: how finely it draws\n\nIf Rich Content is about *how much*, Authentic Details is about *how fine*. The headline number here\nis **10px**: text small enough that most generators turn it into a texture that merely looks like\nwriting, which 3.0 claims to keep genuinely legible. Legible small text is the single hardest thing in\nimage generation — it's where the difference between \"renders language\" and \"renders squiggles\" lives.\n\n<TinyText />\n\nThe stress test is an academic paper: a full page of algebraic-geometry derivations with superscripts,\nsubscripts, curly braces, fraction bars, and multi-line aligned equations — the kind of layout where a\nsingle wrong glyph is obvious.\n\n<Figure\n  src=\"/articles/qwen-image-3/fig2.jpg\"\n  alt=\"A generated full page of an academic mathematics paper with multi-line LaTeX-style formula derivations, section headings and dense body text, legible at small size.\"\n  caption=\"A generated page of an algebraic-geometry paper — LaTeX-style typesetting held together down to small type (Qwen-Image-3.0, official blog).\"\n/>\n\nThe same precision shows up in **editing**, not just generation. Given a damaged traditional\nink-wash painting, the model restores the missing regions — matching brushwork, ink gradients, and\nfeather texture, removing mould spots — while leaving the original composition intact.\n\n<Figure\n  src=\"/articles/qwen-image-3/fig4a.jpg\"\n  alt=\"A traditional Chinese ink-wash painting of eagles in combat, visibly damaged with mould spots and missing areas.\"\n  caption=\"Before: a damaged eagle-combat ink painting (Qwen-Image-3.0, official blog).\"\n/>\n\n<Figure\n  src=\"/articles/qwen-image-3/fig4b.jpg\"\n  alt=\"The same ink-wash eagle painting after editing: damage and mould removed, missing regions filled in with brushwork consistent with the original style.\"\n  caption=\"After: restoration with brushwork consistent with the original, damage removed (Qwen-Image-3.0, official blog).\"\n/>\n\n## Deep Knowledge: how broadly it draws\n\nThe third axis is coverage. Qwen claims native rendering across **12 languages** (Japanese, Korean and\nSpanish are shown), 100-plus artistic styles, and a spread of real UI grammars — web pages, games,\nlivestream overlays — backed by enough world knowledge to build things like a scientific figure from a\nphoto. Given an insect photograph, the model keeps the subject and adds taxonomic labels, morphological\nannotations, a magnified detail inset, and a scale bar: a publication-ready research figure.\n\n<Figure\n  src=\"/articles/qwen-image-3/fig5.jpg\"\n  alt=\"A dense, illustrated knowledge infographic about whale sharks, combining labelled illustrations with large amounts of small body text across multiple regions.\"\n  caption=\"A whale-shark knowledge infographic — illustration plus a lot of small, accurate body text (Qwen-Image-3.0, official blog).\"\n/>\n\nIt can even reach *outside* itself: the model connects to the web to pull current facts — the\nannouncement generates a weather-forecast card for a specific city and date — and composes with known\nfigures, e.g. staging Qi Baishi and Van Gogh co-hosting a livestream. This is the \"productivity tool,\nnot toy\" thesis in one line: newspapers, storyboards, exam papers, and UI mockups are the target, not\nwallpaper.\n\n## The catch\n\nHere's where the honesty matters. Every image above is Qwen's own curated demo, and independent\ntesters have been blunter than the blog: reports describe it as roughly a **notch below** the best\nproprietary generators (GPT-Image-class models, Nano Banana Pro), with **real gaps** once you leave the\nreel — Korean text with typos, charts that break, and data-plotting tasks (a GDP chart) with points\nplaced in the wrong spots. Legible-at-10px and accurate-at-10px are different claims, and factual\nlayout — where the *numbers* have to be right, not just crisp — is exactly where it slips.\n\nNone of that erases the direction, which is the interesting part. Optimising an image model for\n*deployable* output — long controllable prompts, small legible text, nested real UIs, factual layout —\nis a more useful target than one more bump in aesthetic quality, even when the first release doesn't\nfully hit it. The trade for it is openness: 1.0 and 2.0 were weights you could run; 3.0 is an API you\ncall.\n\n## The take\n\nQwen-Image-3.0 is best read as a **repositioning**, not a benchmark win. \"Real\" is a good target —\nimage generation is far more valuable as a layout-and-typography engine for documents and interfaces\nthan as an art toy — and the three levers it leans on (a 4.5k-token instruction ceiling, a 10px text\nfloor, and world-knowledge grounding) are the right ones for that job. Just hold the demos and the\nindependent tests in the same hand: the ceiling is real, the floor is real, and the accuracy at that\nfloor is still catching up. It pairs naturally with the site's piece on\n[Qwen Audio 3.0 TTS](/articles/qwen-audio-3-tts) — the same \"3.0, make it deployable\" push, one\nmodality over.\n\n---\n\n*Source: [Qwen-Image-3.0: Rich Content, Authentic Details, Deep Knowledge](https://qwen.ai/blog?id=qwen-image-3.0)\n(Qwen team, 2026-07-16). Architecture, parameters and benchmarks were not published with the release;\navailability and independent-testing notes via secondary coverage. All images are Qwen's official\nexamples, shown for commentary.*\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/articles/qwen-image-3","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"EAGLE-3: making the draft model scale, and a from-scratch build","description":"Speculative decoding makes an LLM emit several tokens per forward pass by having a cheap draft model propose and the big model verify — losslessly. EAGLE-3 fixes the thing that stopped the draft model from improving with more data: it drops feature prediction for direct token prediction (Training-Time Test) and fuses low/mid/high features, unlocking a scaling law and up to 6.5x speedup. Plus a walk through tiny-speculators, a from-scratch EAGLE-3 trainer on Qwen3-8B.","date":"2026-07-20","tags":["inference","speculative-decoding","llm","systems","explainer"],"draft":false,"cover":"/articles/eagle-3-speculative-decoding/fig1.png","featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"eagle-3-speculative-decoding","body":"Autoregressive decoding is the tax every LLM pays: one forward pass, one token, and each pass is\nmemory-bound — you stream all the weights through the chip to produce a single token. **Speculative\ndecoding** is the trick that beats it. A small, cheap **draft** model guesses the next few tokens; the\nbig **target** model verifies all of them in one parallel forward pass and keeps the longest prefix it\nagrees with. The output is provably identical to normal sampling — you just get several tokens per\nexpensive pass instead of one. [EAGLE-3](https://arxiv.org/abs/2503.01840) (Li et al.) is the current\npeak of the EAGLE family, and its contribution is subtle: it makes the *draft model itself* keep\ngetting better with more training data. Up to **6.5× faster**, losslessly.\n\n## Draft and verify\n\nThe whole game is **acceptance**: how often the target agrees with the draft's guesses. A token only\ncounts if every token before it was accepted too, so its odds decay geometrically — which is why\ndeeper drafts hit diminishing returns and the real lever is raising the acceptance rate. Drag it:\n\n<DraftVerify />\n\nThe mean **accepted length τ** — tokens produced per target forward pass — is essentially the speedup.\nVerification is exact: the target only ever emits a token it would have sampled itself (rejected\ntokens are resampled from the corrected distribution), so nothing about quality changes. Speculative\ndecoding is a pure latency win with no accuracy cost.\n\n## What EAGLE-3 changes\n\nEAGLE's draft model was clever: instead of predicting tokens directly, it autoregressively predicted\nthe target's **top-layer feature** and reused the target's own LM head. But that came with a\nfeature-prediction loss that **constrained the draft** — and the authors found that scaling up its\ntraining data barely helped. EAGLE-3 makes two changes.\n\n<Figure\n  src=\"/articles/eagle-3-speculative-decoding/fig1.png\"\n  alt=\"Three-panel schematic. Top: EAGLE predicts the target's feature f then a token. Middle: direct token prediction. Bottom: Training-Time Test, where the draft model consumes its own previous unconstrained outputs a across simulated steps.\"\n  caption=\"Training-Time Test (bottom): the draft is trained on its own multi-step outputs, not just one-step feature targets — so what it sees at inference matches what it saw in training (paper, Figure 3).\"\n/>\n\nFirst, it **drops feature prediction for direct token prediction**, and trains the draft the way it\nwill actually run — a technique they call **Training-Time Test (TTT)**. The problem it solves is\nconcrete: EAGLE's first drafted token got accepted often, but its *second* collapsed, because at step\ntwo the draft feeds on its own step-one output, which drifts away from the features it was trained on.\nTTT folds that multi-step rollout into training so the draft learns to consume its own predictions.\n\nSecond, freed from the feature constraint, the draft fuses the target's **low, middle, and high**\nfeatures instead of only the top layer — concatenated and projected down — giving it a richer basis\nfor predicting tokens two and three ahead.\n\n<Figure\n  src=\"/articles/eagle-3-speculative-decoding/fig2.png\"\n  alt=\"Diagram of the EAGLE-3 draft pipeline: low/mid/high target features are fused, combined with the embedding of the sampled token, passed through a single decoder layer, and sampled; subsequent steps substitute the draft's own outputs for unavailable features.\"\n  caption=\"The draft pipeline: fuse (low, mid, high) features + the sampled token's embedding, run one decoder layer, sample; later steps substitute the draft's own outputs for features it can't yet see (paper, Figure 5).\"\n/>\n\n## A scaling law for inference acceleration\n\nHere's the payoff, and it's the reason the paper exists. With the feature constraint removed, the\ndraft model's speedup **keeps climbing as you give it more training data** — a relationship never seen\nfor EAGLE, which flatlines. EAGLE-3 was trained on roughly **8× more data** than EAGLE, and the curve\nis still going up.\n\n<Figure\n  src=\"/articles/eagle-3-speculative-decoding/fig4.png\"\n  alt=\"Two-panel plot of speedup versus training-data scale relative to ShareGPT. EAGLE-3's curve rises steadily; EAGLE's plateaus.\"\n  caption=\"The scaling law: EAGLE-3's speedup rises with draft-training data; EAGLE's plateaus. Inference acceleration now scales with data, like everything else (paper, Figure 1).\"\n/>\n\nThat reframes draft-model training as a data-scaling problem instead of a fixed architectural trick —\nthe same lesson the rest of the field keeps relearning.\n\n## The numbers\n\nOn Vicuna-13B (greedy), EAGLE-3 averages **5.51× speedup** across five tasks, peaking at **6.47×** on\nHumanEval with an accepted length of **7.54** — a clear step over EAGLE-2, and multiples over Medusa\nand vanilla speculative sampling:\n\n<BenchBars\n  title=\"Mean speedup vs vanilla decoding — Vicuna-13B, temp 0\"\n  unit=\"×\"\n  bars={[\n    { label: \"EAGLE-3\", value: 5.51, highlight: true },\n    { label: \"EAGLE-2\", value: 4.22 },\n    { label: \"EAGLE\", value: 3.05 },\n    { label: \"Hydra\", value: 2.80 },\n    { label: \"Medusa\", value: 2.12 },\n    { label: \"std. spec. sampling\", value: 1.92 },\n  ]}\n/>\n\n<Figure\n  src=\"/articles/eagle-3-speculative-decoding/fig3.png\"\n  alt=\"Bar chart of speedup ratios for many methods across chat and reasoning models, with EAGLE-3 the tallest.\"\n  caption=\"Speedup across models and methods; EAGLE-3 leads on chat and reasoning targets alike (paper, Figure 2).\"\n/>\n\nThe part that matters for real serving is that the win survives **batching**. Most speculative methods\ndegrade past batch ~16 (the extra draft compute stops paying off when the GPU is already busy); in\nSGLang, EAGLE-3 still delivers **+38% throughput at batch 64**, and at batch 1 it hits **373 tok/s vs\n158 for plain SGLang** — a 2.36× serving speedup. It's also compatible with EAGLE-2's dynamic draft\ntree, so the two stack.\n\n## From scratch: tiny-speculators\n\nIf you want to see the machinery without the framework, [`tiny-speculators`](https://github.com/junuxyz/tiny-speculators)\n(junuxyz) is a from-scratch EAGLE-3 **trainer**, verifier = Qwen3-8B, in five clear stages: prepare\nShareGPT data → extract the verifier's early/mid/late/final hidden states via vLLM → train the draft\nwith the **3-step TTT rollout** (using PyTorch **FlexAttention** for the non-causal TTT mask) → export\nto vLLM's speculators format. The draft is a **single decoder layer** that fuses three hidden states\n(projected `3H → H`), concatenates the sampled token's embedding, and predicts the next token.\n\nIt's honest about being a small educational build. Its 60k-sample checkpoint on HumanEval reaches a\n**27.4% draft-token acceptance rate** and a **mean accepted length of 1.82**, cutting single-request\np50 latency from **2.61s to 1.78s** — while openly noting that at high concurrency it stayed *below*\nplain serving. That gap between a from-scratch draft and the paper's 6.5× is exactly the value of the\nscaling law: acceptance is a data problem, and EAGLE-3's whole point is that more of it keeps helping.\n\n## The take\n\nSpeculative decoding was already the standard latency trick; the interesting move in EAGLE-3 is\nturning the *draft model* into something that scales. Drop the constraint that stopped it learning\n(feature prediction), train it on its own multi-step outputs (Training-Time Test), give it more of the\ntarget's internal features to look at, and its acceptance — and therefore the speedup — rises with\ndata instead of plateauing. It pairs naturally with the site's pieces on\n[multi-token prediction](/articles/multi-token-prediction) and\n[DeepSeek's DSpark](/articles/deepseek-dspark): the same idea — predict more per step, verify exactly —\nattacked from three directions.\n\n---\n\n*Source: [EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test](https://arxiv.org/abs/2503.01840)\n(Li, Wei, Zhang, Zhang) and the [tiny-speculators](https://github.com/junuxyz/tiny-speculators) repo.\nFigures are the paper's; the interactive is mine.*\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/articles/eagle-3-speculative-decoding","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"Motif 2.6B: differential attention and PolyNorm, trained at scale","description":"Motif 2.6B is a small model that makes two architecture bets you almost never see in a shipped model — differential attention and a learned polynomial activation (PolyNorm) — and trains them on 2.5T tokens with a data-mixing schedule that swings the corpus from broad web text into math, code, and reasoning. The payoff: a 2.6B model that matches 7–8B models on HumanEval and MATH. A walk through the two mechanisms, the data schedule, WSD + checkpoint averaging, and the honest benchmark picture.","date":"2026-07-20","tags":["llm","architecture","pretraining","attention","explainer"],"draft":false,"cover":"/articles/motif-2-6b/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"motif-2-6b","body":"Most \"small strong model\" reports are the same model everyone else ships — a dense pre-norm\nTransformer with RoPE, GQA, and SwiGLU — trained on better data. Motif Technologies'\n[Motif 2.6B](https://huggingface.co/Motif-Technologies/Motif-2.6B)\n([arXiv 2508.09148](https://arxiv.org/abs/2508.09148)) is not that. It makes two architecture bets\nthat mostly live in papers, not products — **differential attention** and a learned polynomial\nactivation called **PolyNorm** — and it's the first model I've seen train *both* at real scale\n(2.5T tokens). The result is a 2.6B model that goes toe-to-toe with 7–8B models on code and math.\nTwo mechanisms, a data schedule with a plan, and a couple of training tricks make it work.\n\n## The block\n\nMotif is a 32-layer, hidden-size-2048 dense Transformer — 16 attention heads, no GQA (16 KV heads),\na 219,520-token vocabulary, RoPE with θ = 500,000. Standard pre-norm skeleton. What's swapped in are\nthe two coloured boxes: the attention sublayer is Differential Attention, and the feed-forward\nsublayer's nonlinearity is PolyNorm.\n\n<Figure\n  src=\"/articles/motif-2-6b/fig1.png\"\n  alt=\"Motif 2.6B architecture: a pre-norm block with a Differential Attention sublayer that projects Q1, Q2, K1, K2, V and computes [softmax(Q1 K1ᵀ) − λ softmax(Q2 K2ᵀ)]V followed by GroupNorm, and a feed-forward sublayer whose PolyNorm activation combines normalized X, X², and X³.\"\n  caption=\"The two swaps: Differential Attention (left, five projections and a subtracted second softmax map) and the PolyNorm feed-forward, which composes normalized X, X², X³ (paper, Figure 1).\"\n/>\n\nThese weren't picked by taste. Motif ran controlled ablations at 0.6B, 1.8B, and 4.6B under a fixed\n3e20-FLOP budget, testing QK-Norm, Cross-Layer Attention, and Normalized GPT (nGPT) alongside these\ntwo — and differential attention plus the polynomial activation are what survived.\n\n### Differential attention\n\nOrdinary attention leaks. A softmax over the whole context puts non-trivial mass on tokens that have\nnothing to do with the query — the more filler in the window, the more the signal gets diluted.\nDifferential attention (Ye et al.) fixes this by computing **two** attention maps per head and\nreturning their difference: `attn = [softmax(Q₁K₁ᵀ) − λ·softmax(Q₂K₂ᵀ)]V`. The second map is trained\nto model the common-mode noise, so subtracting a λ-scaled copy of it cancels the leak. Drag λ:\n\n<DiffAttention />\n\nIt's the same move as a differential pair in analog circuits, or noise-cancelling headphones:\nmeasure the noise separately, subtract it, keep the signal. Sparser attention shows up downstream as\nbetter long-context retrieval, less hallucination, and cleaner in-context learning — and a GroupNorm\nafter the subtraction keeps the (now possibly negative) scores stable.\n\n### PolyNorm\n\nThe other swap is the activation. Instead of committing the whole network to one fixed curve, PolyNorm\nmakes the nonlinearity **learned**: a degree-3 polynomial over normalized powers of the input,\n`PolyNorm(x) = a₁·n(x) + a₂·n(x²) + a₃·n(x³)`. Each layer learns its own aᵢ, so it can bend toward\nnear-linear, saturating, or S-shaped as needed, and pick up higher-order interactions a single\nactivation can't. Mold it:\n\n<PolyNorm />\n\nThe per-power normalization is the load-bearing detail — it's what stops the x³ term from exploding\nthe activation scale, which is exactly why cubic activations don't normally survive contact with a\nreal training run. Motif's is capped at degree 3 on purpose.\n\n## A schedule for the data\n\nHere's the training idea I like most. Motif runs a **data-mixing scheduler** — a schedule for the\n*corpus*, conceived like a learning-rate schedule. The 2.5T-token dataset is partitioned into eight\ndomain groups whose sampling ratios move **linearly** from a start mix to an end mix across training.\nDrag the training-progress handle:\n\n<DataSchedule />\n\nEarly training is broad and web-heavy, teaching general language; the final, best-behaved tokens\npour into Korean, code, math, and reasoning — the dense skills the model will actually be graded on.\nThe base corpus is aggregated and filtered from **DCLM, TxT360, FineWeb2, and FineMath**, plus an\nin-house Korean corpus (there wasn't a good open one).\n\nTwo more training details are worth stealing. The LR follows **WSD** (warmup-stable-decay): a peak of\n`5e-4` held flat for the first 2T tokens, then annealed to 25% of peak over the final 0.5T. And\nthroughout, Motif does **checkpoint averaging** — every 8B tokens it takes a simple moving average of\nthe six most recent checkpoints and feeds the averaged weights straight back into the training loop,\na cheap, continuous smoothing that costs nothing at inference. A stage-2 anneal (~500B tokens) then\nstretches RoPE from θ = 10,000 to 500,000 (ABF) and extends context 4K → 16K for the long-context\nvariant in the last 80B tokens.\n\n## The finetuning stack\n\nThe post-training is where the reasoning gets sharpened. SFT is small — under 15B tokens, ~5M samples\n— but heavily engineered.\n\n<Figure\n  src=\"/articles/motif-2-6b/fig2.png\"\n  alt=\"Motif dataset and post-training pipeline: a dataset stage (deduplication, length filtering, Exam-CoT QA, rejection-sampling synthesis, EvolKit, dataset fusion) feeding SFT dataset mixtures, then base models go through large-scale supervised fine-tuning and coarse-grained then fine-grained DPO.\"\n  caption=\"The data and post-training pipeline: synthesized and fused SFT mixtures, then SFT, then coarse-to-fine DPO (paper, Figure 2).\"\n/>\n\nA few of the moves are unusually specific. **Exam-CoT QA** synthesizes ~5M standardized-exam\nmultiple-choice items with step-by-step rationales. **EvolKit** (Auto Evol-Instruct, with Qwen3-8B)\nrewrites existing SFT samples into harder ones. And **dataset fusion** compresses several samples into\none cohesive conversation — they found Qwen3-8B just concatenated the inputs, so they used GPT-4o to\nactually fuse them, packing more knowledge per token. Rejection sampling against a reward model prunes\nthe weak generations. Then alignment is two-stage DPO — coarse-grained (Tulu 3 preference mixtures)\nthen fine-grained (MagpieLM + LMSys arena data).\n\n## Punching above its weight\n\nThe scoreboard is the point of all of it. On code and math, a 2.6B model lands where 7–8B models do —\nand sometimes past them:\n\n<BenchBars\n  title=\"HumanEval (0-shot, pass@1)\"\n  unit=\"\"\n  bars={[\n    { label: \"Llama 3 8B\", value: 72.6 },\n    { label: \"Motif 2.6B\", value: 68.3, highlight: true },\n    { label: \"Mistral 7B\", value: 30.5 },\n    { label: \"Gemma 2 2B\", value: 20.1 },\n  ]}\n/>\n\n<BenchBars\n  title=\"MATH (4-shot, maj@4)\"\n  unit=\"\"\n  bars={[\n    { label: \"Motif 2.6B\", value: 40.2, highlight: true },\n    { label: \"Gemma 2 2B\", value: 16.0 },\n    { label: \"Mistral 7B\", value: 13.1 },\n  ]}\n/>\n\nOn HumanEval it's within a point of Llama 3 8B and more than doubles Mistral 7B; on MATH it clears\nMistral 7B by 3×. GSM8K tells the same story — 75.7 (8-shot, maj@8) against Mistral's 52.2. The honest\ncaveat is knowledge: on MMLU the smaller model can't fake breadth.\n\n<BenchBars\n  title=\"MMLU (5-shot)\"\n  unit=\"\"\n  bars={[\n    { label: \"Llama 3 8B\", value: 69.4 },\n    { label: \"Mistral 7B\", value: 60.1 },\n    { label: \"Motif 2.6B\", value: 58.0, highlight: true },\n    { label: \"Gemma 2 2B\", value: 52.2 },\n  ]}\n/>\n\nMMLU rewards parameters you simply don't have at 2.6B, so Motif trails the 7–8B models there while\nstill beating Gemma 2 2B. That's the shape of the whole result: reasoning and code you can *train in*\nwith the right architecture and data schedule; raw knowledge still scales with size.\n\n## The take\n\nMotif 2.6B is a bet that the small-model recipe isn't finished — that there's still room to change the\narchitecture, not just the data. Differential attention buys cleaner attention, PolyNorm buys a learned\nnonlinearity, the data scheduler front-loads language and back-loads skill, and WSD plus checkpoint\naveraging smooth the ride. None of it is exotic in isolation; the report's contribution is showing the\ncombination survives 2.5T tokens and lands a 2.6B model in 7–8B territory on the things you can teach.\nThe pieces that \"only work in papers\" turn out to work in a model — which is the most interesting kind\nof result.\n\n---\n\n*Source: the [Motif 2.6B technical report](https://arxiv.org/abs/2508.09148) (Motif Technologies) and\nits [model card](https://huggingface.co/Motif-Technologies/Motif-2.6B). Figures are the paper's;\nthe interactive diagrams are mine. Differential Attention is from Ye et al.; the polynomial-activation\nidea predates PolyNorm's use here.*\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/articles/motif-2-6b","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Audex: audio, speech, and text through one decoder — without the text tax","description":"NVIDIA's Nemotron-Labs-Audex-30B-A3B bolts full audio intelligence — ASR, translation, audio understanding, TTS, audio generation, speech-to-speech — onto a strong text MoE LLM, and the surprise is what doesn't happen: the text scores barely move. One decoder, one extended vocabulary, audio in as continuous embeddings and out as discrete tokens. A walk through the architecture, the no-regression result, and the training recipe that buys it.","date":"2026-07-20","tags":["audio","llm","multimodal","speech","explainer"],"draft":false,"cover":"/articles/nemotron-audex/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"nemotron-audex","body":"There's a tax you pay when you make a text LLM multimodal. Bolt on a speech encoder, fine-tune for\naudio tasks, and the model's text scores — reasoning, knowledge, instruction-following — tend to\nsag. The audio ability arrives; some of the intelligence leaves. NVIDIA's\n[Audex](https://huggingface.co/nvidia/Nemotron-Labs-Audex-30B-A3B) (Nemotron-Labs-Audex-30B-A3B,\n[arXiv 2607.05196](https://arxiv.org/abs/2607.05196)) is a unified audio-text LLM whose whole point\nis that it *doesn't* pay that tax. It does ASR, speech translation, audio understanding,\ntext-to-speech, text-to-audio, and direct speech-to-speech — and keeps the frontier text scores of\nthe model it was built on.\n\nThe design is almost aggressively simple, which is the interesting part.\n\n## One decoder, one vocabulary\n\nMost audio LLMs treat audio as a side channel: an encoder produces features, an adapter head or a\nseparate module consumes or emits them. Audex refuses the split. It is a **single Transformer\ndecoder** built on **Nemotron-Cascade-2-30B-A3B** — a hybrid Mamba–Transformer mixture-of-experts,\n30B total parameters with ~3B active, a 1M-token context. Audio *inputs* are turned into continuous\nembeddings by an audio encoder plus MLP adapters and dropped straight into the **text embedding\nspace**. Audio *outputs* are discrete tokens drawn from an **extended vocabulary** (reported at\n205,312 entries) that the model predicts in exactly the same autoregressive stream as text.\n\n<Figure\n  src=\"/articles/nemotron-audex/fig1.png\"\n  alt=\"Audex architecture: speech or general audio enters an audio encoder and MLP adapters into the Nemotron-Cascade-2-30B-A3B backbone alongside text tokens; the backbone emits text tokens directly, speech tokens into a speech decoder, and audio tokens into an audio decoder.\"\n  caption=\"Audex reads audio as continuous embeddings projected into the text space, and writes text, speech, and general-audio tokens from one extended vocabulary; the discrete speech/audio tokens are detokenized by dedicated decoders (paper, Figure 1).\"\n/>\n\nThe upshot is that \"task\" is not an architectural mode — it's just which token types show up in the\nstream. Transcription is audio-in, text-out. TTS is text-in, speech-tokens-out. A spoken reply is a\nsingle sequence that switches from text to speech tokens partway through. Flip between them:\n\n<UnifiedDecoder />\n\nAt the far end, the discrete tokens hit dedicated detokenizers: a **speech decoder** (XCodec2, with\na causal variant for streaming) reconstructs the waveform for TTS and speech-to-speech, and an\n**audio decoder** (XCodec1 with an enhancement VAE) handles general text-to-audio. Because the\nwhole thing is \"just an LLM emitting tokens,\" it stays compatible with standard LLM training and\ninference infrastructure — no bespoke serving path for the audio head.\n\n## The text tax, measured\n\nHere is the claim that matters, and it's a claim you can only make with a table. Scored on\n**text-only** benchmarks against other recent audio LLMs, Audex doesn't just avoid regressing — it\nsits at or near the top of nearly every one:\n\n<TextTax />\n\nRead the paper's own table and the pattern is stark. Audex 30B-A3B posts **AIME 2025 91.2**,\n**MMLU-Pro 78.9**, **GPQA-Diamond 74.9**, **ArenaHard v2 81.6**, **IFBench 77.8**,\n**LiveCodeBench v6 85.3**, and a **1M-token context** with **99.4 / 83.4** on 256K/1M needle-in-a-\nhaystack — while audio models like Voxtral and MiMo-Audio show the tax plainly and even a strong\nomni peer trails on reasoning and long context.\n\n<Figure\n  src=\"/articles/nemotron-audex/fig2.png\"\n  alt=\"Table of text-benchmark results comparing Step-Audio R1.1 33B, Voxtral Small-24B, MiMo-Audio 7B, Qwen3-Omni 30B-A3B Thinking, Qwen3.5-Omni Flash 35B-A3B, and Audex 30B-A3B and 2B across reasoning, knowledge, alignment, long-context, and agentic benchmarks.\"\n  caption=\"Audex retains frontier text intelligence — reasoning, knowledge, alignment, 1M-token long context, and agentic tool use — where audio-tuned peers regress (paper, Table 5).\"\n/>\n\nThat \"marginal or no regression\" line in the abstract is the whole thesis. The audio ability is\nadditive, not a trade.\n\n## What it does with the audio half\n\nKeeping the text brain would be a hollow win if the audio were weak. It isn't. On the **OpenASR**\nleaderboard (English), Audex 30B-A3B averages **6.82 WER** across eight test sets — 1.34 on\nLibriSpeech clean, a table-best **1.76 on SPGI** — competitive with Whisper-large-v3 and the omni\nmodels while being a single unified system:\n\n<Figure\n  src=\"/articles/nemotron-audex/fig3.png\"\n  alt=\"WER results on the OpenASR leaderboard across LibriSpeech clean/other, AMI, Earnings22, GigaSpeech, SPGI, TED-LIUM, and VoxPopuli, comparing Whisper, Canary, Qwen-Omni variants, and Audex 30B-A3B and 2B.\"\n  caption=\"ASR word-error-rate on the OpenASR leaderboard; Audex is competitive with dedicated ASR models while also doing translation, understanding, TTS, and generation (paper, Table 8).\"\n/>\n\nAround that sit speech translation, audio question-answering, text-to-speech and text-to-audio\ngeneration, and — the one that closes the loop — **speech-to-speech**: spoken input to spoken\noutput in one model, no cascaded ASR→LLM→TTS pipeline with its latency and error stacking.\n\n## How you buy no-regression\n\nThe recipe is where the tax actually gets dodged. Audex is trained on **157.4B audio tokens and\n320.5B text tokens** — note the text is still the majority — through multi-stage supervised\nfine-tuning, then a **text-only** Cascade RL pass plus multi-domain on-policy distillation.\n\n<Figure\n  src=\"/articles/nemotron-audex/fig4.png\"\n  alt=\"Audex training pipeline: two SFT curricula (multi-stage adding one capability at a time, versus a consolidated single-stage), followed by Cascade-2-style RL with MOPD to produce the final Audex model.\"\n  caption=\"Two SFT curricula — capability-at-a-time vs consolidated single-stage — followed by text-domain RL and on-policy distillation, the step that keeps the text intelligence intact (paper, Figure 3).\"\n/>\n\nTwo details do the work. First, the SFT is studied as **two curricula** — a multi-stage one that\nadds capabilities one at a time (text SFT → audio warmup → audio-gen → audio-gen + understanding)\nand a consolidated single-stage one that mixes everything at once. Second, and more important, the\nreinforcement-learning stage that follows is applied in the **text domain**, the same Cascade-2 RL\nthe backbone already knew. Audio is learned as *additional* token vocabulary on a preserved base,\nand the final polish happens where the text intelligence lives — so it's reinforced, not eroded.\n\n## The take\n\nAudex's bet is that you don't need a clever fusion architecture to add audio to an LLM — you need to\nrefuse to treat audio as special. Encode it into the text embedding space on the way in, emit it as\nextra vocabulary on the way out, keep the majority of your training tokens textual, and do your RL\nwhere the reasoning is. The reward is a model that hears, speaks, and generates audio while still\nscoring 91 on AIME and holding a million-token context. The \"unified\" in unified audio-text LLM\nturns out to mean *boring on purpose* — and that's the compliment.\n\n---\n\n*Source: [Unified Audio Intelligence Without Regressing on Text Intelligence](https://arxiv.org/abs/2607.05196)\n(Zhifeng Kong et al., NVIDIA) and the [model card](https://huggingface.co/nvidia/Nemotron-Labs-Audex-30B-A3B).\nFigures are the paper's; the interactive diagrams are mine.*\n","readingTimeMins":5,"url":"https://ai.thesatyajit.com/articles/nemotron-audex","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Qwen Audio 3.0 TTS: an instructable LM-plus-flow-matching speech stack","description":"Alibaba's Qwen-Audio-3.0-TTS pairs an autoregressive LM with a flow-matching decoder in the CosyVoice lineage, fronted by a 12.5 Hz supervised tokenizer that keeps the token stream short. It adds free-style instruction control, 86 inline non-verbal tags, 16 languages plus 20 Chinese dialect regions, one-pass long-form to three minutes, and 48 kHz super-resolution — and it tops the Artificial Analysis TTS leaderboard. A walk through the stack, the frame-rate trick, and the control surface.","date":"2026-07-20","tags":["audio","tts","speech","generative","explainer"],"draft":false,"cover":"/articles/qwen-audio-3-tts/fig1.png","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"qwen-audio-3-tts","body":"Modern text-to-speech has mostly converged on a shape: a language model predicts discrete speech\ntokens from text, and a generative decoder turns those tokens back into a waveform.\n[Qwen-Audio-3.0-TTS](https://funaudiollm.github.io/qwen-audio-3.0-tts/) — Alibaba's latest, in the\nCosyVoice lineage — runs that shape hard and adds the things that make a TTS model actually usable:\ninstruction control, inline non-verbal events, 16 languages and 20 Chinese dialect regions,\none-pass long-form synthesis, and 48 kHz output. It currently sits at **#1 on the Artificial\nAnalysis Text-to-Speech leaderboard**. Here's how it's built.\n\n## The stack\n\nTwo models do the heavy lifting. An **autoregressive LM** predicts a sequence of discrete speech\ntokens from the text (and, for zero-shot cloning, a reference clip). A **flow-matching decoder**\nturns those tokens into a mel-spectrogram. A **vocoder** reconstructs and super-resolves the\nwaveform to 48 kHz. Fronting all of it is a **12.5 Hz supervised speech tokenizer** — the piece\nthat quietly sets the model's latency. Click through the stages:\n\n<TtsStack />\n\nSplitting content (the LM) from voice-and-prosody (the flow-matching decoder) is what makes the\nmodel *instructable*: you can change *what* is said and *how* it's said through different parts of\nthe system. It's also why the model is robust to a bad reference — a noisy or reverberant prompt\nstill conditions the LM, and there's no explicit denoising step to break.\n\n## Why 12.5 Hz\n\nThe tokenizer's frame rate is the single most consequential number in an autoregressive TTS system,\nbecause the LM emits one token per step and the steps are serial. Fewer tokens per second of audio\nmeans fewer decode steps means less latency. Most neural speech codecs sit at 25–75 Hz;\nQwen's supervised tokenizer runs at **12.5 Hz**. Drag it and watch the step count move:\n\n<FrameRate />\n\nThe word doing the work is **supervised**. A raw reconstruction codec at 12.5 Hz would throw away\ntoo much to sound good; a *supervised* tokenizer is trained to keep exactly the content and speaker\ninformation the LM needs, so it can afford the low frame rate. Short token stream, fast decode,\nintact voice — that's the trade the tokenizer is engineered to win.\n\n## Say what, and how\n\nControllability in most TTS models means \"pick a preset voice.\" Qwen splits it into three\nindependent knobs. A **free-style natural-language instruction** sets role, emotion, speaking style,\nrate, timbre, and accent for the whole utterance. **86 fine-grained inline tags** drop non-verbal\nevents — laughter, breathing, coughing, sighing — at the word level, inside the text. And the text\nitself is the content. Switch the instruction and the delivery re-colors without a word changing:\n\n<InlineControl />\n\nOn top of that: **16 languages and 20 Chinese dialect regions** (seven languages new this version),\n**one-pass long-form synthesis up to three minutes**, a reproducible speaker fine-tuning protocol,\nand vocoder **super-resolution to 48 kHz**. It also handles hard text-normalization cases and\ndegraded reference speech without a separate cleanup stage.\n\n## The receipts\n\nThe project page leads with two radar charts across the **CV3-Eval** multilingual set — one for\ncontent consistency (word-error rate) and one for speaker similarity — against MiniMax-Speech,\nElevenLabs v3, VoxCPM2, DotsTTS, and the Qwen3-TTS base. The shape tells the story: Qwen-Audio-3.0-\nTTS holds a large, even envelope across all ~20 language axes, where the lighter baselines collapse\non the long-tail languages.\n\n<Figure\n  src=\"/articles/qwen-audio-3-tts/fig1.png\"\n  alt=\"Radar chart of content-consistency word-error-rate across roughly twenty CV3-Eval languages, comparing MiniMax-Speech, ElevenLabs v3, DotsTTS, VoxCPM2, the Qwen3-TTS base, and Qwen-Audio-3.0-TTS, with Qwen-Audio-3.0-TTS forming a large even envelope.\"\n  caption=\"Content consistency (WER, lower is better — outer ring is better) across CV3-Eval languages; Qwen-Audio-3.0-TTS stays strong on the long-tail languages where lighter models fall in (project page).\"\n/>\n\n<Figure\n  src=\"/articles/qwen-audio-3-tts/fig2.png\"\n  alt=\"Radar chart of speaker similarity across roughly twenty CV3-Eval languages for the same set of models, with Qwen-Audio-3.0-TTS maintaining high similarity across the board.\"\n  caption=\"Speaker similarity across the same languages — how faithfully a zero-shot clone matches the reference voice (project page).\"\n/>\n\nThe paper reports state-of-the-art results across SEED-TTS-Eval, CV3-Eval, instruction-following,\nlong-form, and acoustic-robustness suites; the leaderboard #1 is the headline. (Exact WER/SIM\nfigures live in those radar charts rather than a table on the page — the shape is the claim.)\n\n## The training, briefly\n\nThe two-model split has a matching two-track training recipe — **five progressive stages**: the LM\nand flow-matching decoder are **pretrained independently**, then **jointly trained** with a\nhigh-quality-data annealing phase, then the LM gets a **reinforcement-learning** pass, and the\ndecoder gets its own **robustness** stage and then its own **RL** stage. The robustness stage is\nwhat lets the flow-matching decoder cope with degraded prompts; the separate RL passes are what\nsharpen intelligibility and speaker fidelity without the two objectives fighting.\n\n## The take\n\nQwen-Audio-3.0-TTS isn't a new paradigm — it's the LM-plus-flow-matching recipe executed with taste.\nThe 12.5 Hz supervised tokenizer keeps it fast, the content/voice split keeps it controllable, the\ninline tags and instructions make it expressive, and the multilingual coverage is broad and even\nrather than English-plus-a-long-tail. The interesting lesson is how much of \"good TTS\" is now about\nthe surfaces you expose — frame rate, instruction grammar, tag vocabulary — rather than the core\ngenerative trick, which the field has largely settled.\n\n---\n\n*Source: the [Qwen-Audio-3.0-TTS project page](https://funaudiollm.github.io/qwen-audio-3.0-tts/)\n(Alibaba / FunAudioLLM). The radar figures are theirs; the interactive diagrams are mine.*\n","readingTimeMins":4,"url":"https://ai.thesatyajit.com/articles/qwen-audio-3-tts","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"The harness effect: orchestration, not the model, sets your agent token bill","description":"A controlled swap that holds the model constant and changes only the orchestration layer — the harness — cuts blended cost per task 41%, wall-clock 44%, and tokens 38% at quality parity, and shows efficiency is model-invariant while quality gains scale almost perfectly with baseline strength (r = 0.99).","date":"2026-07-18","tags":["agents","orchestration","token-economics","llm","explainer"],"draft":false,"cover":"/articles/harness-effect/fig1.png","featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"harness-effect","body":"The line I keep hearing is that per-token prices fall every quarter, so agent costs will\nsort themselves out. The invoices say otherwise. This paper — [*The Harness Effect*](https://arxiv.org/abs/2607.06906)\nfrom a Writer AI team (Muayad Sayed Ali et al., corresponding author Waseem AlShikh) — runs\nthe clean experiment I wanted someone to run: hold the model fixed, change only the\norchestration layer around it, and measure the bill. The result is that the orchestration\nlayer — the *harness* — moves cost per task more than switching between the cheapest and\nmost expensive model on the menu does.\n\nOne caveat up front, because it is load-bearing: the harness under test is Writer's own, and\nWriter ran the evaluation. I read the numbers as directional, not as a neutral benchmark. What\nmakes it worth reading anyway is the mechanism — the paper formalizes *why* orchestration sets\ntoken economics, and the formalization is provider-agnostic.\n\n## Token maxing\n\nThe paper names the failure mode first. **Token maxing** is buying capability with tokens:\nlonger reasoning traces, more agent turns, wider tool payloads, larger replayed contexts — so\nthat tokens per task grow faster than task value. Falling per-token prices mask the pattern\nwithout fixing it. Total spend rises anyway.\n\nThe bill for one agentic task is a sum over its `k` turns:\n\n$$\nC = \\sum_{i=1}^{k}\\left(p_{\\text{in}}\\,T^{\\text{in}}_{i} + p_{\\text{out}}\\,T^{\\text{out}}_{i}\\right)\n$$\n\nwhere $p_{\\text{in}}, p_{\\text{out}}$ are the input/output prices per token and $T^{\\text{in}}_i, T^{\\text{out}}_i$\nthe tokens at turn $i$. The input side is the part the orchestration layer builds:\n\n$$\nT^{\\text{in}}_{i} = \\underbrace{S_i}_{\\text{system}} + \\underbrace{H_i}_{\\text{history}} + \\underbrace{G_i}_{\\text{tool schemas}} + \\underbrace{R_i}_{\\text{retrieval}} + \\underbrace{U_i}_{\\text{user turn}}\n$$\n\nHere is the trap. If every turn replays the full transcript, the history term $H_i$ grows with\n$i$, so the cumulative input over a task grows as $O(k^2)$. A harness that compacts and caches\nhistory keeps it near $O(k)$. The gap between those two curves is spend that buys no quality —\nand because the per-token price is falling the whole time, the total keeps climbing quietly.\nDrag the horizon and watch it happen:\n\n<TokenMaxing />\n\n<Figure\n  src=\"/articles/harness-effect/fig3.png\"\n  alt=\"Line chart of cumulative input tokens against agent turns k. A dark 'naive replay' curve grows quadratically as O(k squared); a blue 'harness-managed context' curve grows linearly as O(k). The shaded region between them is labelled the token maxing region.\"\n  caption=\"Where token maxing comes from: full-history replay grows as O(k²), harness-managed context as O(k); the shaded gap is spend that buys no quality (Sayed Ali et al., 2026, Figure 1).\"\n/>\n\n## The bill, and the one price that actually moves\n\nThe lever the harness pulls hardest is **prompt caching**. Providers serve a previously seen\nprompt prefix from cache at roughly a tenth of the base input rate. If a fraction $h$ of input\ntokens are cache reads billed at multiplier $\\kappa$, the effective input price is\n\n$$\np^{\\text{eff}}_{\\text{in}} = p_{\\text{in}}\\left(1 - h\\,(1-\\kappa)\\right), \\qquad \\kappa \\approx 0.1\n$$\n\nso a harness that keeps $h$ near 1 pays about a tenth of list price on the dominant input term.\nThe point the paper makes well: $h$ is not a model property and not a provider favor. It is a\nfunction of how byte-stable your prompt prefix is across turns — which is set entirely by the\norchestration layer. On an identical-prefix call the harness served **99.9% of prompt tokens as\ncache reads** (7,876 of 7,886). That is the whole game: shape the prompt so the expensive term\nis almost always a cache hit.\n\n## The controlled swap\n\nThe experiment is deliberately boring, which is why it is convincing. Twenty-two locked\nevaluation tasks. Six foundation models — Claude Sonnet 4.6, Gemini 3.1, Gemini Flash 3.5,\nQwen 3.6, GLM 5.1, Palmyra X6. Each model runs the tasks twice: once under a frozen conventional\nproduction loop, once under the Writer Agent Harness. Nothing else changes — same tasks, same\njudges, same price table. Only the orchestration layer swaps. Flip it:\n\n<ControlledSwap />\n\nBlended across all six models and 22 tasks, replacing the loop with the harness cuts cost per\ntask 41% (`$0.21` → `$0.12`), median wall-clock 44% (48s → 27s), and tokens per task 38%\n(14.2k → 8.8k) — with task-completion quality at parity (0.78 → 0.81, directional at this\nsample size).\n\n<Figure\n  src=\"/articles/harness-effect/fig1.png\"\n  alt=\"Three grouped bar charts comparing a baseline production loop against the Writer harness on cost per task, wall-clock per task, and tokens per task. Cost falls from $0.21 to $0.12 (minus 41%), wall-clock from 48s to 27s (minus 44%), tokens from 14.2k to 8.8k (minus 38%).\"\n  caption=\"Blended efficiency across six models and 22 tasks, models held constant: cost per task −41%, median wall-clock −44%, tokens per task −38% (Sayed Ali et al., 2026, Figure 3).\"\n/>\n\nTwo derived numbers make the parity concrete. Quality per dollar rises 82%. And throughput —\ntask-completions per million tokens — nearly doubles:\n\n<BenchBars\n  title=\"task-completions per million tokens\"\n  unit=\"\"\n  bars={[\n    { label: \"Writer harness\", value: 92.0, highlight: true },\n    { label: \"production loop\", value: 54.9 },\n  ]}\n/>\n\n## Everyone gets cheaper\n\nThe efficiency win is not a quirk of one model. Under the swap, **every** model's cost and\nlatency fall — cost by 33% to 61%, latency by 33% to 55%. The effect is a property of the\norchestration layer, not of any model.\n\n<Figure\n  src=\"/articles/harness-effect/fig4.png\"\n  alt=\"Two grouped bar charts, one for cost per task and one for median wall-clock, each with six model pairs (Sonnet 4.6, Gemini 3.1, Flash 3.5, Qwen 3.6, GLM 5.1, Palmyra X6). Every model's harness bar is shorter than its baseline bar, with cost reductions labelled from minus 32 percent to minus 61 percent.\"\n  caption=\"Per-model efficiency under the orchestration swap — every model gets cheaper and faster; the effect belongs to the harness, not the model (Sayed Ali et al., 2026, Figure 4).\"\n/>\n\n<BenchBars\n  title=\"cost cut from the harness, per model (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Flash 3.5\", value: 61, highlight: true },\n    { label: \"Palmyra X6\", value: 52 },\n    { label: \"GLM 5.1\", value: 48 },\n    { label: \"Qwen 3.6\", value: 44 },\n    { label: \"Sonnet 4.6\", value: 38 },\n    { label: \"Gemini 3.1\", value: 32 },\n  ]}\n/>\n\n## Harness leverage\n\nNow the finding that made me want to write this up. Efficiency is model-invariant, but the\n**quality** gain from the same orchestration upgrade is not — it scales almost perfectly with a\nmodel's baseline strength. Plot each model's mean quality gain against its baseline capability\nand the points fall on a line: $r = 0.99$ over $n = 6$. The paper calls it **harness leverage**.\nStronger models extract more from the same harness. Scrub the models:\n\n<HarnessLeverage />\n\n<Figure\n  src=\"/articles/harness-effect/fig2.png\"\n  alt=\"Scatter plot of quality gain from the harness on the y-axis against mean baseline capability on the x-axis, for six models. The points rise almost linearly along a dashed fit line: Qwen 3.6 is slightly negative, Flash 3.5 near zero, GLM 5.1 and Gemini 3.1 positive, Sonnet 4.6 and Palmyra X6 highest at plus 0.073 and plus 0.079.\"\n  caption=\"Harness leverage: mean quality gain vs baseline strength. Stronger models gain more from the same orchestration upgrade — r = 0.99, n = 6 (Sayed Ali et al., 2026, Figure 6).\"\n/>\n\nThe honest edge of this: across 48 capability×model cells, 30 improve, 11 are flat, and **7\nregress — all of them in the three smaller models**, concentrated in orchestration-heavy\ncapabilities (tool use over MCP, playbooks, presentations). Qwen 3.6 comes out net negative on\nquality (−0.031). It is still 44% cheaper. So the harness is a strict efficiency win everywhere,\nand a quality win that grows with the model you point it at.\n\n## Six mechanisms behind the effect\n\nThe paper decomposes the harness into six mechanism families. None of them is exotic — they are\nthe unglamorous orchestration glue, which is exactly why they are easy to leave on the table:\n\n1. **Cache-shape discipline — the two-zone prompt.** A byte-stable prefix (tool-schema catalog,\n   stable system prompt, append-only transcript) carries the provider's cache breakpoints;\n   everything volatile is confined to a tail that is rebuilt each turn and structurally excluded\n   from caching. This is what pushes $h$ toward 1 in the effective-price equation.\n2. **Structured, incremental, cache-aware compaction.** Shrink history without breaking the\n   cache prefix — compact the middle, keep the front byte-identical.\n3. **Context offload.** Tool outputs land in a store the model can reference, not in the prompt.\n   Tokens the model never pays to re-read.\n4. **Zero-token waiting; durability as economics.** Durable execution so a pause, retry, or\n   long-running tool call does not replay the whole context to resume.\n5. **Failure-spend governance.** Cap what a failing or looping run can burn before it is stopped.\n   Most runaway bills are failures, not successes.\n6. **A model-agnostic floor.** The five above set an efficiency floor under *any* model — which\n   is what makes the savings a property of the layer, not the checkpoint.\n\n## How other harnesses compare\n\nThe paper also scores six widely used agent systems on the same axes — vendor-integrated\nclients, orchestration libraries, multi-agent conversation frameworks, and open personal\nharnesses — from public documentation rather than head-to-head runs. The pattern is that most\nframeworks implement some mechanisms and leave the rest \"to the application to build and budget.\"\nCache-shape discipline and failure-spend governance are the two most often missing, and they are\ntwo of the biggest levers. Treat that table as a design-time source study, not a measurement.\n\n## What it is worth at fleet scale\n\nThe reason this matters past a single task: the per-task delta multiplies by volume, and by every\nmodel you run. Apply the blended cost gap to monthly task volume and at **one million agent tasks\nper month the harness is worth about `$90k`/month over the baseline — `$1.08M`/year** — and the\ngap widens linearly with volume. An organization does not run one model; it runs a fleet, present\nand future. The harness is the one component whose efficiency multiplies across all of them.\n\n<Callout type=\"warn\">\n**Read the caveats.** (1) The sample is small — **22 tasks, 6 models**. The quality deltas are\ndirectional at this size; the paper says so and calls its statistical posture \"suggestive.\"\n(2) It is the **vendor's own harness, evaluated by the vendor** (Writer), against a \"frozen\nconventional production loop\" the vendor defined — a reasonable baseline, but not a neutral one.\n(3) It is a **single workload**. The mechanisms generalize in principle; the exact 41% / 44% /\n38% numbers are this task set, these price tables, these six models. (4) The `$0.21` → `$0.12`\nand fleet-scale figures ride on current provider cache pricing ($\\kappa \\approx 0.1$); change the\nprice table and the arithmetic moves.\n</Callout>\n\n## The take\n\nStrip the framing and the useful claim is narrow and testable: for agentic workloads, the\norchestration layer is a first-class cost object, and most of the cost is in prompt shape, not\nmodel choice. The effective-input-price equation is the part I will actually use — it says the\nexpensive input term is a cache hit if and only if your prompt prefix is byte-stable across\nturns, and that is an engineering property you control. Efficiency came out model-invariant\n(every model 33–61% cheaper); quality came out capability-dependent (r = 0.99 with baseline\nstrength). I would want an independent harness and a second workload before trusting the exact\npercentages. But the direction matches what I see in production: the token bill is set less by\nwhich model you picked and more by how you assemble the context you hand it.\n\n---\n\n*Source: \"The Harness Effect: How Orchestration Design Sets the Token Economics of Enterprise\nAgentic AI\" (Muayad Sayed Ali et al., Writer AI, 2026) —\n[arXiv 2607.06906](https://arxiv.org/abs/2607.06906). Figures 1, 3, 4, and 6 are reproduced from\nthe paper for commentary. Benchmark numbers are quoted as reported; the interactive diagrams\nillustrate the mechanisms and use the paper's headline values.*\n","readingTimeMins":10,"url":"https://ai.thesatyajit.com/articles/harness-effect","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"Diffusing blame: credit assignment under Dale's principle","description":"A biologically plausible network splits every layer into separate excitatory and inhibitory streams — obeying Dale's principle, which real neurons never break — and learns by diffusing the output error straight to every hidden unit instead of transporting transposed weights back through the stack; this is a first-principles walk through why backprop can't run in a brain, how Error Diffusion with modulo routing gets around it, and the 96.7% MNIST / 61.7% CIFAR-10 numbers that follow.","date":"2026-07-17","tags":["deep-learning","credit-assignment","biologically-plausible","backpropagation","theory","explainer"],"draft":false,"cover":"/articles/diffusing-blame/fig1.png","featured":false,"interest":4,"helpful":2,"kind":"articles","slug":"diffusing-blame","body":"**Diffusing Blame** asks a sharp question: can a network learn useful representations while obeying the one rule real brains never break? That rule is **Dale's principle** — a neuron is either excitatory or inhibitory, and *every* synapse it makes carries that one sign. Backpropagation quietly violates a stack of constraints like this. The paper builds a network that respects them, trains it with a rule called **Error Diffusion**, and asks how far it gets. The answer: **96.7% on MNIST**, a **61.7% baseline on CIFAR-10**, and reinforcement-learning agents that hold their own against a backprop-free baseline — all while enforcing Dale's principle strictly. This is a walk through why that is hard, what the rule actually does, and the numbers, honestly labelled.\n\n<Callout type=\"note\">\nEverything below is from Yamada, Grillotti, Charakorn, Risi, Ha and Lange, *Diffusing Blame: Task-Dependent Credit Assignment in Biologically Plausible Dual-Stream Networks* (2026). Every accuracy, return, and ablation delta is **author-reported** on their own runs; I have not reproduced them. The two interactive widgets are my illustrations of the mechanism, not measured traces.\n</Callout>\n\n## Why backprop can't run in a brain\n\nBackpropagation is the reason deep nets learn, and it is also the reason nobody thinks the brain runs it. Look at the backward pass. To update layer $\\ell$, backprop needs the error signal $\\delta_\\ell$, and it computes it by pulling the layer above's error back through the transposed forward weights:\n\n$$\n\\delta_\\ell = \\big(W_{\\ell+1}^{\\top}\\,\\delta_{\\ell+1}\\big) \\odot \\phi'(z_\\ell).\n$$\n\nRead that literally as a circuit and three problems fall out.\n\n- **Weight transport.** The backward pass uses $W_{\\ell+1}^{\\top}$ — the *same* forward weights, transposed. A synapse would have to read the value of the forward synapse it feeds and reuse it, exactly, on the way back. There is no known biological mechanism for a synapse to know its partner's weight.\n- **Sign symmetry.** Because it is the same weight, the feedback carries the same *sign* as the forward connection. Forward and backward paths are locked together.\n- **A separate error channel.** The $\\delta$'s have to travel back through a network that is distinct from the forward one, layer by layer, without disturbing the forward activations.\n\n**Dale's principle** makes all of this worse. In cortex a neuron's outgoing synapses are uniformly excitatory or uniformly inhibitory — the sign belongs to the neuron, not the synapse. A weight in a standard net has no such loyalty: one unit can push one target up and pull another down in the same step. Toggle between the two regimes and click a source neuron to see its fan-out:\n\n<DaleNetwork />\n\nEnforcing Dale's principle means the sign is frozen per neuron, so a net has to split into separate excitatory (E) and inhibitory (I) populations and coordinate them. That changes credit assignment at the root: you can no longer flip a weight's sign to fix an error, and the tidy $W^{\\top}$ feedback path is off the table anyway. Prior biologically plausible rules — feedback alignment and its kin — dodge weight transport by sending the error back through a *fixed random* matrix $B$ instead of $W^{\\top}$. That helps, but it trades one implausible object (the transpose) for another (a dedicated random feedback matrix), and historically these rules stall out past MNIST. The paper wants neither the transpose nor the random matrix.\n\n## Dale's principle, in the weights\n\nThe architecture is dual-stream. Each layer carries a positive activation vector $\\mathbf{p}$ and a negative one $\\mathbf{n}$, and there are **four** weight matrices between consecutive layers — the within-stream pair $W_{pp}, W_{nn}$ and the cross-stream pair $W_{np}, W_{pn}$:\n\n$$\n\\mathbf{p}_i = \\phi_i\\!\\big(\\mathbf{p}_{i-1} W_{pp} - \\mathbf{n}_{i-1} W_{np} + \\mathbf{b}_p\\big), \\qquad\n\\mathbf{n}_i = \\phi_i\\!\\big(\\mathbf{n}_{i-1} W_{nn} - \\mathbf{p}_{i-1} W_{pn} + \\mathbf{b}_n\\big).\n$$\n\nThe trick is in the signs. **Every learnable weight is constrained non-negative**, $W_{\\bullet\\bullet} \\ge 0$ element-wise, and the minus signs in front of the cross-stream terms are *hardcoded into the wiring*, not learned. So the E stream always excites and the I stream always inhibits, structurally — Dale's principle holds by construction, and gradient descent can never sneak a sign flip past it. A readout that needs a signed output just subtracts the streams: $\\hat{y} = y^{+} - y^{-}$.\n\nThis is a real cost. A standard dense layer is one matrix; this is four non-negative matrices with a fixed sign pattern, and the optimizer has to move the whole coordinated E/I system in lockstep. The question the paper answers is whether a learning rule can drive that system without any of backprop's illegal moves.\n\n## Diffusing the error instead of transporting it\n\nError Diffusion's answer: don't route the error *back through the layers* at all. Route it *directly to every layer*. Take the output error $S$ (shape $B \\times C$ for a batch of $B$ over $C$ classes) and broadcast it to the hidden units through a fixed routing matrix $M$, then form each layer's local update from presynaptic activity and the postsynaptic nonlinearity's derivative:\n\n$$\nR = S\\,M^{\\top}, \\qquad U_p = \\phi'(Z_p) \\odot R, \\qquad \\Delta W_{pp} \\propto A_p^{\\top}\\,U_p.\n$$\n\n$R$ is the routed error drive, $U_p$ scales it by the local activation slope $\\phi'$, and the weight change is an outer product of presynaptic activations $A_p$ with $U_p$. No $W^{\\top}$ appears anywhere. No random feedback matrix appears either — the routing $M$ is a fixed, structured broadcast, not a learned or random projection.\n\nThe original Error Diffusion was defined for binary classification. To go past that, the paper adds **modulo error routing**: hidden unit $i$ is assigned to output channel\n\n$$\nr(i) = i \\bmod C,\n$$\n\nand learns from that channel's error $s_{r(i)}$. It is coarse — several hidden units share a channel, and unit $C$ wraps back to channel 0 — but it is deterministic, transport-free, and enough to spread class-specific blame across a wide hidden layer. Step through the forward pass, the diffusion, and the update, and flip between backprop and Error Diffusion to see the paths diverge:\n\n<ErrorDiffusion />\n\nThe contrast is the whole point. Backprop's blame crawls back one layer at a time, each hop paying the $W^{\\top}$ transport tax. Error Diffusion drops the error onto every hidden unit at once and lets each layer compute a local update. That is what makes it plausible — and also what makes it approximate, since a modulo-routed broadcast is a much blunter credit signal than the exact gradient.\n\n<Figure\n  src=\"/articles/diffusing-blame/fig1.png\"\n  alt=\"Three-panel overview. Left: the dual-stream excitatory/inhibitory architecture, with separate positive and negative streams and four non-negative weight matrices per layer, enforcing Dale's principle structurally. Center: the Error Diffusion update broadcasting the output error directly to all hidden layers, without transposed weights or random feedback matrices. Right: the shared architecture applied to classification, with layer-specific sigmoid widths, batch-centered class error, and asymmetric initialization, and to reinforcement learning via PPO integration.\"\n  caption=\"The dual-stream Error Diffusion framework: structural E/I streams (left), direct error broadcast without weight transport (center), and the shared backbone specialized to classification and RL (right) (Yamada et al., 2026, Figure 1).\"\n/>\n\n## Three fixes that turn it into a learner\n\nError Diffusion out of the box does not learn much — the seed configuration lands at **50.4% on MNIST** and **11.6% on CIFAR-10** (barely above chance on ten classes). Three domain-specific fixes close most of the gap.\n\n**Layer-specific sigmoid widths.** The activation is a temperature-controlled sigmoid,\n\n$$\n\\phi_i(z) = \\frac{1}{1 + e^{-2z/\\alpha_i}},\n$$\n\nwith a per-layer width $\\alpha_i$. Why it matters: the update is scaled by $\\phi'$, and a standard sigmoid's derivative is tiny once units saturate. The paper measures a **25x attenuation** of the surrogate gradient from the output down to the first hidden layer, so the early layers barely move. Widening the sigmoid (larger $\\alpha$) keeps the derivative alive deeper in the stack. Their CIFAR-10 setup uses $\\alpha = 3.0$ for convolutional layers and $\\alpha = 6.0$ for fully connected ones; MNIST uses $\\alpha = 6.0$ throughout.\n\n**Batch-centered class error.** Instead of feeding raw one-vs-all errors, the class error is centered across the batch,\n\n$$\n\\tilde{E}_{b,c} = E_{b,c} - \\frac{1}{B}\\sum_{b'} E_{b',c},\n$$\n\nso every class's error signal is zero-mean over the mini-batch. This removes a constant per-class bias that would otherwise push all units in a channel the same way regardless of the input.\n\n**Asymmetric E/I initialization.** The excitatory matrices $W_{pp}, W_{nn}$ are scaled up by $1.5\\times$ at init and the inhibitory $W_{np}, W_{pn}$ scaled down by $0.5\\times$, a starting excitation-to-inhibition ratio of roughly **3:1**. That gives the network net-positive drive to begin with, and the paper shows the ratio relaxes toward a biological-like balance as training proceeds.\n\n## What it scores\n\nOn the standard benchmarks, the constrained network learns — not to backprop's level, but well past chance, and well past unconstrained biologically plausible baselines that stall on MNIST. Direct Feedback Alignment (DFA), the backprop-free baseline that still uses a random feedback matrix, sits a few points ahead as the reference ceiling:\n\n<BenchBars\n  title=\"MNIST test accuracy (%) — author-reported\"\n  unit=\"%\"\n  bars={[\n    { label: \"Error Diffusion (ours)\", value: 96.7, highlight: true },\n    { label: \"DFA (baseline)\", value: 97.6 },\n    { label: \"seed ED (no fixes)\", value: 50.4 },\n  ]}\n/>\n\n<BenchBars\n  title=\"CIFAR-10 test accuracy (%) — author-reported\"\n  unit=\"%\"\n  bars={[\n    { label: \"Error Diffusion (ours)\", value: 61.7, highlight: true },\n    { label: \"DFA (baseline)\", value: 69.1 },\n    { label: \"seed ED (no fixes)\", value: 11.6 },\n  ]}\n/>\n\n<Figure\n  src=\"/articles/diffusing-blame/fig2.png\"\n  alt=\"Two bar-chart panels. Left, MNIST: classification accuracy across six configuration variants, all clustered high near 96 to 97 percent except the seed variant, which collapses when layer-specific widths are removed. Right, CIFAR-10: accuracy across the same six variants, with the batch-centered class error variant collapsing when that component is removed. Error bars show plus or minus one standard deviation over five seeds.\"\n  caption=\"Accuracy across six ablation variants on MNIST (left) and CIFAR-10 (right), ±1 std over 5 seeds. The importance hierarchy flips between tasks (Yamada et al., 2026, Figure 2).\"\n/>\n\nThe most interesting result is not the headline accuracy — it is what the ablations reveal. Remove each fix and measure the accuracy drop, and the ranking **reverses between the two datasets**:\n\n| removed component | MNIST Δ (pp) | CIFAR-10 Δ (pp) |\n|---|---|---|\n| layer-specific sigmoid widths | **−71.4** | −15.1 |\n| batch-centered class error | −0.3 | **−47.9** |\n| asymmetric initialization | +0.0 | −5.5 |\n\nOn MNIST the whole model lives or dies by the sigmoid widths — pull them and accuracy craters by 71 points, while the batch-centering does almost nothing. On CIFAR-10 it is the exact opposite: batch-centered class error is load-bearing (−47.9), and the widths matter far less. Same architecture, same rule — but the credit-assignment bottleneck is *task-dependent*, and a single-benchmark evaluation would have hidden that entirely. That is the paper's sharpest point: which fix carries the model is a property of the task, not the method.\n\n## Into RL: ED-PPO\n\nClassification is the easy setting — the error signal is a clean label. To test the rule where credit assignment is genuinely hard, the paper drops Error Diffusion into **PPO**, replacing the backprop gradient through the hidden layers of both the policy and value networks (the PPO objective still supplies the output-level error). Policy errors route to hidden units by action channel; value errors broadcast to all units. On Brax continuous control, ED-PPO is competitive with DFA and, on HalfCheetah, clears backprop:\n\n<BenchBars\n  title=\"Brax HalfCheetah — episode return, higher is better (author-reported)\"\n  unit=\"\"\n  bars={[\n    { label: \"ED-PPO (ours)\", value: 5494, highlight: true },\n    { label: \"DFA-PPO\", value: 5581 },\n    { label: \"BP-PPO\", value: 3520 },\n  ]}\n/>\n\nThat HalfCheetah result — ED-PPO at 5494±691, essentially matching DFA-PPO's 5581±359 and beating backprop's 3520±485 — is the strongest single number in the paper, but it does not generalize cleanly. On Humanoid, ED-PPO (6670±2592) trails backprop (8478); on the open-ended exploration task **Craftax**, ED-PPO edges out DFA-PPO (19.8±1.5 return) but sits below BP-PPO. The honest read is \"competitive with the backprop-free baseline, still short of backprop on the hardest tasks\" — which is exactly what the abstract claims, and worth stating plainly rather than cherry-picking HalfCheetah.\n\n<Callout type=\"warn\">\nKeep the scale in view. These are small networks on MNIST, CIFAR-10, Brax and Craftax — not a scaling result, and not close to state of the art. Backprop still wins on accuracy on every classification task here (97.6% DFA and higher for standard backprop vs 61.7% on CIFAR-10), and beats ED on the harder RL environments. The contribution is not a better optimizer; it is a demonstration that representation learning is *possible at all* under strict Dale's principle, without weight transport or random feedback matrices — plus the finding that the binding constraint shifts with the task. Read it as biology-motivated evidence, not a drop-in replacement for backprop.\n</Callout>\n\n## The take\n\nThe idea is clean and the framing is honest. Real neural circuits obey constraints backprop ignores — a synapse can't read its partner's weight (no weight transport), and a neuron can't flip signs synapse by synapse (Dale's principle). Build a network that respects both, and credit assignment stops looking like a transpose and starts looking like a broadcast: Error Diffusion drops the output error straight onto every hidden layer, routes it by a modulo rule, and updates each layer locally. Three fixes — wider per-layer sigmoids to fight a 25x gradient decay, batch-centered class error, and a 3:1 excitation-to-inhibition initialization — are what turn a 50%-on-MNIST seed into a 96.7% learner and a 61.7% CIFAR-10 baseline. None of that is state of the art, and the paper doesn't pretend otherwise. What it earns is a real claim: you can learn representations under the brain's actual wiring rules, the gap to backprop is a few points rather than a chasm, and — the part I'll remember — *which* trick matters most depends on the task, a bottleneck you only see if you test on more than one benchmark.\n\n---\n\n*Built on Y. Yamada, L. Grillotti, R. Charakorn, S. Risi, D. Ha and R. T. Lange, [Diffusing Blame: Task-Dependent Credit Assignment in Biologically Plausible Dual-Stream Networks](https://arxiv.org/abs/2606.31700) (arXiv 2606.31700, 2026). Figures 1 and 2 are reproduced from the paper for commentary. The `DaleNetwork` and `ErrorDiffusion` widgets are my own illustrations of the mechanism, not measured traces; all accuracies, returns, and ablation deltas are author-reported and I have not independently reproduced them.*\n","readingTimeMins":12,"url":"https://ai.thesatyajit.com/articles/diffusing-blame","signal":{"interest":4,"helpful":2,"score":6,"level":2,"label":"Solid"}},{"title":"Intern-S2: a 397B model that reads the raw page","description":"InternLM's Intern-S2-Preview-397B is a multimodal scientific foundation model that trades blows with the closed frontier on general tasks and beats it by multiples on specialized science — how its raw-page vision pretraining, dynamic tokenizer, and multi-domain RL get there, with the benchmarks.","date":"2026-07-17","tags":["llm","multimodal","scientific-ai","mixture-of-experts","explainer"],"draft":false,"cover":"/articles/intern-s2/fig1.png","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"intern-s2","body":"Intern-S2-Preview-397B, from InternLM (Shanghai AI Lab), is a 397-billion-parameter\nmultimodal foundation model built for one thing: science. Not \"science\" as a benchmark\ncategory bolted onto a general chatbot — science as the training objective, down to how\nthe model tokenizes a molecule and how it reads a figure off a paper.\n\nThe headline is a shape, not a single number. On general knowledge, math, and agentic\ncoding, Intern-S2 sits at **frontier parity** — a 397B model trading blows with much\nlarger closed systems, usually a hair behind. On specialized science — multi-omics,\nmolecular reasoning, material generation, protein-binder design — it **leads every\nfrontier model**, often by 4× or more. That gap is the whole story, and it's the payoff\nof three specific design choices.\n\n## The lineage\n\nIntern-S2 is the third step in a line, and each step is worth naming because Intern-S2\ninherits all of it:\n\n- **Intern-S1** — a 235B mixture-of-experts on a Qwen3 backbone plus a 6B InternViT\n  vision encoder, continuously pretrained on 5T tokens, more than half of it scientific.\n- **Intern-S1-Pro** — scaled to a trillion-parameter MoE with 512 experts and 8 active\n  per token, added Fourier Position Encoding (FoPE) and explicit time-series modelling.\n- **Intern-S2-Preview-397B** — the most capable of the family. (There is also a\n  lightweight **Intern-S2-Preview-35B**, continued-pretrained from Qwen3.5 with a\n  shared-weight multi-token-prediction head, a KL loss, and chain-of-thought compression.)\n\nThe MoE math from S1-Pro is the standard sparsity trade. With $k$ of $N$ experts firing\nper token,\n\n$$\n\\theta_{\\text{active}} \\;=\\; \\frac{k}{N}\\,\\theta_{\\text{expert}} \\;+\\; \\theta_{\\text{shared}},\n\\qquad k = 8,\\; N = 512,\n$$\n\nso a trillion-parameter model pays for only ~22B activated parameters per token. FoPE and\ntime-series modelling let it ingest sequences from $10^0$ to $10^6$ points — the kind of\nrange a seismograph or a mass spectrometer actually produces.\n\n## Three ideas that matter\n\nStrip away the scale and Intern-S2 is three ideas working together:\n\n1. A **vision-language pretraining paradigm** that learns directly from raw pages of\n   scientific literature — no OCR-and-parse preprocessing step in front of the model.\n2. A **dynamic tokenizer** that natively represents molecular formulas, protein\n   sequences, and seismic signals as meaningful units rather than subword debris.\n3. Large-scale **multi-task reinforcement learning** across more than 20 scientific\n   domains, trained jointly, which also happens to sharpen general reasoning.\n\nTake them in order.\n\n## Reading the raw page\n\nA conventional document pipeline flattens a page to a string before the model ever sees\nit: OCR recovers the words, a layout parser guesses the reading order, and the figures\nand equations are dropped on the floor. The text model then learns from a transcript that\nhas already thrown away the thing you care about — how *this* curve relates to *that*\ncaption and *that* variable.\n\nIntern-S2 skips the transcript. It \"learns directly from raw pages of scientific\nliterature, jointly modelling symbolic semantics and visual relationships in a shared\nrepresentation space without intermediate parsing.\" The vision encoder maps text, figures,\nand equations into one representation, and both a symbolic-semantics head and a\nvisual-relations head read off that same space.\n\n<VisionPretrain />\n\nThe consequence is that a plot and the sentence that references it are learnable as a\nsingle object, not two disconnected streams. For scientific literature — where the\nargument often *lives* in the figure — that is the difference between a model that reads\nthe paper and one that reads a description of the paper.\n\n## The dynamic tokenizer\n\nA tokenizer is a vocabulary learned on a corpus, and a standard BPE vocabulary is learned\non natural-language text. Hand it a SMILES string or a protein sequence and it splits\nwhere its merge statistics say to split — which has nothing to do with where the *meaning*\nis. The aromatic ring in aspirin gets smeared across three tokens; a run of amino acids\ngets merged into a chunk that no longer corresponds to any residue.\n\n<DynamicTokenizer />\n\nIntern-S2's dynamic tokenizer emits scientifically-meaningful units directly: an atom, a\nbond, a residue, a waveform sample each become a token the model can address. This is not\ncosmetic. If a residue's identity is spread across a token boundary, the model can't attend\nto that residue cleanly — the representation is fighting the tokenizer. Native tokenization\nis what lets Intern-S2 treat a molecular formula, a protein, or a time series as a\nfirst-class input instead of a string that happens to look like one.\n\n## Multi-task RL across 20+ domains\n\nThe last piece is post-training. Intern-S2 runs large-scale reinforcement learning across\nmore than 20 scientific domains **jointly**, rather than fine-tuning a separate model per\ntask. Training the domains together is what gives the model its leading general-reasoning\nscores as a side effect: the same optimization that teaches it multi-omics and material\nchemistry also rewards careful, multi-step reasoning that transfers.\n\nIt deploys on the usual high-throughput stacks — **LMDeploy, vLLM, and SGLang** — with a\n256K-token context for text reasoning and 64K tokens for multimodal input. It's genuinely\nstrong at generative science: biomolecular interaction design and material-structure\ngeneration, not just question answering.\n\n## The benchmarks\n\nHere is the shape, in one chart. Flip between the two task families and watch Intern-S2's\ndot move from *just behind* the best competitor to *far ahead* of it.\n\n<ScienceGap />\n\n### General tasks: frontier parity\n\nOn general benchmarks Intern-S2 rarely wins outright, but it rarely loses by much — which\nis the remarkable part for a 397B model standing next to the largest closed systems. It\nposts MMLU-Pro 89.75 (Gemini-3.1-Pro leads at 91.00), HMMT-2026 91.57 (GPT-5.5 at 97.06),\nMMMU-Pro 80.46, and SWE-Bench-Multilingual 81.67 — effectively tied with GLM-5.2's 82.00.\n\n<Figure\n  src=\"/articles/intern-s2/fig1.png\"\n  alt=\"Benchmark table of general tasks comparing Intern-S2 against Qwen3.5-397B-A17B, DeepSeek-V4-pro, Kimi-K2.7-Code, GLM-5.2, GPT-5.5, Gemini-3.1-Pro, and Claude-Opus-4.8 across MMLU-Pro, SimpleQA-Verified, AdvancedIF, HMMT-2026, MMMU-Pro, ChartQAPro, SkillsBench, TerminalBench, SWE-Bench-Pro, and SWE-Bench-Multilingual.\"\n  caption=\"General-task benchmarks: Intern-S2 at frontier parity with the largest closed and open models (Intern-S2-Preview-397B model card, 2026).\"\n/>\n\nOn factual recall it clearly clears the open field even where it trails the closed leader —\nSimpleQA-Verified is a good example:\n\n<BenchBars\n  title=\"SimpleQA-Verified (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Gemini-3.1-Pro\", value: 75.6 },\n    { label: \"Intern-S2\", value: 69.9, highlight: true },\n    { label: \"GPT-5.5\", value: 64.3 },\n    { label: \"Qwen3.5-397B\", value: 54.8 },\n    { label: \"DeepSeek-V4-pro\", value: 46.6 },\n  ]}\n/>\n\nThe honest read: it trails the very top closed models on the hardest coding and knowledge\nbenches — TerminalBench 2.1 67.42 vs Claude-Opus-4.8's 84.60, SWE-Bench-Pro 61.56 vs\n69.20. Parity, not conquest.\n\n### Scientific tasks: dominance\n\nNow the inversion. On specialized science the gaps stop being fractions of a point and\nstart being multiples. Biology-Instructions (multi-omics) is the clearest case: Intern-S2\nscores 56.92 where the next-best frontier model manages 13.87, and most models land between\n4 and 10.\n\n<Figure\n  src=\"/articles/intern-s2/fig2.png\"\n  alt=\"Benchmark table of scientific tasks comparing Intern-S2 against Qwen3.5-397B-A17B, DeepSeek-V4-pro, Kimi-K2.7-Code, GLM-5.2, GPT-5.5, Gemini-3.1-Pro, and Claude-Opus-4.8 across Biology-Instructions, Mol-Instructions, MolecularIQ, SciReasoner, TOMG-Bench, MP20, ProteinBinder-9, XLRS-Bench, MicroVQA, SFE, ObsCrisis-Bench, SciCode, and SGI-Bench, with Intern-S2 far ahead on most rows.\"\n  caption=\"Scientific-task benchmarks: Intern-S2 leads every frontier model on most rows, frequently by 4× or more (Intern-S2-Preview-397B model card, 2026).\"\n/>\n\n<BenchBars\n  title=\"Biology-Instructions · multi-omics (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Intern-S2\", value: 56.92, highlight: true },\n    { label: \"Gemini-3.1-Pro\", value: 13.87 },\n  ]}\n/>\n\nMaterial-structure generation tells the same story: MP20 67.88 against a next-best of\n16.75, with most models between 1.5 and 16. Molecular reasoning (Mol-Instructions 52.37 vs\nGPT-5.5's 40.49), remote sensing (XLRS-Bench 51.97), microscopy VQA (MicroVQA 68.81), and\nbiomolecular interaction design (ProteinBinder-9 4.36 vs a best competitor near 2.4) all\nland the same way. Roughly:\n\n$$\n\\frac{56.92}{13.87} \\approx 4.1\\times, \\qquad \\frac{67.88}{16.75} \\approx 4.05\\times.\n$$\n\nIt is not a clean sweep, and that's worth saying: on MolecularIQ, GPT-5.5 still leads\n(76.41 vs Intern-S2's 61.49). But across the science suite as a whole, a 397B model beats\nGPT-5.5, Gemini-3.1-Pro, and Claude-Opus-4.8 — the payoff of the raw-page pretraining, the\ndynamic tokenizer, and multi-domain RL compounding.\n\n<Callout type=\"warn\">\nThe caveats are real. This is a **Preview**, not a final release. On the hardest general\ncoding and knowledge benchmarks it still trails the top closed models. Several of the most\nlopsided scientific wins — MP20, ProteinBinder-9 — are **internal benchmarks**, so treat\nthe exact multiples as InternLM's own measurement until third parties reproduce them. And\nat 397B it is heavy to self-host: frontier-scale hardware, not a workstation.\n</Callout>\n\n## What I make of it\n\n- **The specialization is the product.** Most \"science\" models are general chatbots with\n  a domain fine-tune. Intern-S2 pushes science into the tokenizer and the pretraining\n  objective, and the benchmark gaps show the difference that makes — 4× is not a\n  prompt-engineering delta.\n- **Parity at 397B is the quiet achievement.** Matching Gemini-3.1-Pro and Opus-4.8 on\n  general tasks with a fraction of the (public) scale, while dominating science, is a\n  stronger statement than any single scientific score.\n- **Trust the shape, verify the numbers.** The parity-vs-dominance pattern is convincing\n  and mechanistically motivated. The internal-benchmark wins want independent replication\n  before I'd quote the exact multiples as settled — but even halved, the lead holds.\n\n---\n\n*Sources: the [Intern-S2-Preview-397B model card](https://huggingface.co/internlm/Intern-S2-Preview-397B)\nand the [Intern-S1 project](https://github.com/InternLM/Intern-S1) (InternLM / Shanghai AI\nLab). Benchmark numbers are quoted as reported on the model card; several scientific\nbenchmarks are internal.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/intern-s2","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Kimi K3: a 2.8T open model that turns compute into intelligence 2.5× better","description":"Moonshot's Kimi K3 is a 2.8T-parameter, ~50B-active open MoE with a 1M-token context. A first-principles walk through what actually makes it new — Kimi Delta Attention, Attention Residuals, and Stable LatentMoE routing 16 of 896 experts with quantile balancing — the ~2.5× scaling-efficiency gain over K2, what it would take to train, and where it lands against the frontier.","date":"2026-07-17","tags":["llm","mixture-of-experts","linear-attention","kimi","scaling","explainer"],"draft":false,"cover":"/articles/kimi-k3/fig1.png","featured":true,"interest":5,"helpful":4,"kind":"articles","slug":"kimi-k3","body":"Moonshot's [Kimi K3](https://www.kimi.com/blog/kimi-k3) is the largest open model anyone has shipped:\n**2.8 trillion** parameters, about **50B active** per token, a **1-million-token** context, multimodal, released\n2026-07-16 as open-weight. On Moonshot's own suite it beats every other model it was tested against on coding and\nagentic work, and trails only the two strongest proprietary systems — Claude Fable 5 and GPT-5.6 Sol. It is positioned,\nnot unfairly, as Opus-4.8-class capability at Sonnet-5-class pricing.\n\nThe interesting part is not the parameter count. It is *how the parameters are spent*. K3 is built on two attention\nchanges — **Kimi Delta Attention (KDA)** and **Attention Residuals (AttnRes)** — that rework how information flows across\nsequence length and across depth, and it scales up MoE sparsity hard: it activates **16 of 896 experts** per token,\ninside a **Stable LatentMoE** framework. Together with refined training and data recipes, those structural changes yield\nroughly a **2.5× improvement in overall scaling efficiency** over K2 — the model converts compute into intelligence more\neffectively. This piece is a first-principles tour of each piece, why it is new, and what a model like this actually\ncosts to build.\n\n<K3Architecture />\n\nRead the block bottom-to-top: the hidden state passes through the attention sublayer (Gated MLA + KDA), Attention\nResiduals reach back to earlier depths, and Stable LatentMoE routes the token to 16 of 896 experts before the block\nemits its output. Four ideas, each doing a specific job. Take them one at a time.\n\n## Kimi Delta Attention: constant-size memory over a million tokens\n\nOrdinary softmax attention keeps a **KV cache** that grows by one entry per token. At a 1M-token context that cache is\nthe whole ballgame: decoding is [memory-bound on a cache that scales with sequence length](/articles/how-llm-inference-works),\nand it only gets heavier as the context fills.\n\nKDA is a **gated delta-rule linear attention**. Instead of a growing cache it keeps a **fixed-size recurrent state**\n$S_t$ that each token updates in place: it *erases* a little of the old state (a gated decay) and *writes* the new\nkey/value association (the delta rule). A compact way to write the family is\n\n$$\nS_t = g_t \\odot S_{t-1} + \\beta_t\\, k_t v_t^{\\top}, \\qquad o_t = S_t\\, q_t\n$$\n\nwhere $g_t$ is the per-channel gated decay (the erase), $\\beta_t\\, k_t v_t^{\\top}$ is the written delta, and $o_t$ reads\nthe state with the query $q_t$. The state $S_t$ is a fixed $d \\times d$ matrix — its size does not depend on how many\ntokens came before. Scrub the recurrence and watch the state stay constant-size while a softmax cache piles up:\n\n<KimiDeltaAttention />\n\nThat constant-size state is what makes a genuine 1M context tractable, and it is why Moonshot reports **up to 6.3× faster\ndecoding** in million-token contexts. It is not free — a linear-attention state is a lossy summary, not a perfect record,\nso K3 interleaves KDA with full-attention layers (via Gated MLA) to keep exact recall where it matters. KDA also breaks\nthe assumptions of conventional prefix caching, so Moonshot contributed a KDA implementation to the vLLM community to make\nserving practical.\n\n## Attention Residuals: selective retrieval across depth\n\nThe second change is about *depth*, not length. A plain residual stream forces every layer to add the same accumulated\nstate from the layer just below it, so representations from far-earlier depths get smeared together as they climb the\nstack. **Attention Residuals** let a layer instead *selectively retrieve* representations from specific earlier depths — a\nlearned read across depth rather than a uniform accumulation. Toggle the two modes and scrub the current layer:\n\n<AttnRes />\n\nThe payoff Moonshot reports is concrete: about **25% higher training efficiency at under 2% additional cost**. That ratio\nis the tell — a cheap structural change that improves gradient flow and lets the stack go deeper without the usual\ndegradation, which is exactly the kind of lever that compounds into the headline 2.5× scaling number.\n\nAlongside these, the attention sublayer uses **Gated MLA** — Multi-head Latent Attention with a learnable gate that\ncontrols activation and sharpens attention selectivity — and the MLP nonlinearity is a **Sigmoid Tanh Unit (SiTU)**,\nchosen for stable activation dynamics at 2.8T scale. Small pieces, but at this size \"stable\" is load-bearing.\n\n## Stable LatentMoE: 16 of 896, and why that is hard\n\nHere is the aggressive part. K3's feed-forward is a mixture of experts with **896 experts**, of which only **16** fire for\nany given token — an activation ratio of **1.8%**. That sparsity is what lets a 2.8T-parameter model activate only ~50B\nparameters per token, so the compute per token is that of a ~50B model while the *knowledge capacity* is that of a 2.8T\none. The experts are **latent** — they operate in a learned latent space rather than through hand-tuned token-to-expert\nheuristics. Scrub a few tokens and watch the selected 16 change:\n\n<LatentMoE />\n\nAt 1.8% activation, two problems that are mild in a denser MoE become first-order. **Routing:** which 16 experts you pick\nper token has to be learned well, or the model wastes most of its capacity. **Load balance:** if a few experts hog the\ntokens, the rest never train, and the effective model collapses to something far smaller than 2.8T. Standard MoEs fight\nthis with an **auxiliary load-balancing loss** and a sensitive balance coefficient — one more hyperparameter that, tuned\nwrong, either destabilizes training or lets experts collapse.\n\n### Quantile balancing: no auxiliary loss, no knob\n\nK3's answer is **Quantile Balancing**: derive each expert's allocation directly from the **quantiles of its router\nscores**. Set a target quantile — keep the top $q$ fraction of tokens by score — and every expert serves the same\nfraction of tokens *by construction*, with no auxiliary loss and no balance coefficient to tune. Drag the quantile and\nflip to the aux-loss regime to see the imbalance it removes:\n\n<QuantileBalancing />\n\nThe systems half of the story matters just as much. K3 uses a **fully balanced expert-parallel training method with static\nshapes and no host synchronization**. That is not a throwaway detail: variable expert loads normally produce variable\ntensor shapes, which force recompilation and host-side synchronization that stalls a large cluster. Quantile balancing\ngives every expert the same load, so the shapes are static, so the expert-parallel pipeline runs without host sync — the\ndifference between 16-of-896 routing being a nice idea and being trainable at 2.8T.\n\nWith all four pieces on the table, here is the paper's module-level picture redrawn: the **Stable LatentMoE** and **KDA**\nblocks in full detail on the left, and on the right the **Block Attention Residuals** backbone — where each module's\noutput flows through an `α` gate that can read *every* earlier block and the embedding, not just the layer below it.\n\n<KimiK3Architecture />\n\n## Turning compute into intelligence\n\nStack it up — KDA's cheap long-context memory, AttnRes's cheap depth, LatentMoE's extreme-but-stable sparsity, plus\nrefined training and data recipes — and Moonshot's headline is a **~2.5× improvement in overall scaling efficiency** over\nK2. Concretely: K3 reaches the same capability at roughly **1/2.5 the training compute**. Drag the capability marker:\n\n<ScalingEfficiency />\n\nThat is the number that actually matters. \"2.8 trillion parameters\" is a spec-sheet figure; \"2.5× more capability per\nFLOP\" is an engineering result. It is what lets an open lab, working around compute limits, ship a frontier-adjacent model\nwithout a frontier-sized compute bill.\n\n## What it would take to train it\n\nSo what does building a 2.8T-A50B model actually cost? The sparsity helps here too: training compute for an MoE scales\nwith the **active** parameters, not the total, so K3's per-token training FLOPs are those of a ~50B model. The standard\nestimate is\n\n$$\nC \\approx 6 \\, N_{\\text{active}} \\, D\n$$\n\nwith $N_{\\text{active}} \\approx 50\\text{B}$ and $D$ the number of training tokens. Moonshot has not published K3's exact\ntoken budget; for reference, K2 was trained on **15.5T tokens** with the Muon optimizer. Plug in a frontier-scale token\nbudget and pick a cluster — the estimate is millions of GPU-hours and weeks of wall-clock:\n\n<TrainingCost />\n\nThree things make that estimate *achievable* rather than merely large:\n\n- **Per-Head Muon.** K3 extends the Muon optimizer to optimize each attention head independently — more adaptive updates\n  at scale, on top of Muon's already better compute-efficiency than AdamW for this regime.\n- **MXFP4 / MXFP8 quantization-aware training.** From the SFT stage onward, K3 trains with **MXFP4 weights and MXFP8\n  activations**. The model is trained to be low-precision-native, which is why the full 2.8T weights fit in roughly **1.4 TB**\n  and why it is servable at all without a quality cliff.\n- **Static-shape expert parallelism.** As above — quantile balancing plus static shapes and no host synchronization is\n  what keeps a few-thousand-accelerator cluster busy instead of stalling on dynamic routing.\n\nThe memory reality is worth stating plainly: 2.8T parameters even at 4-bit is ~1.4 TB just for weights, before optimizer\nstates and activations, so training and serving both demand tensor-, pipeline-, and expert-parallelism across many\naccelerators. This is a systems achievement as much as a modeling one.\n\n## The benchmarks\n\nOn coding, K3 is a clear #2-or-#3 behind Fable 5 and GPT-5.6 Sol, and ahead of everything else open or closed that\nMoonshot tested — with a few outright wins.\n\n<Figure\n  src=\"/articles/kimi-k3/fig1.png\"\n  alt=\"Kimi K3 coding benchmarks. Six grouped bar charts — DeepSWE, FrontierSWE, Kimi Code Bench 2.0, Terminal Bench 2.1, Program Bench, SWE Marathon — comparing Kimi K3 against Fable 5, GPT-5.6 Sol, GPT-5.5, Opus-4.8 and GLM-5.2, all at maximum thinking effort. Kimi K3 is highlighted and lands first or second in most panels.\"\n  caption=\"Kimi K3 coding benchmarks vs Fable 5, GPT-5.6 Sol, GPT-5.5, Opus-4.8 and GLM-5.2 — all maxed on thinking effort (Moonshot AI, 2026).\"\n/>\n\nOn FrontierSWE it sits second, close behind Fable 5 and well ahead of the rest:\n\n<BenchBars\n  title=\"FrontierSWE (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Fable 5\", value: 86.6 },\n    { label: \"Kimi K3\", value: 81.2, highlight: true },\n    { label: \"GPT-5.6 Sol\", value: 71.3 },\n    { label: \"GLM-5.2\", value: 67.3 },\n    { label: \"Opus-4.8\", value: 66.7 },\n    { label: \"GPT-5.5\", value: 64.9 },\n  ]}\n/>\n\nOn Terminal Bench 2.1 it is effectively tied for first, and on the long-horizon SWE Marathon and Program Bench it is\nfirst outright:\n\n<BenchBars\n  title=\"Terminal Bench 2.1 (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"GPT-5.6 Sol\", value: 88.8 },\n    { label: \"Kimi K3\", value: 88.3, highlight: true },\n    { label: \"Opus-4.8\", value: 84.6 },\n    { label: \"Fable 5\", value: 84.6 },\n    { label: \"GPT-5.5\", value: 83.4 },\n    { label: \"GLM-5.2\", value: 82.7 },\n  ]}\n/>\n\n<BenchBars\n  title=\"SWE Marathon — long-horizon (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Kimi K3\", value: 42.0, highlight: true },\n    { label: \"Opus-4.8\", value: 40.0 },\n    { label: \"GPT-5.6 Sol\", value: 39.0 },\n    { label: \"Fable 5\", value: 35.0 },\n    { label: \"GPT-5.5\", value: 14.0 },\n    { label: \"GLM-5.2\", value: 13.0 },\n  ]}\n/>\n\nThe agentic and visual picture is similar — competitive across the board, and #1 on browsing:\n\n<Figure\n  src=\"/articles/kimi-k3/fig2.png\"\n  alt=\"Kimi K3 general and visual agent benchmarks. Bar charts for GDPval-AA v2 Elo, AA-Briefcase Elo, Automation Bench, JobBench, SpreadsheetBench 2, BrowseComp, CharXiv and Zerobench, comparing Kimi K3 against Fable 5, GPT-5.6 Sol, GPT-5.5, Opus-4.8 and GLM-5.2. Kimi K3 leads on BrowseComp, Automation Bench and SpreadsheetBench 2.\"\n  caption=\"Kimi K3 general + visual agent benchmarks — GDPval, AA-Briefcase, Automation Bench, JobBench, SpreadsheetBench 2, BrowseComp, CharXiv, Zerobench (Moonshot AI, 2026).\"\n/>\n\n<BenchBars\n  title=\"BrowseComp (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Kimi K3\", value: 91.2, highlight: true },\n    { label: \"GPT-5.6 Sol\", value: 90.4 },\n    { label: \"Fable 5\", value: 88.0 },\n    { label: \"GPT-5.5\", value: 84.4 },\n    { label: \"Opus-4.8\", value: 84.3 },\n  ]}\n/>\n\nThe pattern is consistent: K3 wins where the task is long-horizon and tool-heavy (SWE Marathon, Program Bench,\nBrowseComp, Automation Bench, SpreadsheetBench 2), and comes second to Fable 5 or GPT-5.6 Sol on the single-shot,\nknowledge-dense ones (GDPval and AA-Briefcase Elo, DeepSWE). For an open model, being in that conversation at all is the\nstory.\n\n## What it costs to serve\n\nThe sparsity that makes K3 cheap to train makes it cheap to run. API pricing is **$0.30 / MTok** on cache-hit input,\n**$3.00 / MTok** on cache-miss input, and **$15.00 / MTok** output — and Moonshot reports cache-hit rates **above 90%** in\ncoding workloads, so the effective input price is closer to the cheap number than the expensive one. MXFP4 weights keep\nthe footprint at ~1.4 TB, and Moonshot recommends supernode configurations of **64 or more accelerators**, with a\nMiniTriton CUDA-core roofline shown on NVIDIA L20. The KDA state and static-shape routing are what make that serving\nprofile hold up at 1M context.\n\n<Callout type=\"warn\">\n**Read the caveats.** (1) K3 still **trails Fable 5 and GPT-5.6 Sol** on overall capability and on user-experience polish —\nMoonshot says so directly, and flags sensitivity to thinking-history preservation and over-proactiveness in ambiguous\nsituations. (2) The benchmarks are **Moonshot's own suite**, run with opponents' potential fallbacks (Fable 5) and\ncyberguards (GPT-5.6 Sol) noted on the charts; treat cross-lab numbers as directional. (3) At release the weights were\n**promised (by 2026-07-27) but not yet downloadable**, so \"open\" was a commitment, not yet a download. (4) The training\ncost above is a **first-principles estimate**, not a disclosed figure — Moonshot has not published K3's token budget,\ncluster, or dollar cost.\n</Callout>\n\n## The take\n\nStrip away the size record and what is genuinely new in K3 is a coherent set of efficiency bets: **KDA** buys a real 1M\ncontext with constant-size memory; **AttnRes** buys depth almost for free; **Stable LatentMoE** with **Quantile Balancing**\nbuys 2.8T of capacity at ~50B of active compute *and* makes that extreme sparsity trainable without an aux-loss knob or\nhost-sync stalls; **Per-Head Muon** and **MXFP4/MXFP8 QAT** make the whole thing converge and fit. The sum is the number\nthat matters — **~2.5× more capability per FLOP than K2** — delivered in the open at 2.8T. It does not top the frontier,\nand it does not pretend to. What it proves is that the gap between open and closed is now measured in scaling *efficiency*,\nnot in whether an open lab can build at frontier scale at all.\n\n---\n\n*Sources: the [Kimi K3 tech blog](https://www.kimi.com/blog/kimi-k3) (architecture, Stable LatentMoE, Quantile Balancing,\nKDA, AttnRes, benchmarks, pricing) and Moonshot's reported figures. Benchmark numbers are quoted from Moonshot's charts;\nthe training-cost figures are a first-principles estimate from $C \\approx 6\\,N_{\\text{active}}\\,D$ with clearly labeled\nassumptions, using K2's 15.5T-token budget as a reference. Interactive diagrams illustrate the mechanisms; the routing,\nloop, and cost visuals are illustrative.*\n","readingTimeMins":12,"url":"https://ai.thesatyajit.com/articles/kimi-k3","signal":{"interest":5,"helpful":4,"score":9,"level":5,"label":"Essential"}},{"title":"LIMSSR: scoring actions when modalities go missing at training time","description":"Most missing-modality methods assume you trained on complete data and only lose a stream at test time. LIMSSR (ICML 2026 spotlight) tackles the realistic, harder case — modalities missing during training too — by reframing action-quality assessment as LLM-driven sequence-to-score reasoning, imputing missing modalities in prompt space instead of reconstructing them, and gating away hallucinated guesses.","date":"2026-07-17","tags":["multimodal","missing-modality","llm","action-quality-assessment","explainer"],"draft":false,"cover":"/articles/limssr/fig1.png","featured":false,"interest":3,"helpful":2,"kind":"articles","slug":"limssr","body":"**Action Quality Assessment (AQA)** is the task of watching a video of an action — a\ndive, a figure-skating program, a gymnastics routine — and predicting a *numeric quality\nscore* a judge would give it. The strong systems are multimodal: they read RGB frames,\noptical flow, and sometimes audio, because form, motion, and rhythm all carry signal.\n\nWhich is fine until a modality goes missing. And in the real world, modalities go missing\nnot just at test time but *during training* — a dataset where some clips never had audio,\nor flow was never computed. That is the case **LIMSSR** (Xu, Wu, Ke, and Peng; ICML 2026\nspotlight) is built for, and it is meaningfully harder than the usual setup.\n\n## Why training-time missingness is the hard case\n\nMost missing-modality work makes a quiet assumption: you trained on complete data, learned\nwhat a \"normal\" audio or flow stream looks like, and only lose one at inference. Then you\ncan reconstruct the missing stream, or distill a complete-modality teacher into an\nincomplete-modality student.\n\nTake the missingness back into training and both tricks weaken. You cannot reconstruct a\ndistribution you never fully saw, and there is no clean complete-modality teacher to distill\nfrom if the training set itself is full of holes. The score is a scalar, so a wrong\nimputation does not announce itself — it just quietly biases the number. To make \"how much\ndoes missingness hurt\" precise, AQA is graded by **Spearman's rank correlation** $\\rho$\nbetween predicted and ground-truth scores: 1.0 is a perfect ordering of performances, 0 is\nchance. It is a ranking metric, so it punishes exactly the systematic bias a bad guess\nintroduces.\n\n<Figure\n  src=\"/articles/limssr/fig2.png\"\n  alt=\"Three paradigms for incomplete-multimodal learning. (a) Reconstruction-based methods use a generative model to rebuild the missing modality before a downstream head. (b) Distillation/prior-based methods train a complete-modality teacher and distill priors into an incomplete-modality student. (c) LIMSSR tokenizes the available modality plus a text prompt naming the missing ones and feeds them to a LoRA-tuned LLM as a sequence-to-score problem.\"\n  caption=\"Two older answers — reconstruct the missing modality (a), or distill a complete-modality teacher (b) — and LIMSSR's: reframe the whole thing as LLM-driven sequence-to-score reasoning (c) (Xu et al., 2026, Figure 1).\"\n/>\n\n## Impute in prompt space, don't reconstruct\n\nLIMSSR's first move is to stop reconstructing. Each modality gets a frozen feature\nextractor and a small projection; a missing modality is not zero-filled but replaced with a\n**special token**, and the model is handed a **text prompt that names which modalities are\npresent and which are gone**. The LLM is then asked to *infer the missing modality's latent\nrole from the context that remains* — role inference, not feature reconstruction.\n\nThat distinction is the whole point. A zero-filled slot drags a fixed-slot fusion model\ntoward a wrong answer; a described-absence lets a language model reason about what the gap\nmeans. Drop a modality on each side and watch the gap open:\n\n<MissingModality />\n\nThe audio-only case is the tell. A naive fusion model, fed zeros for the two missing\nstreams, collapses to $\\rho = 0.177$ — barely above chance. LIMSSR, told in words that\nvideo and flow are gone and asked to reason from audio alone, holds $\\rho = 0.687$ on the\nsame split.\n\n## The pipeline\n\nEnd to end: per-modality features → specific projection → **prompt-guided context-aware\nmodality imputation** (special tokens + the modal-condition prompt) → an **LLM-driven\nmultidimensional representation fusion** that packs everything into fusion tokens → a frozen\n**large language model with LoRA** doing the sequence-to-score reasoning → a mask-aware\naggregation head that emits the score.\n\n<Figure\n  src=\"/articles/limssr/fig1.png\"\n  alt=\"The LIMSSR architecture. Frozen specific feature extractors turn RGB, flow, and audio into features; a modality-missing condition and specific projection produce incomplete multimodal features; prompt-guided context-aware modality imputation and LLM-driven multidimensional representation fusion build fusion tokens; a frozen LLM with LoRA processes them under a modal-condition prompt; and a Mask-Aware Dual-Path Aggregation head combines cross-modal pattern recovery with uncertainty-calibrated reasoning to output the quality score.\"\n  caption=\"The full pipeline: frozen extractors, prompt-space imputation, a LoRA-tuned LLM doing the reasoning, and the Mask-Aware Dual-Path Aggregation head (Xu et al., 2026, Figure 2).\"\n/>\n\n## Mask-aware dual-path aggregation: don't trust a lucky guess\n\nReasoning about a missing modality invites a failure mode: the model confidently\nhallucinates the part it cannot see. LIMSSR's aggregation head is built to suppress exactly\nthat. It runs **two paths** off the same missingness mask $m$:\n\n- **Cross-modal pattern recovery** — cross-attention, gated weighting, and a *learnable\n  confidence*. Strong when modalities are present, shaky when they are not.\n- **Uncertainty-calibrated reasoning** — role-aware weighting and mask-aware refinement that\n  explicitly discounts low-confidence dimensions, so it degrades gracefully.\n\nA learnable-confidence gate blends the two. As modalities drop, the recovery path has little\nleft to cross-attend over, its confidence falls, and the gate shifts weight onto the\ncalibrated path — the one that already distrusts what it cannot verify. Toggle the modalities\nand watch the gate swing:\n\n<DualPathAggregation />\n\nThat shift is the anti-hallucination mechanism. On FS1000 the full gate cuts mean-squared\nerror from **18.18** (simple fusion) to **14.08**, at $\\rho = 0.789$.\n\n## Results\n\nAcross every available/missing combination, LIMSSR's predicted scores track the diagonal —\nthe ground truth — more tightly than the multimodal-expert baselines (MoMKE, MCMoE), and it\nholds up in the settings where they fall apart:\n\n<Figure\n  src=\"/articles/limssr/fig3.png\"\n  alt=\"Scatter plots of predicted versus true action-quality score for seven available/missing modality combinations, comparing MoMKE, MCMoE, and LIMSSR. Points for LIMSSR cluster along the diagonal in every panel, including the hardest single-modality cases, while the baselines spread further from it.\"\n  caption=\"Predicted vs true score across all seven modality-availability settings — LIMSSR (right of each triplet) stays near the diagonal where MoMKE and MCMoE drift (Xu et al., 2026, Figure 5).\"\n/>\n\nThe starkest number is the hardest split — audio only, the two visual streams gone:\n\n<BenchBars\n  title=\"FS1000 Spearman ρ — audio only (both visual streams missing)\"\n  unit=\"\"\n  max={1}\n  bars={[\n    { label: \"LIMSSR\", value: 0.687, highlight: true },\n    { label: \"naive fusion\", value: 0.177 },\n  ]}\n/>\n\nWith every modality present the margin is smaller but still real ($\\rho$ 0.891 vs 0.819),\nwhich is the shape you want: a method that helps most exactly where the problem is hardest,\nand does no harm when the data is complete.\n\n<Callout type=\"warn\">\nKeep the scope in view. (1) This is **AQA**, a narrow regression task on relatively small\ndatasets (FS1000 and friends), not a general multimodal benchmark — the gains are real but\ndomain-specific. (2) The interactive pipelines here are **illustrative**; the $\\rho$ and MSE\nvalues are the paper's reported FS1000 numbers, but the gate dynamics I animate are a\nsimplification. (3) Bolting a LoRA-tuned LLM onto a scoring head **adds parameters and\nlatency** versus a lightweight fusion model — you are paying for the reasoning that buys the\nrobustness.\n</Callout>\n\n## The take\n\nThe reframing is the idea worth keeping. Missing-modality learning has mostly been treated as\na *reconstruction* problem — rebuild the pixels or features you lost. LIMSSR treats it as a\n*reasoning* problem: describe the absence in language, let an LLM infer what the missing\nstream would have contributed, and gate the answer by how much you can trust it. That it\nworks under training-time missingness — the case reconstruction and distillation both\nstruggle with — and lifts audio-only $\\rho$ from 0.177 to 0.687 is a good argument that, for\nmessy real-world multimodal data, telling the model what it is missing beats trying to fake\nwhat it lost.\n\n---\n\n*Source: \"LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete\nMultimodal Observations\" (Huangbiao Xu, Huanqi Wu, Xiao Ke, Yuxin Peng), ICML 2026 spotlight.\nNumbers ($\\rho$, MSE) are the paper's reported FS1000 results; the interactive diagrams\nillustrate the mechanism.*\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/articles/limssr","signal":{"interest":3,"helpful":2,"score":5,"level":1,"label":"Niche"}},{"title":"Monolith 1.0: a 1.6T open MoE built for reasoning","description":"Basalt Labs' Monolith 1.0 is a 1.57T-parameter open Mixture-of-Experts with 49.5B active per token — top-2 routing over 128 experts plus one shared, a two-stage YaRN stretch to a 1M-token context, and an SFT/DPO/RLVR reasoning-RL recipe. What is actually new, what it costs to run, and how much of its own benchmark story to trust.","date":"2026-07-17","tags":["llm","mixture-of-experts","reasoning","long-context","explainer"],"draft":false,"featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"monolith-1-0","body":"Basalt Labs' [Monolith 1.0](https://huggingface.co/basaltlabsai/monolith-1.0) is a\n**1.572-trillion-parameter** Mixture-of-Experts with **49.5B active per token**, released as\nopen weights under an MIT license — weights, tokenizer, and eval harness, commercial use\nallowed. It is a decoder-only, reasoning-focused, Chinese-English model, and Basalt is blunt\nabout what it is for: \"a 1.6T open Mixture-of-Experts foundation model for reasoning at scale.\"\nThe [tech report](https://basaltlabs.org/monolith) lays out the recipe.\n\nThe headline number is the total parameter count, but the number that governs everything else is\nthe **active** one. Monolith spends 1.57T parameters of capacity but pays for only 49.5B of\ncompute per token — a **32x sparsity** ratio. This piece walks the pieces that make that work: the\nMoE routing, the two-stage context extension to a million tokens, the training recipe, and the\ndecode trick that makes a model this large servable. Full prose and math carry each idea; the\ninteractive diagrams are there to build intuition.\n\n## The spec sheet\n\n- **1.572T** total parameters, **49.5B** active per token — a **32x** sparsity ratio.\n- **80 layers**, model dimension **8,192**.\n- **Grouped-query attention**: 64 query heads, 8 KV groups, head dim 128.\n- **128 routed experts** per layer (SwiGLU, intermediate dim 6,144) **+ 1 shared expert**; **top-2** routing.\n- **RoPE base 5e6**, two-stage YaRN; context length **1,048,576** tokens.\n- Byte-level BPE tokenizer, **151,936** vocab.\n\nTwo design choices do most of the work: how the experts are routed, and how the context is\nstretched. Take them one at a time.\n\n## Routing: top-2 of 128, plus one that never sleeps\n\nMonolith's feed-forward block is a mixture of experts. Each layer holds **128 routed experts** and\n**one shared expert**. For every token, a router scores the 128 and keeps the **top-2**; the shared\nexpert is always on. So **3 of 129** experts fire per token. Scrub the token index and watch the\nrouted pair swing while the shared expert stays lit:\n\n<MoeRouting />\n\nThe always-on shared expert is the part worth dwelling on. A pure top-$k$ router has to relearn the\ncommon, every-token computation inside many different experts, which wastes capacity. Splitting off\none shared expert lets the routed experts specialize while the shared one carries the baseline work —\na pattern that has become standard in large MoEs because it stabilizes routing at high sparsity.\n\nThe sparsity is what makes the size affordable. Active parameters are the shared expert plus the\ntop-2 routed, so the per-token compute is that of a roughly 50B model while the knowledge capacity\nis that of a 1.57T one:\n\n$$\n\\text{sparsity} = \\frac{N_{\\text{total}}}{N_{\\text{active}}} = \\frac{1.572 \\times 10^{12}}{4.95 \\times 10^{10}} \\approx 32\\times\n$$\n\nThat 32x is the whole bet: you get the memory footprint of a trillion-scale model but the FLOPs of a\nmid-size one, provided you can route well and keep every expert busy.\n\n## Attention: grouped-query, so the 1M cache fits\n\nBefore the context trick, one attention detail matters for it. Monolith uses **grouped-query\nattention** — 64 query heads sharing just **8 KV groups**. The KV cache stores keys and values per\ngroup, not per query head, so it is **8x smaller** than full multi-head attention at the same width.\nAt a million-token context the KV cache is the dominant memory cost of decoding, so an 8x reduction\nthere is the difference between a 1M window being a spec and being something you can actually hold in\nmemory.\n\n## Long context: two YaRN stages to a million tokens\n\nMonolith pretrains at a cheap **4,096-token** window, then extends to **1,048,576** tokens — a\n**256x** stretch — using **YaRN** in two stages on a RoPE base of **5e6**. YaRN rescales the rotary\nposition frequencies so positions far beyond the training length stay in distribution instead of\naliasing into nonsense. Doing it in two 16x steps rather than one 256x leap keeps long-range\nattention coherent. Step through the stages:\n\n<YarnContext />\n\nThe extension factor is exactly\n\n$$\n\\frac{1048576}{4096} = 256,\n$$\n\nand a two-stage split puts the midpoint at the geometric mean, $\\sqrt{256} = 16$, so each stage is a\n16x reach: $4096 \\to 65536 \\to 1048576$. (The 65,536 midpoint is my illustration of a\nclean two-stage split; Basalt reports two YaRN stages without pinning the intermediate length.) The\nreason to stage it is numerical: RoPE extrapolation degrades faster than linearly with the extension\nfactor, so two moderate stretches with a re-anchor in between hold up where one aggressive stretch\nsmears the attention over distant tokens.\n\n## Training: 60T tokens, and a FLOP budget that checks out\n\nMonolith is trained on **60T tokens** (a multilingual mixture) in **BF16 mixed precision with an\nFP32 optimizer state**. Basalt reports a compute budget of about **1.8e25 FLOPs**. That number is\nnot arbitrary — the standard estimate for transformer training compute is\n\n$$\nC \\approx 6 \\, N_{\\text{active}} \\, D,\n$$\n\nwith $N_{\\text{active}}$ the active parameters and $D$ the token count. MoE training compute scales\nwith the **active** parameters, not the total, because only the active experts run per token. Plug in\n$N_{\\text{active}} = 49.5\\text{B}$ and $D = 60\\text{T}$:\n\n$$\n6 \\times (4.95 \\times 10^{10}) \\times (6.0 \\times 10^{13}) \\approx 1.78 \\times 10^{25}\\ \\text{FLOPs},\n$$\n\nwhich lands on the reported 1.8e25. The sparsity pays off twice: at 32x it means a 1.57T model trains\nat the per-token cost of a ~50B one.\n\nPost-training is a three-stage reasoning pipeline: **SFT**, then **DPO**, then **RLVR** — reinforcement\nlearning with verifiable rewards. RLVR is the reasoning-specific piece: instead of a learned reward\nmodel, the reward comes from checking whether the answer is actually correct (a math result that\nverifies, code that passes tests), which is a cleaner signal for training long chains of thought and\nharder to reward-hack than a preference model.\n\n## Serving it: self-speculative decoding\n\nA 1.57T-parameter model is memory-bound at decode time, so Monolith ships **self-speculative\ndecoding**: the model drafts several tokens cheaply, then verifies them all in one forward pass,\nkeeping the longest correct prefix and correcting the first miss. Scrub the phase and flip the\ndomain:\n\n<SelfSpeculative />\n\nBecause the verify pass reproduces the base model exactly, this is **lossless** — the output is token\nfor token what greedy decoding would have produced, just fewer expensive passes to get there. Code is\nmore predictable than prose, so more drafts survive verification: Basalt reports **~2.1x** faster\ndecoding on natural language and **~2.7x** on code.\n\nEven so, \"open weights\" here still means rack-scale hardware. Basalt targets **one GB300 NVL72 rack\n(72 GPUs)** at FP8, or a **CloudMatrix-384** at native BF16. You can download the weights; running\nthem is another matter.\n\n## The benchmarks\n\nHere is where honesty has to lead. On Basalt's own harness, at maximum thinking effort, Monolith\nposts numbers that are not just ahead of the field but near the ceiling of the tests themselves.\nCompared against GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, and Kimi K2.6:\n\n<BenchBars\n  title=\"Humanity's Last Exam (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Monolith 1.0\", value: 99.4, highlight: true },\n    { label: \"GPT-5.4\", value: 61.2 },\n    { label: \"Claude Opus 4.6\", value: 58.9 },\n    { label: \"Gemini 3.1 Pro\", value: 55.4 },\n    { label: \"Kimi K2.6\", value: 44.1 },\n  ]}\n/>\n\n<BenchBars\n  title=\"AIME 2025 (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Monolith 1.0\", value: 100.0, highlight: true },\n    { label: \"GPT-5.4\", value: 96.7 },\n    { label: \"Claude Opus 4.6\", value: 94.3 },\n    { label: \"Gemini 3.1 Pro\", value: 93.3 },\n    { label: \"Kimi K2.6\", value: 90.0 },\n  ]}\n/>\n\n<BenchBars\n  title=\"GPQA Diamond (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Monolith 1.0\", value: 95.9, highlight: true },\n    { label: \"GPT-5.4\", value: 89.4 },\n    { label: \"Claude Opus 4.6\", value: 88.1 },\n    { label: \"Gemini 3.1 Pro\", value: 86.7 },\n    { label: \"Kimi K2.6\", value: 79.8 },\n  ]}\n/>\n\n<BenchBars\n  title=\"MMLU-Pro (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Monolith 1.0\", value: 96.2, highlight: true },\n    { label: \"GPT-5.4\", value: 88.0 },\n    { label: \"Claude Opus 4.6\", value: 87.3 },\n    { label: \"Gemini 3.1 Pro\", value: 86.1 },\n    { label: \"Kimi K2.6\", value: 82.4 },\n  ]}\n/>\n\nRead those bars, then read the next box before you form an opinion.\n\n<Callout type=\"warn\">\n**These are single-lab, self-reported, own-harness numbers — treat them as directional, not settled.**\nA **99.4** on Humanity's Last Exam and a **100.0** on AIME 2025 are effectively saturation: the model\nis not beating the field by a few points, it is sitting at the top of the scale while every frontier\nmodel on the same chart trails by 30 to 40 points. Numbers that clean from the lab that built the\nmodel and ran the eval are exactly the ones to be skeptical of. Own-harness, maximum-effort results\nselect for the configuration that flatters the model; they say nothing about a neutral setup, a\ncontaminated-test check, or a different prompt. The thing to wait for is **independent third-party\nevaluation** on held-out sets. And the practical caveat does not go away with better scores: \"open\nweights\" for a 1.57T model still means a **GB300 NVL72 rack** to run it, so open here is a licensing\nfact, not an accessibility one.\n</Callout>\n\n## What I make of it\n\n- **The engineering is coherent and legible.** 32x sparsity via top-2-of-128 plus a shared expert,\n  an 8x-smaller KV cache from grouped-query attention, a staged YaRN reach to 1M tokens, and a FLOP\n  budget that checks out against $6 N_{\\text{active}} D$ — none of it is exotic, all of it is the\n  right lever for a trillion-scale reasoning model. The self-speculative decode is a real, lossless\n  serving win.\n- **The MIT license is the genuinely useful part.** Weights, tokenizer, and eval harness, commercial\n  use allowed — that is more open than most models at this scale, and it means the benchmark claims\n  can, in principle, be checked by anyone with the hardware.\n- **The benchmarks are the part to hold at arm's length.** Saturated, self-reported, own-harness\n  scores are a marketing artifact until someone independent reproduces them. I would love to be\n  wrong; I would rather wait for the third-party numbers than quote 99.4 as if it were settled.\n\nThe bet Monolith makes is that a trillion-scale open MoE, routed and staged carefully, can be a\nfrontier reasoning model in the open. The architecture is a credible version of that bet. Whether it\nactually reasons at 99.4-on-HLE levels is a question its own harness cannot answer.\n\n---\n\n*Sources: the [Monolith 1.0 model card](https://huggingface.co/basaltlabsai/monolith-1.0) and the\n[Basalt Labs tech report](https://basaltlabs.org/monolith) (architecture, training, deployment,\nbenchmarks). Benchmark numbers are quoted as reported by Basalt on their own harness at maximum\nthinking effort. The training-compute figure is checked against $C \\approx 6\\,N_{\\text{active}}\\,D$\nwith the reported 49.5B active parameters and 60T tokens. The interactive diagrams illustrate the\nmechanisms; the routing, context, and decode visuals are illustrative, and the 65,536-token YaRN\nmidpoint is my own clean two-stage split, not a disclosed intermediate length.*\n","readingTimeMins":9,"url":"https://ai.thesatyajit.com/articles/monolith-1-0","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"VideoChat3: a 4B video model that watches longer for less","description":"VideoChat3 is a fully open 4B video MLLM built around two efficiency moves — an inflated 3D ViT that hits 16x spatiotemporal compression, and adaptive per-frame resolution. How both work, the benchmark deltas over Qwen3-VL-4B and Molmo2-4B, and the 44.4s → 20.4s latency win on 2048-frame video.","date":"2026-07-17","tags":["multimodal","video-understanding","vision-language","efficiency","explainer"],"draft":false,"cover":"/articles/videochat3/fig1.png","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"videochat3","body":"VideoChat3, from Nanjing University, Shanghai AI Lab, NTU, and Peking University\n([arXiv 2607.14935](https://arxiv.org/abs/2607.14935),\n[HF](https://huggingface.co/papers/2607.14935)), is a **4B-parameter video\nmultimodal LLM** with one clear thesis: you should not have to pick between a model\nthat generalizes across video types, a model that is cheap to run on long clips, and\na model you can actually reproduce. Most video MLLMs give you one of the three. This\none aims for all three, and ships the training data to prove the last.\n\nThe interesting part is not the leaderboard — it is *how* a 4B model stays coherent\nover 2048 frames without the token count exploding. Two mechanisms carry that: an\ninflated 3D tokenizer, and adaptive frame resolution. I'll walk both, then the\nnumbers.\n\n## The problem: tokens grow with time\n\nA video is frames, and every frame is a few hundred vision tokens. Feed a clip in\nframe-by-frame and the sequence length grows linearly with duration — a few thousand\ntokens for a short clip, tens of thousands for a long one. That is what makes long\nvideo expensive: the LLM pays quadratic attention over a token budget set by the\ntokenizer, and the tokenizer, if it treats each frame independently, has no reason to\nbe frugal.\n\nTwo failure modes fall out of that. Compute blows up on long clips. And a 2D image\ntokenizer, bolted on frame-by-frame, never models *motion* — it sees a stack of\nstills. VideoChat3 attacks both at the tokenizer.\n\n## I3D-ViT: inflating a 2D tokenizer into 3D\n\nThe first move is the **Inflated 3D Vision Transformer (I3D-ViT)**. Start from a 2D\nimage encoder — a plain patch-and-attend ViT — and *inflate* it into a spatiotemporal\none instead of training a video encoder from scratch. The inflation is three steps:\n\n1. **Chunk the frames.** Group `T = 4` consecutive frames into a chunk.\n2. **Attend within the chunk.** Run self-attention across the chunk's tokens, so\n   spatial and temporal structure are modelled together — motion, locally.\n3. **Pool the chunk down.** A temporal pooling (×4) plus a 2×2 spatial merge collapse\n   each chunk to a compact, motion-aware token slot.\n\nTemporal ÷4 times spatial ÷4 is a **16× spatiotemporal compression** — the token\nbudget grows with the frame count, but 16× slower than the naive per-frame path.\nDrag the frame count and watch where the tokens go:\n\n<I3dVit />\n\nThe reason this works without wrecking accuracy: the compression happens *after* the\nchunk has been attended, so motion is already encoded into the surviving tokens. You\nare not throwing frames away — you are summarising each 4-frame window into one slot\nthat remembers what moved. Native resolution and aspect ratio are preserved through\nabsolute temporal embeddings, so the model still knows *when* each token happened.\n\n<Figure\n  src=\"/articles/videochat3/fig1.png\"\n  alt=\"I3D-ViT pipeline: native-resolution frames are patchified, given spatial and temporal positional embeddings, passed through variable-length self-attention within 4-frame chunks, then chunk-wise temporal pooling and a 2x2 pixel-shuffle merge feed compact video tokens into the LLM.\"\n  caption=\"I3D-ViT inflates a 2D tokenizer: patchify → spatial + temporal position embeddings → variable-length self-attention inside frame chunks → temporal pooling + 2×2 merge → compact tokens into the LLM (Li et al., 2026, Figure 2).\"\n/>\n\n## Adaptive frame resolution: spend pixels where the evidence is\n\nCompression handles the *count* of tokens. The second move handles the *cost per\nframe*. In a streaming setting most of a video is uneventful — a static room, a\nheld shot, dead air. Processing every frame at high resolution spends the same budget\non the boring frames as on the one that answers the question.\n\nSo VideoChat3 conditions the per-frame resolution on state. Routine moments are\nperceived under a low **224²-pixel** quota. When a *Standby* cue fires — the signal\nthat an answer is about to appear — the following window is enlarged to a **448²**\nquota, roughly 4× the tokens, to catch the detail. Click frames to promote them and\nwatch the budget move:\n\n<AdaptiveResolution />\n\nThe framing is a three-state stream: **Silence** (nothing to report, low-res),\n**Standby** (something is coming, stay ready), **Response** (answer now, high-res).\nThe budget follows the state instead of the clock. On a stream that is mostly\nsilence, that is most of the frames spent at a quarter of the token cost.\n\n<Figure\n  src=\"/articles/videochat3/fig3.png\"\n  alt=\"Streaming timeline: clips 0 through N+2 are processed low-res during silence, clips N+3 and N+4 jump to high-res as a Response window opens, then clip N+5 drops back to low-res.\"\n  caption=\"Adaptive perception on a live stream: low-res while Silence holds, high-res for the Response window, back to low-res after — the token quota tracks the state, not the frame index (Li et al., 2026, Figure 3).\"\n/>\n\n## The benchmarks\n\nThe headline: at **4B parameters**, VideoChat3 beats comparable open models\n(Qwen3-VL-4B, Molmo2-4B) across temporal perception, long video, reasoning, temporal\ngrounding, and online tasks — the paper's cross-benchmark sweep is one figure:\n\n<Figure\n  src=\"/articles/videochat3/fig2.png\"\n  alt=\"Grouped bar chart comparing Molmo2-4B, Qwen3-VL-4B, and VideoChat3 across MotionBench, TempCompass, VideoMME, LVBench, MMVU, VideoMME-v2, Charades, ActivityNet, QVHighlights, OVOBench, and StreamingBench; VideoChat3 leads on all, with large margins on the temporal-grounding benchmarks.\"\n  caption=\"VideoChat3 vs Qwen3-VL-4B and Molmo2-4B across eleven benchmarks — leading everywhere, with the widest gaps on temporal grounding (Li et al., 2026, Figure 1).\"\n/>\n\nWhere it separates most is **temporal grounding** — answering *when* something happens,\nnot just *what*. Over Qwen3-VL-4B the gains run from **+9.7** (Charades) up to **+20.6**\n(VUE-TR V2 in the TimeLens suite), depending on the benchmark. Charades makes the point:\n\n<BenchBars\n  title=\"Temporal grounding — Charades (mIoU)\"\n  unit=\"\"\n  bars={[\n    { label: \"VideoChat3-4B\", value: 56.1, highlight: true },\n    { label: \"Qwen3-VL-4B\", value: 46.4 },\n    { label: \"Molmo2-4B\", value: 33.3 },\n  ]}\n/>\n\nThe same ordering holds on ActivityNet (54.8 / 48.2 / 39.8) and QVHighlights\n(67.1 / 58.7 / 58.7). It is not just grounding, though — the general video and\nreasoning benchmarks land ahead too, if by smaller margins:\n\n<BenchBars\n  title=\"Video reasoning — MMVU (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"VideoChat3-4B\", value: 56.4, highlight: true },\n    { label: \"Molmo2-4B\", value: 51.2 },\n    { label: \"Qwen3-VL-4B\", value: 50.5 },\n  ]}\n/>\n\nVideoMME 70.1 (vs 69.3 / 69.6), LVBench 56.7 (vs 56.2 / 53.9), TempCompass 75.6 (vs\n70.8 / 72.8), StreamingBench 83.0 (vs 80.2). The deltas on the general suites are\nsingle digits; the grounding deltas are where the tokenizer design shows up.\n\n## The efficiency payoff\n\nThe point of 16× compression is latency, and it compounds with length. Same\nhardware, same clip, VideoChat3 vs Qwen3-VL, end-to-end inference:\n\n| Frames | Qwen3-VL | VideoChat3 |\n|---|---|---|\n| 512 | 3.84s | 3.60s |\n| 1024 | 12.25s | 8.10s |\n| 2048 | 44.45s | **20.41s** |\n\nAt 512 frames the gap is small — the tokenizer overhead is a rounding error. At 2048\nframes it is **2.2×**: `44.4s → 20.4s`. The compression buys you the frames the\ngrounding benchmarks reward, at a latency that stays usable as the clip grows.\n\n## Fully open: the data, not just the weights\n\nThe \"fully open\" claim is the part I'd flag to anyone who has tried to reproduce a\nvideo MLLM. The bottleneck is never the architecture — it is the ~3M-sample\ninstruction mix nobody publishes. VideoChat3 releases three:\n\n- **VideoChat3-Academic2M** — 2.27M caption/QA instances from six academic sources,\n  with evidence-grounded annotation enhancement.\n- **VideoChat3-LV116K** — 116.2K long-form samples, mean durations 156s to ~1.3K\n  seconds.\n- **VideoChat3-OL617K** — 617,183 streaming/online instances across 40 shards.\n\nTrained through a four-stage curriculum: tokenizer pretraining → video-language\nalignment → general instruction tuning → long/streaming tuning. Weights and data\nboth out, so the recipe is checkable end to end.\n\n<Callout type=\"warn\">\nRead the comparison for what it is: a **4B-vs-4B** result. VideoChat3 beats *comparable\nopen* models at its size — it is not claiming to beat frontier closed video systems or\nmuch larger open ones, and the general-suite margins (VideoMME +0.5 to +0.8) are\ninside the range where mix and eval harness matter. The token math here is\nillustrative (I use ~64 spatial tokens/frame to keep the diagrams honest about\n*ratios*, not absolute counts); the 16× compression, the 224²/448² quotas, and the\nlatency numbers are the paper's. Adaptive resolution has a real failure mode too: set\nthe Standby threshold too tight and a fast event is only ever seen in 224².\n</Callout>\n\n## What I make of it\n\n- **The tokenizer is the whole story.** I3D-ViT is a clean idea — inflate a 2D\n  encoder, attend within short chunks, pool 16×. It is *why* a 4B model can watch 2048\n  frames in 20 seconds, and *why* the temporal-grounding gaps are as large as they are.\n  Motion survives the compression; that is the trick.\n- **Adaptive resolution is the right shape for streaming.** Spend the budget on the\n  evidence, not the clock. It maps cleanly onto a Silence/Standby/Response state\n  machine, and the savings are largest exactly where video is cheapest to skimp — the\n  dead air.\n- **Open data is the contribution that outlasts the benchmarks.** Numbers age; a\n  released 3M-sample video instruction mix is something the rest of the field can build\n  on. For a model whose pitch is \"generalist *and* reproducible,\" shipping\n  Academic2M + LV116K + OL617K is the part that makes the claim real.\n\n---\n\n*Source: \"VideoChat3: Fully Open Video MLLM for Efficient and Generalist Video\nUnderstanding,\" Li, Zhu, Zeng, Dong, Wu, et al.\n([arXiv 2607.14935](https://arxiv.org/abs/2607.14935)). Benchmark values read from the\npaper's reported figures and tables; numbers quoted as reported. Interactive diagrams\nare my own illustration of the mechanism — token counts are illustrative, ratios are\nthe paper's.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/videochat3","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"ZUNA 1.1: a channel-agnostic EEG foundation model","description":"Zyphra's ZUNA 1.1 is a 380M-parameter encoder–decoder diffusion autoencoder that reconstructs, denoises, and upsamples EEG across any electrode layout — a 4-channel headband to a 256-channel cap — by treating each electrode as a token at a physical (x, y, z, t) coordinate instead of a fixed montage slot. How the 4D-RoPE channel-agnostic trick and the rectified-flow decoder work, and where the reconstruction actually holds up.","date":"2026-07-17","tags":["eeg","foundation-model","diffusion","signal-processing","neuroscience","explainer"],"draft":false,"cover":"/articles/zuna-1-1/fig1.png","featured":false,"interest":4,"helpful":2,"kind":"articles","slug":"zuna-1-1","body":"ZUNA 1.1, from Zyphra, is an EEG foundation model: a 380M-parameter transformer\nencoder–decoder diffusion autoencoder that reconstructs missing channels, denoises\ncorrupted ones, and upsamples sparse montages to electrode positions that were never\nrecorded. It is Apache-2.0, runs on a consumer GPU or a plain CPU, and — the part I\nfind interesting — it is **channel-agnostic**: the same weights read a 4-electrode\nconsumer headband or a 256-channel research cap, because it treats every electrode as\na token at a physical coordinate, not a fixed slot in a montage.\n\nThere is no arXiv paper; this is a release across the [Zyphra blog](https://www.zyphra.com/our-work/zuna1.1),\nthe [Hugging Face model card](https://huggingface.co/Zyphra/ZUNA1.1), and the\n[GitHub repo](https://github.com/Zyphra/zuna) (`pip install zuna`, plus a browser-based\nCloud EEG Playground). It is a point release on ZUNA1 — same 380M parameters, a bigger\nand cleaner training corpus, and a handful of design changes that matter more than the\nversion bump suggests.\n\n## The problem: no two EEG setups agree\n\nEEG is a mess to model across datasets, and the mess is structural, not incidental.\nA clinical 10-20 montage has 19 electrodes; a sleep lab might run 6; a research cap\nruns 64, 128, or 256; a consumer headband runs 4. The electrodes sit at different\nscalp locations, sample at different rates, and are filtered differently. Worse, within\na single recording, channels die, drift, or go noisy for part of a session and recover\nlater. Most models paper over this by fixing a channel list and re-referencing every\nrecording onto it — which throws away recordings that do not fit and cannot exploit\nthe extra electrodes when they exist.\n\nZUNA's bet is to stop treating a recording as a fixed-width tensor and start treating\nit as a **set of tokens, each tagged with where and when it was measured**. Once the\nmodel keys on physical position instead of channel index, the montage stops mattering.\n\n## Channel-agnostic: an electrode is just a coordinate\n\nThe mechanism is a 4D rotary positional encoding. Each 0.125-second segment of one\nelectrode becomes a token, and its position is the tuple $(x, y, z, t)$: the\nelectrode's 3D coordinate on the scalp plus a coarse time index. Attention applies\nrotary phases over all four axes, so \"nearby\" means nearby in space *and* time — the\nmodel learns that Cz and C3 covary because they are physically close, not because they\nhappen to be channels 10 and 9 in some file.\n\nThat single choice is what buys channel-agnosticism. There is no learned per-channel\nembedding to run out of, so an arbitrary subset or superset of electrodes is just a\ndifferent set of coordinates. Drop a region and the decoder fills those coordinates\nfrom the electrodes it still has; ask for coordinates that were never recorded and it\npredicts them the same way. Switch montages below and watch the same model span a\nheadband and a dense cap:\n\n<ChannelAgnostic />\n\n<Figure\n  src=\"/articles/zuna-1-1/fig1.png\"\n  alt=\"ZUNA 1.1 architecture: a 16-layer transformer encoder with RMS-norm blocks and 4D-RoPE self-attention feeds a latent into a 16-layer decoder whose self- and cross-attention blocks are conditioned by adaptive-RMS norm; the encoder takes clean and noisy EEG, the decoder emits the reconstructed signal.\"\n  caption=\"The encoder–decoder. A 16-layer encoder maps clean context to a latent; a 16-layer decoder cross-attends to it and denoises, with 4D-RoPE over (x, y, z, t) on every attention block and adaptive-RMS norm carrying the latent into the decoder (Zyphra, ZUNA 1.1, 2026).\"\n/>\n\n## Inside the model: a diffusion autoencoder\n\nThe architecture is two transformer stacks. The **encoder** reads the clean context\nand compresses it to a latent. The **decoder** cross-attends to that latent and\nproduces the signal at the requested coordinates. The latent is injected into every\ndecoder layer through **adaptive-RMS norm** — the latent sets the per-layer scale of\nthe normalization, which is a cheap, stable way to condition a deep stack on a global\nsummary (the same conditioning trick diffusion image models use for the timestep).\n\nThe decoder is trained with a **rectified-flow** objective rather than a plain\nregression loss, and this is the right call for reconstruction. Filling a missing\nchannel is genuinely uncertain — many signals are consistent with the surrounding\nscalp — so a model trained to minimize mean-squared error returns the blurred average\nof all of them. A generative decoder instead returns a *sample* from the plausible\nset. Rectified flow makes that sampling cheap: it learns a straight-line transport from\na noise draw to the data. With noise $x_0$ and target signal $x_1$, the interpolant and\nits velocity are\n\n$$\nx_\\tau = (1-\\tau)\\,x_0 + \\tau\\,x_1, \\qquad \\frac{dx_\\tau}{d\\tau} = x_1 - x_0,\n$$\n\nand the decoder regresses a velocity field $v_\\theta(x_\\tau, \\tau)$ onto that constant\nvelocity $x_1 - x_0$. At inference you draw $x_0$ and integrate the field from\n$\\tau=0$ to $\\tau=1$. Because the target path is a straight line, few integration steps\nget you most of the way — which is why this runs on a CPU. Drag the scrubber:\n\n<DiffusionAutoencoder />\n\nThe same decoder does all three jobs — reconstruct a missing channel, denoise a noisy\none, upsample to a new position — and only the conditioning changes. That is the payoff\nof framing everything as \"predict the signal at these coordinates given those.\"\n\n## Training: corruption on purpose\n\nThe corpus roughly doubled over ZUNA1, from about 2M to **3.5M channel-hours** of\npublic EEG. Two things about how it was prepared are worth noting. First, quality is\nscored **per channel, per second**, so a channel that is clean for most of a session\nand noisy for a stretch is used where it is good instead of being dropped whole.\nSecond, each recording is kept in two filter variants — a bandpass at 0.1–45 Hz and a\nminimally processed version (0.01 Hz high-pass plus a notch for line noise) — so the\nmodel sees both heavily and lightly filtered signal. Inputs are variable length, 0.5 to\n30 seconds, bucketed into four bins so short clips are not wasted padding a long window.\n\nThe interesting part is that the model is trained to reconstruct under four distinct\ncorruption patterns, not one. This is the whole reason it generalizes to messy\nreal-world recordings:\n\n<Figure\n  src=\"/articles/zuna-1-1/fig2.png\"\n  alt=\"Four EEG channel-dropout schemes shown as multi-channel traces with masked regions highlighted: whole-channel (entire rows removed), full-time (vertical time slices across all channels), channel-time (rectangular space-time blocks in some channels), and random-uniform (scattered short segments).\"\n  caption=\"The four dropout schemes the decoder is trained to invert — whole-channel (dead electrodes), full-time (dropouts across all channels), channel-time (localized space-time gaps), and random-uniform (scattered artifacts) (Zyphra, ZUNA 1.1, 2026).\"\n/>\n\nWhole-channel dropout teaches it to rebuild a dead electrode from its neighbors.\nFull-time dropout — a gap across every channel at once — teaches temporal inpainting.\nChannel-time dropout is the realistic case: a cluster of electrodes goes bad for a\nwindow. Random-uniform scatter mimics muscle artifacts and momentary failures. Because\ntraining mixes all four, the model handles almost arbitrary space-time masks at\ninference, which is exactly what `reconstruct_fif()` exposes — it auto-detects MNE bad\nchannels and `BAD_` annotations and repairs them.\n\n## Results: reconstruction as channels drop\n\nThe metric is normalized mean-squared error between the held-out true signal and the\nreconstruction,\n\n$$\n\\mathrm{NMSE} = \\frac{\\lVert \\hat{x} - x \\rVert_2^2}{\\lVert x \\rVert_2^2},\n$$\n\nwhere 1.0 is the trivial \"predict zero\" baseline and lower is better. The baseline to\nbeat is MNE's spherical-spline interpolation, the classical way to rebuild a missing\nelectrode from a smooth fit over the others. Zyphra publishes the comparison as\nfigures, not tables, so the numbers below are read off the plots and are approximate.\n\n<Figure\n  src=\"/articles/zuna-1-1/fig3.png\"\n  alt=\"Four line plots (ANPHY-Sleep, Berlin BCI, BCI2000, AAD) of reconstruction NMSE versus channel dropout rate from 0.2 to 0.9. ZUNA1.1 and ZUNA1 curves stay low and close together while the spherical-spline curve rises sharply at high dropout, exceeding 2.5 NMSE on Berlin BCI.\"\n  caption=\"Reconstruction NMSE as the fraction of dropped channels grows, across four datasets. Both learned models stay flat; classical spline interpolation blows up once most channels are missing (Zyphra, ZUNA 1.1, 2026).\"\n/>\n\nAt 20% channel dropout everything is close — NMSE around 0.4 to 0.6 across the four\ndatasets, because with most electrodes present even a spline does fine. Push to 90%\ndropout and the spline blows up (Berlin BCI reaches roughly 2.7, i.e. worse than\npredicting silence) while ZUNA1.1 and ZUNA1 hold near 1.0 to 1.5. That widening gap is\nthe headline: the learned prior degrades gracefully as information disappears; the\nclassical interpolant does not.\n\nZUNA1.1 versus ZUNA1 is, honestly, a wash — ZUNA1.1 is a touch better on ANPHY-Sleep\nand BCI2000 and marginally behind on Berlin BCI and AAD. That matches Zyphra's own\nclaim: better or essentially equal NMSE at the same 380M parameters, with the real\ngains going to stability and the broader input regime rather than raw accuracy.\n\nThe more realistic test deletes a whole brain region and rebuilds it from the other\nseven:\n\n<Figure\n  src=\"/articles/zuna-1-1/fig4.png\"\n  alt=\"Grouped bar chart of average reconstruction NMSE by brain region (frontal, temporal, central, parietal, occipital, left and right) with error bars, comparing ZUNA1.1, ZUNA1, and spherical-spline. ZUNA1.1 and ZUNA1 bars are similar and low; spherical-spline is much higher for frontal and temporal regions.\"\n  caption=\"Region-occlusion reconstruction: delete an entire region, rebuild it from the rest. The two learned models are close and both far below spline in frontal and temporal cortex; the gap narrows over parietal, where a smooth interpolant is already a decent model (Zyphra, ZUNA 1.1, 2026).\"\n/>\n\nHere the two learned models track each other closely and both crush the spline in\nfrontal and temporal regions (spline around 0.8 to 1.0 NMSE, ZUNA around 0.35 to 0.6).\nThe three converge only over parietal cortex, where the field is smooth enough that a\nspline is already a reasonable prior. Central electrodes are the easiest — every model\ndoes well — because they are surrounded by neighbors on all sides.\n\n## Where it breaks\n\n<Callout type=\"warn\">\nA reconstruction is a generative prior, not a measurement. The model fills a missing\nchannel with signal that is plausible *given the rest of the scalp* — which is\nprecisely wrong when the thing you care about is a focal event that only the missing\nelectrode would have seen. For a sleep-staging or BCI pipeline that leans on spatial\nredundancy, that is fine. For reading a possible focal spike off a dead electrode, a\nlow NMSE can hide a confidently hallucinated normal trace. Denoising and upsampling\ncarry the same caveat: the output is the model's best guess at a signal that is\n*consistent*, not the signal that was actually there.\n</Callout>\n\nTwo more honest limits. The evaluation is four datasets and F32 weights; generalization\npast those recording conditions is asserted, not shown. And rectified-flow decoding is\niterative — CPU inference works, and it is cheap because the transport path is straight,\nbut latency still scales with how many sampling steps you take, so \"runs on a CPU\" and\n\"real-time\" are not the same claim.\n\n## The take\n\n- **The reframing that pays off is spatial.** Making position, not channel index, the\n  thing the model keys on is the whole idea, and 4D RoPE over $(x, y, z, t)$ is a clean\n  way to do it. It is the same move that made vision transformers resolution-flexible,\n  applied to the scalp — and it is what lets one set of weights span a headband and a\n  256-channel cap and interpolate to electrodes it never saw.\n- **The diffusion-autoencoder choice fits the problem.** Reconstruction is genuinely\n  uncertain, so a generative decoder that samples a plausible signal is more honest\n  than a regressor that returns the blurred mean. Rectified flow keeps that sampling\n  cheap enough to run without a GPU.\n- **It is built to be used, not admired** — Apache-2.0, `pip install zuna`, a browser\n  playground, and an MNE-friendly `reconstruct_fif()` entry point. The win over\n  classical interpolation is decisive; the win over ZUNA1 is a tie, and Zyphra says so.\n  Both are worth saying out loud.\n\n---\n\n*Sources: the [ZUNA 1.1 release](https://www.zyphra.com/our-work/zuna1.1), the\n[Hugging Face model card](https://huggingface.co/Zyphra/ZUNA1.1), and the\n[GitHub repo](https://github.com/Zyphra/zuna). Figures are from Zyphra's release; NMSE\nvalues are read off the published plots and are approximate, since exact tables were\nnot released. Released 2026-07-16 under Apache-2.0.*\n","readingTimeMins":10,"url":"https://ai.thesatyajit.com/articles/zuna-1-1","signal":{"interest":4,"helpful":2,"score":6,"level":2,"label":"Solid"}},{"title":"GRAPE: RoPE, ALiBi, and FoX are the same construction","description":"Positional encoding is a zoo of tricks — rotary embeddings, linear ALiBi biases, forget gates — each justified on its own terms. GRAPE (ICLR 2026, Princeton/UCLA/Tsinghua) shows they are one thing: a position n acting through a group action G(n) = exp(nωL). Pick a rank-2 skew generator and you get a rotation in SO(d) — that is exactly RoPE. Pick a rank-1 nilpotent generator and you get a shear in GL that adds a linear bias — that is exactly ALiBi, and FoX is its path-integral. A walk through the group theory, the closed forms, the honest (small-scale) results, and what the framework does and doesn't yet buy you.","date":"2026-07-16","tags":["llm","attention","positional-encoding","transformers","explainer"],"draft":false,"cover":"/articles/grape-position-encoding/fig1.png","featured":false,"interest":5,"helpful":4,"kind":"articles","slug":"grape-position-encoding","body":"Self-attention has no idea what order its tokens came in — permute the sequence and the raw\nattention scores are unchanged. So every transformer bolts on a *positional encoding*, and over the\nyears these have multiplied into a small zoo of unrelated-looking tricks. [Rotary embeddings\n(RoPE)](/articles/attention-mechanisms) rotate queries and keys by angle-per-position. **ALiBi**\nsubtracts a linear penalty proportional to how far apart two tokens are. The **Forgetting Transformer\n(FoX)** multiplies in a per-token forget gate. Each is derived on its own terms, with its own\nintuition, and the folklore treats them as competing families: multiplicative *phase* versus additive\n*bias*.\n\n**GRAPE** — *Group Representational Position Encoding*, from Princeton, UCLA and Tsinghua's IIIS\n(ICLR 2026) — makes the deflationary claim that all of them are the **same construction seen through\ndifferent generators**. A position $n$ acts on the query/key space through one group action,\n\n$$\\mathbf{G}(n) = \\exp(n\\,\\omega\\,\\mathbf{L}),$$\n\nand *which kind of matrix* you put in the generator $\\mathbf{L}$ decides the family. A rank-2\nskew-symmetric $\\mathbf{L}$ gives a **rotation**, and RoPE falls out exactly. A rank-1 nilpotent\n$\\mathbf{L}$ gives a **shear** that injects a linear bias, and ALiBi and FoX fall out exactly. Scrub a\nposition below and flip the generator to watch the two behaviours emerge from one law:\n\n<GeneratorAction />\n\nThe payoff of the unification is not a new record — it's a *design space*. Once RoPE is \"the rotation\nwith the canonical basis and a log-uniform spectrum,\" you can ask what the *learned* basis does; once\nALiBi is \"the rank-1 unipotent action with a fixed slope,\" you can ask what a *content-dependent* slope\ndoes. GRAPE names and tries both.\n\n## One law, two generators\n\nThe organizing principle is that a positional map should respect an **exact relative law**:\n\n$$\\mathbf{G}(t-s) = \\mathbf{G}(s)^{-1}\\,\\mathbf{G}(t), \\qquad \\mathbf{G}(n+m) = \\mathbf{G}(n)\\,\\mathbf{G}(m).$$\n\nThis is what makes attention translation-invariant: if you apply $\\mathbf{G}(i)$ to query $i$ and\n$\\mathbf{G}(j)$ to key $j$, the score $\\tilde{\\mathbf{q}}_i^\\top\\tilde{\\mathbf{k}}_j =\n\\mathbf{q}_i^\\top\\mathbf{G}(j-i)\\mathbf{k}_j$ depends only on the offset $j-i$, never on absolute\nposition. Any one-parameter subgroup $\\mathbf{G}(n) = \\exp(n\\mathbf{L})$ satisfies it automatically —\nso the whole design question collapses to *which generator $\\mathbf{L}$ to exponentiate*. GRAPE\nidentifies the two generator types that keep the exponential cheap and the geometry clean:\n\n- **Rank-2 skew** $\\mathbf{L} = \\mathbf{a}\\mathbf{b}^\\top - \\mathbf{b}\\mathbf{a}^\\top \\in\n  \\mathfrak{so}(d)$ exponentiates to an **orthogonal** map — a norm-preserving rotation in\n  $\\mathrm{SO}(d)$. This is *Multiplicative GRAPE*.\n- **Rank-1 nilpotent** $\\mathbf{A}$ with $\\mathbf{A}^2 = \\mathbf{0}$ exponentiates, in one term, to a\n  **unipotent** map $\\mathbf{G}(n) = \\mathbf{I} + n\\omega\\mathbf{A}$ in the general linear group $\\mathrm{GL}$\n  — a shear that translates a feature and shows up as an additive logit bias. This is *Additive GRAPE*.\n\n<Figure\n  src=\"/articles/grape-position-encoding/fig1.png\"\n  alt=\"Overview diagram of the GRAPE framework. A top box states the general relative law G(t−s)=G(s)^{-1}G(t) and the map G(n)=exp(nω·Generator). Two arrows fork down to a blue Multiplicative GRAPE panel (Operation: Rotation, Manifold SO(d), rank-2 skew generator L=ab^T−ba^T with a Rodrigues closed form, a rotating-vector inset, recovers RoPE, extends to learned bases) and a red Additive GRAPE panel (Operation: Translation, Manifold GL(d+k) unipotent lift, rank-1 nilpotent generator A with A²=0 so exp(A)=I+A, a descending bias-vs-position inset, recovers ALiBi and FoX, extends to path integral).\"\n  caption=\"One framework, two generator types: a rank-2 skew generator gives a norm-preserving rotation (recovering RoPE); a rank-1 nilpotent generator gives a unipotent shear/additive bias (recovering ALiBi and FoX). Both obey the same relative law (Zhang et al., 2026, Figure 1).\"\n/>\n\n## Multiplicative GRAPE: RoPE is a rotation with a fixed basis\n\nBuild the generator from two vectors $\\mathbf{a},\\mathbf{b}\\in\\mathbb{R}^d$. With\n$\\alpha = \\|\\mathbf{a}\\|^2$, $\\beta = \\|\\mathbf{b}\\|^2$, $\\gamma = \\mathbf{a}^\\top\\mathbf{b}$ and\n$s = \\sqrt{\\alpha\\beta - \\gamma^2}$, the rank-2 skew $\\mathbf{L}$ squares to\n$\\mathbf{L}^2 = -s^2\\,\\mathbf{P}_{\\mathcal{U}}$ on the plane $\\mathcal{U} = \\mathrm{span}\\{\\mathbf{a},\\mathbf{b}\\}$.\nThat single fact collapses the matrix exponential to a **Rodrigues-type closed form**:\n\n$$\\exp(\\mathbf{L}) = \\mathbf{I} + \\frac{\\sin s}{s}\\mathbf{L} + \\frac{1-\\cos s}{s^2}\\mathbf{L}^2,$$\n\na pure rotation by angle $s$ inside the plane $\\mathcal{U}$, computable in $O(d)$ flops with no matrix\never materialized. Stack $d/2$ of these on disjoint coordinate pairs with frequencies $\\theta_i$ and\nthey commute, so\n\n$$\\mathbf{G}(n) = \\prod_{i=1}^{d/2}\\exp(n\\theta_i\\mathbf{L}_i)\n  = \\mathrm{blockdiag}\\big(\\mathbf{R}_2(n\\theta_1),\\dots,\\mathbf{R}_2(n\\theta_{d/2})\\big).$$\n\nThat block-diagonal of $2\\times2$ rotations *is* RoPE — the paper's Proposition 3.1 states RoPE is\n**exactly** commuting multi-subspace GRAPE-M with the canonical coordinate pairs and a log-uniform\nspectrum. What GRAPE adds is the freedom RoPE gives up: the planes and spectrum can be **learned**\n(commuting subspaces at $O(d)$ per head), or you can allow a compact **non-commuting** mixture (at\n$O(rd)$ per head) so different feature subspaces can *couple* — geometry the fixed RoPE basis cannot\nexpress.\n\n## Additive GRAPE: ALiBi and FoX are shears in a lifted space\n\nTo get an *additive* bias out of a *multiplicative* group, GRAPE uses the classic trick of a\n**homogeneous lift**: augment $\\mathbf{x}\\in\\mathbb{R}^d$ to $\\hat{\\mathbf{x}}\\in\\mathbb{R}^{d+k}$ and\nwork in $\\mathrm{GL}(d+k)$ with a nilpotent generator. Because $\\mathbf{A}^2 = \\mathbf{0}$, the\nexponential is just $\\mathbf{G}_{\\mathrm{add}}(n) = \\mathbf{I} + n\\omega\\mathbf{A}$. With an asymmetric\nlift $\\hat{\\mathbf{q}}_i = [\\mathbf{q}_i;1;0]$, $\\hat{\\mathbf{k}}_j = [\\mathbf{k}_j;0;1]$ and the rank-1\ngenerator $\\mathbf{A}_h = -\\beta_h\\,\\mathbf{e}_{d+2}\\mathbf{e}_{d+1}^\\top$, the score becomes\n\n$$\\hat{\\mathbf{q}}_i^\\top\\,\\mathbf{G}_{\\mathrm{add},h}(j-i)^{-\\top}\\,\\hat{\\mathbf{k}}_j\n  = \\mathbf{q}_i^\\top\\mathbf{k}_j + (j-i)\\,\\beta_h,$$\n\nwhich is **exactly ALiBi** with head slope $\\beta_h$. The nilpotent structure is not decoration: it is\nwhat guarantees the exact relative law and clean streaming (cache the rotated keys once). GRAPE then\ngeneralizes the *slope*. Replace the constant $\\beta_h$ with non-negative softplus gates on the query\nand key, and the bias becomes **content-dependent**:\n\n$$\\tilde{\\mathbf{q}}_i^\\top\\tilde{\\mathbf{k}}_j\n  = \\mathbf{q}_i^\\top\\mathbf{k}_j + (j-i)\\,\\omega\\big[\\mathrm{softplus}(\\mathbf{v}^\\top\\mathbf{q}_i/\\sqrt{d})\n  + \\mathrm{softplus}(\\mathbf{u}^\\top\\mathbf{k}_j/\\sqrt{d})\\big].$$\n\nThis is **GRAPE-A-QK**: a learnable, content-adaptive linear bias derived from first principles rather\nthan hand-set per head. Drag the gate below to see the fixed ALiBi head-fan give way to a\ncontent-driven slope:\n\n<AdditiveBias />\n\nThe **Forgetting Transformer** falls out of the same picture. FoX's per-token forget gates accumulate\na bias $b_h(t,j) = \\sum_{\\ell=j+1}^{t}\\log f_{\\ell,h}$, which is precisely a *path product* of unipotent\nfactors $\\prod_\\ell(\\mathbf{I} + \\log f_{\\ell,h}\\,\\mathbf{E}) = \\mathbf{I} + b_h(t,j)\\,\\mathbf{E}$. So\nFoX is an exact instance of **Path-Integral Additive GRAPE (GRAPE-AP)** — the endpoint-dependent\nversion that keeps row-wise composition and prefix-sum streaming.\n\n## The whole map\n\nPut the pieces together and every named scheme is a leaf on one tree. Click through them — each is\neither *recovered exactly* by a specific generator or sits just past a known method as a GRAPE\n*extension*:\n\n<FamilyMap />\n\n## Does the extra freedom help?\n\nHere is where the honesty starts. GRAPE is validated at **small scale**: 353M and 770M models trained\non 50B tokens of FineWeb-Edu, context length 4,096, in a nanoGPT/Llama-style setup, evaluated 0-shot on\na standard NLU suite (ARC, HellaSwag, OBQA, PIQA, WinoGrande, SciQ). The training curves are close, but\nGRAPE's additive variants hold a persistent small edge, and the authors note RoPE showed a training\ninstability at 770M that GRAPE did not:\n\n<Figure\n  src=\"/articles/grape-position-encoding/fig2.png\"\n  alt=\"Two line charts for the medium 353M model on FineWeb-Edu, training loss (left) and validation loss (right), versus training tokens from 0 to 50 billion. Four curves — RoPE (blue), ALiBi (green), FoX (orange), GRAPE-AP (red) — all decline from above 3.1 toward about 2.55–2.6 and stay tightly bunched, with GRAPE-AP and FoX slightly lower than RoPE late in training.\"\n  caption=\"Training and validation loss for the 353M model across positional encodings; the curves are close, with GRAPE-AP tracking at or slightly below RoPE and ALiBi throughout (Zhang et al., 2026, Figure 2).\"\n/>\n\nOn downstream average, the ordering is consistent but the margins are small. For the 353M models,\nGRAPE-AP (path-integral) is the best of the eight variants, edging FoX and ALiBi, with plain RoPE last:\n\n<BenchBars\n  title=\"353M models · average over 7 NLU tasks (0-shot, %)\"\n  unit=\"%\"\n  bars={[\n    { label: \"RoPE\", value: 51.73 },\n    { label: \"ALiBi\", value: 52.87 },\n    { label: \"FoX\", value: 52.96 },\n    { label: \"GRAPE-AP\", value: 53.25, highlight: true },\n  ]}\n/>\n\nThe 770M models tell the same story — GRAPE-AP first, RoPE last — again by roughly a point:\n\n<BenchBars\n  title=\"770M models · average over 7 NLU tasks (0-shot, %)\"\n  unit=\"%\"\n  bars={[\n    { label: \"RoPE\", value: 55.76 },\n    { label: \"FoX\", value: 56.30 },\n    { label: \"ALiBi\", value: 56.44 },\n    { label: \"GRAPE-AP\", value: 56.91, highlight: true },\n  ]}\n/>\n\n<Callout type=\"warn\">\n**Read the wins narrowly.** (1) *Small scale, standard benchmarks.* Everything is 353M/770M on 50B\ntokens at 4K context, on ARC/HellaSwag-style tasks — there are **no long-context or\nlength-extrapolation experiments** (no RULER, no retrieval), which is striking given the paper motivates\nitself with long-context and ALiBi's extrapolation. (2) *The rotational story didn't pay off\nempirically.* The Multiplicative variants that generalize RoPE — GRAPE-M-ctx/nonctx — actually\n**underperform RoPE** on the large models (54.7–54.8 vs 55.76 avg); all the downstream gains come from\nthe **additive** family, so the framework's practical dominance rests on the ALiBi/FoX side, not the\nrotation side. (3) *Margins are ~0.3–1.5 average points* over strong baselines, and the ranking flips\nunder the KV-shift setting: with KV-shift enabled, FoX edges GRAPE-AP at 770M (57.09 vs 56.86). (4)\n*No efficiency measurements.* The $O(d)$/$O(rd)$-per-head costs are stated, not timed — there are no\nwall-clock or FLOP comparisons. (5) Baselines are the authors' own reimplementations; there is no\ncomparison to tuned production models. The contribution is the **unifying theory and design space**,\nlightly validated — not a demonstrated accuracy or efficiency SOTA.\n</Callout>\n\n## The take\n\nGRAPE's real product is conceptual compression. Positional encoding stops being a list of tricks and\nbecomes a single knob — *which generator do you exponentiate?* — with RoPE, ALiBi and FoX as three\nspecific settings and a labelled space of alternatives (learned rotation bases, non-commuting mixtures,\ncontent-gated slopes, path-integral biases) in between. That is genuinely clarifying, and the exact\nrecoveries are proved, not hand-waved: RoPE as commuting rank-2 rotations, ALiBi as a rank-1 unipotent\naction, FoX as its path integral. What the paper does *not* yet show is that the new freedom the map\nopens up buys much at scale — the strongest empirical variant is a modest improvement on the *additive*\nside, the rotation-generalizing side trails plain RoPE, and the long-context claims the framing invites\ngo untested. As a theory it's a clean unification worth knowing; as a recipe, GRAPE-AP is a small,\nhonest win over FoX-style biases, and the rest is an invitation to experiment.\n\n---\n\n*Built on [Group Representational Position Encoding](https://arxiv.org/abs/2512.07805) (Zhang, Chen,\nLiu, Qin, Yuan, Xu, Yuan, Gu, Yao; Princeton / UCLA / Tsinghua IIIS, ICLR 2026). Equations, tables and\nfigures are quoted from the paper (353M and 770M models, FineWeb-Edu, 0-shot lm-evaluation-harness);\nthe interactive diagrams are illustrations of the mechanism, not measured data. Related reading:\n[how attention works](/articles/how-transformers-attention-works),\n[a tour of attention mechanisms](/articles/attention-mechanisms),\n[MiniMax Sparse Attention](/articles/minimax-sparse-attention), and\n[how LLM inference works](/articles/how-llm-inference-works).*\n","readingTimeMins":9,"url":"https://ai.thesatyajit.com/articles/grape-position-encoding","signal":{"interest":5,"helpful":4,"score":9,"level":5,"label":"Essential"}},{"title":"Bonsai 27B: a 27B model at 1.125 bits, small enough for a phone","description":"PrismML's Bonsai 27B takes a full-precision Qwen3.6-27B and quantizes it end to end — embeddings, attention, MLPs, and the LM head — into a ternary (1.71 bits) or 1-bit (1.125 bits) model that shrinks 54 GB down to 3.9 GB and runs on an iPhone. The capability is Qwen's; the achievement is the extreme low-bit compression and the kernels that make it run. This is a walk through the bit encodings, why pushing low precision through the *whole* network is the hard part, and the honest, uneven cost — math survives almost intact while agentic tool-calling and vision fall much harder.","date":"2026-07-15","tags":["quantization","inference-optimization","on-device","multimodal","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"bonsai-27b","body":"Most of the work that makes a large language model *usable* on your own hardware is not a\nbetter model — it's a smaller one that behaves like the big one. **Bonsai 27B**, from PrismML,\nis a clean example: it takes a full-precision **Qwen3.6-27B** and re-encodes its weights at\nclose to one bit each, ending up small enough to load inside a phone's memory budget. The\nintelligence is Qwen's. What Bonsai contributes is the **extreme low-bit representation** —\nrun end to end, not just on the easy layers — and the custom kernels that make it fast.\n\nThe headline is the collapse in size. A 27B model at FP16 is 54 GB; Bonsai ships two quantized\nvariants, and the smaller one is **3.9 GB** — \"27B-class capability at a footprint smaller than\na full-precision 2B model,\" as PrismML puts it. Flip between the precisions:\n\n<Footprint />\n\n## Two encodings, close to one bit each\n\nThe two variants differ only in how each weight is stored. The **ternary** model uses the\nthree-value set `{−1, 0, +1}` with an FP16 scale shared across a group of weights — that works\nout to **1.71 effective bits per weight** and a 5.9 GB model. The **1-bit** model drops the\nzero, storing `{−1, +1}` plus the group scale, for **1.125 bits** and 3.9 GB. (The theoretical\nfloors are log₂3 ≈ 1.58 and 1.0 bits; the group-wise scales are the small overhead on top.)\nEverything else — the hybrid-attention architecture, the 262K-token context window, the\nApache-2.0 license — is inherited from the base.\n\n## The hard part: every block, not just the MLPs\n\nQuantizing a transformer to a couple of bits is not new. What usually happens is that the\n*sensitive* parts — the token embeddings, the attention projections, the LM head — are kept at\nhigher precision, and only the big feed-forward MLPs get squeezed. That protects quality, but\nit also means the footprint only partly shrinks: a model is not small until its embeddings and\nhead are small too. Bonsai's claim is that the low-bit representation \"runs end to end across\nthe language network, embeddings, attention, MLPs, and the LM head,\" with a compact **4-bit\nvision tower** alongside. Toggle between the two philosophies:\n\n<PrecisionMap />\n\nPushing 1-bit weights through the parts everyone else keeps in FP16 is exactly where accuracy\nusually falls off a cliff — which is why the interesting question is not the size, but what it\ncosts. This is the *inference-time* mirror image of [native FP4 training](/articles/nemotron-nvfp4):\nthere the goal was to keep the math stable during training while deliberately holding some layers\nhigher precision; here the goal is to serve an already-trained model with nothing held back. If\nyou want the mechanics of why low-precision inference is memory-bound in the first place, the\n[how LLM inference works](/articles/how-llm-inference-works) piece sets that up, and\n[TurboQuant](/articles/turboquant-kv-cache) covers the complementary problem of quantizing the KV\ncache rather than the weights.\n\n## What survives — and what doesn't\n\nHere is the honest part, and it's the part a size-and-speed announcement tends to bury. PrismML\nreports that ternary keeps **~95%** of full-precision quality and 1-bit keeps **~90%**, averaged\nover a 15-benchmark suite in thinking mode. Both averages check out against their own table — but\nthe average hides a wide spread. Pick a category and watch the three precisions, then read the\nper-category retention strip:\n\n<Retention />\n\nMath is remarkably robust: the 1-bit model holds ~96% of the full-precision score. But\n**agentic tool-calling** falls from 80.0 to 66.0 and **vision** from 72.6 to 59.6 — roughly 82%\nretention each, nearly a fifth of the capability gone. Long, multi-step tool use and multimodal\nperception are precisely the workloads that lean on the fine-grained information that one-bit\nweights throw away. The overall number:\n\n<BenchBars\n  title=\"Overall score · 15-benchmark suite (thinking mode)\"\n  bars={[\n    { label: \"Qwen3.6-27B (FP16)\", value: 85.0 },\n    { label: \"Ternary Bonsai (5.9 GB)\", value: 80.5, highlight: true },\n    { label: \"1-bit Bonsai (3.9 GB)\", value: 76.1, highlight: true },\n  ]}\n/>\n\n## On the device\n\nThe point of all this is where it runs. Bonsai reports up to **163 tok/s** for the 1-bit variant\non an RTX 5090 (134 for ternary), and up to **87 tok/s** on an Apple M5 Max (58 for ternary) — and,\nthe flashiest claim, that the 3.9 GB model fits inside an iPhone 17 Pro's app-memory budget, making\nit \"the first 27B-class model to run on a phone.\" PrismML frames this as *intelligence density* — a\ncoined score-per-GB metric on which the 1-bit model scores 0.53/GB, which they call more than 10×\nthe full-precision baseline. It ships with weights on Hugging Face, an MLX path for Apple silicon and\nCUDA for NVIDIA, and speculative-decoding support for lossless draft-and-verify acceleration.\n\n<Callout type=\"warn\">\nRead the numbers for what they are. Bonsai's **capability is Qwen3.6-27B's** — this is a compression\nand kernels result, not a new model. Every score above is **vendor-reported on PrismML's own\n15-benchmark suite in \"thinking mode,\"** so treat the suite and mode as chosen, not neutral. The\n\"~90–95% retained\" headline is a real average that **masks much larger, uneven drops**: math barely\nmoves, but agentic tool-calling and vision lose ~18% at 1-bit — so the right variant depends entirely\non your workload. \"Intelligence density,\" \"first 27B on a phone,\" and \"10×\" are marketing framings\n(intelligence-density is a coined score-per-GB metric), and the throughput figures are specific to an\nRTX 5090 and an M5 Max. No independent evaluation exists yet.\n</Callout>\n\n## The takeaway\n\nBonsai is a bet that for a large slice of real use — on-device assistants, privacy-sensitive tasks,\nhybrid deployments that route only the hard cases to a frontier API — a model that keeps 90% of a 27B's\nquality while fitting in 3.9 GB beats a bigger model you can't run locally at all. That bet is strongest\nwhere quality degrades gracefully (math, general reasoning) and weakest where it doesn't (agentic, vision).\nThe genuinely impressive engineering is the end-to-end part: getting one-bit embeddings and a one-bit LM\nhead to work is what turns \"quantized MLPs\" into a model that actually fits on the phone in your pocket.\n","readingTimeMins":5,"url":"https://ai.thesatyajit.com/articles/bonsai-27b","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Inkling: an open-weights multimodal MoE built to be adapted","description":"Thinking Machines Lab's Inkling is a 975B-total / 41B-active mixture-of-experts foundation model with encoder-free text, image and audio, up to 1M-token context, and open weights. The lab is unusually blunt that it's 'not the strongest overall model' — it's a customizable base tuned for broad adaptation, not a SOTA claim. The interesting parts are the mechanics: controllable effort that hits a reference model's Terminal-Bench score at ~1/3 the tokens, RL reward that scaled log-linearly over 30M+ rollouts while chain-of-thought got shorter on its own, and a 5:1 sliding-window/global attention hybrid. All benchmarks are vendor-reported on their own suite, so scope the numbers accordingly — but the weights are open, which is the part that's checkable in time.","date":"2026-07-15","tags":["llm","mixture-of-experts","multimodal","reinforcement-learning","attention","explainer"],"draft":false,"featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"inkling","body":"Most model launches lead with a leaderboard. Thinking Machines Lab's **Inkling** does the opposite: the\nannouncement states plainly that it is **\"not the strongest overall model,\"** and is instead **\"designed\nfor broad adaptation through fine-tuning.\"** That framing is the right lens for everything below. Inkling\nis an **open-weights**, multimodal **[mixture-of-experts](/articles/mixture-of-experts-from-scratch)**\nfoundation model — **975B total parameters, 41B active** — with text, image and audio in one stack and up\nto a **1M-token** context. A smaller companion, **Inkling-Small (276B total / 12B active)**, ships in\npreview. The pitch is a customizable *base*, not a frontier trophy.\n\nWhat makes it worth a close read is the mechanics: an attention design tuned for long context, an\nencoder-free multimodal path, a reinforcement-learning run whose reward scaled *log-linearly* while the\nmodel's reasoning got *shorter* on its own, and a knob that lets you dial how many tokens the model spends\nper query. Let's take them in turn — and keep the honest caveats in view throughout.\n\n## The backbone: sparse experts, hybrid attention\n\nInkling is a **66-layer** decoder-only transformer. Two forms of sparsity run through it. In the\nfeed-forward path, every layer is a mixture-of-experts: **256 routed experts plus 2 shared experts**, with\n**6 routed experts active per token**. A **sigmoid-based router** with an **auxiliary-loss-free\nload-balancing** bias decides which six fire — the same \"drop the aux loss, use a bias term\" trick that\nhas become standard for keeping expert utilization even without a loss that fights the main objective. The\n2 shared experts are always on, giving every token a common backbone of computation. Net effect: only\n**41B of the 975B** parameters do work on any given token.\n\nThe attention path is a **hybrid**: of the 66 layers, **55 are sliding-window (512-token) local and 11 are\nglobal** — an interleaved **5:1 ratio** — with **64 query heads** tied to **8 KV heads**, over a **6144-dim**\nresidual stream. Five cheap local layers pass for every one exact global layer — the same local/global\nbargain that makes long-context serving affordable in\n[MiniMax's sparse attention](/articles/minimax-sparse-attention) and\n[MiMo-V2-Flash](/articles/mimo-v2-flash).\n\nThen a cluster of small but telling choices — the kind you only catch by reading the config, not the\nlaunch post:\n\n- **Relative position bias**, not [RoPE](/articles/how-llm-inference-works) (`d_rel=16`, `rel_extent=1024`)\n  — a learned bias on relative distance, chosen for cleaner extrapolation past the trained length.\n- **Short depthwise convolutions** (kernel size 4) in *several* places — after the key and value\n  projections and on the residual branches — a cheap way to blend a little local context into each token\n  before attention even runs. Convs-inside-a-transformer is a recurring \"free lunch\" for stability.\n- An easy-to-miss one: a **separate RMSNorm on the token embeddings**, applied *before* the usual\n  per-block RMSNorms (`use_embed_norm=true`). The residual stream is normalized at *entry*, not only inside\n  each layer — extra insurance on embedding scale that most decoder-only stacks skip.\n\nNone of these are headline features; together they read as the fingerprint of a team tuning the backbone\nfor stable long-context training rather than chasing a benchmark. Scrub the stack to see both sparsities at\nonce — which layers are global, and which experts a token lights up:\n\n<ArchitectureStack />\n\n## Encoder-free multimodal\n\nThe multimodal design is deliberately minimal: **no separate vision or audio encoder**. Instead every\nmodality is turned into tokens the transformer reads directly. **Audio** becomes **discrete dMel\nspectrogram** tokens; **images** are cut into **40×40-pixel patches** and lifted by a small **four-layer\nhMLP** patch encoder; all modalities land in the **shared hidden space** and flow through the same experts\nand attention. There's no bolted-on CLIP-style tower whose representation you have to align — the model\nlearns text, image and audio in one backbone. That is part of why it's pitched as an adaptation base:\nfine-tuning touches one stack, not a federation of encoders.\n\n## Controllable effort — the signature move\n\nInkling can vary how much it \"thinks.\" The **system message plus a per-token cost** let you trade accuracy\nfor token spend: turn effort down and it answers tersely; turn it up and it reasons at length, approaching\nits ceiling. The headline result is on **Terminal-Bench-2.1**, where the lab reports Inkling reaching\n**Nemotron-3-Ultra-equivalent** accuracy at **roughly one-third the generated tokens**. Drag the effort\nknob and read the tie line — the same score sits about **3× further right** on the reference curve:\n\n<EffortCurve />\n\nThe efficiency framing matters more than any single point on the curve. A model that lets the *caller*\nchoose the accuracy/latency trade-off, per request, is a different product from one with a fixed thinking\nbudget — especially for the fine-tuning-and-deploy audience Inkling targets, who care about tokens-per-task\ncost at scale. (The curve shape above is illustrative; the ~63.8% Terminal-Bench-2.1 plateau and the\n~1/3-token match are the real, vendor-reported anchors.)\n\n## The RL story: log-linear reward, self-shortening reasoning\n\nPost-training leaned on **large-scale asynchronous [reinforcement learning](/articles/ring-zero-trillion-scale-rl)** —\n**over 30 million rollouts**. Two findings stand out. First, the **aggregate held-out eval reward rose\nlog-linearly** across those rollouts, climbing from **0.264** at the SFT-initialised checkpoint to\n**0.356** at release — a straight line on a log-rollouts axis, i.e. more RL compute kept paying off\npredictably rather than saturating. Second, and more surprising: with **no brevity objective** in the\nreward, the model's **chain-of-thought became more concise on its own**, \"dropping grammatical overhead\nwhile remaining comprehensible.\" Reasoning compression emerged as a side effect of optimizing for correct\nanswers. Drag the marker to watch reward climb as thought-length falls:\n\n<RlScaling />\n\nThis connects back to controllable effort: a model whose reasoning is naturally terser is cheaper to run at\nany accuracy target, and the effort knob then lets you push that further.\n\n## Training and release\n\nPretraining ran on **45 trillion tokens** of mixed text, image, audio and video, optimized with **[Muon](/articles/muon-optimizer)\nfor the large matrix weights and Adam for everything else** (weight decay coupled to the squared learning\nrate), on **NVIDIA GB300 NVL72** systems. Alongside the standard weights, Thinking Machines released\n**[NVFP4](/articles/nemotron-nvfp4)** weights for Blackwell — the same 4-bit format NVIDIA used to train\nNemotron. The release is genuinely open: **weights on Hugging Face** (both standard and NVFP4), an **API on\nTinker plus Together, Fireworks, Modal, Databricks and Baseten**, day-one **vLLM / SGLang / llama.cpp**\nintegration, and a public **Playground**.\n\n## Results — read them as vendor-reported\n\nHere are the headline numbers from Thinking Machines' own suite (at high effort). Reasoning first:\n\n<BenchBars\n  title=\"Inkling — reasoning (vendor-reported, %)\"\n  unit=\"%\"\n  bars={[\n    { label: \"AIME 2026\", value: 97.1, highlight: true },\n    { label: \"GPQA-Diamond\", value: 87.2, highlight: true },\n    { label: \"HLE (with tools)\", value: 46.0 },\n    { label: \"HLE (text only)\", value: 29.7 },\n  ]}\n/>\n\nAnd the agentic / coding side, where the effort story is most relevant:\n\n<BenchBars\n  title=\"Inkling — agentic & coding (vendor-reported, %)\"\n  unit=\"%\"\n  bars={[\n    { label: \"SWEBench-Verified\", value: 77.6, highlight: true },\n    { label: \"BrowseComp\", value: 77.1 },\n    { label: \"MCP-Atlas\", value: 74.1 },\n    { label: \"Terminal-Bench-2.1\", value: 63.8, highlight: true },\n  ]}\n/>\n\nMultimodal and safety round it out: **VoiceBench 91.4%**, **MMAU 77.2%**, **MMMU-Pro 73.5%**,\n**Global-MMLU-Lite 88.7%**; on safety, **FORTRESS Benign 95.9% / Adversarial 78.0%** and **StrongREJECT\n98.6%**. Inkling-Small tracks the big model closely on several evals (HLE text-only **29.6%**, HLE with\ntools **46.6%**).\n\n<Callout type=\"warn\">\n  **Scope the numbers.** Every score here is **vendor-reported on Thinking Machines' own evaluation\n  suite**, and the comparison set (GPT-5.6 Sol, Claude Fable 5, GLM 5.2, Nemotron-3-Ultra, and others) is\n  **provider-selected** — so treat this as a self-report, not a neutral head-to-head. The \"~1/3 the\n  tokens,\" the log-linear RL scaling (0.264 → 0.356), and the emergent reasoning-compression claim are all\n  **their measurements on their evals**: real and interesting, but not independently verified. And this is\n  a **company blog and model card, not a peer-reviewed paper** — there is no external methodology to audit.\n  The genuine mitigant is that it's an **open-weights** release, so the architecture and the claims become\n  independently checkable over time in a way a closed model's never are.\n</Callout>\n\n## The take\n\nInkling's most refreshing feature is its honesty about what it is. Thinking Machines did not build the\nmodel to top a leaderboard; they built a broad, open, multimodal base and tuned the *ergonomics of\nadapting and running it* — controllable effort so callers own the accuracy/cost trade-off, a 5:1\nlocal/global attention hybrid and relative positions so 1M-token context stays affordable, an encoder-free\nmultimodal path so fine-tuning touches one stack, and NVFP4 weights so it deploys cheaply on Blackwell. The\ntwo research results worth remembering are the **log-linear RL reward** (evidence that the post-training\nrecipe kept scaling) and the **emergent reasoning compression** (shorter chains-of-thought with no brevity\nreward) — both their own measurements, both the kind of thing open weights will let others probe. Judge it\nnot as \"is this the best model\" — the lab already answered no — but as a base you can take, fine-tune, and\nserve. On that axis, an open 975B/41B MoE with these ergonomics is a substantial thing to hand the\ncommunity.\n\n---\n\n*Built on Thinking Machines Lab's [Inkling announcement](https://thinkingmachines.ai/news/introducing-inkling/)\nand [model card](https://thinkingmachines.ai/model-card/inkling/), plus the\n[Hugging Face release](https://huggingface.co/thinkingmachines/inkling). All benchmark and scaling figures\nare vendor-reported; the interactive diagrams are illustrations of the mechanism, with real endpoints\nnoted inline. Related reading:\n[mixture-of-experts from scratch](/articles/mixture-of-experts-from-scratch),\n[NVFP4 training](/articles/nemotron-nvfp4),\n[MiniMax sparse attention](/articles/minimax-sparse-attention),\n[MiMo-V2-Flash](/articles/mimo-v2-flash),\n[trillion-scale RL](/articles/ring-zero-trillion-scale-rl),\n[the Muon optimizer](/articles/muon-optimizer), and\n[how LLM inference works](/articles/how-llm-inference-works).*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/inkling","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"LOTUS: reasoning in the hidden states, not the token stream","description":"Explicit chain-of-thought writes every reasoning step out token-by-token, which is slow; latent CoT reasons in hidden states instead — but past 1B parameters it has always trailed explicit CoT, and the gap grew with scale. LOTUS closes it at 3B with a looped padded Transformer: reuse the same weights for R passes over a fixed latent region, refine all K blocks in parallel, and supervise each latent position against its gold CoT-step token. A walk through the loop, the parallel supervision, the real GSM8K numbers, and the honest limits of a fixed thinking budget.","date":"2026-07-15","tags":["llm","reasoning","chain-of-thought","latent-reasoning","inference-optimization","explainer"],"draft":false,"cover":"/articles/lotus-latent-reasoning/fig1.png","featured":true,"interest":5,"helpful":4,"kind":"articles","slug":"lotus-latent-reasoning","body":"The way a reasoning model earns its answer is by writing out its work: an explicit **chain of thought (CoT)**,\none token at a time, before it commits to a final answer. That is where the latency goes. Each of those\nintermediate tokens is a full sequential decode step — a [memory-bound pass over the growing KV\ncache](/articles/how-llm-inference-works) — so the more the model thinks, the slower it answers.\n\n**Latent CoT** is the tempting alternative: do the multi-step reasoning inside the model's *hidden states*,\nreplacing decoded tokens with continuous representations, and skip the token-by-token bottleneck entirely.\nThe problem is that it has never quite worked at scale. Methods like Coconut, CODI, and SIM-CoT match\nexplicit CoT on small models, but **beyond 1B parameters no latent method keeps up on math reasoning, and\nthe gap widens as the backbone grows**. LOTUS — *Looped Transformers with parallel supervision on latents*,\nfrom Ying Fan, Anej Svete, and Kangwook Lee — is, to the authors' knowledge, the first latent-CoT method\nto close that gap at the **3B** scale, while cutting the thought phase by **2.5×–6.9×**.\n\nIt gets there by fixing the two things the authors argue were holding latent CoT back.\n\n- **(P1) Sequential generation.** Coconut, CODI, and SIM-CoT still produce their latent tokens\n  *autoregressively* — the sequential bottleneck is still there, just moved into latent space.\n- **(P2) No CoT grounding.** Without supervision that aligns each latent position to a specific gold\n  reasoning step, the latent trace drifts and destabilizes as the model gets bigger.\n\n## The loop: one weight set, R passes, K blocks in parallel\n\nLOTUS builds a **padded latent region** into the prompt. Between two learnable delimiters `⟨BoT⟩` and\n`⟨EoT⟩` it inserts $K$ blocks of $c$ shared, learnable `⟨lat⟩` tokens — a fixed $K\\cdot c$ latent positions\n(the deployed config is $K=6$, $c=25$, so 150 positions). The question $Q$ sits before `⟨BoT⟩`; the answer\n$A$ comes after `⟨EoT⟩`.\n\nThe reasoning then happens by **looping the base language model over that region**. Let $E$ be the learnable\nlatent embeddings and $f_\\theta$ the ordinary LM backbone. Starting from the latents, LOTUS reuses the *same\nweights* for $R$ iterations, adding the previous pass's output back in each time:\n\n$$\nh^{(0)} = f_\\theta\\!\\big(E \\mid C_{\\text{pre}}\\big), \\qquad\nh^{(t)} = f_\\theta\\!\\big(E + h^{(t-1)} \\mid C_{\\text{pre}}\\big), \\quad t = 1,\\dots,R\n$$\n\nwhere $C_{\\text{pre}}$ is the reused KV cache of the question. This is a **recurrent-depth (looped)\nTransformer**: it adds computation depth by reusing parameters, not by adding them. The crucial property is\nthat all $K\\cdot c$ latent positions are refined **together** on each pass — so the whole thought phase is\n$R$ sequential forward passes, not one pass per generated token. Scrub the loop and watch the latents sharpen,\nthen read out at the final iteration:\n\n<LoopedForward />\n\nThat parallelism is the answer to **(P1)**. Where an autoregressive latent method spends a forward pass per\nlatent token — the same shape of bottleneck that makes [multi-token prediction](/articles/multi-token-prediction)\nand [diffusion language models](/articles/illada-diffusion-language-model) attractive — LOTUS spends only $R$\npasses for the entire trace, regardless of how many tokens that trace would have been.\n\nThe other way to see this is as a **network**. A looped Transformer is a recurrent-depth\nnetwork: roll it up and it is one block with a loop-back edge; unroll it and it is an effective\n$R$-deep stack of the *same* weights, with the latents $E$ fed back in at every pass. The depth is\nreal — each pass is a full forward through the backbone — but the parameter count never grows past\n$1\\times$. Toggle between the rolled and unrolled views, and drag the unroll depth:\n\n<LoopUnroll />\n\nThat is the whole trick behind \"add computation depth by reusing parameters, not by adding them\": a\n3B backbone reasons at a depth its parameter budget alone would not buy, because depth here is\n$R$ passes through shared weights rather than $R$ times the weights.\n\n## Parallel supervision: grounding each latent in its gold step\n\nThe loop alone is not enough; the latents need to be told *what to compute*. This is the answer to **(P2)**,\nand it is the part that makes LOTUS more than \"a looped model.\" After the final iteration, LOTUS reads each\npost-loop latent position $h^{(R)}_{i,j}$ **through the base LM head** $f_{\\text{head}}$ and trains it, with\ncross-entropy, toward the gold CoT-step token that belongs in that slot. Each gold CoT step $i$ is tokenized\nand padded/truncated to $c$ tokens $T_{i,\\cdot}$, and every position is supervised at once:\n\n$$\n\\mathcal{L}_{\\text{step}} = \\frac{1}{N_{\\text{step}}}\\sum_{i=1}^{K}\\sum_{j=1}^{c}\n\\operatorname{CE}\\!\\big(f_{\\text{head}}(h^{(R)}_{i,j}),\\, T_{i,j}\\big)\n$$\n\nThis is direct, position-aligned supervision to real reasoning tokens — much like ordinary explicit-CoT\nsupervision — rather than the indirect hidden-state or KV-cache distillation used by earlier parallel-latent\nmethods (PCCoT, KaVa). A separate final forward pass then supervises the answer against the latents:\n\n$$\n\\mathcal{L} = \\mathcal{L}_{\\text{ans}} + \\lambda_{\\text{step}}\\,\\mathcal{L}_{\\text{step}}\n$$\n\n<Figure\n  src=\"/articles/lotus-latent-reasoning/fig2.png\"\n  alt=\"LOTUS architecture, two panels. (a) Looped forward: the input row is Q, BoT, a run of lat tokens grouped into block 1 through block K, then EoT. A single base LM f_theta box is applied with a curved loop arrow labelled ×R, producing post-loop hidden states h(R), one per latent position, which are read through f_head up to a row of gold CoT token boxes. (b) Final forward: the post-loop latents are inserted back into the sequence and one more pass through f_theta produces answer hidden states z, read through f_head to answer tokens A1, A2, supervised by L_ans.\"\n  caption=\"LOTUS refines K blocks of latent tokens in parallel over R looped passes of one shared base LM, then reads each latent position out through the base LM head to its gold CoT-step token (L_step). A final forward pass produces the answer (L_ans) (Fan et al., 2026, Figure 2).\"\n/>\n\nThe paper frames why both losses are needed with a **Parallel Chain Likelihood (PCL)** view. Because the\nstep loss factorizes over positions rather than autoregressively, it induces\n\n$$\np_\\theta^{\\text{PCL}}(T\\mid Q) = \\prod_{i=1}^{K}\\prod_{j=1}^{c} p_\\theta(T_{i,j}\\mid Q)\n$$\n\nThe two losses then split the work: $\\mathcal{L}_{\\text{step}}$ provides **coverage** — it puts probability\nmass on the correct gold token at each position — while $\\mathcal{L}_{\\text{ans}}$ provides **selection** —\nit forces the jointly-computed latents to actually support the right answer. The ablation makes the split\nconcrete: with only $\\mathcal{L}_{\\text{step}}$, the model's post-loop latents recover the gold top-1 token\njust **9.1%** of the time; with only $\\mathcal{L}_{\\text{ans}}$, **9.4%**; with **both**, **70.9%**\n(NLL 3.07 vs 9.29 and 5.97). Neither loss alone builds a readable, correct latent trace.\n\n## Why it is fast\n\nThe efficiency story is simple once the loop is clear. Explicit CoT's thought-phase latency scales with **how\nmuch it writes**; LOTUS's scales with $R$, which is fixed. So the win grows exactly when the rationale gets\nverbose. On Llama-3.2-3B the paper measures the thought phase directly — drag the playhead and watch LOTUS\nfinish while explicit CoT is still decoding:\n\n<ThoughtLatency />\n\nOn compact math-expression CoT the thought phase drops from **338.8 ms to 133.0 ms** (2.5×), and total\nlatency from 384.2 ms to 181.2 ms (about 2.1× overall). Swap in verbose **natural-language** rationales and\nexplicit CoT balloons to **963.6 ms** while LOTUS barely moves to **140.8 ms** — a **6.9×** thought-phase\nspeedup, at essentially the same accuracy (68.13% vs 68.41%). The query-prefill and answer phases are nearly\nidentical across methods; only the thought phase moves.\n\n## The numbers\n\nHeld against explicit CoT and the strongest latent baselines on GSM8K (Llama-3.2-3B, in-domain), LOTUS lands\nwithin about a point and a half of explicit CoT and clearly ahead of the latent baselines:\n\n<BenchBars\n  title=\"GSM8K accuracy, Llama-3.2-3B in-domain (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Explicit CoT\", value: 71.5 },\n    { label: \"LOTUS\", value: 70.0, highlight: true },\n    { label: \"LOTUS + CODI\", value: 70.6, highlight: true },\n    { label: \"CODI + SIM-CoT\", value: 62.3 },\n  ]}\n/>\n\nThe pattern holds across backbones — GPT-2 (LOTUS 44.1 vs explicit 42.7), Llama-1B (57.3 vs 58.4), Llama-3B\n(70.0 vs 71.5) — so unlike prior latent methods the gap does **not** widen with scale. And on the\n**out-of-domain** average (GSM-Hard, MultiArith, SVAMP), LOTUS actually edges ahead of explicit CoT, **63.9\nvs 62.1**, led by a near-perfect 99.9% on MultiArith and 75.7% on SVAMP.\n\n<BenchBars\n  title=\"Thought-phase speedup vs explicit CoT, Llama-3B (×)\"\n  unit=\"×\"\n  bars={[\n    { label: \"math CoT\", value: 2.5, highlight: true },\n    { label: \"natural-language CoT\", value: 6.9, highlight: true },\n    { label: \"total (math)\", value: 2.1, highlight: true },\n  ]}\n/>\n\n## How deep does the loop need to be?\n\nLoop depth $R$ is LOTUS's compute knob, and reasoning turns out to need a real amount of it. Train the model\nat increasing $R$ and accuracy climbs steeply — a shallow loop simply cannot fit multi-step arithmetic:\n\n<LoopDepth />\n\nTwo honest wrinkles live in this chart. First, depth is **not** free test-time compute you can dial up after\ntraining: take the model trained at $R=6$ and run it at $R=7$ and accuracy *dips* to 69.3% — LOTUS reasons\nbest at the depth it was trained for. Second, the parallel **width** $c$ is nearly free where depth is not:\nsweeping $c$ from 1 to 50 tokens per block (a 50× change in latent positions) moves the thought phase by only\nabout 30 ms — 110.9 ms to 141.2 ms — because those positions are processed in parallel. A single token per\nstep (c=1) is too narrow (51.4%), but moderate widths (c=25–30) saturate at 70%.\n\n## Does it actually reason, or memorize?\n\nBecause LOTUS reads its latents through the ordinary LM head, you can literally decode the thought — the same\ntrick behind interpretability tools that [unembed intermediate activations](/articles/jacobian-lens). The\npost-loop latents recover the gold CoT at **70.9% top-1 / 85.8% top-5**. More telling is a multi-path test:\nfor a question with a *trained* gold chain (G) and an *unseen-but-valid* alternative chain (U), the readout\norders their likelihoods $G \\ll U \\ll \\text{random}$ (NLL 0.07, 4.28, 8.16), assigning graded probability to\nvalid-but-never-seen reasoning. That ordering is the paper's evidence that the latents encode reasoning\nstructure, not just a memorized string.\n\n<Callout type=\"warn\">\n**Read the setup before believing the headline.** (1) *Scope is narrow:* every result is **math word\nproblems** (GSM8K-family, trained on GSM8k-Aug); the authors explicitly flag transfer to other domains as\nopen. (2) *The budget is fixed:* $K$, $c$, and $R$ are hyperparameters set to cover the expected step count —\nchains **longer than $K$ steps fall back to autoregressive completion**, and making the budget adaptive is\nlisted as future work, not a solved problem. (3) *\"Bridges the gap\" means parity, not a win:* LOTUS is ~1.5\npoints **behind** explicit CoT in-domain at 3B (70.0 vs 71.5); it needs the LOTUS+CODI combo to get within a\npoint, and it leads only on the OOD average. (4) *Speedups are the paper's own measurements* on Llama-3B\n(H-class GPU), and the flattering 6.9× is specifically the verbose natural-language regime; the compact-math\nnumber is 2.5×. (5) *Baselines are author-selected* latent methods (Coconut, CODI, SIM-CoT, PCCoT, KaVa) and\none explicit-CoT reference — not a broad frontier-model comparison.\n</Callout>\n\n## The take\n\nLOTUS's real contribution is a clean recombination: take a **looped padded Transformer** (depth from weight\nreuse, all latent positions refined in parallel) and give it **direct, position-aligned supervision** to gold\nCoT tokens through the base LM head. The loop kills the sequential bottleneck that every prior latent method\nkept; the supervision kills the drift that made latent reasoning fall apart at scale. Together they do\nsomething no earlier latent-CoT method managed — **stay on the explicit-CoT accuracy curve at 3B** — while\nturning a variable, write-everything thought phase into a fixed $R$-pass one.\n\nThe honest frame is that this is a *parity-with-a-speedup* result on math, bounded by a thinking budget you\nhave to choose in advance. Whether a fixed $K\\cdot c\\cdot R$ box holds up when problems demand more steps than\nyou budgeted — and whether the story survives outside arithmetic — are the open questions the authors name\nthemselves. But as a demonstration that latent reasoning can finally keep pace with explicit reasoning at\nscale, and answer 2.5–6.9× faster while doing it, LOTUS is the first latent-CoT method that clears the bar.\n\n---\n\n*Built on [Bridging the Gap Between Latent and Explicit Reasoning with Looped\nTransformers](https://arxiv.org/abs/2606.31779) (Fan, Svete, Lee; 2026). All accuracy, latency, and ablation\nfigures are quoted from the paper (Llama-3.2-3B unless noted; GSM8K-family math benchmarks; deployed config\n$K=6$, $c=25$, $R=6$). The interactive diagrams are illustrations of the mechanism; the gold-CoT tokens in\nthe loop diagram are illustrative.*\n","readingTimeMins":10,"url":"https://ai.thesatyajit.com/articles/lotus-latent-reasoning","signal":{"interest":5,"helpful":4,"score":9,"level":5,"label":"Essential"}},{"title":"Ring-Zero: what a trillion-parameter model learns from reward alone","description":"Zero RL — reinforcement learning from verifiable rewards, no human labels, no SFT — has mostly been studied on small models. Ring-Zero runs it on a 1-trillion-parameter (63B-active) MoE and reports three things: scale sharply raises the ceiling and sample-efficiency, training splits into a 'discovery' then a 'sharpening' phase, and advanced reasoning behaviors emerge on their own. The engineering that makes it stable is a four-stage pipeline and one quiet fix — a training-inference ratio correction that stops the importance weight from exploding. A detailed walk through the method, the real math, the numbers, and the honest caveats: it trails frontier models, and most ablations are run at 104B, not 1T.","date":"2026-07-15","tags":["llm","reinforcement-learning","reasoning","mixture-of-experts","explainer"],"draft":false,"cover":"/articles/ring-zero-trillion-scale-rl/fig1.png","featured":true,"interest":5,"helpful":3,"kind":"articles","slug":"ring-zero-trillion-scale-rl","body":"**Zero RL** is the stripped-down recipe behind the reasoning-model boom: take a pretrained base model, give it math problems whose answers can be *checked*, reward it for getting them right, and let reinforcement learning grow the chain-of-thought on its own — no supervised fine-tuning, no human-written reasoning traces, no reward model. DeepSeek-R1 made it famous. But almost every published zero-RL study runs on models small enough to fit a modest cluster, which leaves the interesting question open: **what happens when you do this to a genuinely large model?**\n\nRing-Zero is that experiment. The authors run zero RL directly on **Ling-2.5-1T-Base** — a **1-trillion-parameter** Mixture-of-Experts model with **63B active** parameters per token — and report what changes at scale. The paper frames its results as a vindication of the *bitter lesson*: with enough scale, hand-crafted heuristics for \"good reasoning\" become unnecessary because the model develops them itself. The contribution is less a single trick than a **stable, four-stage training pipeline** that survives trillion-scale RL, plus a careful account of the training dynamics and the behaviors that emerge.\n\n<Figure\n  src=\"/articles/ring-zero-trillion-scale-rl/fig1.png\"\n  alt=\"Overview diagram of Ring-2.5-1T-Zero. Top row: a four-box training pipeline — First-stage RL (token-level loss, stability strategies), Self-Distillation (CoT compression, train-infer gap reset), Second-stage RL (sample-level loss, remove KL penalty), Third-stage RL (tier-based training, Low/Medium/High). Bottom left: infrastructure optimization (mixed-precision control with FP32 attention and LM head; context-parallel optimization for MLA and Lightning Attention). Bottom right: four emergent behaviors — anthropomorphism, structured format, parallel reasoning, context anxiety, each with a quoted trace snippet.\"\n  caption=\"The whole system: a four-stage RL pipeline over a 1T MoE base, the infrastructure that keeps it stable, and the cognitive behaviors that appear without supervision (Tang et al., 2026, Figure 1).\"\n/>\n\n## The pipeline is the method\n\nThere is no single loss here. Ring-Zero's real content is a **sequence of four stages**, each one repairing a failure mode the previous stage creates. Click through them:\n\n<PipelineStages />\n\nThe logic of the sequence is worth stating plainly. **First-stage RL** uses a *token-level* loss — the per-response loss is deliberately **not** divided by length — so a longer correct trace earns more total credit and the model learns to think at length. That works, but it also teaches the model to pad (more on that below). **Self-distillation** then samples from the stage-1 expert, keeps the *shortest correct* trace, self-filters redundant steps, and fine-tunes the base model on the result — compressing the bloat and, crucially, **resetting the gap between the training and inference engines**. **Second-stage RL** switches to a *sample-level* (length-normalized) loss so gradients no longer reward length, and drops the KL penalty now that the model is a strong starting point. **Third-stage RL** adds three difficulty tiers with their own prompts so one checkpoint can reason short or long on demand.\n\n## The setup\n\nThe base is **Ling-2.5-1T-Base**, a hybrid MoE combining **MLA** (multi-head latent attention) and **Lightning Attention** layers, trained *from scratch with no SFT*. The smaller **Ling-2.5-flash-Base** (104B total, 7.4B active) is used as the scaling foil and — importantly — as the workhorse for most ablations. Training runs on **320 × H200** GPUs with Megatron for updates and SGLang for rollouts. Each step draws **G = 16** rollouts per question at temperature 1.0; the reward is dead simple and rule-checkable:\n\n$$\nr_i = r_{\\text{acc},i} + r_{\\text{format},i}, \\qquad r_{\\text{acc},i},\\, r_{\\text{format},i} \\in \\{0, 1\\}\n$$\n\nwhere $r_{\\text{format}}$ checks for well-formed `<think>...</think>` and `<answer>...</answer>` tags and $r_{\\text{acc}}$ is rule-based matching early on, LLM-as-judge (Qwen3-Next-80B) later. That is the *entire* supervision signal — no human labels, no learned reward model. This is what \"zero\" means. It builds directly on ideas we have covered before: the MoE backbone (see [Mixture of Experts, from scratch](/articles/mixture-of-experts-from-scratch) and [Switch Transformers](/articles/switch-transformer)) and RL from verifiable rewards on reasoning (see [Leanstral](/articles/leanstral-formal-proofs)).\n\n## The objective, precisely\n\nStage-1's policy objective is a **clipped-importance** RL loss with a group-normalized (GRPO-style) advantage:\n\n$$\n\\mathcal{J}(\\theta)=\\mathbb{E}\\!\\left[\\sum_{i=1}^{G}\\sum_{t=1}^{|o_i|}\\operatorname{sg}(\\hat{\\rho}_{i,t})\\;\\hat{A}_{i,t}\\;\\log \\pi^{M}_{\\theta}\\!\\left(o_{i,t}\\mid q,\\,o_{i,<t}\\right)\\right]\n$$\n\nThe $\\operatorname{sg}(\\cdot)$ is a **stop-gradient**: the importance weight $\\hat{\\rho}$ scales each token's contribution but gradient flows only through $\\log\\pi_\\theta$. The weight is a *clipped* importance ratio:\n\n$$\n\\rho_{i,t}=\\frac{\\pi^{M}_{\\theta}\\!\\left(o_{i,t}\\mid q,\\,o_{i,<t}\\right)}{\\pi^{S}_{\\theta_{\\text{old}}}\\!\\left(o_{i,t}\\mid q,\\,o_{i,<t}\\right)},\n\\qquad\n\\hat{\\rho}_{i,t}=\\operatorname{clip}\\!\\left(\\rho_{i,t},\\,\\epsilon_{\\text{low}},\\,\\epsilon_{\\text{high}}\\right)\n$$\n\nwith $\\epsilon_{\\text{high}} = 5.0$ and **no lower bound**. Read the numerator and denominator carefully, because this is the paper's quiet but load-bearing fix. The denominator $\\pi^{S}_{\\theta_{\\text{old}}}$ is the probability the **inference engine (SGLang)** assigned when it *generated* the rollout. The numerator $\\pi^{M}_{\\theta}$ is the probability the **training engine (Megatron)** assigns now. The two engines disagree by tiny floating-point amounts, and a naive ratio that mixes engines lets that disagreement compound until the ratio explodes and training collapses. The **training-inference ratio correction** is simply: put the *training* engine in the numerator, so the ratio measures the update you actually make.\n\n<RatioCorrection />\n\nThis is the same disease diagnosed — from the routing angle — in [Rollout Routing Replay](/articles/rollout-routing-replay): when the engine that generates a rollout and the engine that computes the gradient disagree, the importance ratio blows up and MoE RL diverges. Ring-Zero attacks the numerical side of it (and adds a small KL leash, $\\beta = 10^{-4}$, with the reference model refreshed every 400 steps, plus **mixed-precision control** — BF16 everywhere except FP32 in the attention softmax and the LM head, the two places rounding error is worst).\n\n## From token-level to sample-level: killing length inertia\n\nThe token-level loss has a side effect the paper names **length inertia**. Because the loss is not normalized by length, the model discovers a lazy shortcut: emitting more tokens is mathematically safer, so responses inflate *even on easy problems it already solves on the first try*. The fix is the Stage-2 loss, identical to Stage-1 except for one factor:\n\n$$\n\\mathcal{L}_{\\text{II}}(\\theta)=-\\mathbb{E}\\!\\left[\\sum_{i=1}^{G}\\frac{1}{|o_i|}\\sum_{t=1}^{|o_i|}\\operatorname{sg}(\\hat{\\rho}_{i,t})\\;\\hat{A}_{i,t}\\;\\log \\pi_{\\theta}\\!\\left(o_{i,t}\\mid q,\\,o_{i,<t}\\right)\\right]\n$$\n\nThat $\\tfrac{1}{|o_i|}$ makes the gradient magnitude **independent of response length**, so there is no longer a gradient reason to ramble. Paired with the self-distillation step that actively trims traces, it holds length flat while accuracy keeps climbing.\n\n## One model, three depths\n\nStage-3 trains three difficulty tiers jointly — Low (4k budget), Medium (16k), High (64k) — each with its own system prompt $p_k$, so a single checkpoint routes its reasoning depth by prompt:\n\n$$\n\\mathcal{L}_{\\text{III}}(\\theta)=-\\sum_{k\\in\\{l,m,h\\}}\\mathbb{E}\\!\\left[\\sum_{i=1}^{G}\\frac{1}{|o_i|}\\sum_{t=1}^{|o_i|}\\operatorname{sg}(\\hat{\\rho}_{i,t})\\;\\hat{A}_{i,t}\\;\\log \\pi_{\\theta}\\!\\left(o_{i,t}\\mid p_k,\\,q,\\,o_{i,<t}\\right)\\right]\n$$\n\nPick a tier and watch the budget, the tokens actually spent, and the accuracy move together:\n\n<AdaptiveDepth />\n\n## Does it work? The numbers\n\nThe headline is **scaling**. On the first stage of RL alone, the 1T model clears the 104B model by wide margins on every math benchmark. First-stage 1T scores (with the 104B flash model in prose for contrast): AIME 2024 **89.1%** (flash 71.2), AIME 2025 **83.3%** (63.5), AIME 2026 **84.2%** (65.3), HMMT Feb 2026 **66.2%** (50.3), IMOAnswerBench **59.3%**.\n\n<BenchBars\n  title=\"Ring-2.5-1T-Zero, first-stage RL only (pass@1, %)\"\n  unit=\"%\"\n  bars={[\n    { label: \"AIME 2024\", value: 89.1, highlight: true },\n    { label: \"AIME 2025\", value: 83.3, highlight: true },\n    { label: \"AIME 2026\", value: 84.2, highlight: true },\n    { label: \"HMMT Feb26\", value: 66.2, highlight: true },\n    { label: \"IMOAnswerBench\", value: 59.3, highlight: true },\n  ]}\n/>\n\nThe full pipeline (second-stage RL, with a 2× YaRN context extension) pushes those to **94.1% / 92.3% / 93.2%** on AIME 2024/25/26. The scaling advantage is visible not just in the endpoint but in the *slope* — the 1T model learns faster per step:\n\n<Figure\n  src=\"/articles/ring-zero-trillion-scale-rl/fig2.png\"\n  alt=\"Line chart of AIME 2024 accuracy versus training step. A red curve (Ling-2.5-1T-Base) rises from ~21% to ~89% over 3600 steps, staying well above a blue curve (Ling-2.5-flash-Base) that rises from ~14% to ~72% over 5200 steps. The 1T curve is consistently steeper.\"\n  caption=\"Model-scale effect: the 1T base (red) reaches a higher accuracy ceiling and gets there in fewer steps than the 104B flash base (blue) on AIME 2024 (Tang et al., 2026, Figure 10a).\"\n/>\n\nHonesty check on where this lands. Ring-Zero's best AIME 2026 number (**93.2%**) is genuinely strong — but it still **trails the frontier** models the authors themselves list. This is a zero-RL-at-scale study, not a SOTA claim:\n\n<BenchBars\n  title=\"AIME 2026: Ring vs frontier models (pass@1, %)\"\n  unit=\"%\"\n  bars={[\n    { label: \"GPT-5.5\", value: 98.3 },\n    { label: \"Gemini 3.1 Pro\", value: 98.2 },\n    { label: \"Qwen3.7-Plus\", value: 97.0 },\n    { label: \"Kimi K2.6\", value: 96.4 },\n    { label: \"Claude Opus 4.8\", value: 95.7 },\n    { label: \"Ring-2.5-1T-Zero\", value: 93.2, highlight: true },\n  ]}\n/>\n\nWhere Ring-Zero does claim an edge is **CoT quality**, measured three ways. Its traces win LLM-as-judge *comprehensibility* comparisons against GLM-5.1, Kimi-k2.6, MiniMax-M2.7 and Qwen3.5-397B. They *reproduce* better under distillation: fine-tuning student models on only **100K** Ring-Zero traces beats distilling **800K** DeepSeek-R1 traces —\n\n<BenchBars\n  title=\"Distillation into students: Ring-CoT (100K) vs DeepSeek-R1 (800K)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Qwen32B · Ring\", value: 78.4, highlight: true },\n    { label: \"Qwen32B · R1\", value: 72.6 },\n    { label: \"Llama70B · Ring\", value: 74.5, highlight: true },\n    { label: \"Llama70B · R1\", value: 70.0 },\n  ]}\n/>\n\n— and they are *efficient*: on problems both solve, Ring-Zero averages **6,368 tokens**, less than half its baselines' length.\n\n## Two phases: discovery, then sharpening\n\nThe second finding is about *how* RL improves the model over time. Track two quantities: **pass@1024** (can the model solve a problem in *any* of 1024 attempts — a measure of coverage) and **pass@1** (does it nail it on the first try — reliability). They move on different schedules.\n\n<DiscoverySharpening />\n\nCoverage saturates early — pass@1024 flattens around step 800 — meaning RL has already surfaced essentially every reasoning pattern it will ever use (the **discovery** phase). But pass@1 keeps rising long after (the **sharpening** phase): the model is not finding new tricks, it is becoming *reliable* at the ones it has. The paper reads this as evidence for a sharper claim in its discussion — that zero RL **optimizes within a boundary set by pretraining** rather than expanding it. Which is exactly the honest ceiling in its limitations: RL cannot invent a proof technique the base model never saw.\n\n## Behaviors nobody programmed\n\nThe third finding is the paper's \"bitter lesson\" payoff: with scale, the model **spontaneously develops** cognitive behaviors that smaller-model work usually has to elicit with hand-crafted prompts or rewards. The paper documents five:\n\n- **Anthropomorphism** — traces narrate themselves (\"I might have a brain fart here\", \"let me not wing it\", \"genius idea\"), artifacts of the pretraining corpus surfacing as reasoning scaffolding.\n- **Structured formatting** — spontaneous \"Step 1: / Step 2: / Verify:\" scaffolds appear with no formatting instruction, hinting at a higher-level action space above raw tokens.\n- **Parallel reasoning** — the model branches into competing strategies within a single rollout, compares outcomes, and commits only when evidence converges (tree-of-thought, self-taught).\n- **Self-verification** — it re-checks assumptions, substitutes answers back into the problem's constraints, learned because that is what secures the correctness reward.\n- **Context anxiety** — approaching its token limit, the model *strategically aborts* deep reasoning to guarantee a well-formatted answer, revealing an implicit awareness that format compliance is also rewarded.\n\nThe claim is not that these are magic; it is that at 1T scale they arrive *for free*, making the elaborate reasoning-elicitation machinery of small-model RL redundant.\n\n## What the ablations actually establish\n\nSeveral design choices are backed by ablations — with one caveat that matters (see below): they are run on the **104B flash** model, not the 1T model.\n\n- **RL algorithm.** Comparing GRPO / DAPO / CISPO / GSPO reveals a **speed-stability tradeoff**: amplifying low-probability tokens (CISPO, DAPO) learns fastest but its entropy collapses; GRPO is most stable but slowest. Ring-Zero's clipped-importance-plus-corrections scheme is the attempt to get both.\n- **KL penalty.** Remove it in stage 1 and the log-prob gap diverges, entropy collapses, and reward crashes within ~2,000 steps. With $\\beta = 10^{-4}$ it stays healthy.\n- **Ratio correction.** The naive ratio collapses near step 800; a clip-only patch delays collapse to ~2,700 steps; the training-engine-numerator correction trains indefinitely. This is the single most important stability result.\n- **Format reward.** A single opening `<think>` tag lets length explode with no reward gain; requiring properly closed tags with an EOS token is what makes stopping — and therefore credit — well-defined.\n- **Hyperparameters.** Robust to learning rate over $\\{1,2,3\\}\\times10^{-6}$; $G=32$ is fastest per step but $G=8$ fastest in wall-clock; token-level loss grows length, sample-level keeps it flat — motivating the stage-1 to stage-2 switch.\n\n<Callout type=\"warn\">\n**Read the scope before the headline.** (1) The efficiency and stability *ablations* — RL-algorithm choice, KL, ratio correction, format reward, hyperparameters — are run on the **104B flash** model, not the 1T model, for cost. The conclusions are *assumed* to transfer up. (2) On raw accuracy, Ring-Zero **trails the frontier** models it lists (GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.8, Qwen3.7-Plus, Kimi K2.6 all sit at 95.7–98.3% on AIME 2026 vs Ring's 93.2%); the win it claims is CoT *quality*, not peak score, and those quality judgments lean on LLM-as-judge. (3) The five \"emergent behaviors\" are **qualitative** — quoted trace snippets and interpretation (\"context anxiety\"), not quantified frequencies. (4) The frontier comparison numbers are **as reported by the authors** for competitors. (5) Adaptive depth carries **negative transfer**: the jointly-trained High tier (93.2%) sits *below* the dedicated second-stage model (94.1%) — flexibility costs a little peak. (6) The model, weights, and 320×H200 infrastructure are **not released**, so the 1T result is not externally reproducible. (7) Training is capped at **64k context** by hardware; the paper expects longer windows to unlock more, i.e. this is not the ceiling of the recipe.\n</Callout>\n\n## The take\n\nRing-Zero's value is not a new loss function — it is a **demonstration and an engineering recipe**. The demonstration: run the simplest possible RL signal (right/wrong plus format) on a trillion-parameter base, and you get sharp gains, a clean two-phase learning dynamic, and reasoning behaviors that smaller models have to be coaxed into. The recipe: a four-stage pipeline where token-level RL grows reasoning, self-distillation compresses it and resets the engine gap, sample-level RL sustains it without length bloat, and tiered RL makes depth controllable — all held together by a training-inference ratio correction that is easy to overlook and, per the ablation, the difference between training and diverging.\n\nThe honest frame is the one the paper itself offers in its limitations: zero RL **sharpens the reasoning already latent in pretraining; it does not transcend it**. That is why coverage saturates while reliability climbs, and why a bigger base — not a cleverer reward — is what moves the ceiling. As a controlled study of what pure reward does at scale, it is unusually candid; as a frontier-accuracy claim, it is not one, and it does not pretend to be.\n\n---\n\n*Built on [Ring-Zero: Scaling Zero RL to a Trillion Parameters for Emergent Reasoning](https://arxiv.org/abs/2607.12395) (Tang, Cao, Liu et al., 2026; CC BY 4.0), the team behind the Ling / Ring models. All benchmark, efficiency, and ablation figures are quoted from the paper; the interactive diagrams are illustrations of the mechanism, not reruns of the experiments. Ablations are on the 104B flash model unless noted; the 1T model and infrastructure are not publicly released.*\n","readingTimeMins":12,"url":"https://ai.thesatyajit.com/articles/ring-zero-trillion-scale-rl","signal":{"interest":5,"helpful":3,"score":8,"level":4,"label":"High"}},{"title":"Mach-Mind-4-Flash: specialize, then integrate","description":"Li Auto's Mach-Mind-4-Flash is a 35B / 3B-active MoE that reaches 100B-class scores through post-training alone — no extra pre-training compute. The recipe is specialize-then-integrate: train a dozen domain RL experts in parallel, then fuse them into one generalist with Multi-Teacher On-Policy Distillation (MOPD), a routed reverse-KL objective that kills the see-saw degradation of mixed-reward RL. A separate stage, HMPO, uses a median-length budget to cut reasoning tokens 19-46% for ≤0.7pp accuracy loss. A walk through both mechanisms, the numbers, and the honest gaps.","date":"2026-07-13","tags":["llm","reinforcement-learning","knowledge-distillation","mixture-of-experts","explainer"],"draft":false,"cover":"/articles/mach-mind-4-flash/cover.png","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"mach-mind-4-flash","body":"The dominant recipe for a better model is still *scale the pre-training* — more parameters, more\ntokens, more compute. Mach-Mind-4-Flash, from Li Auto's Foundation Model team, is a bet on the other\naxis. It starts from an existing compact base — **Qwen3.5-35B-A3B**, a Mixture-of-Experts model with\n35B total parameters but only **3B active per token** — and pushes it toward the score band of\n100B-class models using **post-training only**: reinforcement learning, expert fusion, and\ninference-time efficiency. No extra pre-training compute.\n\nThe one idea to leave with is the shape of that post-training: **specialize, then integrate.** Rather\nthan run one big mixed-reward RL job — which tends to rob Peter to pay Paul across capabilities — the\nteam trains **more than ten domain specialists in parallel** (across Reasoning, General, and Agent\ntracks), then fuses them into a single deployable generalist. The fusion is the paper's headline\ncontribution, **Multi-Teacher On-Policy Distillation (MOPD)**, and a second stage, **HMPO**, trims the\nmodel's reasoning length without paying for it in accuracy.\n\n<Figure\n  src=\"/articles/mach-mind-4-flash/fig1.png\"\n  alt=\"The post-training pipeline. A base model goes through Overall SFT, then fans out into three parallel RL tracks: Reasoning RL (Math, Code, STEM experts), General RL (Instruct-Following, Writing, Safe experts), and Agent RL (Code Agent, Tool Use, DeepSearch, Claw Agent experts). All the resulting experts feed into a MOPD block, then a Token Efficiency RL block, producing Mach-Mind-4-Flash.\"\n  caption=\"Specialize-then-integrate. From an SFT base, a dozen domain experts are trained in parallel across three tracks, fused by MOPD into one student, then compressed by token-efficiency RL (Foundation Model Team, 2026, Figure 4).\"\n/>\n\n## The fusion problem, and MOPD\n\nIf you train separate RL experts — a math expert, a code-agent expert, a safety expert — each is\nexcellent in its lane. The naive way to get one model with all of those skills is to mix every\ndomain's reward into a single RL objective. In practice that fails in a specific way the paper names\n**see-saw degradation**: the gradients from different rewards collide, so a gain on one capability is\n\"routinely offset by regressions on others.\" You climb one hill by sliding down another.\n\nMOPD sidesteps the collision. Every training sample carries a **routing key** $k$ that deterministically\nselects the one frozen domain teacher $\\pi_{T_k}$ that should supervise it. The student generates a\nrollout under its *own* policy, and each token is pulled toward its routed teacher with a **token-level\nreverse-KL**:\n\n$$\n\\mathcal{L}_{\\text{MOPD}}(\\theta) = \\mathbb{E}_{(x,k)\\sim\\mathcal{D}}\\;\\mathbb{E}_{y\\sim\\pi_\\theta(\\cdot\\mid x)}\\!\\left[\\frac{1}{|y|}\\sum_{t=1}^{|y|} D_{\\mathrm{KL}}\\!\\big(\\pi_\\theta(\\cdot\\mid x,y_{<t})\\;\\big\\|\\;\\pi_{T_k}(\\cdot\\mid x,y_{<t})\\big)\\right]\n$$\n\nTwo things matter here. It is **on-policy** — the distillation runs on the student's own generations,\nnot a fixed teacher corpus — which is what makes it behave like RL rather than plain SFT. And it is\n**routed**, so each domain's gradient stays clean: no averaging of conflicting rewards. (Under the\nhood the reverse-KL is optimized with a single-sample $k_1$ estimator and a clipped policy-gradient\nsurrogate; domains are mixed in a strict 1:1 ratio.) Flip between the two regimes below and drag the\ntraining progress — watch how mixed reward lets the weakest capability stall while MOPD lets all three\nclimb together:\n\n<MopdFusion />\n\nMOPD sits inside a **unified RL/OPD objective**, $\\mathcal{L} = \\alpha\\,\\mathcal{L}_{\\text{OPD}} +\n\\beta\\,\\mathcal{L}_{\\text{RL}}$, so the same framework can run pure RL ($\\alpha=0$), pure distillation\n($\\beta=0$), or a joint blend — and new teachers register as config nodes with \"zero intrusion into\nthe framework's core logic.\" A pilot on the tool-agent domain shows the mechanism converging: the\nteacher-student top-$K$ token-overlap rate climbs from **0.73 to 0.84** over training.\n\nFusion is not perfectly lossless, and the paper is candid about it. Across the three tracks it reports\nthree distinct outcomes: **capability anchoring** for Reasoning (the frozen expert prevents the\nstudent from regressing during fusion), **full retention** for General, and **mixed results** for\nAgent — where the fused model sometimes lands *below* its own expert teacher (SWE-bench Verified\n71.1 after fusion vs. 73.8 for the standalone expert). Long-horizon agent behavior is the hardest\nthing to distill without smoothing away.\n\n## HMPO: pay for correct-and-short\n\nThe second stage attacks **overthinking** — reasoning chains far longer than the task needs, which\ninflate latency and serving cost for no accuracy gain. **Hybrid Median-length Policy Optimization\n(HMPO)** is a single-stage token-efficiency method with a neat trick for the length budget: don't set\na threshold, *measure* one. For each query the policy samples a group of $G=10$ rollouts, and the\nbudget $b$ is the **median length of the correct ones**:\n\n$$\nb = \\operatorname{median}\\{\\,n_i \\mid i \\in \\mathcal{C}\\,\\}\n$$\n\nwhere $\\mathcal{C}$ is the set of correct rollouts. The token reward is a cosine decay that starts at\n1 and fades toward $\\lambda$ as a correct trace grows, then cliffs to zero the instant it runs over\nbudget — and any incorrect trace earns zero at any length:\n\n$$\nR_{\\text{token}} = \\begin{cases}\\min\\!\\big(1,\\ \\cos(\\tfrac{\\pi n}{2b}) + \\lambda\\big) & \\text{if correct and } n < b\\\\[4pt] 0 & \\text{otherwise}\\end{cases}\n\\qquad\nR_{\\text{final}} = R_{\\text{acc}}\\cdot R_{\\text{token}}\n$$\n\nThe multiplicative composition enforces a strict **correctness-first, length-second** hierarchy:\nwrong or over-budget traces get exactly zero reward, so efficiency gradients never flow to bad\nanswers. And because $b$ is the group median, it **self-tightens** as the policy gets more concise —\nan implicit curriculum with, in the authors' words, zero tuning. Drag the candidate length, the\ntraining progress, and $\\lambda$:\n\n<HmpoBudget />\n\n<Figure\n  src=\"/articles/mach-mind-4-flash/fig2.png\"\n  alt=\"HMPO overview. Left: a query goes into the policy model, which samples a group of G rollouts, each labelled correct or incorrect with its length. Right: a token-level reward curve that decays from 1 to lambda over the budget then drops to zero, and a length-distribution histogram with a red dashed median line b separating a 'prefer shorter' positive-reward region from a 'no reward' region. The final reward is R_acc times R_token, with default lambda = 0.8.\"\n  caption=\"HMPO derives the length budget b from the median length of the group's correct rollouts, then rewards short-and-correct traces on a cosine decay and zeroes everything over budget or incorrect (Foundation Model Team, 2026, Figure 13).\"\n/>\n\nTrained on a compact set of ~6.5K math problems (group size $G=10$, $\\lambda=0.8$), HMPO cuts\ngeneration length by **19-46% with at most a 0.7-percentage-point accuracy drop** — and although the\ntraining is math-only, the learned length control **generalizes** to unseen domains: code generation,\nscience QA, and instruction following. As a single-pass method it also costs **1.5-2.5× fewer\nGPU-hours** than the multi-stage length-control baselines it replaces.\n\n## The infrastructure that makes it cheap\n\nNone of this is free unless the training loop is fast, and the third contribution is the plumbing: a\nunified RL/OPD framework with **operator-level acceleration** reported at a **17% end-to-end training\nspeedup**. The wins are Hopper-specific kernel work — a deep integration of *SonicMoE* into Megatron\nthat implements an efficient **Indexed Grouped GEMM** for the MoE MLPs (using TMA copy, warp\nspecialization, and multi-stage producer-consumer pipelines), a **gate-up fusion**, and a **segmented\nfusion with the shared expert** that overlaps communication and computation by splitting the shared\nexpert into AllGather / compute / ReduceScatter stages staggered against the routed experts. It also\nleans on [multi-token prediction](/articles/multi-token-prediction) and multi-dimensional hybrid\nparallelism. This is the same category of problem as\n[stabilizing MoE RL](/articles/rollout-routing-replay) and [cutting RL's\ncost](/articles/frontier-rl-cheaper): the algorithm is only as good as the systems that let you run it\nat scale.\n\n## The numbers\n\nThe result is a 3B-active model that trades blows with much larger ones. On raw reasoning it is\nstrong but *not* the frontier — at AIME'26 it lands ahead of the 122B-active Qwen3.5 but behind the\n309B MiMo-V2-Flash and the 1T-parameter Kimi-K2.5:\n\n<BenchBars\n  title=\"AIME'26 accuracy by model (%)\"\n  unit=\"%\"\n  max={100}\n  bars={[\n    { label: \"Mach-Mind (3B act)\", value: 92.7, highlight: true },\n    { label: \"Qwen3.5 (35B)\", value: 91.9 },\n    { label: \"Qwen3.5 (122B)\", value: 91.7 },\n    { label: \"Nemotron-3 (120B)\", value: 89.9 },\n    { label: \"MiMo-V2 (309B)\", value: 93.8 },\n    { label: \"Kimi-K2.5 (1T)\", value: 93.3 },\n  ]}\n/>\n\nWhere it genuinely *leads* the pack — beating both the 122B-active Qwen3.5 and the 1-trillion-parameter\nKimi-K2.5 — is on instruction-following, safety, tool use, and Chinese web search. These are the axes\nthe specialize-then-integrate recipe was built to lift:\n\n<BenchBars\n  title=\"Benchmarks where the 3B-active model leads (%)\"\n  unit=\"%\"\n  max={100}\n  bars={[\n    { label: \"IFBench\", value: 82.8, highlight: true },\n    { label: \"Behavioral-Safety\", value: 80.7, highlight: true },\n    { label: \"BFCL-v4\", value: 75.8, highlight: true },\n    { label: \"LexInstructEval\", value: 74.6, highlight: true },\n    { label: \"BrowseComp-zh\", value: 72.3, highlight: true },\n  ]}\n/>\n\nFor context on those: IFBench 82.8 vs. 76.1 (Qwen 122B) and 67.4 (Kimi 1T); Behavioral-SafetyBench\n80.7 vs. 29.9 and 67.8; BFCL-v4 75.8 vs. 72.2 and 74.5. Elsewhere it is competitive rather than\ndominant — GPQA-Diamond 83.1, LiveCodeBench-V6 80.9, SWE-bench Verified 70.6, $\\tau^2$-bench 80.0 —\nsolidly in the mix for a model activating a fraction of its rivals' parameters. And the efficiency\nstory is where the whole thing pays off: on AIME'26, HMPO puts Mach-Mind-4-Flash at the **upper-left**\nof the accuracy-vs-tokens frontier, matching frontier accuracy at far fewer tokens per trajectory than\nmodels of much larger active scale.\n\n<Figure\n  src=\"/articles/mach-mind-4-flash/fig3.png\"\n  alt=\"A scatter plot of accuracy versus average tokens per trajectory on AIME'26, where upper-left is better. Mach-Mind-4-Flash (a gold star) sits at about 92.5% accuracy and ~15.3K tokens, to the left of MiMo-V2-Flash-309B and Kimi-K2.5-1T at similar accuracy but ~16.5K tokens, and well left of Qwen3.5-35B and GLM-4.7-Flash which use 19-20K tokens at lower accuracy.\"\n  caption=\"Token efficiency on AIME'26 (upper-left is better). The HMPO-trained model reaches near-frontier accuracy using fewer tokens per trajectory than models with far larger activated parameter counts (Foundation Model Team, 2026, Figure 14).\"\n/>\n\n<Callout type=\"warn\">\nRead the wins with their scope. **All numbers are the authors' own** — a single-vendor technical\nreport, not an independent evaluation. The \"100B-class performance\" framing is real but selective:\nMach-Mind-4-Flash reliably beats the *122B-active* Qwen3.5 it's compared against, yet it **trails the\n1T-parameter Kimi-K2.5** on AIME'25 (92.1 vs. 96.1), GPQA-Diamond (83.1 vs. 87.6), LiveCodeBench (80.9\nvs. 85.0) and SWE-bench Verified (70.6 vs. 76.8) — this is *not* a SOTA claim. Fusion has costs: the\nAgent track shows \"mixed results,\" with the fused model landing below its own standalone expert on\nsome agentic tasks. HMPO's compression is real but measured on **single-turn** reasoning and the\nmath-trained generalization is the authors' evaluation. The **17%** speedup is Hopper-specific\ninfrastructure, not a modeling result. The paper's own limitations are blunt: MOPD leaves \"a small but\nconsistent gap on extremely long-horizon tasks such as repository-level software engineering\"; HMPO\ndoes not yet extend to multi-turn agentic trajectories; and persistent web browsing (DeepSearch) plus\nlong-context comprehension \"remain the weakest axes for compact models.\"\n</Callout>\n\n## The take\n\nMach-Mind-4-Flash's contribution isn't a new architecture — it reuses an off-the-shelf 35B-A3B MoE —\nit's a **post-training recipe that composes cleanly**. MOPD turns \"train many experts, ship one model\"\nfrom a lossy averaging problem into a routed distillation where each capability keeps its own gradient,\nand the see-saw that plagues mixed-reward RL mostly disappears. HMPO is the tidy companion: make the\nlength budget a measured group-median instead of a tuned hyperparameter, gate it behind correctness,\nand reasoning gets shorter for almost free. Set against the field's default answer — spend more\npre-training compute — this is the argument that a lot of headroom is still sitting in the\n*post*-training stack, reachable by a compact model that only lights up 3B parameters at a time.\nWhether a 35B model can truly close the last gap to a trillion-parameter frontier on the hardest\nlong-horizon tasks is the open question the paper's own limitations point at — but as a demonstration\nthat specialize-then-integrate scales to a dozen domains without falling apart, it's a clean result.\n\n---\n\n*Built on the [Mach-Mind-4-Flash Technical Report](https://arxiv.org/abs/2607.09375) (Foundation Model\nTeam, Li Auto Inc., 2026; CC BY-NC-ND 4.0). All benchmark and efficiency figures are quoted from the\nreport; the MOPD and HMPO interactives are illustrations of the mechanism, and the capability curves in\nthe fusion diagram are illustrative, not measured. Related on this site:\n[Mixture of Experts, from scratch](/articles/mixture-of-experts-from-scratch),\n[Rollout Routing Replay](/articles/rollout-routing-replay), and\n[MiMo-V2-Flash](/articles/mimo-v2-flash), which also leans on multi-teacher distillation.*\n","readingTimeMins":10,"url":"https://ai.thesatyajit.com/articles/mach-mind-4-flash","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Soofi S: a sovereign 3B-active model that keeps its cache near-constant","description":"A German consortium (KI Bundesverband, DFKI, Fraunhofer, TU Darmstadt, funded by the BMWE) built Soofi S 30B-A3B: an open, hybrid Mamba-Transformer MoE for German and English that activates ~3.2B of 31.6B parameters per token and keeps only 6 of 52 layers as attention — so decode throughput stays nearly flat as context grows while dense models decay. Pretrained on ~27T tokens with German deliberately up-weighted, on a sovereign German B200 cloud. This is a walk through the hybrid architecture, the near-constant-cache mechanism, the German-up-weighted data mixture, and the honest caveats — it is an author-reported consortium tech report, not peer-reviewed, and 'matches dense 14–27B' is an active-vs-total claim.","date":"2026-07-13","tags":["llm","mixture-of-experts","pretraining","inference-optimization","explainer"],"draft":false,"cover":"/articles/soofi-s/fig1.png","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"soofi-s","body":"Most open-model releases are open in name only: you get weights and an aggregate token count, not the\ndata, recipe, or checkpoints needed to audit or rebuild them. Most general-purpose multilingual models\nspread their capacity thinly across dozens of languages, leaving German underrepresented relative to its\neconomic weight. And most are dense full-attention Transformers, whose per-sequence [KV\ncache](/articles/how-llm-inference-works) grows with context and drags throughput down exactly in the\nlong-context, high-concurrency regime that costs the most to serve. **Soofi S 30B-A3B** — from a German\nconsortium coordinated by the KI Bundesverband (DFKI, Fraunhofer IAIS/IIS, TU Darmstadt, and others),\nfunded by the German BMWE — sets out to close all three gaps at once, and to do it on sovereign European\ninfrastructure.\n\nThe design that ties those goals together is a **hybrid Mamba–Transformer Mixture-of-Experts**: 31.6B\ntotal parameters, but only ~3.2B active per token, and a backbone that is mostly linear-time Mamba-2 with\nattention in just 6 of 52 layers. That last choice is the whole serving story — decode throughput stays\nnearly flat as context grows, where dense baselines fall off a cliff:\n\n<ThroughputScaling />\n\n<Figure\n  src=\"/articles/soofi-s/fig1.png\"\n  alt=\"Two panels. Left: Capability Index versus aggregate decode TPS per GPU at 40K context on a log x-axis; Soofi S 30B-A3B sits top-right (high capability, highest throughput) as a star, above clusters of international and European open models. Right: aggregate decode TPS per GPU versus context length from 4K to 256K; the Soofi S line stays flat near the top while dense baselines decay steeply and Qwen3.5 decays gently.\"\n  caption=\"Soofi S pairs frontier-level capability with the highest measured aggregate long-context decode throughput, and unlike dense full-attention baselines holds it as context grows to 256K. Throughput is measured TP=1, one B200, batch 32, latency-subtraction; the Capability Index is an author-defined average of five benchmark groups, each normalized to the best plotted model (Soofi S Pretraining Report v1.0, Figure 1).\"\n/>\n\n## The architecture: hybrid backbone, sparse experts\n\nSoofi S reuses the openly published **Nemotron 3 Nano** reference design without modification — a\ndeliberate choice, so the effect of the German–English data recipe can be measured against an\narchitecture-identical control (the same-arch [Nemotron](/articles/nemotron-nvfp4) baseline). The\nbackbone is 52 layers: **23 Mamba-2** sequence-mixing layers, **23 granular MoE** layers, and only **6\nGrouped-Query Attention** layers, distributed sparsely through the depth. Scrub the stack — and note how\nfew layers actually hold a cache:\n\n<HybridStack />\n\nThe Mamba-2 layers carry most of the sequence mixing with a *fixed-size recurrent state*; the attention\nlayers give exact long-range recall but are the only ones whose cache grows. The capacity lives in the\nMoE layers, and this is where \"30B at the cost of 3B\" comes from. Each MoE layer has **128 routed\nexperts** plus **2 shared experts**; a learned, sigmoid-gated router activates just **6 routed experts**\nper token, with the 2 shared always on:\n\n<MoeRouting />\n\nThe exact config, for reproducibility (Table 1): model dimension 2688, 32 attention query heads over just\n2 KV heads (head dim 128), Mamba-2 state dimension 128, expert dimension 1856, squared-ReLU MoE\nactivation, RMSNorm, no positional embeddings, untied embeddings. Total 31.6B parameters, ~3.2B active\nper token (~3.6B including embeddings).\n\n## Why the cache stays near-constant\n\nDecoding is memory-bandwidth bound: every generated token must re-read the model weights **and** the\nattention cache of every sequence in the batch. In a dense full-attention model that per-sequence cache\ngrows with context, so at tens or hundreds of thousands of tokens, served many-at-once, the KV reads come\nto dominate and throughput decays. Soofi's hybrid backbone attacks exactly this. Only 6 of 52 layers keep\na KV cache, with 2 KV heads each, so the incremental attention-cache footprint is about **6 KB per token\nper sequence** — the report puts that at **11–53× lower** than the dense models in its comparison. As\ncontext grows, only that small attention component scales with length; the Mamba-2 recurrent state stays\nconstant-size.\n\nThe measured payoff: at 40K context and batch 32, Soofi sustains **4.82k aggregate decode TPS/GPU**, a\nreported **9.2×** over Ministral 3 14B, while fitting the weights and all 32 sequence states on a single\nGPU. Across 4K→256K the aggregate decode rate stays essentially flat (no point more than ~34% below the\n4K value), where dense throughput decays with context. Among the comparison models only Qwen3.5 — itself\na Gated-DeltaNet hybrid — scales similarly, and its 35B-A3B variant still measures ~1.9× slower than Soofi\nat 40K. The prefill side shows the same shape: time-to-first-token at 256K is 372.7s for Soofi versus\n2,058.9s for dense Ministral 3 14B and 6,428.6s for a dense Qwen3 32B control.\n\n## The data: ~27T tokens, German on purpose\n\nSoofi S was pretrained on approximately **27 trillion tokens** (~26.68T actually consumed) under a\nthree-phase Warmup–Stable–Decay curriculum: ~20T of diverse, quality-tiered pretraining, ~6.58T of\nhigh-quality annealing, and a ~0.10T long-context extension that pushes the usable window to 1M tokens.\nThe defining move is that **German is deliberately up-weighted** — to 7.2% of the stable phase and 15.32%\nof the annealing mixture, more than triple the ~5% total multilingual share of the reference Nemotron\nrecipe, and concentrated in a single language rather than spread across dozens.\n\n<Figure\n  src=\"/articles/soofi-s/fig2.png\"\n  alt=\"A flow (Sankey) diagram tracing seven data categories — English Web, Academic & Wiki, SFT, Reasoning, Code, Math, German — across three training phases. Phase 1 (~23T effective tokens) is dominated by English Web at 50.3%; Phase 2 (~6T) shifts toward skill data and raises German to 15.3%; Phase 3 (~188B) branches SFT into General, Code, and Math SFT for long-context extension.\"\n  caption=\"The effective-token mixture across the three phases. Phase 1 maximizes diversity (50.3% English web); Phase 2 concentrates skill-oriented data and triples German's share to 15.3%; Phase 3 is a length-bucketed long-context pool. Band width is each source's share of that phase's tokens (Soofi S Pretraining Report v1.0, Figure 3).\"\n/>\n\nThe corpus is documented at the granularity of individual source datasets — raw tokens, epoch multiplier,\neffective tokens, and even sources that were evaluated and *excluded* — so the mixture can be audited and,\nwhere licenses permit, rebuilt. German coverage combines naturally occurring web and document text (HPLT,\nGerman Commons, Genios, German FinePDFs/FineWiki) with machine-translated and synthetic German, since\nhigh-quality native German text is far scarcer than English. The report identifies that German data\npipeline as the principal bottleneck for further gains.\n\nJust as notable is *where* it ran. Soofi S was trained end-to-end on the **Industrial AI Cloud** operated\nby Deutsche Telekom in Munich — up to **512 NVIDIA B200 GPUs** (64 DGX B200 nodes), ~253,000 B200\nGPU-hours from 24 March to 13 May 2026, on a facility powered by renewable energy and cooled with water\nfrom the Eisbach canal. Training on German soil under European data-protection rules is itself part of the\n\"sovereign\" claim.\n\n## Results, and how to read them\n\nAgainst a set of large open-source models (Alia 40B, EuroLLM 22B, Apertus 70B, Olmo 3 32B), Soofi S is\nthe strongest in the set: highest **German aggregate** and, among fully open models, the highest English\nand German evaluation scores — ahead of Olmo 3 32B and Apertus 70B despite activating a fraction of their\nparameters.\n\n<BenchBars\n  title=\"German aggregate — open-source comparison (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Soofi S 30B-A3B\", value: 79.1, highlight: true },\n    { label: \"Apertus 70B\", value: 72.8 },\n    { label: \"EuroLLM 22B\", value: 70.6 },\n    { label: \"Olmo 3 32B\", value: 69.2 },\n    { label: \"Alia 40B\", value: 68.4 },\n  ]}\n/>\n\nThe point the authors most want to land is capability-per-active-parameter: Soofi matches dense 14–27B\nmodels on English and German aggregates while activating only ~3.2B parameters per token.\n\n<BenchBars\n  title=\"Active parameters per token (B) — Soofi vs the open-source baselines\"\n  unit=\"B\"\n  bars={[\n    { label: \"Apertus 70B\", value: 70 },\n    { label: \"Olmo 3 32B\", value: 32 },\n    { label: \"EuroLLM 22B\", value: 22 },\n    { label: \"Alia 40B\", value: 40 },\n    { label: \"Soofi S (active)\", value: 3.2, highlight: true },\n  ]}\n/>\n\nIt also posts the best code aggregates in that comparison (HumanEval 73.8, MBPP 70.2, HumanEval-DE 65.5,\nMBPP-DE 84.2 — first on four of five code benchmarks), and leads the set on mathematics (GSM8K 86.1).\n\n<Figure\n  src=\"/articles/soofi-s/fig3.png\"\n  alt=\"A grouped bar chart, 'Base model evaluation overview', comparing Soofi S 30B-A3B (highlighted) against Alia 40B, EuroLLM 22B, Apertus 70B and Olmo 3 32B across English aggregate, German aggregate, Code EN, Code DE, Math EN, Math DE, MMLU-Pro and GPQA-D-DE. Soofi S is marked #1 on every group shown, with values 70.1, 79.1, 72.0, 74.9, 82.8, 71.5, 51.4 and 41.9.\"\n  caption=\"Base-model evaluation overview for the open-source comparison: Soofi S is #1 on every aggregate shown against Alia, EuroLLM, Apertus and Olmo 3. Aggregates are harness-level suite means; results are author-reported on the authors' selected suite (Soofi S Pretraining Report v1.0, Figure 4).\"\n/>\n\nRead against the *open-weight* set, the picture is more measured and the report says so: Soofi is **not**\nthe top model on aggregate — Qwen3.5 35B-A3B leads (English 74.6, German 81.6), and Soofi's English\naggregate (70.1) essentially ties Gemma 3 27B and Ministral 3 14B (both 70.3). Its clearest, cleanest\nresult is the architecture-identical comparison: versus Nemotron 3 Nano 30B-A3B (same backbone, different\ndata), the German–English recipe lifts German aggregate +4.2, held-out English +6.7, GPQA-Diamond +9.6,\nand German-language proficiency (GLP-DE) +15.1 — while English capability is preserved or improved, the\nusual price of monolingual specialization avoided.\n\n## The honest caveats\n\n<Callout type=\"warn\">\n  This is a **consortium tech report, not a peer-reviewed paper**. Every number is **author-reported**,\n  the \"Capability Index\" in Figure 1 is **author-defined** (an average of five benchmark groups, each\n  normalized to the best-plotted model), and the throughput figures use an **author-selected baseline set\n  and measurement protocol** (TP=1, one B200, batch 32, latency-subtraction). Soofi S is a **3B-active**\n  model: \"matches dense 14–27B\" is active-vs-total, not 30B-dense compute. \"Best/highest among fully open\"\n  and \"outperforms every European sovereign baseline\" are **scoped to their comparison and eval suite**,\n  and deliberate German up-weighting shapes the aggregates — so these are not unqualified SOTA claims.\n</Callout>\n\nThe report is candid about its own limitations. **Competition-style math in German** is the clearest gap\nto the frontier: Minerva MATH-DE 56.0 trails Qwen3.5 35B-A3B (76.5) and Gemma 3 27B (65.6). **Open-domain\nfactual recall** is capacity-limited — NaturalQuestions 79.0 trails the largest dense baselines (Gemma 3\n27B 83.5), consistent with storing world knowledge in only ~3B active parameters (the authors expect\nretrieval-augmentation to close this in practice). And on openness itself the report draws an explicit\nline: it satisfies the OSI's OSAID 1.0 (weights, checkpoints, training and eval code, exact per-source\ndata accounting, all under permissive licenses), but falls short of the stricter \"every training token\nmust be redistributable\" bar on exactly one component — the commercially licensed Genios corpus (1.3% of\nPhase 1) — so ~99% of the mixture, not 100%, can be independently reconstructed.\n\n## The take\n\nSoofi S's contribution is less a new mechanism than a **thesis about deployment cost, executed\ntransparently**. The near-constant cache is the load-bearing idea: by keeping only 6 of 52 layers as\nattention and letting Mamba-2 carry the rest with a fixed-size state, decode throughput stops caring about\ncontext length — which is where dense models bleed. Wrap that in a sparse MoE (3.2B active of 31.6B) and\nyou get a model that serves like a 3B but scores like a 14–27B dense on its target languages. Set the\nknobs honestly — author-reported numbers, an author-defined capability index, a scoped baseline set, a 3B\nactive budget, real gaps in German competition math and factual recall — and what remains is genuinely\nnotable: a fully documented, per-source-audited, German–English pretraining run on sovereign European\nB200 hardware, released with checkpoints and code. As a template for \"open in substance, efficient by\narchitecture, and built at home,\" it is a clean and unusually legible bet.\n\n---\n\n*Built on the [Soofi S Pretraining Report v1.0](https://huggingface.co/Soofi-Project) (\"A Sovereign,\nOpen-Source Foundation Model for German and English\", the Soofi-Team; consortium coordinated by the KI\nBundesverband, funded by the German BMWE). Architecture, data, and benchmark figures are quoted from the\nreport for commentary; the interactive diagrams are illustrations of the mechanism, and the throughput\ncurves use the report's measured endpoints with an illustrative in-between shape. Related: the\narchitecture-shared [Nemotron in NVFP4](/articles/nemotron-nvfp4), [mixture-of-experts from\nscratch](/articles/mixture-of-experts-from-scratch), [how LLM inference\nworks](/articles/how-llm-inference-works), and [large-scale\npretraining](/articles/megatrain-single-gpu-training).*\n","readingTimeMins":10,"url":"https://ai.thesatyajit.com/articles/soofi-s","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Colibri: running a 744B model on a 25 GB machine by streaming experts from disk","description":"Colibri is a pure-C, zero-dependency engine that runs GLM-5.2 — a 744B-parameter MoE — on a consumer box with ~25 GB of RAM, by keeping only the int4 dense core resident and streaming the routed experts from disk on demand. It works because the model is extremely sparse and heavily quantized. But be honest about what 'runs' means: this is a feasibility feat, not a usable-speed setup — cold decode is 10–20 seconds per token, and even the best community result is ~3 seconds per token. And '25 GB of RAM' only holds if you also have ~370 GB of fast NVMe.","date":"2026-07-10","tags":["inference-optimization","mixture-of-experts","systems","explainer","llm"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"colibri","body":"[GLM-5.2](/articles/glm-5-2) is a 744-billion-parameter model. Loaded the normal way, its\nweights want hundreds of gigabytes of fast memory — data-center territory. [Colibri](https://github.com/JustVugg/colibri)\nis a single-file, pure-C engine, zero external dependencies, that runs that same model on a\nconsumer machine with about **25 GB of RAM**. Not a distillation, not a smaller sibling — the\nfull 744B checkpoint, answering correctly, on a box that costs less than one H100's cooling fan.\n\nBefore the \"how,\" the honest headline: **this is a feasibility feat, not a usable-speed setup.**\nCold, Colibri decodes at roughly **0.05–0.1 tokens per second** — that is *10 to 20 seconds per\ntoken*. Warm, with every trick engaged, the best real community result is about **0.37 tok/s**,\nstill ~3 seconds per token. And \"25 GB of RAM\" is only true if you *also* have ~**370 GB of fast\nNVMe** for the experts. Keep both numbers in the same sentence or the claim is misleading.\n\n<Callout type=\"warn\">\n**Read the fine print before you get excited.** (1) It is *slow* — 0.05–0.1 tok/s cold is 10–20\nseconds **per token**; the best community case, ~0.37 tok/s, is still ~3 s/token. (2) \"Runs on 25 GB\nRAM\" **requires ~370 GB of fast disk** for the experts — the RAM number alone is misleading. (3) The\nnumbers here are **community-reported / single real test cases**, not official benchmarks — treat them\nas existence proofs, not spec sheets. (4) It works only because GLM-5.2 is an **extremely sparse MoE**\n*and* because of **int4/int8 quantization** — remove either and the trick collapses.\n</Callout>\n\n## Why this is even possible: extreme sparsity\n\nColibri does not compress 744B parameters into 25 GB. It exploits the fact that, at any moment,\nalmost none of those parameters are doing work. GLM-5.2 is a\n[Mixture of Experts](/articles/mixture-of-experts-from-scratch): each MoE layer holds **256 experts**,\nbut a router picks only a small **top-k** of them per token. Across the model, only about **40B of the\n744B parameters activate for a given token**, and of those, only ~**11 GB of weights** actually *change*\nfrom one token to the next — the routed experts. Everything else (attention, shared experts, embeddings —\nthe \"dense\" ~17B) is used on *every* token.\n\nThat split is the whole design. Split the model by how often each piece is touched:\n\n- **The dense core (~17B params)** is touched every token → keep it **resident in RAM**, int4-quantized\n  to **~9.9 GB**.\n- **The routed experts (21,504 of them: 75 MoE layers × 256, plus the MTP head, ~19 MB each at int4)**\n  are touched rarely and unpredictably → leave them **on disk (~370 GB)** and fetch only the handful a\n  token actually routes to.\n\nThe second enabler is quantization. The original FP8 checkpoint is ~**756 GB**; Colibri's offline\nconverter requantizes it to int4 (the experts) so the dense core fits in ~9.9 GB of RAM and the expert\nstore shrinks to ~370 GB on disk. Sparsity says *you rarely need most experts*; quantization says *the\nones you do need are small enough to stream*. You need **both**.\n\n## The three-tier memory, made visible\n\nSo the memory hierarchy has three tiers: the **int4 dense core resident in RAM**, an **LRU cache of hot\nexperts in RAM**, and the **full expert store on disk**. When the router picks an expert, one of two\nthings happens. If that expert is already in the RAM cache (or pinned), it is a **hit** — served at\nmemory speed. If not, it is a **miss**: a disk read of ~19 MB that sits **on the critical path** of that\ntoken. The token cannot finish until the bytes arrive.\n\nRoute a token through one MoE layer and watch it resolve. Scrub the token, resize the cache, toggle\nhot-expert pinning, and watch the LRU fill and evict:\n\n<ExpertStreaming />\n\nThis is the entire performance story in one picture. The router picks its experts; the cache absorbs the\nones you keep re-using; everything else is a disk fetch. A cold token — nothing cached — reads about\n**11 GB from disk** (75 layers × ~8 experts × ~19 MB). At a typical NVMe rate of ~1 GB/s, that read alone\nis ~10 seconds, which is exactly why cold decode lands at 0.05–0.1 tok/s. **Decode is disk-bound**, not\ncompute-bound: the CPU sits waiting on I/O. This is the same memory-wall intuition as ordinary\n[LLM inference](/articles/how-llm-inference-works), pushed to its limit — except the \"memory\" the decode\nwaits on is a spinning queue of NVMe reads instead of GPU HBM.\n\nColibri softens the disk with the usual systems tricks: an **LRU cache** so repeat experts stay hot, an\n**async readahead** that reads the *next* block of experts while the current one is still multiplying (so\ncompute and I/O overlap), the **OS page cache** acting as a free second-level cache, and a **RAM safety\nbudget** — the cache is auto-sized from `MemAvailable` at startup so it fills spare memory without ever\ntriggering an OOM kill. None of these change the physics of a cold miss; they just make misses rarer.\n\n## Budget the RAM and the disk together\n\nBecause the \"25 GB\" number is the seductive, misleading one, it is worth drawing to scale. On a 25 GB\nmachine the resident footprint is the 9.9 GB int4 core plus an auto-sized LRU expert cache plus a safety\nheadroom — and the 370 GB of experts sit on disk, roughly **15× larger than the entire RAM**. Slide the\ncache size and see how little of the model is ever in memory at once:\n\n<MemoryBudget />\n\nThat 15× gap is the point. Colibri did not shrink GLM-5.2 to fit in RAM; it arranged for only the\nin-use ~4% to be in RAM at any instant, and made disk the backing store for the rest. Take away the fast\nNVMe and there is no engine — which is why quoting the RAM figure without the disk figure sells a fiction.\n\n## Buying speed back: warm cache, MTP, and pinning\n\nCold decode is the floor, not the experience. Three things stack on top of it, and it is worth being\nprecise about what each one is.\n\n**A warm cache** is just locality: real prompts re-route to the same experts, so after a few tokens the\nhottest experts live in RAM and the miss rate drops. **Pinned hot experts** make that permanent — Colibri\nrecords which experts your usage actually routes to (a `.coli_usage` file) and pins the hottest ones in\nspare RAM, so the engine *literally gets faster the more you use it*.\n\n**MTP** is the subtle one. GLM-5.2 ships a native [multi-token-prediction](/articles/multi-token-prediction)\nhead — a lightweight draft model that proposes several future tokens, which the main model then *verifies*\nin one batched forward. Colibri runs it natively at **int8** (this matters: at int4 the draft head's\npredictions are so degraded that acceptance collapses to 0–4% and speculation never engages; at int8 it\nreaches ~39–59% acceptance, community-measured). The payoff is **2.2–2.8 tokens per forward** — but note\nthat is a *speculation/acceptance rate*, the number of tokens you get out of one main-model pass, **not**\ntokens per second. It amortizes the fixed per-forward cost; it does not make the disk faster. On a *cold*\ncache MTP can even be a net loss, because verifying extra draft tokens routes to *more* experts\n(~660 → ~1100 expert-loads/token) — so speculation only pays once the cache and pins are warm.\n\nStack them, and honestly label which rungs are measured versus an illustrative split:\n\n<ThroughputLadder />\n\nThe endpoints are real community numbers; the per-factor decomposition in the middle is illustrative\n(Colibri's README reports the endpoints, not the split). The takeaway is the seconds-per-token column: the\nbest real result, **0.37 tok/s on a Ryzen AI 9 Framework 13** with a warm cache, MTP and pinning, is still\nabout **one token every three seconds**. A different community machine — an M5 Max with **128 GB of RAM** —\nreaches **1.06 tok/s**, but only because far more RAM lets far more experts stay resident, which just\nconfirms the thesis: *the disk is the bottleneck, and RAM buys you out of it.*\n\n<BenchBars\n  title=\"Community-reported decode throughput (tok/s) — single test cases, not benchmarks\"\n  unit=\"\"\n  bars={[\n    { label: \"Cold (no cache)\", value: 0.08 },\n    { label: \"Core Ultra 7 · 24 GB\", value: 0.11 },\n    { label: \"Ryzen AI 9 · 128 GB · warm+MTP+pin\", value: 0.37, highlight: true },\n    { label: \"M5 Max · 128 GB\", value: 1.06 },\n  ]}\n/>\n\n## Colibri is the engine, not the model\n\nKeep the two things distinct. [GLM-5.2](/articles/glm-5-2) is the *model* — the 744B MoE, its routing, its\nMTP head, its training. **Colibri is an *engine*** that runs that model's forward pass in pure C on tiny\nhardware. The cleverness is entirely in the *systems* layer: how to lay out weights on disk, when to read\nthem, what to cache, how to overlap I/O with compute, how to quantize the head so speculation survives.\nIt reimplements GLM-5.2's forward pass faithfully — MLA attention with a compressed KV cache (~57× smaller\nthan dense), DeepSeek-V3-style routing, native MTP — but adds nothing to the model's *capability*. Same\nweights, same answers; the contribution is fitting them onto a laptop.\n\n## The take\n\nColibri is a lovely demonstration of a real principle: **a sparse MoE's active footprint, not its parameter\ncount, is what a runtime has to hold in fast memory.** Because GLM-5.2 fires only a few of its 256 experts\nper layer per token, and because int4/int8 quantization shrinks both the resident core and each streamed\nexpert, the working set collapses from hundreds of gigabytes to ~10 GB of RAM plus on-demand disk reads.\nThat is genuinely clever, and it is pure C with zero dependencies, which makes it a beautiful object to read.\n\nBut be clear-eyed about what it buys. It buys *access*, not *speed*: you can hold a conversation with a\n744B frontier model on a 25 GB machine, at the pace of a few seconds per token, provided you also own a\n370 GB fast disk. The bottleneck is not going away — it is the physics of pulling ~11 GB across NVMe for\nevery cold token — and the caching, pinning and speculation only push against it. As an existence proof\nthat extreme sparsity plus aggressive quantization can put a frontier model on consumer hardware, Colibri\nis compelling. As a way to actually *use* one at interactive speed, it is not there, and it is refreshingly\nhonest about that.\n\n---\n\n*Built from the [Colibri repository](https://github.com/JustVugg/colibri) (JustVugg; Apache-2.0) and its\nREADME. All throughput and hardware figures are **community-reported single test cases**, not official\nbenchmarks: cold ~0.05–0.1 tok/s and MTP 2.2–2.8 tok/forward on the dev machine; 0.37 tok/s (Ryzen AI 9,\nFramework 13), 1.06 tok/s (M5 Max, 128 GB), 0.11 tok/s (Core Ultra 7, 24 GB) from community runs.\nArchitecture figures — 744B total, ~17B dense, 9.9 GB int4 core, 21,504 experts, ~370 GB on disk — are from\nthe repository README. The interactive diagrams are illustrations of the mechanism, not measurements.*\n","readingTimeMins":9,"url":"https://ai.thesatyajit.com/articles/colibri","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"One small box, 64 users at once: continuous batching on a DGX Spark","description":"A single DGX Spark — a GB10 Grace-Blackwell edge box with 128 GB of unified memory — serves dozens of concurrent chat users on a 35B model. The trick isn't a faster chip; it's vLLM's continuous batching: every decode step gathers all live streams into one fused forward pass, so aggregate throughput scales far above single-stream while each user's own speed stays modest. A precise, honest walk through the mechanism, the numbers I could verify against the spark-bench results, and the ones I couldn't.","date":"2026-07-10","tags":["inference-optimization","llm","systems","explainer","moe","quantization"],"draft":false,"featured":false,"interest":3,"helpful":4,"kind":"articles","slug":"dgx-spark-batching","body":"A [DGX Spark](/articles/deepseek-dspark) is not a datacenter. It's a small GB10 Grace-Blackwell\nbox with **128 GB of unified memory** — the kind of thing that sits on a desk. And yet one of them,\nrunning [vLLM](https://github.com/vllm-project/vllm), will hold a conversation with **dozens of people\nat the same time** on a 35B-class model. The surprising part isn't the model or the silicon. It's a\nscheduling trick called **continuous batching**, and it's the single most important reason a small box\ncan feel like a shared server.\n\nThe instinct is to picture the users taking turns — one prompt finishes, the next begins. That's not\nwhat happens. At every decode step the server bundles **all** the currently-active conversations into\n**one** forward pass through the GPU, and appends exactly one new token to each. Sixty-four people, one\npass. Drag the concurrency and step the clock to see it move:\n\n<ContinuousBatch />\n\n## Why decoding one token is a waste of a GPU\n\nTo see why batching helps so much, you have to look at what generating a single token actually costs.\nLLM inference splits into two phases — the [prefill and decode](/articles/how-llm-inference-works) —\nand it's the **decode** phase, one token at a time, that dominates a chat session.\n\nDecoding one token for one user means: load the model's weights out of memory, multiply the single\ncurrent token's vector through them, read that user's entire [KV cache](/articles/turboquant-kv-cache)\nto do attention, and emit one token. The arithmetic is tiny — one token — but you had to stream **all\nthe weights** through the chip to do it. That makes single-stream decode **memory-bandwidth-bound**:\nthe GPU's compute units sit almost idle, waiting on memory. You paid to load the weights and barely\nused them.\n\nContinuous batching is the fix that falls out of that observation. If you're loading the weights\nanyway, run *more* tokens through them on the same load. Gather N users' current tokens, stack them,\nand do one fused matmul against the weights you already fetched. The weight-load cost — the expensive\npart — is now **amortized over N streams instead of one**. Aggregate throughput climbs steeply, and\nkeeps climbing until you run out of either compute or KV-cache memory.\n\n<ThroughputScaling />\n\nThat's the trade in two curves. **Aggregate** tokens/second rises and saturates; **per-user**\ntokens/second falls the whole way. Both are true at once, and conflating them is the most common way\nthese numbers get oversold.\n\n## \"Continuous,\" not just \"batched\"\n\nPlain static batching — wait for N requests, run them together, wait for all N to finish — would be\nuseless for chat, because requests arrive at random times and finish at wildly different lengths. One\nlong generation would stall everyone.\n\nvLLM's batching is **iteration-level** (also called *in-flight* batching): the batch is re-formed\n**every single decode step**. A request whose prompt just arrived joins the batch on the next step; a\nrequest that just emitted its stop token leaves it and frees its memory immediately. The GPU never\nblocks waiting for the slowest member — that's the \"continuous\" part, and it's what the join/leave\nchurn in the first diagram is showing. Streams flow through a batch that is constantly being rebuilt.\n\nThe enabling piece underneath is **PagedAttention**. Each user's KV cache is stored not as one\ncontiguous slab but as a list of fixed-size **blocks** (the paged rows in the diagram), allocated on\ndemand from a shared pool — exactly like virtual memory pages. Without it, fitting 64 independent,\ndifferent-length caches into one memory space would fragment badly and you'd waste most of it. With it,\n64 caches pack tightly, and a finished request's blocks return to the pool for whoever's next. KV-cache\nmemory — not FLOPs — is what ultimately caps how many users fit.\n\n## The numbers, and which ones I trust\n\nHere's where honesty matters. The story above is mechanism, and it's solid. The specific figures need\nsorting into what the public [spark-bench](https://github.com/Weschera/spark-bench) results actually\ncontain versus what's a reported run.\n\nThe model is **Qwen3.6-35B** — specifically an **A3B mixture-of-experts**: ~35B total parameters but\nonly **~3B active per token**. That's half of why it's fast (you only compute a fraction of the network\neach step) and why it fits comfortably (the weights are also **quantized** — the committed throughput\nruns use vLLM with **NVFP4**, a [4-bit format](/articles/nemotron-nvfp4), and an FP8 variant). A 35B\nmodel serving this briskly on a 128 GB box is a *quantized MoE*, not a dense fp16 35B — and that\ndistinction is load-bearing, not a footnote.\n\nThe committed `spark_bench.csv` sweeps concurrency **1 → 16** and shows the batching curve directly. On\nthe NVFP4 build, aggregate decode throughput roughly triples from a single user to sixteen:\n\n<BenchBars\n  title=\"Aggregate decode throughput climbs with concurrency — Qwen3.6-35B, NVFP4 (committed)\"\n  unit=\" tok/s\"\n  bars={[\n    { label: \"1 user\", value: 72 },\n    { label: \"8 users\", value: 161 },\n    { label: \"16 users\", value: 217, highlight: true },\n  ]}\n/>\n\n…while each individual user's stream slows by roughly the same factor — you're trading personal latency\nfor collective capacity:\n\n<BenchBars\n  title=\"Per-user throughput falls as the batch grows — same runs (committed)\"\n  unit=\" tok/s\"\n  bars={[\n    { label: \"1 user\", value: 85 },\n    { label: \"8 users\", value: 33 },\n    { label: \"16 users\", value: 22, highlight: true },\n  ]}\n/>\n\n<Callout type=\"warn\">\n**Read the caveats before quoting a headline number.**\n\n- **700+ tok/s is _aggregate_, across all streams — not per user.** At 64 users that's ≈ **11 tok/s each**,\n  a modest personal reading speed. Continuous batching raises *throughput*, not single-stream latency;\n  it does not make any one user's tokens arrive faster (time-to-first-token actually rises with batch\n  size, from ~0.3 s single-stream to ~2.5 s at concurrency 16 in the committed runs).\n- **The 64-user / ~700 tok/s / 32,768-tokens-in-54-seconds figures are a _reported_ run, not one I could\n  confirm in the repo.** The committed sweep for Qwen3.6-35B tops out at concurrency **16** (~217 tok/s\n  median, ~450 in the best single run). A ~723 tok/s aggregate *does* appear in the committed data — but\n  at concurrency **32**, and for a *different* model ([laguna](/articles/laguna-model-factory)), not this\n  one. So the 64-user number is consistent in magnitude with where the curve is heading, but treat it as\n  reported, not verified.\n- **The ~38 W figure I could not verify at all** — there is no power column anywhere in the committed\n  results. The DGX Spark's whole-box envelope is well above 38 W under load, so whatever that number\n  measures (a sub-component? an idle draw?), take it as reported until someone publishes the methodology.\n- **Quantization is doing real work.** These rates are for a **4-bit (NVFP4) / 3B-active MoE**, not a\n  dense fp16 35B. Different precision, different story.\n</Callout>\n\n## The take\n\nContinuous batching is the quiet reason \"local inference\" and \"serving other people\" stopped being\nmutually exclusive. The mechanism is honest and general: decode is memory-bound, a lone stream wastes\nthe GPU, so amortize the weight-load across as many live streams as KV-cache memory will hold, rebuilding\nthe batch every step so nobody waits on anybody. On a DGX Spark that turns a desk-sized box into a\nsmall shared server — genuinely dozens of concurrent users on a quantized 35B MoE.\n\nWhat it is *not* is a speedup for the person on the other end. Each user's tokens come at a human\nreading pace, and that pace gets slightly worse as the room fills up. The DGX Spark headline — one small\nbox, many users — is real and impressive; it's just a statement about **aggregate** capacity and clever\nscheduling, not about raw single-stream speed. Hold both halves of that at once and the number stops\nbeing a magic trick and becomes what it actually is: good systems engineering.\n\n---\n\n*Mechanism (continuous / in-flight batching, PagedAttention) is standard\n[vLLM](https://github.com/vllm-project/vllm). Throughput and latency figures are read from the committed\n[spark-bench](https://github.com/Weschera/spark-bench) `results/spark_bench.csv` (Qwen3.6-35B-A3B, vLLM\nNVFP4/FP8, concurrency 1–16); the 64-user / ~700 tok/s / ~38 W figures are a reported run and are labelled\nas such above. The interactive diagrams illustrate the mechanism; their tok/s curve is a saturating fit\npinned to the committed points and the reported endpoint.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/dgx-spark-batching","signal":{"interest":3,"helpful":4,"score":7,"level":3,"label":"Notable"}},{"title":"KAT-Coder-V2.5: training a coding model to live inside a repository","description":"Most 'coding models' are single-turn code generators. KAT-Coder-V2.5 is trained to act autonomously inside real, executable repositories — so the hard part isn't the model, it's the data-and-environment stack around it. This is a walk through that stack: AutoBuilder rebuilding real repos into verifiable sandboxes, a hint-boosted data flywheel that recovers near-miss trajectories without leaking hints, harness randomization that stops the policy overfitting its scaffold, an asymmetric PPO whose critic peeks at the future, and a multi-teacher distillation that fuses five specialists into one. It lands second only to Opus 4.8 on repository-level SWE — with honest caveats about internal baselines and benchmarks.","date":"2026-07-10","tags":["llm","agents","reinforcement-learning","code-generation","explainer"],"draft":false,"cover":"/articles/kat-coder-agentic-training/fig2.png","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"kat-coder-agentic-training","body":"A \"coding model\" usually means a next-token predictor you paste a function stub into. **KAT-Coder-V2.5**, from the Kwaipilot / Kuaishou team, is trained for a different job: to *act* — to open a real repository, run its tests, read the failures, edit files, and iterate until the tests pass, across dozens of turns. Once that's the target, the model stops being the hard part. The hard part is manufacturing enough **precisely specified, executable, objectively verifiable** tasks to train on, and building an RL loop that survives long, sparse-reward trajectories. This report is mostly about that manufacturing stack. It pairs with our pieces on [agent harnesses](/articles/agent-harness) and on [near-frontier code RL](/articles/swe-1-7); KAT-Coder is a full-stack answer to the same question those raise.\n\nThe one idea to leave with: **an autonomous coding agent is an environment-and-data problem before it is a modeling problem.** Everything below — AutoBuilder, the data flywheel, harness randomization, the asymmetric critic, multi-teacher distillation — exists to feed a policy verifiable practice inside realistic scaffolds, without letting it overfit any single one.\n\n<Figure\n  src=\"/articles/kat-coder-agentic-training/fig2.png\"\n  alt=\"Two-row pipeline diagram. Top row, 'Environment Scaling Engine': repo artifacts (repositories, issues, pull requests, commits) feed task mining (code patch and test patch into problem statement, requirements, interface constraints), a clarity check funnel, AutoBuilder (build-and-verify agent, config script, dependencies, clean checkout, sandbox execution), verification (parsed outputs, >90% tests collected, reproducible pass-to-pass/fail-to-pass), producing 100K+ environments across 12 languages at a 16.5%-to-57.2% success rate. Bottom row, 'Data Scaling Flywheel': rollout trajectories (failed, near-miss, passing) go through hint-boosted rollout (success 0% to 20%), hint-free replay, process filtering (exploration, localization, fidelity, minimality, verification, honesty), and harness robustness (randomized interfaces and injected perturbations). A right-hand column lists the outputs: precise tasks, executable envs, validation tests, verified patches, robust trajectories, reward signals, harness-invariant behavior.\"\n  caption=\"The agentic software-engineering data pipeline: an Environment Scaling Engine turns real repositories into verifiable sandboxes, and a Data Scaling Flywheel turns raw rollouts into high-value training signals (Huang et al., 2026, Figure 2).\"\n/>\n\n## AutoBuilder: turning real repos into verifiable sandboxes\n\nThe raw material is public repositories, but an issue title and a merged PR are not a task. Two problems have to be solved. First, **task mining**: raw issue/PR text is ambiguous and often misaligned with what actually got merged, so AutoBuilder regenerates a structured spec from the *golden patch* and *test patch* — a **problem statement** (from the golden patch), **requirements** (from the test patch), and **interface constraints** (from both) — then runs a clarity check to ensure it's self-contained.\n\nSecond, **environment construction**: a build agent writes a configuration script and a separate verification agent runs it in an isolated sandbox, accepting the environment only when it can collect **>90% of the expected tests** with reproducible fail-to-pass and pass-to-pass outcomes (exit codes and log-scraping are explicitly rejected as too easy to fool). Combining base images, language templates, and retrievable build recipes, this reaches a **57.2% environment-construction success rate** (up from a 16.5% starting point) and yields **>100,000 verifiable environments across 12 languages**. That corpus of executable tasks is the substrate everything else trains on.\n\n## The data flywheel: recovering near-misses without leaking hints\n\nVerifiable environments are necessary but not sufficient: on genuinely hard tasks the model's raw pass rate is near zero, so rollouts produce no learning signal. KAT-Coder's answer is a two-stage recovery loop. First, inject **process-level hints** to lift near-misses over the line, raising the pass rate from ~0% to **~20%**. But a trajectory that only succeeds *because it was handed a hint* teaches the model to expect hints it won't have at deployment — so the second stage **replays the same tasks hint-free** and keeps only the trajectories that still recover. The final data carries no hint leakage and stays faithful to the original distribution. Drag the hint strength and step the stages:\n\n<DataFlywheel />\n\nWhat survives then passes a **process-score filter** that scores each trajectory on exploration, localization, pre-edit reasoning, specification fidelity, repository conventions, patch minimality, verification quality, recovery, and honesty — down-weighting exploitative or unstable behavior even when the tests happen to pass. A parallel **harness-robustness** step randomizes tool names, argument conventions, and output formats and injects realistic perturbations (missing dependencies, transient failures, truncated outputs, noisy logs), so trajectories don't encode a single scaffold's quirks.\n\n## KwaiClawEnv: general agentic tool use\n\nRepository work isn't the only skill. **KwaiClawEnv** is a three-layer pipeline that manufactures general tool-use trajectories: a **Service layer** builds callable capabilities from human-authored Skills and LLM-generated Services (>90% generation success from open-source community Skills), a **Task layer** expands real task seeds into variants with configurable difficulty and tool-chain length, and an **Eval layer** converts rollouts into SFT-ready samples and feeds quality signals back upstream. It yields **>100,000 high-quality instances** with an **average of 15 tool calls** per task and the longest **exceeding 100 steps** — genuinely long-horizon agentic data, validated through a three-stage checker (service reachability, task-schema legality, sandboxed execution) and two-layer filtering (hard rules plus LLM-as-judge).\n\n## Reinforcement learning, hardened for long horizons\n\n### Harness randomization\n\nAn agentic policy trained inside one fixed scaffold overfits the *surface* of that scaffold, not the task. The report names three failure modes: **format overfitting** (anchoring to one action format, so parsing breaks when the protocol changes), **context-structure overfitting** (depending on how history is concatenated), and **control-flow overfitting** (relying on a fixed reflection/stop schedule). The fix is to train across many harnesses that vary along three axes — tool-invocation protocol, context management, and control flow — spanning **white-box** harnesses (like mini-swe-agent: simple, uncompressed, clean signal) and **black-box** production harnesses (Claude Code, Codex, OpenClaw, OpenHands: with compression and context reorganization). Flip the axes and watch the rendered action change while the task — and its reward — stay put:\n\n<HarnessRandomizer />\n\nUnderneath, the sandbox itself had to be hardened: container-image and environment-variable bugs meant **~16% of trajectories** initially failed for reasons unrelated to the model. Fixing disk pressure (95% → 60% usage) and timeouts drove sandbox-related failures to **below 2%**, and invalid-rollout rates from 6–7% to under 1%.\n\n### Asymmetric PPO with a hindsight critic\n\nLong-horizon tasks are sparse-reward: the signal lands only at the end, when the tests pass or fail. That makes the critic's job — estimating the value of an intermediate state — brutally high-variance, and a noisy value function means a noisy advantage, which destabilizes training. KAT-Coder's move is an **asymmetric actor–critic**: the actor sees only the normal harness state $s_t$ (so it behaves identically in training and deployment), while the critic is given a *privileged* **hindsight context** $c_t$ — the eventual reward, test outcomes, coverage signals, patch-level diffs, trajectory statistics, and subsequent turns. Scrub the turn and toggle hindsight to see the value estimate tighten:\n\n<HindsightCritic />\n\nConcretely, the standard clipped PPO objective is optimized:\n\n$$\n\\mathcal{J}_{\\mathrm{PPO}}(\\theta)=\\mathbb{E}_{q,\\,o}\\left[\\frac{1}{|o|}\\sum_{t=1}^{|o|}\\min\\!\\left(r_t\\hat{A}_t,\\ \\mathrm{clip}(r_t,1-\\epsilon,1+\\epsilon)\\hat{A}_t\\right)\\right], \\quad r_t=\\frac{\\pi_\\theta(a_t\\mid s_t)}{\\pi'(a_t\\mid s_t)},\n$$\n\nwith advantages from GAE, $\\hat{A}_t=\\sum_{l=0}^{T-t-1}(\\gamma\\lambda)^l\\delta_{t+l}$ and $\\delta_t=r_t+\\gamma V'(s_{t+1})-V'(s_t)$. The asymmetry is entirely in the value function: the critic conditions on the hindsight context, so its regression target uses $V_\\psi(s_t,c_t)$ rather than $V_\\psi(s_t)$:\n\n$$\n\\mathcal{L}_{\\mathrm{critic}}^{\\mathrm{asym}}(\\psi)=\\mathbb{E}_{(s_t,c_t,R_t)}\\left[\\left(V(s_t,c_t;\\psi)-R_t\\right)^2\\right].\n$$\n\nBecause $c_t$ contains information the actor can't see, the value estimate is far less of a blind guess — lowering advantage variance without ever contaminating the deployed policy. Reward itself is a two-part framework: a **rule-based** reward with a core task score (all fail-to-pass *and* pass-to-pass tests must pass for full credit), eight behavioral constraints (duplication, garbled output, tool-call accuracy and placement, redundant calls, parallelism, debug-artifact cleanup), and failure-path incentives (file-search $F_2$, unit-test pass rate); plus a **model-based** generative reward model scoring fault diagnosis, post-fix validation, and execution strategy. The reported SWE training curve rises stably throughout.\n\n<Figure\n  src=\"/articles/kat-coder-agentic-training/fig3.png\"\n  alt=\"Architecture diagram of the RL training infrastructure. A top row labelled 'any agent harness' shows Claude Code, Codex CLI, OpenHands, mini swe-agent, and SWE-agent. They connect to 'Kwai Env', a dashed box containing a Gateway Server (middleware speaking Anthropic, OpenAI Chat, and OpenAI Responses protocols) linked to an Environment Module with Sandbox and Container, and an Experience Buffer. On the left, a Rollout Engine (N workers) exchanges token-in/token-out with the gateway and receives weight sync from a Train Engine (M workers), which is fed request-level samples from the experience buffer.\"\n  caption=\"The agentic RL infrastructure: a gateway server lets any agent harness drive sandboxed environments over multiple protocols, while rollout and train engines exchange trajectories and synced weights (Huang et al., 2026, Figure 4).\"\n/>\n\n### Multi-Teacher On-Policy Distillation\n\nRather than run five separate RL specialists and hope they compose, KAT-Coder fuses them with **Multi-Teacher On-Policy Distillation (MOPD)**: the student generates on-policy, and for each domain $d$ its distribution is pulled toward that domain's expert teacher $\\pi_{T_d}$ under a reverse-KL objective —\n\n$$\n\\mathcal{L}_{\\mathrm{MOPD}}(\\theta)=\\mathbb{E}_{(x,d)}\\,\\mathbb{E}_{y\\sim\\pi_\\theta(\\cdot\\mid x)}\\left[\\sum_{t=1}^{|y|}w_t\\,\\mathrm{KL}\\!\\left(\\pi_\\theta(\\cdot\\mid x,y_{<t})\\,\\middle\\|\\,\\pi_{T_d}(\\cdot\\mid x,y_{<t})\\right)\\right],\n$$\n\nfusing five experts — agentic software engineering, general agentic reasoning, terminal use, web coding, and general knowledge — into one model without the usual \"see-saw\" of gains in one domain costing another. Two stabilizers make on-policy distillation behave: an **off-policy cold start** (ordinary teacher-forced cross-entropy on teacher samples) to bootstrap before going on-policy, and **drift-aware dynamic truncation** that measures the top-$k$ overlap $\\rho_t=|\\mathcal{T}_t^k\\cap\\mathcal{S}_t^k|/k$ between teacher and student predictions and truncates where they diverge too far.\n\n## The numbers\n\nEvaluated under a unified Claude Code harness against a panel of frontier models (GLM-5.1, GLM-5.2, Kimi-K2.6, and Opus 4.8), KAT-Coder-V2.5 posts the **top score on PinchBench (94.9)** and ranks **second only to Opus 4.8** on repository-level SWE — SWE-Bench Pro **65.2** (Opus 4.8 leads at 69.2) and the team's own KAT Code Bench **53.1** (Opus 4.8 57.3):\n\n<BenchBars\n  title=\"Agentic coding benchmarks — KAT-Coder-V2.5 vs frontier panel (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"PinchBench (Avg)\", value: 94.9, highlight: true },\n    { label: \"Opus 4.8 · Pinch\", value: 93.5 },\n    { label: \"SWE-Bench Pro\", value: 65.2, highlight: true },\n    { label: \"Opus 4.8 · SWE-Pro\", value: 69.2 },\n    { label: \"KAT Code Bench\", value: 53.1, highlight: true },\n    { label: \"Opus 4.8 · KAT-Code\", value: 57.3 },\n  ]}\n/>\n\nThe picture is not uniform. On tool-use and repo-level SWE, KAT-Coder is at or near the frontier; on **Terminal-Bench 2.1 it scores 60.7** — behind GLM-5.2 (77.9), Kimi-K2.6 (73.0), and Opus 4.8 (84.6) — and on **SciCode 50.3**, behind GLM-5.2 (50.5) and both Kimi-K2.6 and Opus 4.8 (53.5). On KAT Claw Bench it reaches 85.5, with GLM-5.2 (86.8) and Opus 4.8 (90.7) ahead. The efficiency of the *training stack* — not inference — is the story; the paper reports no wall-clock or cost comparison against the frontier panel.\n\n<Figure\n  src=\"/articles/kat-coder-agentic-training/fig1.png\"\n  alt=\"Six grouped bar charts, one per benchmark (SWE-Bench Pro, KAT Code Bench, PinchBench, KAT Claw Bench, Terminal-Bench 2.1, SciCode). Each chart plots KAT-Coder-V2.5 in green against GLM-5.1, GLM-5.2, Kimi-K2.6, and Opus 4.8 in grey. KAT-Coder leads on PinchBench and is second on SWE-Bench Pro and KAT Code Bench, but trails on Terminal-Bench 2.1 and SciCode.\"\n  caption=\"KAT-Coder-V2.5 (green) against a frontier panel across six SWE and agent benchmarks: leading on PinchBench, second on repository-level SWE, and behind on terminal and scientific coding (Huang et al., 2026, Figure 1).\"\n/>\n\n<Callout type=\"warn\">\nHonest caveats. **The panel and two of the six benchmarks are author-chosen.** KAT Code Bench and KAT Claw Bench are the team's *own* new benchmarks; the comparison excludes other open agentic-coding methods and reports no ablations isolating what each component (harness randomization vs. the hindsight critic vs. MOPD) actually contributes — the \"16% → &lt;2%\" sandbox and \"0% → ~20%\" hint numbers are engineering deltas, not controlled ablations. The wins are **real but scoped**: first on PinchBench, second on SWE-Bench Pro, but **behind every panel model on Terminal-Bench 2.1 and behind three of four on SciCode** — the report itself flags terminal and scientific tasks as open weaknesses. The whole stack (AutoBuilder, KwaiClawEnv, the benchmarks) is internal infrastructure, so transfer to non-standard or closed repositories is unproven, and long-horizon credit assignment is \"partially addressed\" but still called a challenge.\n</Callout>\n\n## The take\n\nKAT-Coder-V2.5 is best read not as a model but as a **recipe for the surrounding machine**. Its most transferable ideas are architectural in the systems sense: verify environments by actually collecting tests (not scraping logs), recover near-misses with hints and then *strip the hints* so the data stays honest, randomize the harness so the policy learns the task instead of the scaffold, and hand the critic — but never the actor — a view of the future to tame sparse-reward variance. Set against [rollout-stability work like Routing Replay](/articles/rollout-routing-replay) and [cheaper frontier RL](/articles/frontier-rl-cheaper), the throughline is the same: at long horizons, most of the win is in the plumbing, not the loss function. Whether the fixed recipe generalizes past the team's own repositories and benchmarks — and closes the terminal and scientific gaps — is the open question; but as an account of what it takes to make a coding model *live inside a repository*, it's unusually complete.\n\n---\n\n*Built on the [KAT-Coder-V2.5 Technical Report](https://arxiv.org/abs/2607.05471) (Huang, Li, Xu et al.; Kwaipilot / Kuaishou, 2026). All benchmark values are quoted from the paper's Table 4 (unified Claude Code harness); the interactive diagrams are illustrations of the mechanism, not measured traces. PinchBench averages are the paper's, retrieved from pinchbench.com on 2026-07-02.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/kat-coder-agentic-training","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"MusaCoder: teaching a model to write GPU kernels with execution-feedback RL","description":"Generating a correct, fast CUDA/MUSA kernel from a PyTorch reference is a task where almost every early attempt fails to compile, so execution-based RL starves on sparse rewards, reward-hacks with PyTorch fallbacks, and destabilizes. MusaCoder is a full-stack recipe — kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and RL against the MooreEval verifier — held together by three fixes: PrimeEcho anchors multi-turn rewards to the first turn, Buffered Dynamic Retry recovers signal from all-failed samples, and MirrorPop masks off-policy sequences the vanilla filter misses. The 27B model reaches 93.2 Pass@8 on KernelBench, ahead of the closed frontier models the paper tests. A walk through the reward, the three stabilizers, the numbers, and the honest caveats.","date":"2026-07-10","tags":["llm","reinforcement-learning","gpu","cuda","code-generation","explainer"],"draft":false,"cover":"/articles/musacoder-gpu-kernels/fig1.png","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"musacoder-gpu-kernels","body":"Ask a language model to turn a PyTorch module into a hand-written CUDA (or [MUSA](https://en.wikipedia.org/wiki/Moore_Threads), Moore Threads' CUDA-alike) kernel and you hit a task that punishes it at every turn. The output has to compile against a real toolchain, launch without an illegal memory access, produce numerically-matching results across dtypes and shapes, **and** run faster than the reference — or it is worthless. Most first attempts fail outright, which is exactly what makes the obvious training recipe, execution-based reinforcement learning, so hard: when nearly every rollout scores the same failing reward, there is no gradient to learn from. Worse, the model quickly discovers it can \"pass\" a correctness check by quietly calling the very PyTorch operator it was asked to replace.\n\n**MusaCoder** (Cheng et al., Moore Threads, 2026) is a full-stack answer to that. It is less a single trick than an assembled pipeline — data synthesis, supervised and rejection fine-tuning, and RL against an execution verifier — with three stabilization mechanisms that keep the RL from collapsing. The result: a 9B model built on Qwen3.5-9B that matches the frontier closed models the paper evaluates, and a 27B model (on Qwen3.6-27B) that tops them on the paper's KernelBench protocol.\n\n<Figure\n  src=\"/articles/musacoder-gpu-kernels/fig1.png\"\n  alt=\"MusaCoder training pipeline. Left: raw sources feed six data corpora — PyTorch-to-CUDA generation, GPU kernel knowledge QA, profiling analysis, optimization rewrite, and kernel review/repair. Middle-bottom: auxiliary augmentation (shape/stride hints, unit tests, metadata, weak-op upsampling), LLM-based multi-agent synthesis, and the MooreEval evaluator that parses, compiles, correctness-checks, anti-hacking-checks and benchmarks, producing a verified kernel-oriented corpus. Right: multi-task SFT, then diversity-preserving RFT, then two-stage RL (single-turn warmup then multi-turn) with the PrimeEcho, MirrorPop and Buffered Dynamic Retry optimization methods.\"\n  caption=\"The full MusaCoder pipeline: kernel-oriented data synthesis and the MooreEval verifier feed a verified corpus, then multi-task SFT → diversity-preserving RFT → two-stage RL with three stabilizers (Cheng et al., 2026, Figure 2).\"\n/>\n\n## The reward is where correctness gets enforced\n\nEverything downstream depends on one design choice: the scalar reward `s(c)` that MooreEval — the paper's distributed compile/execute/verify environment — assigns to a candidate kernel `c`. MooreEval first produces a structured verdict `V(c,x) = (compiled, correct, legal, speedup, category, detail)`, and collapses it into a **correctness-first** reward (Equation 2):\n\n$$\ns(c)=\\begin{cases}\n-1, & \\text{extraction / compile / runtime fails},\\\\\n-1, & \\text{a disallowed PyTorch/aten::* fallback is detected},\\\\\n-1, & q=0,\\\\\n-0.5+0.5\\,q, & 0<q<1,\\\\\n1+\\lambda\\cdot\\min\\!\\big(\\max(\\nu-1,0),\\,\\nu_{\\max}\\big), & q=1 \\text{ and legal},\n\\end{cases}\n$$\n\nwhere `q` is the fraction of test cases passed and `ν` the measured speedup over the PyTorch baseline. The shape of this function *is* the anti-hacking policy. A kernel that cheats by falling back to `aten::*` scores exactly the same `−1` as one that never compiled; partial correctness earns a bounded, still-negative shaping term so the model can climb out of \"totally broken\"; and **only** a fully correct, legal, native kernel crosses zero, at which point a clipped speedup bonus applies. Drag the verdict below and watch where each outcome lands:\n\n<RewardLadder />\n\nThat single wall at zero is what keeps the model honest. Speed is never rewarded until correctness is banked, and forbidden fallbacks are punished as hard as a crash — so the fastest way to positive reward is to actually write the kernel.\n\n## Data and fine-tuning: getting off the floor before RL\n\nRL only works if the base model clears the bar *sometimes*. MusaCoder spends most of its pipeline manufacturing that starting competence. A three-stage data engine expands the PyTorch-to-CUDA/MUSA workload distribution (real modules, cleaned GitHub projects, and NNSmith-generated computation graphs across ~162 operators), injects tensor **shape/stride/contiguity** hints extracted with `torch.fx`/`torch.export`, and enforces a six-step structured-reasoning template before any code is written. Auxiliary corpora add GPU-kernel knowledge Q&A and a kernel-**reviewer** task that must emit `VERDICT: CORRECT` or `INCORRECT`.\n\nTwo fine-tuning stages follow. Multi-task **SFT** teaches canonical kernel patterns and error-diagnosis (with loss masking so feedback tokens are context-only). Then a **diversity-preserving rejection fine-tuning (RFT)** step deliberately breaks with convention: standard RFT keeps only the single fastest correct sample, which collapses entropy; MusaCoder instead retains a *heterogeneous* set of verified-correct implementations, preserving the exploration diversity that RL will need. That choice alone is worth 2.2 points of Pass@8 (SFT 84.8 → 82.6 without RFT, Table 2). If you have not met [`torch.profiler`](/articles/torch-profiler) — the same tool MusaCoder uses to diagnose which operator families the base model is weak on — it is worth a detour.\n\n## The three stabilizers\n\nRL runs in two stages: a single-turn warmup to establish basic execution understanding, then multi-turn feedback RL where a failed kernel gets MooreEval's error log appended and the model tries again. On top of a GRPO objective, three mechanisms keep it from falling over.\n\n### PrimeEcho — anchor the reward to the turn that ships\n\nIn multi-turn RL the naive reward is the best score across all turns, `max_k s_k`. But the model only ever *deploys* its first-turn kernel, and rewarding best-of-turns teaches it to defer correctness — ship something broken, then \"fix\" it once the verifier hands it the error. PrimeEcho blends the two (Equation 9):\n\n$$\nR_{\\tau} = \\alpha\\,s_{1} + (1-\\alpha)\\max_{1\\le k\\le K} s_{k} + b_{\\text{early}}(\\tau),\n$$\n\nwith an early-success bonus `b_early = β₁·1[success at turn 1] + β₂·1[fail at 1, success at 2]`. Keeping α high anchors the reward to zero-shot quality while still letting later turns supply exploration signal. Slide α and watch a deliberately-late trajectory get *more* reward as the anchor weakens — the exact hack PrimeEcho suppresses:\n\n<PrimeEcho />\n\n### Buffered Dynamic Retry — rescue the all-failed groups\n\nGRPO normalizes advantages within a group of `G` rollouts. When a task is hard enough that **all** `G` samples fail — `r_i = −1` for every `i` — the advantages are all zero and the sample contributes **no gradient**, so the hardest tasks teach nothing. Buffered Dynamic Retry (BDR) composes a *repair task* `x' = Compose(x, c⁻, f⁻)` from a failed kernel and its feedback, pushes it into a FIFO buffer `B`, and mixes buffered repair tasks back into training with probability `p_buf`. It turns a dead rollout group into a feedback-conditioned second chance. In the paper's isolated test (Table 3) BDR lifts Pass@8 from 59.6 → 62.4 on a Qwen3-8B checkpoint (~16% of previously-failed tasks recovered) and 73.2 → 74.4 on Qwen3.5-9B (~28% recovery).\n\n### MirrorPop — catch the off-policy sequences that cancel\n\nMusaCoder's rollouts are generated asynchronously, so the rollout policy drifts from the training policy and the per-token importance ratio `ρ_t` no longer sits at 1. Vanilla sequence-level masking scores a response by its *signed* mean log-ratio — but a response that is badly off-policy with roughly equal positive and negative deviations averages to ≈0 and slips through as if it were on-policy (the paper's Figure 11 \"cancellation\" case). MirrorPop instead uses the mean **absolute** log-ratio, which every token can only push upward, and masks the sequence when it exceeds a threshold δ (Equation 21):\n\n$$\nM_i^{\\text{mirrorpop}} = \\mathbf{1}\\!\\left[\\frac{1}{L_i}\\sum_{t=1}^{L_i}\\big|\\log \\rho_{i,t}\\big| \\le \\delta\\right].\n$$\n\nToggle the two responses below — an on-policy one and a drifted one whose ratios cancel — and see which filter catches the drift:\n\n<MirrorPop />\n\nThis is the same failure mode that [Rollout Routing Replay](/articles/rollout-routing-replay) fixes at its source for MoE routers and that [async frontier-RL setups](/articles/frontier-rl-cheaper) wrestle with generally — here it is handled at the masking layer. Of the three stabilizers, MirrorPop is the one whose removal hurts most.\n\n## The numbers\n\nMusaCoder is evaluated under its own strict MooreEval protocol on KernelBench, split into Level 1–3 by difficulty. **Pass@8** asks whether at least one of 8 samples passes verification; **Avg.@8** is the mean correctness rate across the 8; **Faster Rate** counts a candidate only if it is correct, legal, *and* beats the baseline by more than 1.1×. The headline is overall correctness — MusaCoder-27B-RL reaches **93.2 Pass@8 / 88.6 Avg.@8**, ahead of every model the paper tests:\n\n<BenchBars\n  title=\"KernelBench correctness — Avg.@8, Overall (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"MusaCoder-27B\", value: 88.6, highlight: true },\n    { label: \"MusaCoder-9B\", value: 77.2, highlight: true },\n    { label: \"Claude Opus 4.7\", value: 77.3 },\n    { label: \"GLM-5.1\", value: 76.25 },\n    { label: \"Kimi K2.6\", value: 69.1 },\n    { label: \"DeepSeek-V4-Pro\", value: 54.9 },\n  ]}\n/>\n\n<Figure\n  src=\"/articles/musacoder-gpu-kernels/fig2.png\"\n  alt=\"Grouped bar chart of KernelBench correctness (Avg.@8 correct rate, %) for GLM-5.1, Kimi K2.6, DeepSeek-V4-Pro, Claude Opus 4.7, and MusaCoder (Ours) across Overall, Level 1, Level 2, and Level 3. MusaCoder's orange bars are highest in every group, most dramatically on Level 3 where it reaches 65.8 versus 38–40 for the baselines.\"\n  caption=\"KernelBench correctness (Avg.@8) by difficulty level: MusaCoder leads in every tier and pulls away on the hardest Level 3 (Cheng et al., 2026, Figure 1).\"\n/>\n\nThe gap widens on the hardest tier. On **Level 3**, MusaCoder-27B-RL scores **72 Pass@8 / 65.75 Avg.@8** against Claude Opus 4.7's 54 / 39.25 and GLM-5.1's 54 / 38.50 — the RL model roughly doubles the average correctness of the frontier baselines on the tasks where kernels are hardest to get right. Notably the 9B model (77.2 Avg.@8) edges Claude Opus 4.7 (77.3 is essentially tied) despite being a fraction of the size, and the RL stage is decisive: MusaCoder-27B jumps from 79.4 (SFT) to 88.6 (RL) Avg.@8.\n\nCorrectness is the easy win; **speed is much harder**. Even a good kernel rarely beats a fused `torch.compile` baseline, so absolute Faster Rates are low across the board — but MusaCoder still leads:\n\n<BenchBars\n  title=\"KernelBench Faster Rate vs eager, Overall (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"MusaCoder-27B\", value: 15.0, highlight: true },\n    { label: \"Claude Opus 4.7\", value: 11.8 },\n    { label: \"GLM-5.1\", value: 7.4 },\n    { label: \"DeepSeek-V4-Pro\", value: 5.2 },\n    { label: \"MusaCoder-9B\", value: 5.4, highlight: true },\n  ]}\n/>\n\nAgainst `torch.compile` (a tougher bar), MusaCoder-27B-RL's Faster Rate is 9.2% vs Claude Opus 4.7's 7.5%. On the authors' ported **MUSA KernelBench** (Table 4), the 27B model leads on both correctness and speed (92.4 Pass@8 / 81.7 Avg.@8 / 12.5 Faster) over DeepSeek-V4-Pro (92.0 / 56.9 / 5.7) and GLM-5.1 (88.0 / 66.4 / 6.9).\n\nThe ablation (Table 2) confirms each stabilizer earns its place, measured as removals from the full RL model (93.2 Pass@8):\n\n<BenchBars\n  title=\"Ablation — Overall Pass@8 as each piece is removed (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"full RL\", value: 93.2, highlight: true },\n    { label: \"− single-turn warmup\", value: 90.8 },\n    { label: \"− BDR\", value: 88.6 },\n    { label: \"− PrimeEcho\", value: 88.4 },\n    { label: \"− MirrorPop\", value: 86.0 },\n  ]}\n/>\n\nDropping MirrorPop costs the most (93.2 → 86.0), consistent with off-policy drift being the dominant instability in asynchronous kernel-generation RL.\n\n## The honest caveats\n\n<Callout type=\"warn\">\nThe comparison is **provider-run and provider-defined**. MooreEval, the strict verification protocol, the difficulty split, and the *MUSA* KernelBench variant are all authored by the same team as the model; the closed frontier baselines (Claude Opus 4.7, DeepSeek-V4-Pro/-ProMax, GLM-5.1, Kimi K2.6) were evaluated by the authors under that protocol, not self-reported. Read the numbers as \"MusaCoder wins on the bench MusaCoder built,\" which is a real result but not a neutral one.\n</Callout>\n\nA few more things worth stating plainly:\n\n- **The base models are already strong.** MusaCoder-27B starts from Qwen3.6-27B (67.2 Pass@8) and MusaCoder-9B from Qwen3.5-9B — the recipe adds a lot on top, but this is not a from-scratch capability.\n- **Speed remains the weak axis.** A ~15% Overall Faster Rate means the *large majority* of even correct generated kernels do not beat the reference. The framing is correctness-first for a reason; treat the speedups as a bonus, not the story.\n- **The reward's tuning knobs aren't disclosed.** The paper leaves `λ` (performance weight) and `ν_max` (speedup clip) as symbols; the values in the reward interactive above are illustrative, chosen to show the shape, not read from the paper.\n- **No dedicated limitations section.** The paper does not enumerate its own failure modes or generalization limits, so the boundaries of the approach — how it holds up on operator families outside the ~162 synthesized, or on GPUs beyond the CUDA/MUSA pair — are left for the reader to infer.\n\n## The take\n\nMusaCoder's real contribution is not any one of its parts but the recognition that execution-feedback RL for kernel generation fails in *three specific, nameable ways* — sparse rewards, multi-turn reward hacking, and off-policy cancellation — and a targeted fix for each. The correctness-first reward makes cheating pointless; PrimeEcho keeps the model honest about the turn that ships; BDR rescues the hardest tasks from the dead-gradient zone; MirrorPop stops async drift from poisoning the update. It is careful RL engineering more than a new algorithm, and the payoff — a 9B model tied with the closed frontier and a 27B model ahead of it on the paper's bench — is the kind of result that only shows up when every stage of the pipeline is doing its job. Whether the lead survives a neutral, third-party benchmark is the open question the provider-run setup leaves on the table.\n\n---\n\n*Built on [MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU](https://arxiv.org/abs/2606.04847) (Cheng, Lu, Liao et al.; Moore Threads, 2026; CC BY-SA 4.0). All benchmark numbers are quoted from the paper's tables; the interactive diagrams illustrate the reward and stabilization mechanisms and use illustrative parameter values where the paper leaves them unspecified.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/musacoder-gpu-kernels","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Switch Transformers: route every token to exactly one expert","description":"The 2021 paper that simplified Mixture-of-Experts by routing each token to a single expert instead of the usual two — cutting router math and cross-device communication, and scaling to a 1.6-trillion-parameter sparse model. This is a walk through the Switch layer, the capacity buffer that drops overflow tokens, the load-balancing loss and fp32 router that make top-1 routing stable, and the honest costs: sparse-not-dense compute, sample-efficiency-not-wall-clock speedups, and dropped tokens. The foundational simplification the modern MoE zoo descends from.","date":"2026-07-10","tags":["mixture-of-experts","llm","architecture","deep-learning","explainer"],"draft":false,"cover":"/articles/switch-transformer/fig1.png","featured":false,"interest":3,"helpful":4,"kind":"articles","slug":"switch-transformer","body":"Every modern giant open-weights model — [LongCat 2.0](/articles/longcat-2) at 1.6T\nparameters, and the rest of the sparse-MoE zoo — runs on one idea: don't run all the\nparameters on every token. Keep a big pile of experts, and for each token light up\nonly a few. The mechanism, built from a router and a sparse forward pass, is walked\nthrough in [Mixture of Experts, from scratch](/articles/mixture-of-experts-from-scratch).\n**Switch Transformers** (Fedus, Zoph & Shazeer, 2021) is the paper that made that idea\n*simple* enough to scale — and it did so with one deliberately blunt move: route each\ntoken to **exactly one** expert.\n\nThat sounds like a footnote. It was the whole contribution. The prior MoE recipe\n(Shazeer et al., 2017) argued you needed to route each token to at least the **top-2**\nexperts — the reasoning being that comparing two experts gives the router a gradient\nsignal to learn *which* is better. Switch throws that out and keeps only the **top-1**,\nthe single argmax expert. Flip the toggle below and watch the routing collapse from two\nconnectors per token to one:\n\n<SwitchRouting />\n\n<Figure\n  src=\"/articles/switch-transformer/fig1.png\"\n  alt=\"Switch Transformer encoder block. Left, a standard transformer block with a Switching FFN Layer replacing the dense feed-forward network. Right, the layer expanded: two tokens x1 and x2 each pass through self-attention, then a Router that selects a single FFN — token x1 goes to FFN 2 with probability 0.65, token x2 goes to FFN 1 with probability 0.8 — and the chosen expert's output is scaled by that gate probability before the add-and-normalize.\"\n  caption=\"The Switch layer swaps the dense FFN for a set of expert FFNs; the router sends each token to just one expert (x1→FFN 2 at p=0.65, x2→FFN 1 at p=0.8) and scales that expert's output by the gate probability (Switch Transformers, Fedus et al. 2021, Figure 2).\"\n/>\n\nWhy is top-1 worth a paper? Because the top-2 you save is not free compute you were\nwasting — it is a second copy of every token that has to be **dispatched to another\ndevice**. Experts are sharded across accelerators; routing a token to an expert means\nsending its activation over the network. Halving the experts-per-token roughly halves\nboth the router's arithmetic and the all-to-all communication volume — the actual\nbottleneck at scale. Switch's claim is that with the right guardrails, one expert per\ntoken loses little quality while making the whole thing dramatically cheaper to run.\n\n## The capacity buffer, and the tokens it drops\n\nThe catch with routing is that it is dynamic — you don't know until runtime how many\ntokens will pick each expert, but hardware needs **fixed** tensor shapes. Switch solves\nthis by giving every expert a fixed buffer:\n\n$$\n\\text{expert capacity} = \\frac{\\text{tokens per batch}}{\\text{number of experts}} \\times \\text{capacity factor}\n$$\n\nIf routing were perfectly uniform, a capacity factor of `1.0` would give each expert\nexactly its fair share of slots. It never is uniform. When more tokens route to an\nexpert than it has slots, the overflow tokens are **dropped** — they skip the layer\nentirely and pass through the residual connection unchanged. Drag the load imbalance and\nthe capacity factor and watch tokens overflow into red:\n\n<CapacityDrop />\n\nThis is the core tradeoff, made concrete. A higher capacity factor (the paper tests\n`1.0`, `1.25`, and `2.0`) means fewer dropped tokens — but every empty slot is compute\nand memory spent on nothing. A lower factor is cheaper but throws away more tokens. The\nwhole point of good load balancing is to flatten the routing so a *small* buffer\nsuffices.\n\n## The two tricks that make top-1 stable\n\nBlunt top-1 routing would collapse — the router would learn to send everything to a\nhandful of experts, starving the rest. Two mechanisms hold it together:\n\n- **Differentiable load-balancing loss.** An auxiliary loss added at every Switch layer,\n  scaled by $\\alpha = 10^{-2}$, is minimized when tokens are spread **uniformly** across\n  experts. It's the product of the fraction of tokens dispatched to each expert and the\n  router's average probability mass on that expert, summed over experts — a smooth\n  penalty that pushes the router toward balanced assignment without hard constraints.\n- **Selective precision.** Large sparse models train in `bfloat16` for speed, but the\n  router's `exp`/softmax over expert logits is numerically fragile — small perturbations\n  flip the argmax and destabilize training. Switch casts **only the router's internal\n  computation to `float32`**, keeping everything else in `bfloat16`. The fp32 stays local\n  to the router (it isn't communicated across devices), so it buys stability at no\n  bandwidth cost.\n\nAdd **expert dropout** at fine-tuning time — a higher dropout rate of `0.4` inside the\nexpert layers versus `0.1` elsewhere — to keep the huge sparse model from overfitting\nsmall downstream datasets, and top-1 routing trains cleanly.\n\n## What it bought: 7× at matched FLOPs\n\nHeld to the **same FLOPs per token** as a dense T5, Switch reaches the same pretraining\nquality far sooner. The 64-expert Switch-Base hits T5-Base's quality in about\n**one-seventh** the training steps; scaled up, Switch-XXL reaches T5-XXL's quality about\n**4×** faster.\n\n<Figure\n  src=\"/articles/switch-transformer/fig2.png\"\n  alt=\"A learning-curve chart with negative log perplexity on the y-axis and training time on the x-axis. Four curves: Switch-Base with 128, 64, and 32 experts all rise well above the T5-Base curve, reaching a given quality much earlier. A horizontal arrow labeled 7x Speedup marks the training-time gap between Switch-Base and T5-Base at equal quality.\"\n  caption=\"At equal FLOPs per token, Switch-Base reaches a target quality about 7× sooner than the dense T5-Base — the sample-efficiency win that motivates the whole design (Switch Transformers, Fedus et al. 2021, Figure 5).\"\n/>\n\n<BenchBars\n  title=\"Pretraining speedup to match the dense baseline (matched FLOPs, ×)\"\n  unit=\"×\"\n  bars={[\n    { label: \"Switch-Base vs T5-Base\", value: 7.0, highlight: true },\n    { label: \"Switch-XXL vs T5-XXL\", value: 4.0, highlight: true },\n  ]}\n/>\n\nThe other headline is raw scale. By stacking experts, the paper builds **Switch-C** with\n**2,048 experts** and roughly **1.6 trillion** total parameters — while **Switch-XXL**\ntakes a different bet, only 64 experts but a much larger per-expert FFN, at ~395B\nparameters. Both were among the largest models trained at the time.\n\n## Distilling back to dense\n\nA 1.6T-parameter sparse model is impractical to *serve* for many use cases — the\nparameters have to live in memory across many devices even if each token only touches a\nfew. So the paper distills the sparse teacher back into a small **dense** student, and\nfinds you can compress the model by up to **99%** while still keeping about **30%** of the\nquality improvement the sparse model earned over its dense baseline. Not all of it — but\na meaningful slice of the gains survives into a model you can run on modest hardware.\n\n<Callout type=\"warn\">\nRead the wins precisely — none of them is a free lunch.\n\n- **1.6T is sparse, not dense.** Switch-C activates a *single* expert's FFN per token, so\n  the FLOPs and activated parameters per token stay close to the dense T5 backbone —\n  nowhere near 1.6T of compute. The trillion parameters are capacity you *store and\n  communicate*, not compute you *spend* per token. Never read \"1.6T\" as dense-1.6T cost.\n- **7× is sample-efficiency at matched FLOPs, not wall-clock magic.** It means reaching a\n  quality target in fewer steps at equal FLOPs-per-token — bought by spending far more\n  **memory and cross-device communication** on many more parameters. On different\n  hardware or with communication-bound serving, the real-world speedup shrinks.\n- **Top-1 is not strictly better than top-2.** Routing to one expert can be lower-quality\n  per FLOP in some settings; Switch's contribution is *showing it works well* once you add\n  the capacity buffer, the load-balancing loss, and the fp32 router — not that fewer\n  experts is always better.\n- **Dropped tokens are lost information.** At a low capacity factor, overflow tokens skip\n  the layer entirely. That's a genuine cost you trade against buffer waste — there's no\n  setting that removes it, only balances it.\n</Callout>\n\n## Why it still matters\n\nAlmost every technique in a modern MoE — [LongCat 2.0](/articles/longcat-2)'s 1.6T /\n~48B-active split, the routing and load-balancing machinery in newer systems — is a\ndescendant of the choices made here. Switch didn't invent Mixture-of-Experts; it made it\n**simple and stable enough to scale**, by proving that the aggressive top-1 route works\nif you surround it with a fixed capacity buffer, a balancing loss, and a precision-safe\nrouter. The interesting later work mostly *pushes back* on the simplifications — smarter\nrouting than pure argmax, softer handling than hard token drops — but they all start from\nthe Switch layer. It's the foundation the zoo is built on.\n\n---\n\n*Built on [Switch Transformers: Scaling to Trillion Parameter Models with Simple and\nEfficient Sparsity](https://arxiv.org/abs/2101.03961) (Fedus, Zoph & Shazeer, 2021).\nFigures are the paper's own (Figures 2 and 5), used for academic commentary; the\ninteractive diagrams are our illustrations of the mechanism. Numbers — the α=0.01 loss\ncoefficient, capacity factors, the 7× and 4× speedups, 2,048 experts / 1.6T parameters,\n99% compression retaining ~30% of gains, and 0.4 expert dropout — are quoted from the\npaper.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/switch-transformer","signal":{"interest":3,"helpful":4,"score":7,"level":3,"label":"Notable"}},{"title":"SWE-1.7: near-frontier code RL, and the async training loop behind it","description":"Cognition's SWE-1.7 is an RL-trained software-engineering model built on a Kimi K2.7 base, tuned for long-horizon work inside the Devin harness and served through Cerebras at 1000 tokens/sec. It doesn't top the benchmark table — Claude Opus 4.8 leads every row and it roughly matches GPT-5.5 — but the training write-up is the real payload: asynchronous multi-cluster RL, a top-p 'sampling distribution replay' trick that stops entropy collapse, compressed cross-continental weight sync, and self-compaction for six-hour rollouts. A first-principles walk through all four, with the provider-reported numbers in full.","date":"2026-07-09","tags":["explainer","agents","reinforcement-learning","systems","inference-optimization"],"draft":false,"cover":"/articles/swe-1-7/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"swe-1-7","body":"**SWE-1.7** is Cognition's newest in-house software-engineering model — the SWE-1 family is the set of models behind Devin and the Windsurf editor. It's trained with reinforcement learning on real SE tasks, starting from a **Kimi K2.7** base that had *already* been through heavy RL post-training. That starting point is the headline claim: Cognition still pulled large gains on top of an RL-saturated base, which they read as evidence against a \"post-training ceiling.\" The model is tuned for long-horizon, asynchronous engineering — the multi-hour tasks Devin runs — and it's served through **Cerebras at 1000 tokens/sec**. That speed is the other half of the pitch: near-frontier quality, cheap and fast, moving the cost-performance Pareto curve rather than the top of the leaderboard.\n\n<Callout type=\"warning\">\nEvery number here is **provider-reported**, run under Cognition's own harness (Claude Code for Anthropic models, Codex for OpenAI, Devin CLI otherwise; `timeout=4h`, max reasoning effort). On the published suite SWE-1.7 tops **nothing**: **Claude Opus 4.8** leads all three benchmarks and SWE-1.7 lands roughly at **GPT-5.5**. It is also **not open-weights** — there's no model card or download, only the served model in Devin. Read it as a strong, fast, cheap near-frontier model and a genuinely interesting training write-up — not a new SOTA.\n</Callout>\n\n## The numbers, in full\n\nThree agentic coding benchmarks, pass rate (%). `FrontierCode` is Cognition's own eval; `SWE-Bench Multilingual` is the multi-language slice of the SWE-bench family; `Terminal-Bench 2.1` is agent-in-a-terminal.\n\n| Benchmark | SWE-1.7 | Kimi K2.7 Code (base) | GPT-5.5 | Opus 4.7 | Opus 4.8 | GLM-5.2 | Composer 2.5 | SWE-1.6 |\n|---|---|---|---|---|---|---|---|---|\n| FrontierCode 1.1 Main | **42.3** | 30.1 | 43.0 | 38.5 | 46.5 | 24.5 | 25.6 | 9.4 |\n| Terminal-Bench 2.1 | **81.5** | 72.7 | 84.2 | 83.0 | 86.9 | 81.0 | 76.0 | 39.7 |\n| SWE-Bench Multilingual | **77.8** | 73.5 | 76.8 | 80.5 | 84.4 | 74.5 | 71.6 | 58.3 |\n\nThe load-bearing comparison is the base column. SWE-1.7 vs its own `Kimi K2.7 Code` base is **+12.2** on FrontierCode Main, **+8.8** on Terminal-Bench, **+4.3** on SWE-Bench Multilingual — the largest lift on the hardest eval. That gap is the whole \"no post-training ceiling\" argument: it's pure RL on top of a base that was already RL-post-trained.\n\n<BenchBars\n  title=\"FrontierCode 1.1 Main (%) — provider-reported\"\n  unit=\"\"\n  bars={[\n    { label: \"SWE-1.7\", value: 42.3, highlight: true },\n    { label: \"Kimi K2.7 (base)\", value: 30.1 },\n    { label: \"GPT-5.5\", value: 43.0 },\n    { label: \"Opus 4.7\", value: 38.5 },\n    { label: \"Opus 4.8\", value: 46.5 },\n  ]}\n/>\n\n<BenchBars\n  title=\"SWE-Bench Multilingual (%) — provider-reported\"\n  unit=\"\"\n  bars={[\n    { label: \"SWE-1.7\", value: 77.8, highlight: true },\n    { label: \"Kimi K2.7 (base)\", value: 73.5 },\n    { label: \"GPT-5.5\", value: 76.8 },\n    { label: \"Opus 4.7\", value: 80.5 },\n    { label: \"Opus 4.8\", value: 84.4 },\n  ]}\n/>\n\nSo: SWE-1.7 edges GPT-5.5 on SWE-Bench Multilingual (77.8 vs 76.8), trails it a hair on FrontierCode Main (42.3 vs 43.0) and Terminal-Bench (81.5 vs 84.2), and sits behind both Opus checkpoints everywhere. The jump from the previous **SWE-1.6** (9.4 / 39.7 / 58.3) is huge, but read it honestly: most of that comes from the far stronger Kimi K2.7 base, not only the new RL recipe. The rest of this piece is the recipe, which is where the interesting engineering lives. Four ideas stand out.\n\n## 1. Asynchronous RL: don't let the trainer starve\n\nStart with why this is hard. An agentic SE rollout is a long, **variable-length** trajectory: read files, edit, run tests, read the failure, edit again — dozens of tool-calling turns, and with self-compaction (below) some run for **six hours**. Now do RL on batches of these.\n\nIn **synchronous** RL the loop ping-pongs: the actors generate a batch of rollouts, then the learner does one optimizer step, then the next batch starts. The problem is variance. Within a batch, one trajectory finishes in minutes and another runs for hours, and the learner can only step once the **slowest** one lands. So the trainer sits idle across most of the wall-clock, and the fast actors idle too, waiting for their batch-mates. Expensive accelerators, starved.\n\n**Asynchronous** RL breaks the ping-pong. Decouple the two fleets: actors continuously generate trajectories against the current-ish policy and push them into a buffer; the learner trains on whatever is ready. Both fleets stay busy. Flip the toggle below between the two modes and watch the learner's GPU-utilization gauge and the weight-update counter over the same wall-clock window:\n\n<SyncVsAsync />\n\nThe catch is honest and specific. Once the learner trains on trajectories the actors generated a few weight-versions ago, those trajectories are **off-policy** — sampled by a stale policy, not the one being updated. Cognition names the failure mode directly: a **KL-divergence mismatch between inference and training**, \"since the trainer policy is usually different from the sampling policy.\" The fix is the standard off-policy toolkit — importance sampling, plus a bounded staleness — and a **buffer policy** after any interruption that \"prevents bias from any imbalance in training-inference throughput.\" It's the same bias/variance trade you always pay for throughput: async keeps the GPUs full, and you spend correction terms to keep the stale gradients honest. Cognition cites [PipelineRL](https://arxiv.org/abs/2509.19128) as the async-RL lineage here. The [Agents-A1 write-up](/articles/agents-a1) is a good companion on where verified agentic trajectories come from in the first place, and the Devin side of \"the environment the model is trained in\" is the [agent harness](/articles/agent-harness).\n\n## 2. Preserving entropy with top-p \"sampling distribution replay\"\n\nThis is the most elegant idea in the post, and it's small. Long RL runs die of **entropy collapse**: a strong policy stops exploring, the distribution sharpens to a spike, and reward plateaus within a few hundred steps. Cognition's diagnosis of *why* is worth the walk.\n\nTake three tokens with logits $x_1 > x_2 \\gg x_3$ and softmax probabilities $p_i$. Token 3 is a junk token — sampling it usually means the rollout went off the rails, so the trajectory earns low reward and its advantage is negative, $\\hat{A} < 0$. The policy gradient of its log-prob on the logits is:\n\n$$\n\\nabla \\log p_3 = \\begin{bmatrix} -p_1 \\\\ -p_2 \\\\ p_1 + p_2 \\end{bmatrix}, \\qquad \\Delta x_i \\propto \\hat{A}\\,\\nabla \\log p_3 .\n$$\n\nWith $\\hat{A} < 0$ this becomes\n\n$$\n\\Delta x_1 \\propto |\\hat{A}|\\,p_1, \\qquad \\Delta x_2 \\propto |\\hat{A}|\\,p_2, \\qquad \\Delta x_3 \\propto -|\\hat{A}|\\,(p_1+p_2).\n$$\n\nLook at what that does. Because $p_1 > p_2$, the already-dominant token's logit rises **more** than the runner-up's, and the junk token is pushed down. So *punishing* a junk sample **sharpens** the distribution — every off-track sample bleeds a little entropy. Step the widget below with replay off to watch the bars spike and the entropy gauge fall; then flip **top-p replay** on:\n\n<EntropyCollapse />\n\nThe fix is two moves. First, **top-p sampling**: never sample from the low-probability tail, so junk tokens never become optimization targets in the first place. But top-p naively breaks something else — the trainer computes probabilities over the *full* vocabulary while the rollout sampled from the *top-p subset*, so the two distributions diverge and you're back to a large train/inference mismatch. Second, then, **sampling distribution replay**: record the kept-set mask at rollout time and renormalize the trainer's probabilities over that **same mask**. Sampler and trainer now agree on the support, the mismatch stays bounded, and entropy holds roughly constant.\n\n<Figure\n  src=\"/articles/swe-1-7/fig2.png\"\n  alt=\"Line chart of policy entropy across training. The SWE-1.7 recipe (blue) holds entropy roughly constant across the run, while the baseline (orange) rises early then decays steadily toward collapse.\"\n  caption=\"With top-p sampling plus sampling distribution replay, policy entropy stays roughly flat where the baseline collapses (Cognition, policy-entropy figure).\"\n/>\n\n<Figure\n  src=\"/articles/swe-1-7/fig3.png\"\n  alt=\"Line chart of training-inference mismatch across training steps for the SWE-1.7 run; the divergence rises early then stays bounded and flat for the rest of training rather than diverging.\"\n  caption=\"Training-inference divergence stays bounded across the run once the trainer renormalizes over the recorded top-p mask (Cognition, train-inference-mismatch figure).\"\n/>\n\nThere's a free lunch hiding in the mask. A token whose probability already exceeds the top-p threshold has a **keep-set of size one** — itself — so its renormalized probability is a constant 1 and its gradient is **zeroed out**. Cognition finds a large fraction of sampled tokens sit above the threshold, so they drop out of the update entirely. The optimizer stops spending gradient on tokens the model is already sure about and focuses on the genuinely uncertain, high-learning-signal positions. Less gradient noise, for free.\n\n<Callout type=\"note\">\nThis is the same idea as **Rollout Routing Replay (R3)** — [/articles/rollout-routing-replay](/articles/rollout-routing-replay) — one axis over. R3 records the MoE router's rollout-time expert choices and replays them in the trainer to align sampler and trainer on the *routing* axis; sampling distribution replay does it on the *token-sampling* axis. Both are \"replay the decision the sampler actually made, so the trainer optimizes the same distribution.\" Cognition stacks these with importance sampling and NVFP4 low-precision rollouts, and reports gains from the [Muon optimizer](/articles/muon-optimizer) and from stripping non-deterministic trainer ops that were quietly widening the mismatch.\n</Callout>\n\n## 3. Multi-cluster training: ship weight deltas, not weights\n\nHere's a structural observation that falls out of async RL: it **decomposes across clusters**. Only the trainer needs to live on a single high-bandwidth fabric — that's the one tightly-coupled, all-reduce-heavy component. The rollout inference engines are self-contained; each one needs nothing but the current weights, so it can run on whatever compute is available, anywhere.\n\nCognition leans all the way into that. SWE-1.7's RL spans **four datacenters across three continents**, mixing their own GPUs with third-party inference compute from **Fireworks**. The hard part is keeping every far-flung inference engine current after each optimizer step, because stale weights mean stale trajectories mean weaker gradients. Broadcasting a full ~1T-parameter model across oceans every few steps is a non-starter, so instead the trainer computes a **compressed weight delta** (XOR diff against the previous weights, then zstd) and streams it through **cloud object storage** as the single source of truth. Each engine prefetches the delta while still serving, then pauses briefly to apply it in-place with the KV cache intact.\n\n<WeightDeltaSync />\n\nThe numbers Cognition reports for this: a delta is **>99% smaller** than the full broadcast, a cross-continental update for a 1T model lands in **1-2 minutes** end-to-end, and inference pauses only **3-4 seconds** to apply it. A **Dynamo** router fronts the inference fleet and reroutes trajectories off dead replicas; the trainer checkpoints asynchronously to local disk every step and rebuilds a dead node from peer replicas in seconds, so a hardware failure never stalls the run. The payoff loops back to idea #1: faster weight sync means less trajectory staleness, which buys room for more aggressive learning rates. This is the cost angle a sibling write-up, [why frontier RL is cheaper than you think](/articles/frontier-rl-cheaper), covers head-on — and it's [Fireworks' own post](https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think) that Cognition cites, since they literally run on that compute.\n\n## 4. Self-compaction for six-hour rollouts\n\nThe last idea handles the horizon. Two problems come with training on multi-hour tasks. First, a rollout can run far past the raw context window. Second, as DeepSeek-R1 showed, RL on reasoning tasks tends to make responses grow without bound, but I want a model that's terse on easy tasks and only elaborates on hard ones.\n\n**Self-compaction** solves the first: when the agent approaches the context limit, it's asked to **summarize its working state**, and it resumes from its own summary. During training the model learns both halves at once — to write more informative, compact summaries, and to work well *from* them. That's what lets a single rollout stretch to **six hours** without blowing the window. An **alternating length penalty** solves the second: training alternates between *unconstrained* phases (optimize only for task success) and *budget* phases (penalize solutions that exceed a weighted cost over tokens, turns, and tool-call time). Length compresses on tasks the model can already solve, while long-horizon behavior on genuinely hard tasks is preserved.\n\n<Figure\n  src=\"/articles/swe-1-7/fig4.png\"\n  alt=\"Line chart of mean response length across training under the alternating length penalty. Response length climbs during unconstrained phases and compresses during the shaded budget phases, with the overall trend rising as the model tackles harder tasks.\"\n  caption=\"Mean response length climbs in unconstrained phases and compresses in budget phases, keeping the model terse on solved tasks without capping hard ones (Cognition, response-length figure).\"\n/>\n\n## What the training left behind\n\nRL this heavy leaves fingerprints on behavior, and they line up with the recipe. SWE-1.7's chain-of-thought is measurably **more condensed** than the Kimi K2.7 base — a much lower function-word ratio and nearly half the words per sentence — which Cognition attributes directly to the budget phases of the length penalty. It also **explores the codebase far more** before acting: more tool calls, file reads, and greps per run than K2.7, Opus 4.8, or GPT-5.5, and more probing of edge cases, adversarial inputs, and unstated requirements. On a bug report it chases the root cause rather than patching the one symptom, and it settles ambiguous semantics by writing a small script to test them instead of guessing. Cognition credits the data pipeline — hard verifiers that reject false positives force end-to-end solutions.\n\nThe honest cost of that thoroughness: **scope creep**. More reasoning means more doing — extra test cases, more files touched than the task strictly needs. It's an industry-wide pattern (more reasoning, wider blast radius) and Cognition flags it as an open axis, not a solved one. That's the right way to report it.\n\n## The take\n\nSWE-1.7 doesn't win the benchmark table, and Cognition doesn't claim it does — Opus 4.8 leads every row and SWE-1.7 sits at roughly the GPT-5.5 line. The pitch is the Pareto curve: near-frontier SE quality, served at 1000 tokens/sec through Cerebras, cheap. If that holds up in a real Devin loop rather than a `timeout=4h` harness, it's a strong practical option.\n\nBut the model is the smaller story. The training write-up is the payload, and it's unusually concrete for a launch post: async RL to stop the trainer starving, a top-p **sampling distribution replay** that kills entropy collapse *and* falls out into free gradient denoising, weight-delta streaming that makes cross-continental RL practical, and self-compaction that pushes rollouts to six hours. Each is a clean, separable idea with a plausible mechanism, and together they're a real argument that \"post-training ceiling\" was never a ceiling — just a recipe that hadn't been tuned yet.\n\nThe caveats are the usual ones, stated plainly. Every number is self-run and self-selected; `FrontierCode` is Cognition's own benchmark; there are no open weights to verify anything against; and the sharpest systems claims — >99% delta compression, 1-2 minute cross-continental sync, six-hour rollouts — are provider-reported, not independently measured. Take the leaderboard framing with the usual salt. Take the training ideas seriously.\n\n---\n\n*Built on Cognition's [SWE-1.7 launch post](https://cognition.com/blog/swe-1-7) (July 8, 2026) and its cited [FrontierCode 1.1](https://cognition.com/blog/frontier-code-1.1) eval. SWE-1.7 is served in [Devin](https://devin.ai); benchmark numbers are provider-reported. The interactive diagrams are illustrations of the mechanism, not measured traces; the entropy, mismatch, and response-length charts are reproduced from the launch post for commentary.*\n","readingTimeMins":13,"url":"https://ai.thesatyajit.com/articles/swe-1-7","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"A6B: k-expansion, or what breaks when you force a top-8 MoE router to fire 32 experts","description":"A single-author research log takes Qwen3.6-35B-A3B, overrides its Mixture-of-Experts router from top-8 to top-32 at inference (≈3B → ≈6.6B active params, zero new weights), measures the monotonic damage, and heals it with a router-frozen selective expert fine-tune. The core finding: renormalization hands 54% of the gate mass to 24 experts the model never learned to co-activate, accuracy falls at every step, and ESFT-style residual deltas recover the loss into a statistical tie — not a win. Honest, paired-McNemar measurement, negative results included.","date":"2026-07-08","tags":["explainer","llm","architecture","inference-optimization"],"draft":false,"cover":"/articles/a6b-k-expansion/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"a6b-k-expansion","body":"**a6b-k-expansion** is a single-author research log with a sharp, testable question: a Mixture-of-Experts model already stores far more experts than it fires per token — so what happens if you just *fire more of them* at inference? The author takes **Qwen3.6-35B-A3B** (their stated base: 35B total, ~3B active, 256 routed experts per layer, native top-8 routing) and overrides the router to keep **top-32** instead of top-8. Active compute goes from ~3B to **~6.6B** parameters per token — the regime they call **\"A6B\"** — at **zero new weights**. Then they measure what it costs, and try to heal it.\n\nThe interesting part is that it does not work for free, and the repo says so in the title of its own findings: *the damage from naive k-expansion is smooth and monotonic — there is no free sweet spot.* This is a measurement-first log with paired A/B tests, exact McNemar significance, and the negative results kept in. That honesty is the reason it's worth reading.\n\n<Callout type=\"warn\">\nScope, up front. This is one person's **ongoing research log**, not a model release or a peer-reviewed paper. The base is **Qwen3.6-35B-A3B** as named in the repo — I take that at face value and report the author's own numbers, all of which are self-measured. \"Healing\" is explicitly **demonstrated, not complete**: the point estimates stay negative. Rejection fine-tuning, GRPO, and the Terminal-Bench evaluation are still in progress. Read this as a well-instrumented experiment, not a benchmark trophy.\n</Callout>\n\n## The one-line edit\n\nEvery MoE layer scores all 256 experts with a softmax router (in fp32), keeps the **top-k**, and — this is the load-bearing detail — **renormalizes the selected gate weights so they sum to 1**. Renormalization is baked into the architecture. So raising k is not a matter of appending a few experts on the side; it *redistributes the entire gate mass* across four times as many slots.\n\n<Figure\n  src=\"/articles/a6b-k-expansion/fig1.png\"\n  alt=\"Two side-by-side MoE blocks. Left: stock A3B, router keeps top-8 of 256 experts (blue cells), ≈3B active per token, top-8 carry 46% of gate mass. Right: A6B, same weights but router keeps top-32; the 8 blue experts stay and 24 orange experts (ranks 9-32) are added, ≈6.6B active per token, the orange ranks carrying 54% of the renormalized gate mass. Both feed a renormalized combine plus an always-on shared expert.\"\n  caption=\"Identical weights; one routing constant changed. The added experts (orange) are load-bearing — 54% of the renormalized gate mass — but were never trained to collaborate (a6b-k-expansion, Fig 1).\"\n/>\n\nThe experts are stored as packed 3D tensors — `gate_up_proj [256, 1024, 2048]` and `down_proj [256, 2048, 512]` per layer — so widening k adds **compute but no parameters**. If the routing here is unfamiliar, I built the top-k gate up from nothing in [Mixture of Experts, from scratch](/articles/mixture-of-experts-from-scratch); this article assumes that machinery and pokes at one constant inside it.\n\nFormally, for the selected set the renormalized gate weight of expert $i$ is\n\n$$\n\\tilde{g}_i = \\frac{g_i}{\\sum_{j \\in \\text{top-}k} g_j}, \\qquad i \\in \\text{top-}k .\n$$\n\nThe denominator grows with k. So every already-selected expert's $\\tilde{g}_i$ *shrinks* as k rises, and the freed mass flows to the newcomers. The author profiled the router at k=32 and found the newcomers are not a rounding error: **ranks 9-32 carry 54.0% of the renormalized gate mass.** More than half the block's output now comes from experts the model never trained to fire together.\n\nSlide k below and watch the mass move off the trained top-8 and onto the untrained tail — with the measured accuracy for each k next to it:\n\n<GateMass />\n\n## No sweet spot\n\nYou might hope for a lucky k — 2-3× the native width, where the extra experts add capacity before they add noise. There isn't one. Changing **only** the inference-time top-k on the frozen base model, both a knowledge benchmark and a math benchmark decline at every single step:\n\n| Benchmark | k = 8 | k = 16 | k = 24 | k = 32 |\n| --- | --- | --- | --- | --- |\n| MMLU  | 0.8433 | 0.8283 | 0.8150 | 0.8067 |\n| GSM8K | 0.8933 | 0.8883 | 0.8783 | 0.8650 |\n\nThe shape is the whole point: monotonic, sweet-spot-free. Every step down is paid the moment more untrained expert combinations switch on, and the noise grows with the mass those combinations carry — a mass that is real (54.0% on ranks 9-32). This is what motivates *healing* the model rather than *searching* for a lucky k. There is nothing to search for.\n\n## Healing without touching the router\n\nThe fix is deliberately surgical, in the spirit of [ESFT — Expert-Specialized Fine-Tuning (Wang et al., 2024)](https://arxiv.org/abs/2407.01906). The recipe:\n\n1. **Profile.** Run the target corpus through the model at k=32, count token-level routing frequency for every expert across all 40 layers.\n2. **Select.** Keep experts by cumulative routing frequency up to **top-p 0.2** — the ones actually carrying the work. That is **833 of the 10,240 layer-experts** (40 × 256).\n3. **Train residual deltas.** Add trainable delta tensors to the selected experts' FFN slices and train **only those deltas** — **2.62B of 35B params (7.5%)**. The router and everything else stay frozen.\n4. **Toggle.** Because nothing but the deltas moved, turning them off returns the **exact stock model**. The deltas ship as one **5.2 GB** artifact, patched on or off at load time.\n\nToggle the deltas below to see the surgical footprint and the gap-to-baseline flip from \"significant loss\" to \"statistical tie\":\n\n<EsftSelect />\n\nFreezing the router is not just a cost decision. The deltas are learned *relative to a specific routing distribution* — freeze it and every delta keeps meaning the same thing at inference; let it move and the coordinate system shifts underneath the deltas. Second, a router that drifts under SFT is the classic path to **routing collapse**, where a handful of experts capture all the traffic. A frozen router removes that failure mode entirely.\n\nEach healing generation is a broader training corpus, measured as the **gap to base@k8** on its own machine (negative = still below native top-8), with the verdict from exact McNemar at p = 0.05:\n\n| Generation | Corpus | MMLU Δ | MMLU p | GSM8K Δ | GSM8K p |\n| --- | --- | --- | --- | --- | --- |\n| Gen 0 | naive k32 (no training) | −3.7 pt | 0.002 · loss | −2.8 pt | 0.016 · loss |\n| Gen 1 | agentic-only ESFT | −3.0 pt | 0.010 · loss | −0.8 pt | 0.487 · tie |\n| Gen 2 | mixed + replay ESFT | −2.0 pt | 0.141 · **tie** | −2.2 pt | 0.263 · **tie** |\n\nEach broader corpus removes more of the misalignment: the agentic-only patch already heals math to a tie, and adding coding, tool-calling, math and a small general/knowledge replay is what finally pulls MMLU into a tie too. But read the header honestly — a *tie* is a repair, not a gain. The MMLU point estimate is still −2.0 pt; it is no longer statistically distinguishable from base, but it is not zero.\n\nAnd it does not get better with more steps. A checkpoint trajectory reads **MMLU 0.825 / 0.823 / 0.820** at steps 1200 / 2100 / 3150 — healing **saturates by ~38% of training**. The author reads this as a **capacity ceiling of the selective deltas**, not a data-volume problem: more steps on this delta set will not close the gap; a larger trainable surface or a different objective would be needed.\n\nWhere training *does* buy something beyond healing is code. On HumanEval (n = 164), the agentic patch reaches **0.902** — above both base@k8 and naive k32 — while compressing median generation to about a third of the tokens:\n\n<BenchBars\n  title=\"HumanEval (%) — paired, same-machine (Workstation A)\"\n  unit=\"\"\n  bars={[\n    { label: \"agentic patch\", value: 90.2, highlight: true },\n    { label: \"base@k8\", value: 86.6 },\n    { label: \"naive k32\", value: 84.1 },\n    { label: \"coding-only patch\", value: 76.2 },\n  ]}\n/>\n\n## The hazard: style transfers before knowledge\n\nThat fourth bar is the most useful result in the whole repo. A **coding-only** patch — same recipe, single-domain corpus — taught the model a *style* (terse code) faster than it taught it to stay correct. Median generation crashed to **186 tokens**, and accuracy collapsed with it: **HumanEval 0.762** (the worst of every arm) and **GSM8K 0.820**, which is *below even naive k32*, at p = 0.002. You can train a confident, compact, and wrong model this way.\n\n<Callout type=\"warn\">\n**SFT transfers answer style before it transfers knowledge.** Corpus diversity is a safety rail, not a luxury — it is the specific guard against this failure mode, which is why the Gen-2 mix spans five domains (agentic 62%, coding 12%, tool-calling 11%, math 10%, knowledge replay 3%) plus a small replay slice.\n</Callout>\n\nThe mixed patch is not immune to the same pressure, only more resistant. It still compressed MBPP generations (median **531** vs base's **2852** tokens) and lost MBPP by **−10.2 pt** (p < .0001) — even while HumanEval stayed at parity. Two benchmarks that both \"test coding\" diverged sharply, and the difference is how much each rewards long, explicit generation, which the patch has learned to suppress. Because this compression grows with training length, an *earlier* checkpoint may beat the final one on generation-heavy tasks; that evaluation is still running.\n\n## Why I trust the numbers\n\nThe measurement discipline is where this log earns its credibility, and it's the part most self-reported results skip:\n\n- **Paired, same-condition A/B.** Every comparison runs an arm against its own baseline on the **same machine**, same prompts, same decoding. Significance is **exact McNemar** on the paired per-item correctness vectors — not an unpaired accuracy-difference test.\n- **n = 600 per benchmark** (164 for HumanEval), fixed shuffle seed so every arm sees the same items in the same order.\n- **Choice-logprob MMLU** — scored by summed log-prob of each answer choice, not by parsing free-form text, so truncation and format quirks can't masquerade as a knowledge gap.\n- **No-think GSM8K** — the measured quantity is the arithmetic answer, not the length or style of the scratch reasoning.\n- **Re-measure on every machine.** The author observed **−0.3 to −0.6 pt cross-machine drift** on identical weights and config, so a base arm and a patched arm are always re-run together on the same box before their delta is trusted. Gen 0-1 ran on a 2× RTX PRO 6000 workstation, Gen 2 on an 8× RTX PRO 6000 Blackwell server, each with its own re-measured base@k8.\n\nTraining data also passed a hard decontamination gate against every benchmark used — exact-match, word-13-gram, short-question containment, HumanEval signature purge, Terminal-Bench instruction match — with the knowledge-replay slice further screened by embedding similarity against the full MMLU test set. Reported result: **0 residual hits**. You can disagree with the conclusions, but the instrument is honest about what it measured.\n\n## The take\n\nk-expansion is a clean idea with a clean negative result. Firing 4× the experts at inference is free in weights but not in accuracy, because renormalization is not a bystander — it hands 54% of every MoE block's output to 24 experts per layer that never learned to work together, and the model degrades monotonically for it. There is no lucky k. Router-frozen selective deltas — 2.62B trainable params, toggleable, 833 of 10,240 experts — heal the knowledge loss back to a **statistical tie**, which is a genuine result and an honest one: a tie, saturating at ~38% of training, with the point estimate still negative. The coding axis actually improves (HumanEval 0.902), but generation-length compression is a live hazard, and a single-domain corpus is a fast way to train a compact, confident, wrong model.\n\nWhat I'd take from it, beyond A6B: the failure modes generalize. **Renormalized top-k is not free to widen.** **SFT teaches style before substance.** **Freeze the router or watch the deltas lose their coordinate system.** And measure in pairs on one machine, because a −0.5 pt result that is really cross-machine drift will fool you. Whether A6B ever nets out positive after rejection-FT and GRPO is unsettled — the repo says so plainly — but the instrumentation is the part I'd copy tomorrow.\n\n---\n\n*Built on the [a6b-k-expansion](https://github.com/hikarioyama/a6b-k-expansion) research log (hikarioyama, 2026) — README, `METHOD.md`, and the two HTML reports under `docs/`, MIT-licensed. All numbers are the author's own paired, McNemar-tested measurements on Qwen3.6-35B-A3B; I report them as stated and have not independently reproduced them. The interactive diagrams are my illustration of the mechanism, not measured traces; the architecture figure is reproduced from the repo for commentary.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/a6b-k-expansion","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Agent harnesses: engineering the loop around the model","description":"Lilian Weng argues the harness — the loop and scaffolding that wraps a base model with tools, context management, control flow, and evaluation — matters as much as raw intelligence. A walk through her framing: the coding-harness loop, the tool taxonomy, why durable state belongs in files, the self-improving outer loop, and the failure modes and open problems she names honestly.","date":"2026-07-08","tags":["explainer","agents","llm","systems"],"draft":false,"cover":"/articles/agent-harness/fig1.png","featured":false,"interest":3,"helpful":4,"kind":"articles","slug":"agent-harness","body":"A base model does one thing: given tokens, predict the next ones. Everything an agent actually *does* — read a repo, run a test, spawn a subagent, decide it is finished — happens in the code wrapped around that model. Lilian Weng calls that wrapper the **harness**, and her post [Harness Engineering for Self-Improvement](https://lilianweng.github.io/posts/2026-07-04-harness/) makes a sharp claim about it: the harness is not glue you can ignore. It is \"the system surrounding a base model that orchestrates execution and decides how the model thinks and plans, calls tools and acts, perceives and manages context, stores artifacts, and evaluates results.\" Her thesis in one line: \"the layer between the raw model and the real-world context seems to be as important as the model's raw intelligence.\"\n\nThat reframes a lot of agent engineering. The interesting design surface is not only the model — it is the loop, the tool set, the context policy, and the evaluator you build around it. This is a walk through her framing.\n\n<Callout type=\"note\">\nWeng's post is framed around **recursive self-improvement (RSI)** — the idea, dating to I. J. Good (1965) and his \"ultraintelligent machine,\" that a capable system can improve the machinery that produces it. In modern terms the model doesn't rewrite its own weights; it improves the *training pipeline* and the *deployment system* around itself. The harness is that deployment system, which is why it sits at the center of the story. This article follows her two-part structure: first what a harness *is* (design patterns), then how you *optimize* it (the self-improving outer loop).\n</Callout>\n\n## The loop\n\nStrip an agent down and you get a loop. The model emits an action, the harness executes it, the result comes back, the model emits the next action. Weng's simplest picture of it: user input feeds model inference, which either returns a response or calls a tool; tool results feed straight back into the next inference step.\n\n<Figure\n  src=\"/articles/agent-harness/fig2.png\"\n  alt=\"A flow diagram: USER INPUT flows into MODEL INFERENCE, which branches to either AGENT RESPONSE or TOOL CALLS; TOOL CALLS loops back into MODEL INFERENCE.\"\n  caption=\"The minimal agent loop: inference emits either a final response or a tool call, and tool results feed back into the next inference step (Lilian Weng, simplified Codex agent loop).\"\n/>\n\nFor a coding agent that loop takes a concrete, goal-oriented shape: **plan, execute, observe/test, improve, and execute again until the goal is achieved.** Weng draws it as a pipeline — observe the repo, plan, search and read files, edit and write patches, run tests, inspect errors — with a `Done` exit and a repeat arrow when the goal isn't met yet.\n\n<Figure\n  src=\"/articles/agent-harness/fig1.png\"\n  alt=\"A left-to-right pipeline of rounded boxes: Observe repo → Plan → Search/read files → Edit/write patches → Run tests → Inspect errors. A green Done box sits above Run tests; a dashed Repeat arrow returns from Inspect errors back toward Plan.\"\n  caption=\"The coding-harness loop: the agent works a repo the way a developer works an IDE, iterating until the tests pass (Lilian Weng, coding harness loop).\"\n/>\n\nThe static figure shows the shape; stepping through it shows the mechanic. Here is that loop walked over one real task — a failing test — where the first patch is too narrow and the harness has to loop back before the suite goes green. Watch what crosses the boundary at each phase: the model emits a **tool call**, the harness runs it, and the **observation** returns into context.\n\n<HarnessLoop />\n\nTwo things are worth pulling out of that trace. First, the model touches nothing directly — every action is a tool call the harness executes, and every result is an observation the harness chooses to feed back. Second, the harness owns the **control flow**: it decides when the loop repeats and when the goal is met and the loop exits. That decision — keep going or stop — is not the model's to make unilaterally, and getting it wrong is a failure mode we'll come back to.\n\n## The tools are the harness\n\nIf the loop is the skeleton, the tool set is the muscle. Weng's tour of a modern coding harness is essentially a table of tool categories, and the range is the point — this is a lot more than \"call a function\":\n\n| Category | Representative tools |\n|---|---|\n| **File system** | discovery: `glob`, `grep`, `ls` · read: `read`, `read_many` · modify: `write`, `edit`, `multi_edit`, `apply_patch` |\n| **Shell execution** | `bash`, `PowerShell` |\n| **IO / repo** | `lsp`, `git_status`, `git_diff`, `git_commit` |\n| **External context** | MCP tools, Skills |\n| **Web** | `web_search`, `web_fetch`, browser tools |\n| **Artifacts** | read docs/images; generate HTML/images |\n| **Backend processes** | `CronCreate`, `CronDelete`, `CronList` |\n| **Agent delegation** | `spawn_agent`, `resume_agent`, `wait_agent`, `list_agents`, `close_agent` |\n\nA design note runs through the whole list: the tools are \"deliberately simple and generic to enable generalization.\" They lean on primitives a developer already knows — a file system, a shell, git — rather than bespoke abstractions. That matters because \"learning how to read, write, and edit the file system (commonly via `bash` commands) is a foundation skill for LLMs.\" The model has seen a million shell sessions in training; give it a shell and it already knows the idiom. The last two rows — backend processes and agent delegation — are where a harness stops being a single loop and becomes a small operating system: it can schedule work and fork subagents, which is what makes long-horizon and parallel tasks tractable.\n\n## Context is the scarce resource\n\nHere is the constraint that shapes everything else. A long task produces \"experiment logs, code diffs, paper summaries, error traces, and past rollout trajectories\" that \"often grow much longer than the context window that the model has trained for.\" You cannot keep the whole trajectory in the prompt. So Weng's rule is blunt: \"a harness should not carry the entire workflow and all logs in context; instead, it should keep durable state in files.\"\n\nThat single decision — context as a bounded working set, disk as the durable store — is what keeps a long task from strangling on its own history. Step through a run and watch the two strategies diverge:\n\n<ContextLedger />\n\nThe naive strategy appends everything and eventually overflows the window; past that point the model is quietly losing the early details. The file-backed harness keeps context flat and spills the history to disk, where it stays retrievable with `grep` and `read` — the same file tools from the table above, now doing double duty as memory. This is why file-system fluency is the load-bearing skill: the file system *is* the agent's long-term memory.\n\nThe same principle governs parallelism. Weng's guidance is to make it \"explicit and inspectable\" — store subagent outputs as \"files, logs, and status records\" rather than transient chat contexts, so the system can \"recover after interruptions and reason over its own execution history.\" Durable-state-in-files isn't only a context trick; it's what makes an agent restartable.\n\n<Callout type=\"tip\">\nThere's a research lineage here worth naming. **Agentic Context Engineering (ACE)** maintains a \"context playbook\" of itemized bullet points — each with an identifier and description — updated by a Generator / Reflector / Curator trio that appends *structured entries* instead of rewriting the whole prompt. **Meta Context Engineering (MCE)** goes one level up, separating \"mechanism (how to manage context) from artifact content (what is in context).\" Both are the same instinct as keeping state in files: treat context as a managed, structured store, not an ever-growing transcript.\n</Callout>\n\n## Guardrails live outside the loop\n\nGive an agent `bash`, `edit`, and the ability to spawn more agents and you've handed it a lot of reach. Weng is direct that this breaks abstraction boundaries: when programs can edit the systems they run on, you need a \"proper design of editable surface\" with \"permission control and security layers outside this loop.\" The guardrail is deliberately *not* another prompt instruction inside the model's context — it's an enforcement layer the model cannot talk its way past. That placement is the whole point: a permission check the agent can edit is not a permission check.\n\n## Optimizing the harness\n\nThe second half of Weng's post asks the recursive question: if the harness matters this much, can the agent improve *its own* harness? That turns the inner task loop into an **outer loop** over harness designs — run the current harness, mine where it failed, propose edits, keep the ones that survive a regression test.\n\n<Figure\n  src=\"/articles/agent-harness/fig3.png\"\n  alt=\"A four-stage cycle. Weakness Mining: run the current harness on tasks, collect execution traces, cluster failure patterns. Harness Proposal: use the failures to propose harness edits like validate-before-conclude, a loop-breaker, a tool-policy update. Proposal Validation: run a regression test and accept or reject each edit. Accepted edits produce an Updated Harness that proceeds to the next iteration; if all are rejected, no update.\"\n  caption=\"Self-Harness: an outer loop that mines failure patterns from traces, proposes harness edits, and promotes only the ones that pass a regression test (Lilian Weng, Self-Harness loop).\"\n/>\n\nThis is one instance of a broader family the post surveys — **ADAS**, **AFlow**, **STOP**, **AlphaEvolve**, the **Darwin Gödel Machine** — all variations on \"search over the scaffolding, not the weights.\" The headline result is that it works: the Darwin Gödel Machine's discovered agents went from **20% → 50%** on SWE-bench Verified and **14.2% → 30.7%** on Polyglot, matching or beating handcrafted agents.\n\n<BenchBars\n  title=\"Darwin Gödel Machine — starting agent vs. self-discovered agent\"\n  unit=\"%\"\n  bars={[\n    { label: \"SWE-bench · start\", value: 20 },\n    { label: \"SWE-bench · discovered\", value: 50, highlight: true },\n    { label: \"Polyglot · start\", value: 14.2 },\n    { label: \"Polyglot · discovered\", value: 30.7, highlight: true },\n  ]}\n/>\n\nBut the honest caveat is the more instructive part. **STOP** improved performance when the base model was GPT-4 and *degraded* it with weaker models (GPT-3.5, Mixtral). Weng's reading: \"recursive structure alone is not enough. The base model must be capable enough to improve the mechanism.\" Self-improvement is not free lift from the loop; it's a gain that depends on a model already good enough to reason about its own scaffolding. Below that bar, the outer loop makes things worse.\n\n## Failure modes\n\nThe post catalogs where autonomous agents actually break, drawing on an analysis (Trehan & Chopra, 2026) of auto-research attempts. Six recur:\n\n1. **Training-data defaults.** The model reaches for old libraries, stale commands, and standard formats instead of what the actual repo uses.\n2. **Implementation drift.** When the proposed method gets complex, the model quietly slides toward a simpler solution than the one it was asked for.\n3. **Memory degradation.** Long-horizon projects lose critical details — unless the logs were written out as persistent artifacts. (The file-system point again, stated as a failure when you skip it.)\n4. **Over-optimism.** The model declares success on noisy or failed experiments — a pattern Weng names \"p-hacking and eureka-ing.\"\n5. **Insufficient domain intelligence.** It lacks the tacit craft knowledge to judge whether a result is even plausible.\n6. **Weak scientific taste.** The experiments run fine but fail to answer the right question.\n\nNotice how many of these the *harness* is supposed to catch rather than the model: memory degradation is a context-policy failure, over-optimism is an evaluator failure, drift is a control-flow failure. The whole post is an argument that these are engineering problems in the wrapper, not just intelligence gaps in the core.\n\n## What's still hard\n\nWeng closes with the bottlenecks between here and genuine self-improvement, and they read as an honest problem list rather than a roadmap:\n\n- **Weak fuzzy evaluators.** \"Many research claims don't have a fast/precise verifier.\" Taste and novelty are far harder to score than a passing test suite, and an outer loop is only as good as its evaluator.\n- **Context and memory lifecycle.** Managing context growth over long autonomous runs is becoming \"a core part of intelligence,\" not a plumbing detail.\n- **Negative results.** LLMs are biased toward success and struggle to abandon a hypothesis, because their training data is skewed toward things that worked.\n- **Diversity collapse.** Evolutionary and RL loops exploit known high-reward patterns; without pressure for diversity the population collapses into variants of one solution.\n- **Reward hacking.** A self-improvement loop optimizes the signal it's given — including benchmark artifacts and vulnerabilities in a judge model.\n- **Long-term success.** Short-horizon optimization ignores maintainability, ownership boundaries, migration cost, backwards compatibility, and debugging burden — the things that decide whether real systems survive.\n- **The human role.** Her framing is that \"humans should move up the stack, not be removed from the loop\" — providing oversight \"at the right time, the right abstraction level,\" not disappearing from it.\n\n## The take\n\nThe useful shift in Weng's framing is where it puts the design surface. Agent quality is not only a function of the model you call; it's a function of the loop you wrap it in, the tools you expose, the context policy you enforce, and the evaluator you trust. Those are engineering decisions, and most of them are decisions about *state* — what stays in context, what spills to disk, what the permission layer refuses, what the regression test has to pass before a change ships. The honest bounds are stated plainly too: self-improving harnesses only help above a base-model capability threshold, the outer loop is only as good as a verifier we mostly don't have for fuzzy goals, and reward hacking and diversity collapse are unsolved. Which lands on a pragmatic note — the harness is where a lot of the near-term gains are, and it's ordinary systems engineering: files, permissions, control flow, and tests, applied to a model instead of a service.\n\n---\n\n*Built on Lilian Weng's [Harness Engineering for Self-Improvement](https://lilianweng.github.io/posts/2026-07-04-harness/) (2026). Quotations and the three figures are reproduced from that post for commentary; the interactive loop and context-budget diagrams are my own illustrations of the mechanism, not measured traces. Benchmark numbers (Darwin Gödel Machine, STOP) are as reported in her post.*\n","readingTimeMins":12,"url":"https://ai.thesatyajit.com/articles/agent-harness","signal":{"interest":3,"helpful":4,"score":7,"level":3,"label":"Notable"}},{"title":"Antidoom: breaking doom loops with Final Token Preference Optimization","description":"Liquid AI's Antidoom explains why small reasoning models get stuck repeating a span until the context runs out, and fixes it with Final Token Preference Optimization — a DPO-like method that retrains only the trailing token, spreads probability across several chosen alternatives, and regularizes the rest of the vocabulary in logit space. It cuts the doom-loop rate from 10.2% to 1.4% (LFM2.5-2.6B) and 22.9% to 1% (Qwen3.5-4B), with eval scores rising because the model can finally reach answers it already knew.","date":"2026-07-08","tags":["explainer","llm","reinforcement-learning","training","inference-optimization"],"draft":false,"cover":"/articles/antidoom/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"antidoom","body":"A **doom loop** is when a model emits a span — usually something like `Wait, let me reconsider…` — and then repeats that same span again, and again, until the context window is exhausted. Liquid AI's **Antidoom** post is about why this happens and how to train it away. Their fix is **Final Token Preference Optimization (FTPO)**: a DPO-like method that retrains a single position, the one where the loop would restart, so the model has somewhere else to go. On an early **LFM2.5-2.6B** checkpoint it takes the doom-loop rate from **10.2% to 1.4%**; on **Qwen3.5-4B** from **22.9% to 1%**. The interesting part is *why* eval scores go up when they do it: the training teaches the model nothing new about math or code, it just removes the failure mode that was keeping it from finishing.\n\n<Callout type=\"note\">\nEvery number here is **provider-reported** by Liquid AI (blog + the `LiquidAI/antidoom-mix-v1.0` dataset). The two interactive widgets are my illustrations of the mechanism — schematic distributions and logits, not measured traces. The four embedded charts are the post's own figures, reproduced for commentary.\n</Callout>\n\n## What a loop looks like\n\nThe detector is blunt and effective: a completion is flagged as looping if a section **repeats at least four times, over at least 60 characters**. Small reasoning models hit this most on long thinking traces for hard math and coding — exactly the prompts where the model is uncertain for a long stretch.\n\nThe tokens that *start* a loop are not random. On the early LFM2.5-2.6B checkpoint, Liquid counted which token opens the repeating span:\n\n```text\ncount    share  token\n 2277  11.39%  ' the'\n  902   4.51%  ' So'\n  644   3.22%  'Alternatively'\n  511   2.56%  'Wait'\n  493   2.46%  ' But'\n```\n\nThese are discourse markers and self-reflection tokens. They are not bad tokens — `Wait` or `Alternatively` can mark a genuine change of strategy. The problem is what happens when the model reaches for one *under uncertainty* and then can't get back out.\n\n## Why the loop tightens\n\nThree things stack up, and it's worth keeping them separate because FTPO only attacks the third.\n\n**High priors.** Some tokens carry artificially high prior probability. Liquid points at synthetic training data inflating certain words above their natural human-text frequency — the same effect that made `delve` and `testament` model tells. In reasoning traces, the inflated tokens are the discourse markers above. When the model is unsure of the next real step, these dominate the next-token distribution, and it restarts the same local reasoning pattern instead of making progress.\n\n**Self-reinforcing context.** This is the one that turns a stumble into a loop. Once a span is in the context, that span becomes *more* likely to appear again — and with each repetition the probability of every token inside the looping span climbs toward 1. The distribution collapses:\n\n<Figure\n  src=\"/articles/antidoom/fig1.png\"\n  alt=\"Four rows of tokenized text — Pre-loop, Loop 1, Loop 2, Loop 3 — each token shaded by its probability. Pre-loop and later loops are near-uniformly dark (probability ~1); Loop 1 has several lighter, lower-probability tokens. Annotations show the span probability rising 0.815, 0.961, 0.995 and the first-token probability rising 0.412, 0.920, 0.962 across repeats.\"\n  caption=\"The same repeated span, shaded by per-token probability across repeats. The span probability climbs 0.815 → 0.961 → 0.995 and the first token of the repeat climbs 0.412 → 0.920 → 0.962 — by the third pass the whole span is locked in near 1 (Liquid AI, Antidoom blog).\"\n/>\n\n**Greedy decoding has no exit.** At low temperature — and especially at temp 0 — the model takes the argmax. Once self-reinforcement has pushed the loop token's probability close to 1, there is almost no mass left on anything else, so there is nothing to sample instead. Turning up temperature only helps a little: Liquid reports significant looping even at **temp=0.67**, because there just isn't enough probability left on the alternatives to escape.\n\nThe widget below is the whole story in one place. In **base model** mode, step the repeat count and watch the loop token `Wait` climb from ~41% to ~98% while the distribution collapses onto it — greedy decoding restarts the span every time. Flip to **after FTPO** once you've read the next section to see the fix.\n\n<DoomLoop />\n\n## Why the easy fixes don't hold\n\nThe usual inference-time patch is `repetition_penalty`, which reweights the output distribution to discourage repeats. It's a band-aid: it fights the symptom at decode time and can degrade quality, and it doesn't touch the priors that caused the collapse. Reinforcement learning *can* target looping, but it needs carefully calibrated rewards and costly online rollouts. FTPO's pitch is to fix the distribution once, offline, at the exact position where the loop begins.\n\n## Final Token Preference Optimization\n\nFTPO is preference optimization in the DPO family. The reference form of DPO trains a policy $\\pi_\\theta$ against a frozen reference $\\pi_{\\text{ref}}$ to prefer a chosen response $y_w$ over a rejected one $y_l$:\n\n$$\n\\mathcal{L}_{\\text{DPO}} = -\\,\\mathbb{E}_{(x,\\,y_w,\\,y_l)}\\!\\left[\\log \\sigma\\!\\left(\\beta \\log \\frac{\\pi_\\theta(y_w\\mid x)}{\\pi_{\\text{ref}}(y_w\\mid x)} - \\beta \\log \\frac{\\pi_\\theta(y_l\\mid x)}{\\pi_{\\text{ref}}(y_l\\mid x)}\\right)\\right]\n$$\n\nFTPO keeps that skeleton and changes four things, each aimed at *not* over-correcting:\n\n1. **Final token only.** It trains the *trailing* token of a sequence that is midway through generation — the single position where the loop would restart — not a whole response. $y_w$ and $y_l$ are one token each, at one place.\n2. **Multiple chosen tokens per sample.** Instead of one $y_w$, it uses a *set* of plausible chosen tokens. This spreads the freed-up probability across several alternatives, so you aren't just replacing one overtrained token with a new overtrained token.\n3. **A KL-like term in logit space.** The reference-anchoring divergence is computed on **logits**, omitting the softmax. That avoids gradient pressure leaking onto unrelated tokens through the normalization.\n4. **Two-part regularization.** The tokens it means to move — chosen and rejected — are allowed to travel freely relative to the reference, while the rest of the vocabulary is held tightly near it. Loosen the ones you're fixing, pin everything else.\n\n### Building the training row\n\nA training row is a `[prompt prefix, one rejected token, one or more chosen tokens]` tuple, and Liquid mines it straight from the model's own failures. They generate completions on a loop-eliciting prompt mix (`LiquidAI/antidoom-mix-v1.0`) at low temperature, detect a loop with the ≥4-repeats / ≥60-chars rule, and target the **first token of the first repeat** as the rejected token. At that position they take the base model's top-k log-prob alternatives, filter out short and non-alphanumeric noise, and keep up to **20** plausible substitutes as the chosen set. Before training they regularize the two distributions, because a small set of culprits (`Wait`, `So`, `the`) would otherwise dominate — and over-suppressing them degrades reasoning.\n\n<Figure\n  src=\"/articles/antidoom/fig2.png\"\n  alt=\"A training example. A boxed prompt prefix shows a chat template: user asks who voiced Davy Jones; the assistant thinking trace reads 'Bill Nighy is the voice for Davy Jones. Wait, let me check if there's any other actor. No, Bill Nighy is the one.' Below, the token 'Wait' is labelled Rejected (down arrow), and three tokens 'Let's', 'Yes', 'Ok' are each labelled Chosen (up arrow).\"\n  caption=\"One FTPO training row: the prompt prefix ends where the loop restarts, 'Wait' is the single rejected token, and several plausible continuations ('Let's', 'Yes', 'Ok') are the chosen set (Liquid AI, Antidoom blog).\"\n/>\n\nThe next widget is the same idea in logit space — the mechanism the figure above doesn't draw. Toggle **reference → after FTPO** to watch the one trained position: the rejected `Wait` logit driven down, several chosen logits lifted up, and the entire rest of the vocabulary pinned near where it started.\n\n<FinalToken />\n\nThat last property is the reason this works without collateral damage. FTPO isn't teaching the model anything about Davy Jones or about calculus; it's redistributing probability at the exact positions where the model was getting stuck, and leaving the rest of the distribution alone.\n\n## Results\n\nThe headline is the doom-loop rate under greedy decoding. On the early LFM2.5-2.6B checkpoint it drops from 10.2% to 1.4%; on Qwen3.5-4B, from 22.9% to 1%.\n\n<Figure\n  src=\"/articles/antidoom/fig3.png\"\n  alt=\"Bar chart titled 'Doom-loop Rate', benchmark all temp=0, lower is better. LFM2.5-2.6B: base 10.20, anti doom-loop 1.40. Qwen3.5-4B: base 22.90, anti doom-loop 1.00.\"\n  caption=\"Doom-loop rate at temp=0 (lower is better), base vs Antidoom, for both models (Liquid AI, Antidoom blog).\"\n/>\n\n<BenchBars\n  title=\"Doom-loop rate (%, temp=0) — lower is better · provider-reported\"\n  unit=\"%\"\n  bars={[\n    { label: \"LFM2.5-2.6B base\", value: 10.2 },\n    { label: \"LFM2.5-2.6B + FTPO\", value: 1.4, highlight: true },\n    { label: \"Qwen3.5-4B base\", value: 22.9 },\n    { label: \"Qwen3.5-4B + FTPO\", value: 1.0, highlight: true },\n  ]}\n/>\n\nEval scores go up across the board — but Liquid is careful about the causal story, and so am I. The training set teaches the model nothing new about math or code; it removes the failure mode that was preventing the model from reaching answers it could already produce. A completion that used to spiral into `Wait, let me reconsider…` until it ran out of tokens now finishes and gets scored. The gain is recovered credit, not new capability.\n\n### The temperature tradeoff\n\nThis is the honest catch, and it's a genuinely interesting one. Break the LFM2.5-2.6B evals out by decoding temperature:\n\n<Figure\n  src=\"/articles/antidoom/fig4.png\"\n  alt=\"Eight small line charts of score vs temperature (0 to 1) for LFM2.5-2.6B early checkpoint, base (grey) vs antidoom (purple): Average, Doom-loop rate, AIME25, GPQA, GSMPlus, IFEval, LiveCodeBench v6, RULER. The antidoom curves start far higher than base at temp 0 on most panels; the base curves rise with temperature and largely catch up by temp 1.0. On the Average panel antidoom peaks around temp 0.33 (~48.7) and falls to ~44.8 at temp 1.0, meeting the base curve.\"\n  caption=\"LFM2.5-2.6B early checkpoint, score vs temperature: base (grey) vs Antidoom (purple) across eight evals. Antidoom leads by a wide margin at low temperature and the two converge near temp=1.0 (Liquid AI, Antidoom blog).\"\n/>\n\nThe average score panel tells it: Antidoom leads by roughly 8 points at temp 0 (≈47 vs ≈38), peaks around temp 0.33, and then falls back to meet the base curve near temp 1.0 (≈45). The base model, meanwhile, climbs steadily with temperature — because sampling was its only escape from the loops. So FTPO effectively **shifts the model's best operating temperature downward**: it makes low-temperature decoding safe, which is where you'd want to run a small reasoning model anyway, but it gives up the high-temperature regime, where the extra randomness now mostly adds noise instead of buying an exit. That cuts against the usual intuition that reasoning models like a bit of temperature.\n\n### Cost and recipe\n\nThe whole thing is cheap, which is the other reason to care. For the early LFM2.5-2.6B checkpoint, generating the training set took about **1 hour on 8× MI325** GPUs (bounded by the model's own loop rate, since generation stops when it catches loops), and training took about **1–2 hours on a single MI325**. The recipe:\n\n```yaml\nmethod: FTPO (DPO-family, final-token)\nepochs: 1\nadapter: LoRA            # rank 128–256 — higher learnability, less degradation\ntrain_modules: [attention_proj, mlp_proj, lm_head]\nlearning_rate: 4e-6 – 2e-5\nearly_stop: chosen_win = 0.35   # fraction of samples where chosen beat rejected\n```\n\nTwo guardrails matter. **Over-training happens easily** — training past the `chosen_win=0.35` stopping point tended to degrade the model and, ironically, spawn *new* doom loops. Stopping at that threshold typically pulled loop rates from 20–30% down to 1–2% with minimal degradation. And FTPO is **iterative by design**: after one round the loop-causing tokens are rejected and probability is reweighted toward the chosen alternatives, but that can expose new failure points where *other* tokens start looping, so a second round targets the newly surfaced loops.\n\n## The take\n\nDoom loops are a small-model, low-temperature, hard-problem failure, and the diagnosis here is clean: overtrained discourse-marker priors plus self-reinforcing context plus greedy decoding equals a distribution that collapses onto one token with no exit. FTPO is a tidy fix because it matches the shape of the problem — retrain the one position that restarts the loop, spread the escape probability across several tokens instead of minting a new favourite, and pin the rest of the vocabulary in logit space so you don't disturb what the model already does well. The results are strong (10.2% → 1.4%, 22.9% → 1%) and, importantly, honestly framed: the eval gains are recovered credit for answers the model could already reach, the win is concentrated at low temperature and fades by temp 1.0, over-training is a real risk with a specific stopping rule, and it can take more than one round. For anyone shipping a small reasoning model that greedy-decodes in production, a 1–2 hour LoRA pass that removes a 10–20% failure mode is an easy trade.\n\n---\n\n*Built on Liquid AI's [Antidoom: Reducing Doom Loops with Final Token Preference Optimization](https://www.liquid.ai/blog/antidoom) (2026) and the [`LiquidAI/antidoom-mix-v1.0`](https://huggingface.co/datasets/LiquidAI/antidoom-mix-v1.0) dataset. All benchmark numbers are provider-reported; the four charts are reproduced from the blog for commentary, and the two interactive widgets are illustrations of the mechanism, not measured traces.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/antidoom","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"A field guide to attention mechanisms","description":"One operation, two bills — O(N²) compute and the O(N·layers) KV cache — and a map of how every variant pays them down: MHA, MQA, GQA and MLA on the memory axis; sliding-window, sinks, dilated, block-sparse and content-based NSA on the mask axis; linear/kernelized attention that drops the matrix; and the exact-but-IO-aware systems layer of FlashAttention and PagedAttention.","date":"2026-07-08","tags":["explainer","attention","transformers","long-context","kv-cache"],"draft":false,"featured":true,"interest":4,"helpful":5,"kind":"articles","slug":"attention-mechanisms","body":"Attention is a single operation, and almost everything else in a transformer is plumbing around it. The zoo of named variants — MQA, GQA, MLA, sliding-window, BigBird, NSA, FlashAttention — can read like a pile of unrelated tricks. It isn't. Nearly every one is a targeted answer to a **specific bill** that plain attention runs up, and once you know which bill a mechanism is paying down, the whole field organizes itself.\n\nThis is a map, not an encyclopedia. For the fundamentals — what Q, K and V are and why the dot product means \"relevance\" — start with [how transformers attention works](/articles/how-transformers-attention-works). Here I assume that and go wide: the families, the mechanics, the exact costs, and the honest thing each one gives up.\n\n<FamilyMap />\n\n## The one operation, and its two bills\n\nFor one query vector, attention scores every key by dot product, turns the scores into weights with softmax, and returns the weight-blended values:\n\n$$\n\\mathrm{Attention}(Q,K,V) = \\mathrm{softmax}\\!\\left(\\frac{QK^{\\top}}{\\sqrt{d_k}}\\right)V\n$$\n\n$Q \\in \\mathbb{R}^{N\\times d_k}$ are the queries, $K \\in \\mathbb{R}^{N\\times d_k}$ the keys, $V \\in \\mathbb{R}^{N\\times d_v}$ the values, $N$ the sequence length, and $d_k$ the per-head key dimension. The $1/\\sqrt{d_k}$ is not cosmetic: if $q$ and $k$ have unit-variance entries, $q\\cdot k$ has variance $\\approx d_k$, so without the scale the logits grow with head width and push softmax into a near–one-hot corner where its gradient vanishes. Multi-head attention runs $h$ of these in parallel on different learned projections and concatenates — different heads settle on different relationships over the same tokens.\n\n<ScaledDotProduct />\n\nNow the bills. That one line hides two very different costs, and they are the two axes this whole guide is organized around.\n\n**Bill one — the O(N²) score matrix (the compute / *mask* axis).** The product $QK^{\\top}$ is an $N\\times N$ matrix: every query dotted with every key. Time and memory are $O(N^2 d)$. Double the context and the work quadruples. Every *sparse* method is an answer to this bill — a rule for **which (query, key) pairs to actually compute**.\n\n**Bill two — the KV cache (the memory axis).** At inference a decoder generates one token at a time and re-reads the whole past, so it caches the keys and values it already computed. That cache is $2 \\cdot n_h \\cdot d_h$ elements **per token, per layer** (one key and one value per head) — it grows linearly with context *and* with depth, and at long context it, not the FLOPs, is what pins you to the hardware. Every *KV-sharing* and *compression* method is an answer to this bill.\n\nA third, quieter cost sits underneath both: attention is **memory-bandwidth bound** on real GPUs — moving the $N\\times N$ scores in and out of HBM dominates. That is not a math problem, and it gets its own family (FlashAttention) that changes *how* the exact same numbers are computed.\n\nKeep the two bills in mind and the families below stop looking like a zoo.\n\n## Bill one: which pairs do we compute? (the mask axis)\n\nThe cleanest way to cut the $O(N^2)$ matrix is to not compute most of it. A **mask** says, for each query, which keys it may read; the rest are dropped. The single diagram below is the shared language for this entire axis — a query cursor lighting exactly the keys it is allowed to see, under seven different patterns. Every later sparse diagram reuses these colors.\n\n<MaskExplorer />\n\n**Bidirectional (encoder) attention** is the no-mask case: every query reads the whole sequence, both directions. It is what BERT-style encoders use, and it is the full $O(N^2)$ bill — appropriate when the input is short and you want maximum mixing.\n\n**Causal (decoder) attention** masks the strict upper triangle: a query at position $q$ may read positions $0\\ldots q$ only, never the future. This is what every autoregressive LLM uses. It halves the constant but the asymptotics are still $O(N^2)$ — the triangle is half of a square.\n\n**Cross-attention** is a different picture entirely, because the queries and the keys come from *different sequences*. The decoder's query reads the encoder's keys and values, with no causal mask, because the whole source is already known. It is the join between two sequences — French to English in translation, image patches to caption in a vision-language model.\n\n<CrossAttention />\n\n### Structured sparsity: a fixed, position-based pattern\n\nThe first real savings come from a *fixed* rule that depends only on position.\n\n**Sliding-window attention** (Mistral 7B) lets each query read only the previous $w$ keys, dropping cost to $O(N\\cdot w)$ and — crucially — capping the KV cache at $w$ per layer instead of $N$. Mistral uses $w=4096$ over 32 layers. The catch is obvious: one window layer cannot see past $w$. The rescue is depth. Stacked window layers **compound** their reach — after $k$ layers information can travel $k\\cdot w$ tokens, so Mistral's last layer has a theoretical span of $4096\\times 32 \\approx 131{,}000$ tokens. That \"theoretical\" matters: a window model needs either stacking, or global/sink tokens, or interleaved global layers, or it genuinely loses long-range information.\n\n**Attention sinks** (StreamingLLM) explain *why* a naive sliding window degrades, and fix it cheaply. Softmax weights must sum to 1, so a query with nothing important to attend to still has to put its mass *somewhere* — and models learn to dump that excess onto the first few tokens. Evict those tokens (as a rolling window does) and the whole distribution destabilizes; perplexity explodes. Keeping just **four initial tokens** as always-visible \"sinks\" plus a recent window restores stable streaming to millions of tokens, with a reported up-to-22× speedup over recomputation. The sinks carry almost no information — they are a pressure-release valve for softmax.\n\n**Block-sparse attention** (BigBird) combines three fixed pieces at block granularity — a local **window**, a few **global** tokens every query sees, and a few **random** blocks for mixing — for $O(N)$ cost. The theory is the reassuring part: with the global and random pieces, BigBird is still a universal approximator of sequence functions and Turing complete, so the sparsity does not cost you expressive power in principle. Longformer is the same local-plus-global idea for long documents.\n\n**Dilated / strided attention** attacks distance instead of density. The Sparse Transformer factorizes attention into **strided** and **fixed** patterns for $O(N\\sqrt{N})$; LongNet's **dilated attention** grows the stride exponentially with distance so any two tokens connect in a logarithmic number of hops, reaching $O(N)$. A single dilated head skips across the sequence; combine a few at different strides and every position stays reachable.\n\nThese structured patterns are cheap and predictable, and they are the workhorses of production long-context models — usually **interleaved** with occasional full-attention layers so long-range information still has a path. Gemma 2 alternates local:global 1:1 (window 4096); Gemma 3 shifts to 5:1 with a 1024 window, so only about one layer in six caches the full 128K context and KV-cache overhead drops from roughly 60% to under 15% with little quality loss. MiMo-V2-Flash uses one global layer in six; Character.AI reports a similar 5:1, 1024-window design with over 20× KV-cache reduction. Two of these get full treatments here: [Gemma 4's interleaving and KV budget](/articles/gemma-4) and [MiMo-V2-Flash's 5:1 hybrid](/articles/mimo-v2-flash).\n\n### Content-based sparsity: let the query choose\n\nFixed patterns are blind to content — they read the same positions whether or not those positions matter. The 2025–26 frontier is **learned, content-based selection**: score the past cheaply, then read only the blocks that actually matter *for this query*.\n\n**Native Sparse Attention** (DeepSeek, 2025) is the cleanest example. For each query it runs three branches in parallel over the same KV — a **compression** branch that squashes the past into coarse block summaries (global gist), a **selection** branch that scores blocks and keeps the top-$n$ at full resolution (the important detail), and a **sliding-window** branch (local coherence) — then a learned gate blends them:\n\n$$\no_t = \\sum_{c\\,\\in\\,\\{\\text{cmp},\\,\\text{slc},\\,\\text{win}\\}} g_t^{\\,c}\\;\\mathrm{Attn}\\!\\left(q_t,\\,\\tilde K_t^{\\,c},\\,\\tilde V_t^{\\,c}\\right),\n\\qquad g_t^{\\,c}\\in[0,1]\n$$\n\nThe gate scores $g_t^{\\,c}$ come from a small MLP-plus-sigmoid on the query. The reason this matters is in the name: NSA is **natively trainable** — all three read paths are differentiable, so the sparsity is learned end-to-end rather than bolted onto a dense checkpoint at inference time, and it is designed to be hardware-aligned (Tensor-Core-friendly block sizes).\n\n<NsaBranches />\n\nTwo siblings ship the same idea in production models, and I have covered both in depth: [MiniMax Sparse Attention](/articles/minimax-sparse-attention) scores the past in 128-token blocks and keeps the top-$k$ whole blocks; [LongCat Sparse Attention](/articles/longcat-2) goes finer with a hierarchical coarse-recall-then-token-select index, shares the index across layers, and reshapes the reads for coalesced memory access. This is the least-settled family in the guide — the shape of \"trained-in sparsity\" is still moving — but it is where the interesting long-context work is happening.\n\n## Bill two: how do we share or shrink K/V? (the memory axis)\n\nThe mask axis leaves attention *exact within what it reads*. The memory axis is orthogonal: keep full attention, but pay less to cache K and V. This is the difference between sharing heads and compressing them, and the two are genuinely different axes — you can combine them.\n\n<KvSharing />\n\n**Multi-Query Attention** (MQA) is the blunt version: keep all $h$ query heads but share a **single** key head and value head across them. The cache drops from $2 \\cdot n_h \\cdot d_h$ to $2 \\cdot d_h$ per token — a factor of $n_h$. It targets exactly the decode-time memory-bandwidth bottleneck, and it costs some quality and can destabilize training.\n\n**Grouped-Query Attention** (GQA) is the middle ground almost everyone now uses. Split the query heads into $G$ **groups**; each group shares one key head and one value head, so the cache is $2 \\cdot G \\cdot d_h$. The important thing to get right: GQA interpolates by **KV-head groups**, not by reducing query heads — you keep all $h$ query heads, they just fan into $G$ shared KV heads. $G=1$ is exactly MQA; $G=h$ is exactly MHA. Llama 2 70B adopted it, and the quality is essentially MHA's at a fraction of the cache.\n\n**Multi-head Latent Attention** (MLA, DeepSeek-V2) sits on a *different axis*: it does not share heads, it **compresses** them. K and V are jointly projected down to a shared low-rank **latent** vector $c_t^{KV}$, cached in place of the per-head keys and values, then up-projected back to all heads at compute time. Because RoPE is incompatible with folding the up-projection into the query, MLA carries positional information on a small **decoupled** key dimension. DeepSeek-V2 caches $\\tfrac{9}{2}\\,d_h$ per token — about what GQA with 2.25 groups would cost — while reporting quality at or above full MHA. (Its headline \"93.3% smaller KV cache\" is measured against DeepSeek 67B, itself a GQA model, not against full MHA; the clean comparison is the $\\tfrac{9}{2}\\,d_h$ figure.) Sharing versus compressing is the real distinction between GQA and MLA.\n\n## Drop the matrix entirely: linear and kernelized attention\n\nBoth axes above still compute a softmax. **Linear attention** asks whether we need it at all. Softmax puts a non-linearity between $Q$ and $K$, which is exactly what forces the $N\\times N$ matrix to exist. Replace it with a kernel feature map $\\phi$ and the product reassociates:\n\n$$\n\\mathrm{softmax}(QK^{\\top})V \\;\\longrightarrow\\; \\phi(Q)\\big(\\phi(K)^{\\top}V\\big)\n$$\n\nComputing $\\phi(K)^{\\top}V$ first gives a $d\\times d$ matrix, never an $N\\times N$ one. For autoregressive decoding this becomes a running state updated once per token, exactly like an RNN:\n\n$$\nS_t = S_{t-1} + \\phi(k_t)\\,v_t^{\\top},\n\\qquad \\mathrm{out}_t = \\frac{\\phi(q_t)^{\\top} S_t}{\\phi(q_t)^{\\top} z_t},\n\\qquad z_t = z_{t-1} + \\phi(k_t)\n$$\n\n$S_t \\in \\mathbb{R}^{d\\times d}$ is the fixed-size state, $z_t$ the normalizer. Time is $O(N d^2)$, memory is **constant** in $N$, and context is unbounded. The diagram below is the whole pitch: the softmax triangle grows quadratically as tokens stream in, while the linear state just updates in place.\n\n<LinearAttention />\n\nThe honest cost is real, and I want to state it plainly: linear attention is an **approximation** of softmax, and pure linear models are usually **weaker on recall** — pulling one exact fact out of a long context is precisely where a compressed fixed-size state struggles. Performer's FAVOR+ makes the approximation principled (random features that provably, unbiasedly approximate the softmax kernel); \"lightning\" and gated-linear variants add a decay so old state fades. In practice the winning form today is **hybrid** — MiniMax-01 interleaves seven lightning (linear) blocks per one softmax block, buying linear-time bulk with periodic exact attention to restore recall.\n\n## Exact, but IO-aware: the systems layer\n\nThis family is different in kind, and it is worth being explicit: **FlashAttention and PagedAttention are not new attention functions.** They compute the exact same softmax attention, bit for bit. They change *where the bytes move*. I include them because \"attention is slow\" is usually a memory-traffic statement, not a FLOP statement, and these are the fix.\n\n**FlashAttention** attacks the fact that materializing the $N\\times N$ scores in slow HBM is the real bottleneck. It **tiles** the computation: a block of queries stays in fast on-chip SRAM while blocks of K and V stream past it, and it maintains an **online softmax** — a running max $m$ and running sum $\\ell$ — so it can produce the exact softmax result without ever writing the full matrix to HBM.\n\n<FlashTiling />\n\nThe payoff is an IO complexity of $\\Theta(N^2 d^2 / M)$ HBM accesses, where $M$ is the SRAM size, versus $\\Theta(Nd + N^2)$ for the standard implementation — many-fold fewer round-trips for typical $d$ and $M$, and the reason a long-context forward pass stopped being memory-bound. FlashAttention-2 and -3 push the same idea with better GPU work-partitioning and FP8 on Hopper. It is exact; the only thing it \"gives up\" is the naive implementation's simplicity.\n\n**PagedAttention** (vLLM) does for the KV cache what FlashAttention does for the scores — a systems fix, not a math one. Before it, a serving engine reserved one *contiguous* buffer per request sized to the maximum output length, so a short generation left most of its reservation wasted (internal fragmentation), and two requests could share nothing. PagedAttention stores the cache in fixed-size **blocks** with a per-sequence **block table**, exactly like OS virtual memory: blocks are handed out on demand (near-zero fragmentation), and a shared prompt prefix maps to the **same physical blocks** via copy-on-write.\n\n<PagedKv />\n\nThe result is many more concurrent sequences per GPU with identical model outputs — which is why paged KV is now table stakes for serving.\n\n## A quality move, not an efficiency one: differential attention\n\nNot every variant is about cost. **Differential attention** (Microsoft, 2024) targets a *quality* failure: softmax spends attention mass on irrelevant tokens because the weights are forced to sum to 1 — the same pressure that creates attention sinks also creates broadband \"attention noise.\" The fix borrows from differential amplifiers: compute two softmax maps and return their difference.\n\n$$\n\\mathrm{DiffAttn}(X) = \\Big(\\mathrm{softmax}\\!\\big(\\tfrac{Q_1 K_1^{\\top}}{\\sqrt{d}}\\big) - \\lambda\\,\\mathrm{softmax}\\!\\big(\\tfrac{Q_2 K_2^{\\top}}{\\sqrt{d}}\\big)\\Big)V\n$$\n\nThe irrelevant mass is roughly common to both maps, so it subtracts away; the genuine peaks, which differ between the maps, survive. $\\lambda$ is learned per head (reparameterized, initialized around 0.8), and the reported effect is sparser attention and better long-context retrieval and in-context recall.\n\n<DifferentialAttention />\n\nIt costs roughly double the attention compute and cache (two maps), and it is still $O(N^2)$ — this buys accuracy, not efficiency. Worth it when the failure mode is a model that gets \"distracted\" in long context.\n\n## The whole map, in one table\n\nComplexities are per attention layer; $N$ is sequence length, $d$ the model width, $n_h$ heads, $d_h$ head dim, $w$ window, $k$ selected blocks, $M$ SRAM size. \"KV cache\" is per token per layer.\n\n| Mechanism | Bill it pays | Time | KV cache | Exact? | Quality / recall | Reach for it when |\n|---|---|---|---|---|---|---|\n| MHA (full) | baseline | $O(N^2 d)$ | $2\\,n_h d_h$ | exact | reference | short context; training |\n| MQA | memory | $O(N^2 d)$ | $2\\,d_h$ | exact | small drop | decode memory-bound |\n| GQA | memory | $O(N^2 d)$ | $2\\,G\\,d_h$ | exact | ≈ MHA | default for large models |\n| MLA | memory | $O(N^2 d)$ | $\\tfrac{9}{2}\\,d_h$ | exact | ≥ MHA (reported) | long context + quality |\n| Sliding-window | compute | $O(N w d)$ | capped at $w$ | exact in window | loses distance unless stacked/interleaved | cheap long context |\n| Sink / StreamingLLM | compute + memory | $O(N w)$ | sinks + window | exact in kept set | stable, not true long-range recall | unbounded streaming |\n| Dilated (LongNet) | compute | $O(N)$ | bounded | exact in pattern | pattern-limited | extreme length |\n| Block-sparse (BigBird) | compute | $O(N)$ | bounded | exact in pattern | near-full w/ global+random | long documents |\n| Content-based (NSA / MSA / LSA) | compute | $O(N k)$ | blockwise | exact in selection | near-full if selection is good | trained-in long context |\n| Linear / Performer | compute + memory | $O(N d^2)$ | $d\\times d$ state | approximate | weaker recall | very long, recall-tolerant |\n| FlashAttention | systems (IO) | $O(N^2 d)$, $\\Theta(N^2 d^2/M)$ HBM | same as base | exact | none (identical) | always — default kernel |\n| PagedAttention | systems (memory) | same as base | block-paged | exact | none (identical) | serving many sequences |\n| Differential | quality | $O(N^2 d)$ (≈2×) | ≈2× | exact | better retrieval | reduce attention noise |\n\n## What's settled, what's still moving\n\nSome of this is infrastructure now. **GQA** is the default attention for large models; **FlashAttention** is the default kernel; **PagedAttention** is the default cache manager; **sliding-window interleaved with periodic global (or sink) layers** is the standard recipe for cheap long context. If you are building a model today, those four are choices you make without much agonizing.\n\nThe frontier is the content-based sparse family — **NSA**, **MiniMax Sparse Attention**, **LongCat Sparse Attention** — where the model *learns what to read*. The promise is compelling (near-full quality at a fraction of the reads, trained end-to-end) but the designs are still diverging on granularity, how the index is shared across layers, and how to make the reads hardware-friendly; there is no settled winner yet. **Linear and kernelized attention** remains the most tantalizing and the most caveated: constant-memory unbounded context is exactly what you want, and the recall gap is exactly why pure-linear models have not displaced softmax — hybrids are the pragmatic answer for now. And attention lives inside a larger design space: whether to specialize behavior at the **head** level rather than the layer level is its own question, which I dig into in [HydraHead](/articles/hydrahead).\n\nThe map is stable even as the territory shifts. Every new mechanism you meet is answering one of the same two questions: *which (query, key) pairs do we compute*, and *how do we pay for the K/V we keep*. Place it on those axes and you already understand most of what it does — and what it gives up.\n\n---\n\n*The interactive diagrams are illustrations of each mechanism, not measured traces; real windows, blocks, and head counts are far larger than what fits on screen. Primary sources: Attention Is All You Need (Vaswani et al., 2017, arXiv 1706.03762); Fast Transformer Decoding / MQA (Shazeer, 2019, arXiv 1911.02150); GQA (Ainslie et al., 2023, arXiv 2305.13245); DeepSeek-V2 / MLA (DeepSeek-AI, 2024, arXiv 2405.04434); Mistral 7B (Jiang et al., 2023, arXiv 2310.06825); StreamingLLM (Xiao et al., 2023, arXiv 2309.17453); Longformer (Beltagy et al., 2020, arXiv 2004.05150); BigBird (Zaheer et al., 2020, arXiv 2007.14062); Sparse Transformer (Child et al., 2019, arXiv 1904.10509); LongNet (Ding et al., 2023, arXiv 2307.02486); Native Sparse Attention (Yuan et al., 2025, arXiv 2502.11089); Differential Transformer (Ye et al., 2024, arXiv 2410.05258); Transformers are RNNs / linear attention (Katharopoulos et al., 2020, arXiv 2006.16236); Performer / FAVOR+ (Choromanski et al., 2020, arXiv 2009.14794); FlashAttention (Dao et al., 2022, arXiv 2205.14135) and FlashAttention-2/-3 (arXiv 2307.08691, 2407.08608); PagedAttention / vLLM (Kwon et al., 2023, arXiv 2309.06180); Gemma 2 and 3 (Gemma Team, 2024/2025, arXiv 2408.00118, 2503.19786); MiniMax-01 (2025, arXiv 2501.08313).*\n","readingTimeMins":17,"url":"https://ai.thesatyajit.com/articles/attention-mechanisms","signal":{"interest":4,"helpful":5,"score":9,"level":5,"label":"Essential"}},{"title":"Frontier RL is cheaper than you think: ship deltas, not mega-clusters","description":"Fireworks argues frontier reinforcement learning does not need one co-located mega-cluster. Because more than 98% of a model's weights stay bit-identical between adjacent RL checkpoints, you can ship a ~2% delta across regions instead of the full ~1 TB on every update — cutting cross-region weight traffic ~94% over a 50-step window and letting asynchronous rollouts run on scattered GPU capacity. A walk through the delta-compression and async-RL argument, the numbers, and where it stops working — with the vendor framing named.","date":"2026-07-08","tags":["explainer","reinforcement-learning","systems","inference-optimization"],"draft":false,"cover":"/articles/frontier-rl-cheaper/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"frontier-rl-cheaper","body":"Fireworks' argument in [*Frontier RL Is Cheaper Than You Think*](https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think) is narrow and load-bearing: the belief that reinforcement-learning post-training needs one giant co-located cluster rests on a single assumption — that **every policy update ships the full ~1 TB checkpoint to the rollout fleet**. It does not. Between adjacent RL checkpoints most weights do not change at all, so you can ship a compressed **delta** — a couple of percent of the model — and keep a rollout fleet fresh over ordinary cross-region links. That is the whole claim, and everything else follows from it.\n\n<Callout type=\"warn\">\nThis is a **vendor blog**. Fireworks sells the training and rollout-serving platform this argument recommends, so read the framing accordingly. The numbers below are **Fireworks-reported** from one sample setup, not an independent benchmark. What is *not* vendor-specific is the underlying physics — weight-update sparsity in RL — which a separate paper ([arXiv 2602.03839](https://arxiv.org/abs/2602.03839)) reports independently. I have kept the two apart throughout.\n</Callout>\n\n## RL has two jobs, not one\n\nThe mega-cluster instinct comes from pretraining, where the systems problem is keeping **one** huge synchronous job saturated. RL is a different shape. An RL run has two coupled jobs:\n\n- The **trainer** runs forward, reward computation, backward, and the optimizer step. It wants dense, tightly-coupled hardware — the pretraining kind.\n- The **rollout fleet** samples trajectories from the *current* policy — that is, it runs inference on the latest weights, across many parallel requests. It wants inference throughput, and it can live anywhere.\n\nPretraining only has the first job. RL has both, and the awkward part is the seam between them: how do you keep a large rollout fleet generating from a *fresh enough* policy without stalling on checkpoint transfers every step? That coupling is the whole systems problem, and it is why RL cost lives somewhere non-obvious. The same two-job structure shows up whenever RL trains agentic models — the trajectory-generation half is exactly the rollout fleet here (I wrote about the trajectory side in [Agents-A1](/articles/agents-a1)).\n\n<Figure\n  src=\"/articles/frontier-rl-cheaper/fig1.png\"\n  alt=\"Diagram of a cross-region RL weight-update loop. A policy trainer runs forward/backward plus an optimizer step and emits a base checkpoint every N=25 steps, with changed weights of about 2% per step. A weight-update handoff ships full checkpoints plus compact deltas to three rollout regions — US Ohio (+43 ms), US Virginia (+58 ms), EU Frankfurt (+145 ms) — each receiving a 2.0% compressed delta of weights. A 50-step sample window at the bottom shows one full checkpoint resetting the chain, then deltas of about 2% in between, for more than 98% less cross-region traffic.\"\n  caption=\"The loop: one trainer emits a full checkpoint every N steps and a ~2% delta in between; every rollout region reconstructs the same checkpoint from a shared delta chain over ordinary links. >98% less cross-region traffic than shipping the full model each step (Fireworks AI, Fig 1).\"\n/>\n\n## The 1 TB problem\n\nA frontier checkpoint is around 1 TB. If every policy refresh really required shipping that whole tensor to the rollout fleet, the conclusion writes itself: keep trainer and inference on the same RDMA-class fabric, avoid long-distance transfers, treat remote capacity as second class. That is the mega-cluster story, and its side effect is economic — frontier RL looks like a market only a handful of companies with a co-located supercluster can enter.\n\nThe premise is the full-checkpoint transfer. Break it and the conclusion goes with it.\n\n## The key insight: exploiting 98% sparsity\n\nBetween nearby RL checkpoints, most weights barely move. Fireworks reports that **more than 98% of weights in `bf16` remain bit-equivalent between consecutive checkpoints**, and the unchanged fraction is higher still at lower precision. Their explanation is mechanical: RL delivers a very sparse learning signal — a few bits of reward per rollout — so training runs a small learning rate, and most parameters shift so little in `fp32` that they never cross the threshold to change their 16-bit representation. The independent paper *Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL* ([arXiv 2602.03839](https://arxiv.org/abs/2602.03839)) reports the same phenomenon, often **around 99%** sparsity in practical RL settings.\n\nIf only ~2% of the bits change, you should move ~2% of the bits. The figure's illustrative per-tensor sample makes the sparsity concrete — the change is spread thinly, and even the busiest tensor moves under 1%:\n\n<BenchBars\n  title=\"Per-tensor change between adjacent checkpoints (%) — illustrative sample\"\n  unit=\"%\"\n  bars={[\n    { label: \"embed_tokens\", value: 0.0 },\n    { label: \"attn.q_proj\", value: 0.2 },\n    { label: \"attn.o_proj\", value: 0.4 },\n    { label: \"mlp.gate_up\", value: 0.7, highlight: true },\n    { label: \"mlp.down_proj\", value: 0.2 },\n    { label: \"final_norm\", value: 0.1 },\n  ]}\n/>\n\nThe mechanism is a periodic **full base checkpoint every N steps** to reset the chain, then a **compressed delta** in between. Only changed weight slices survive into the payload; each carries reconstruction metadata (`prev_snapshot_id`, `tensor_checksum`, `shape + dtype`) so every rollout cluster rebuilds the exact next checkpoint **losslessly** from shared object storage, then verifies the checksum before swapping.\n\n<Figure\n  src=\"/articles/frontier-rl-cheaper/fig2.png\"\n  alt=\"Three-panel diagram of delta-compressed weight updates. Panel 1, identify changed weights: a table of tensors (embed_tokens 0%, attn.q_proj 0.2%, attn.o_proj 0.4%, mlp.gate_up 0.7%, mlp.down_proj 0.2%, final_norm 0.1%) showing adjacent checkpoints differ in a few chunks, not everywhere, so unchanged chunks stay out of the transfer. Panel 2, package changed tensors: keep only the changed chunks, bit-pack and encode them into a delta payload, and attach reconstruction metadata (prev_snapshot_id, tensor_checksum, shape and dtype). Panel 3, reconstruct and swap: fetch the previous snapshot, decode the payload, apply the delta, verify the checksum, then swap weights in place.\"\n  caption=\"Identify the changed weights, bit-pack only those into a compact payload with reconstruction metadata, then rebuild and checksum-verify the exact checkpoint on the rollout side before an in-place swap (Fireworks AI, Fig 2).\"\n/>\n\nIn Fireworks' sample setup a full checkpoint is **1024 GiB**, the average delta between adjacent checkpoints is **20.3 GiB — 1.98% of the model**, and over a 50-step window that cuts cross-region transfer volume by **about 94%** versus moving the full model every time. The arithmetic is worth writing down. With window $W$ steps, a full checkpoint every $N$ steps ($f=\\lceil W/N\\rceil$ fulls), delta fraction $\\delta$, and $R$ regions, the total cross-region volume for a checkpoint of size $C$ is:\n\n$$\nV_\\text{full} = W \\cdot C \\cdot R, \\qquad V_\\text{delta} = \\bigl(f\\,C + (W-f)\\,\\delta C\\bigr)\\cdot R\n$$\n\nso the fraction saved is $1 - \\bigl[f + (W-f)\\,\\delta\\bigr]/W$, independent of both $C$ and $R$. Plug in $W=50$, $N=25$, $\\delta=0.0198$: $f=2$, and you move $[2 + 48\\cdot0.0198]/50 \\approx 5.9\\%$ of the naive volume — the reported ~94% cut. Drag the delta size and the cadence and watch where the crossover sits:\n\n<DeltaCost />\n\nTwo things the model makes obvious. The **percentage** saved does not depend on how many regions you feed — but the **absolute** bytes you stop moving scale with every region, which is the entire point of going distributed. And push the delta slider toward 100% and the two curves converge: at full-checkpoint-every-step you are back to the mega-cluster premise, where a co-located RDMA fabric is the only thing that can absorb the traffic.\n\n## Async RL, and why the delta size decides it\n\nSmall deltas are necessary but not sufficient. The other half is **asynchronous RL** (also called Pipeline RL): deliberately let the rollout fleet run a little **off-policy** so that generation and training overlap instead of taking turns. Idle samplers are the expensive failure mode; a few steps of staleness is usually an acceptable price to keep them busy.\n\nThat trade only works if the handoff is fast. Delta-compressed updates keep it small: Fireworks reports distributing a new checkpoint across globally-distributed rollout clusters takes **a few minutes end-to-end**, and the actual in-GPU-memory **weight swap stays well under a minute** because download and decompression are pipelined ahead of the swap. The trainer side is pipelined too — every step uploads to shared object storage, each rank caches its previous upload and transmits only the diff, upload is sharded across training GPUs, download across inference replicas, and compression plus transfer plus signaling run in the background so training never blocks.\n\nThe payoff is where the wall-clock goes: less time waiting on checkpoint movement, more time generating rollouts on fresh weights. But the async win depends on the delta being small enough to hide behind a generation window. Drag the payload up and watch the \"warm\" fleet start to stall and fall off-policy:\n\n<RolloutTimeline />\n\nThis is the honest coupling: async RL alone does not save you, and delta compression alone does not save you. It is the two together — a handoff small enough to overlap with generation — that turns a distributed fleet into usable capacity.\n\n## A note on staleness\n\nRunning trainer and fleet asynchronously means the fleet always serves a policy a few steps behind the trainer. That gap is **staleness**, and it is a real tradeoff, not a free lunch. The systems layer does not remove it — the *algorithm* still has to tolerate off-policy data. What delta compression buys is a staleness that is **bounded and predictable**: policy movement becomes a routine background operation instead of a stop-the-world full-checkpoint transfer. If your RL algorithm cannot stomach any off-policy data, none of this applies.\n\n## Multi-region rollout capacity\n\nHere the systems point turns strategic. Most teams do not have one contiguous idle supercluster for rollouts; they have GPUs scattered across regions, clouds, and availability zones, and stitching them into one co-located sampler fleet is painful even when the aggregate count exists. Once weight updates are small, that fragmented capacity becomes usable: each rollout cluster independently pulls and reconstructs weights from the same shared delta chain, with **no direct connection back to the trainer**. Add, remove, or rebalance clusters while they all track the same stream of policy updates.\n\nFireworks cites this in production: they say they ran Cursor's **Composer 2** RL training this way, with Federico Cassano describing the run as [\"distributed across 3 (sometimes 4) different clusters around the world\"](https://x.com/ellev3n11/status/2034778708163404102). Treat that as a vendor-provided data point — a single external quote, not a controlled measurement — but it is at least a concrete one, and it is a coding-agent model, the same agentic-RL regime as [Agents-A1](/articles/agents-a1).\n\nThe approach is not new territory Fireworks invented alone: it names [AReaL](https://arxiv.org/abs/2505.24298) for async RL and rollout-training disaggregation, and engineering notes from [Kimi](https://moonshotai.github.io/checkpoint-engine/) and [MiniMax](https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm) on RL parameter updates and async scheduling. The contribution is running the delta-compressed, multi-region version in production.\n\n## When this argument stops working\n\nThe blog is unusually clear about its own boundaries, and the caveats matter:\n\n- **Small models.** If trainer and rollout inference fit on one node or a compact cluster, bandwidth was never the bottleneck and the simpler co-located setup wins. The whole argument is about the ~1 TB regime.\n- **Very frequent checkpoints.** If the trainer emits updates faster than the delta pipeline can distribute and apply one, staleness becomes the limiting factor and tighter co-location can make more sense again.\n- **Entangled rollout stacks.** If your rollout workers don't cleanly separate inference from training, treating them as a standard inference fleet is a poor fit and the disaggregated design loses its appeal.\n\nThere is also an assumption baked into the headline sparsity number: it is the fraction of `bf16` weights that stay **bit-identical**. That is exactly the right metric for a lossless delta, but it is a property of the *representation*, not just the math — a run at higher precision, a larger learning rate, or a more aggressive RL objective could move more bits and shrink the win. The ~2% is a measured sample, not a guarantee.\n\n## The take\n\nStrip the vendor framing and the load-bearing claim is a clean systems observation: RL post-training updates are sparse enough that the weight-sync between trainer and rollout fleet — the thing everyone assumed forced co-location — is ~2% of what you feared. That is corroborated independently ([arXiv 2602.03839](https://arxiv.org/abs/2602.03839)), and the engineering that follows (delta compression + checksummed reconstruction + async overlap + sharded pipelined transfer) is the ordinary, correct way to cash it in. The reproducible headline — ~1 TB checkpoint, 20.3 GiB average delta, ~94% less cross-region traffic over 50 steps — comes from one Fireworks sample setup, and the Composer 2 case is a single external quote, so treat the specific figures as illustrative rather than benchmarked. But the shape of the argument survives that discount: if the weights barely change, the mega-cluster was never load-bearing for RL. That is a genuinely useful thing to know before you go shopping for a co-located supercluster.\n\n---\n\n*Built on Fireworks AI's [Frontier RL Is Cheaper Than You Think](https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think) (published 2026-03-23). All quantitative figures are Fireworks-reported from a sample setup unless attributed to [arXiv 2602.03839](https://arxiv.org/abs/2602.03839); the `DeltaCost` and `RolloutTimeline` widgets are my own cost models of their argument (relative/illustrative units), and the two reproduced diagrams are from the source post for commentary. The per-tensor bar chart uses the figure's own illustrative sample values.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/frontier-rl-cheaper","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Gemma 4: an open multimodal family, tuned to the KV-cache budget","description":"Google DeepMind's Gemma 4 is a family of open-weight, natively multimodal models from 2.3B to 31B — dense plus a 26B-A4B MoE — released under Apache 2.0 with a thinking mode. The engineering worth reading is the memory work: a 5:1 local-to-global attention ratio, pp-RoPE, and reusing keys as values on global layers cut the global KV cache ~37.5%; the 12B drops its vision and audio encoders for raw-patch projection; and an MTP drafter head does speculative decoding without prefill. This is a first-principles walk through those choices with the paper's own numbers.","date":"2026-07-08","tags":["explainer","llm","architecture","kv-cache","long-context"],"draft":false,"cover":"/articles/gemma-4/fig1.png","featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"gemma-4","body":"**Gemma 4** is Google DeepMind's next open-weight family: natively multimodal (text, image, audio), released under **Apache 2.0**, with sizes from **2.3B to 31B**. Four dense models — effective **2.3B (E2B)**, **4.5B (E4B)**, **12B**, **31B** — plus a Mixture-of-Experts variant, **26B-A4B**, with **~3.8B activated** of 26B total. The headline features read like a checklist: a **thinking mode**, an **encoder-free 12B**, and a raft of efficiency work aimed at long context. I care about the last part most, because it's the part that decides whether you can actually serve these on the hardware you have.\n\n<Callout type=\"note\">\nEvery benchmark here is **first-party** — Google's own report, thinking mode on unless noted, and the Gemma 3 27B column it compares against is **non-thinking**, so some of the generational jump is thinking-mode-vs-not rather than raw capability. The Arena ranking is real human eval but self-cited. Read it as a strong *open* family for its size, not an all-around SOTA claim: on the open leaderboard, several larger MoE models still sit above it.\n</Callout>\n\n## Where the sizes land\n\nThe family splits by how much of the model runs per token, and by how \"effective\" and \"total\" parameters differ:\n\n| Model | Type | Total | Runs / token | Note |\n|---|---|---|---|---|\n| E2B | dense | 5B | 2.3B effective | per-layer embeddings (Gemma 3n trick) |\n| E4B | dense | 8B | 4.5B effective | per-layer embeddings |\n| 12B | dense | 12B | 12B | the encoder-free unified model |\n| 31B | dense | 31B | 31B | leading dense open model on Arena |\n| 26B-A4B | MoE | 26B | ~3.8B activated | sparse routing |\n\n\"Effective\" is the Gemma 3n move: **E2B** and **E4B** keep 5B and 8B parameters on disk but stream **per-layer embeddings** so only 2.3B / 4.5B are resident in the compute path. It's a memory-placement trick for on-device, not a routing one — orthogonal to the MoE sparsity in 26B-A4B, where a router activates ~3.8B of 26B per token.\n\n## Long context is a memory problem\n\nThe expensive part of a long prompt isn't the matmul — it's the **KV cache**. Every past token leaves a key and a value in every attention layer, and under full attention that cache grows linearly with context on every layer:\n\n$$\n\\text{KV}_{\\text{full}} \\;\\propto\\; 2 \\cdot n_{\\text{layers}} \\cdot L \\cdot d_{\\text{kv}}\n$$\n\nThe factor of 2 is K *and* V; $L$ is the context length. At 128k tokens this is what fills your accelerator's memory. Gemma 4 attacks it structurally. Most layers are **local** — sliding-window attention that only looks back a fixed window $W$ — and only **1 in 6** is **global**, attending over the whole context. The ratio is **5:1** for most models, **4:1** for E2B:\n\n<AttnStrip />\n\nThat interleave changes the scaling. Local layers stop growing once the context passes $W$; only the global layers keep paying for length. Then two more moves shrink the global term itself. On global layers Gemma 4 **reuses keys as values** ($V = K$), storing one tensor where full attention stores two — except on E2B/E4B. Position uses **pp-RoPE** ($p = 0.25$) on global layers and ordinary RoPE (frequency 10k) on local ones, with global frequency 1M. Combined with KV-cache sharing, the report puts the **global KV-cache cut at 37.5%**. So the split KV cost looks like:\n\n$$\n\\text{KV}_{\\text{g4}} \\;\\propto\\; \\underbrace{2\\,n_{\\text{loc}}\\,\\min(L,W)\\,d_{\\text{kv}}}_{\\text{local: bounded by }W} \\;+\\; \\underbrace{n_{\\text{glb}}\\,L\\,d_{\\text{kv}}}_{\\text{global: }V{=}K\\text{, no factor of 2}}\n$$\n\nSlide the context up and watch which term dominates — the sliding window is the structural win, `values=keys` shaves what's left:\n\n<KvBudget />\n\nThe payoff shows up in the real memory table. At 32k context the int8 KV cache adds only **+0.05 GB** (E2B), **+0.28 GB** (12B / 26B-A4B), **+1.10 GB** (31B) on top of the quantized weights — small enough that the weights, not the cache, stay the budget:\n\n| Model | Weights (bf16) | Quantized | + int8 KV @ 32k |\n|---|---|---|---|\n| E2B | 4.6 GB | 0.8 GB | +0.05 GB |\n| E4B | 9.0 GB | 2.3 GB | +0.14 GB |\n| 12B | 24.0 GB | 7.65 GB | +0.28 GB |\n| 26B-A4B | 52.0 / 7.6 GB | 16.2 / 2.8 GB | +0.28 GB |\n| 31B | 64.0 GB | 19.2 GB | +1.10 GB |\n\nThe quantized column is **quantization-aware training** (QAT), not a post-hoc round: the model trains with fake-quant in the loop so int4/int8 weights land with minimal quality loss. That's how 31B fits in ~19 GB.\n\n## The encoder-free 12B\n\nThe most unusual model is the **12B**, trained from scratch with **no vision or audio encoder at all**. Normally a multimodal LLM bolts a frozen ViT and a speech encoder onto the token stream. Gemma 4 12B replaces them with projections:\n\n- **Vision:** it takes raw **48×48×3 RGB patches** and projects them with a **single 35M-parameter matmul** — standing in for the **550M** ViT the larger models use. 2D coordinate embeddings and a LayerNorm carry spatial position.\n- **Audio:** the **305M USM conformer encoder is discarded entirely**. Raw audio is cut into **40ms chunks at 16kHz** (640-dim vectors) and projected straight into the embedding space. Audio is already temporal, so no extra positional encoding.\n\nThe bet: give a large enough LLM the raw patches and it learns the encoder's job internally, for less memory and no separate frozen tower to serve. The report's Table 8 argues the 12B stays competitive on audio-text tasks without the dedicated encoder — a genuinely different design point from the encoder-plus-LLM norm, and the honest caveat is that it only did this for one size.\n\nFor the models that *keep* a vision encoder, the input pipeline is worth a look too. Gemma 4 supports **variable aspect ratios** rather than square-cropping, resizing an image to fit a token budget $N_{\\max} \\in \\{70, 140, 280, 560, 1120\\}$ while mostly preserving shape:\n\n<Figure\n  src=\"/articles/gemma-4/fig2.png\"\n  alt=\"A 572x1024 portrait image of an astronaut otter is resized by a mostly aspect-ratio-preserving algorithm (parameters k=3, sl=10, ps=16) into a 96x192 grid, which becomes 8 tokens of 72 patches each.\"\n  caption=\"Aspect-ratio-preserving resize: a 1:1.79 image maps to a token grid instead of a square crop, so tall or wide inputs keep their proportions (paper, Figure 2).\"\n/>\n\n## Thinking mode\n\nGemma 4 adds a **thinking mode**: before answering, the model can emit a reasoning trace, which lifts math and coding. It's a post-training addition on top of a Gemma 3-style recipe, toggled by a control token in a leading system turn:\n\n```text\n<|think|>              # activates the reasoning trace for this turn\n...user turn...\n# IT models close a turn with <turn|>; base (PT) models emit <eos>\n```\n\nBecause the trace is optional, you pay for it only on the prompts that need it. The flip side, for anyone reading the tables: the Gemma 3 27B baseline is non-thinking, so a row like AIME (89.2 vs 20.8) is partly measuring the mode, not only the model.\n\n## MTP drafter: speculative decoding without prefill\n\nDecoding is memory-bandwidth-bound — one token per forward pass, weights re-read each step. The usual fix is **speculative decoding**: a small draft model proposes several tokens, the big model verifies them in one pass, and accepted tokens are free. Gemma 4 ships a **multi-token-prediction (MTP) drafter head** for exactly this.\n\n<Figure\n  src=\"/articles/gemma-4/fig1.png\"\n  alt=\"Diagram of the autoregressive MTP drafter. The main model (gray blocks) processes token t1 and produces last-layer activations and a KV cache. The drafter (blue blocks) — an input embedding, a concat plus down-projection, four stacked MTP layers, and an up-projection into unembed plus softmax — consumes those activations and cross-attends to the main model's KV to autoregressively emit t3, t4.\"\n  caption=\"The autoregressive MTP drafter (blue) reads the main model's (gray) last-layer activations and KV cache, then emits future tokens by cross-attending to the main KV — no separate prefill (paper, Figure 1).\"\n/>\n\nThe drafter is a **4-layer Transformer block** (model dim 256 for E2B/E4B, 1024 for 26B-A4B/31B; three local and one global attention layers). It reuses the main model's last-layer activations and **cross-attends to the main model's KV cache**, so it needs **no prefill of its own** and supports any draft length. On E2B/E4B there's a further trick: instead of projecting the draft over the full **262k** vocabulary, it does a top-k over token clusters, cutting the final matmul from $d \\times 262\\text{,}000$ to $d \\times 4096$ at a similar acceptance rate. The report doesn't publish an end-to-end speedup, so I won't invent one — but the design (no prefill, cross-attention to the live cache) is the part worth copying.\n\nIf speculative decoding and MTP are new, I built the idea up from scratch in [Multi-Token Prediction](/articles/multi-token-prediction).\n\n## The numbers\n\nStart with human eval, since it's the least gameable. On **Arena Text** (blind side-by-side, Elo, as of June 2026), Gemma 4 31B is the **top dense open model** — but the leaderboard around it is mostly much larger MoE systems, and a closed model tops it:\n\n| Rank | Model | Elo | Open | Params / active |\n|---|---|---|---|---|\n| 1 | Claude Fable 5 | 1508 | no | – |\n| 15 | GLM 5.1 | 1475 | yes | 744B / 40B |\n| 38 | DeepSeek V4 Pro | 1456 | yes | 1.6T / 49B |\n| **43** | **Gemma 4 31B** | **1451** | **yes** | **31B dense** |\n| 61 | Gemma 4 26B-A4B | 1438 | yes | 26B / 4B |\n| 157 | Gemma 3 27B | 1366 | yes | 27B dense |\n\nAn Elo of 1451 at **31B dense** against 744B–1.6T MoE models a dozen ranks up is the real story: it's punching well above its parameter count, not topping the board. On static reasoning benchmarks the family scales cleanly, and the jump over Gemma 3 27B is large (thinking-mode caveat noted):\n\n<BenchBars\n  title=\"AIME 2026, no tools (%) — first-party, thinking mode\"\n  unit=\"\"\n  bars={[\n    { label: \"Gemma 4 31B\", value: 89.2, highlight: true },\n    { label: \"Gemma 4 26B-A4B\", value: 88.3 },\n    { label: \"Gemma 4 12B\", value: 77.5 },\n    { label: \"Gemma 4 E4B\", value: 42.5 },\n    { label: \"Gemma 4 E2B\", value: 37.5 },\n    { label: \"Gemma 3 27B\", value: 20.8 },\n  ]}\n/>\n\n<BenchBars\n  title=\"GPQA Diamond (%) — first-party, thinking mode\"\n  unit=\"\"\n  bars={[\n    { label: \"Gemma 4 31B\", value: 84.3, highlight: true },\n    { label: \"Gemma 4 26B-A4B\", value: 82.3 },\n    { label: \"Gemma 4 12B\", value: 78.8 },\n    { label: \"Gemma 4 E4B\", value: 58.6 },\n    { label: \"Gemma 4 E2B\", value: 43.4 },\n    { label: \"Gemma 3 27B\", value: 42.4 },\n  ]}\n/>\n\nThe rest of the text suite tells the same story — the 31B leads its own family, the 26B-A4B tracks close behind at a seventh of the active params, and both clear Gemma 3 27B by a wide margin:\n\n| Benchmark | 31B | 26B-A4B | 12B | E4B | E2B | Gemma 3 27B |\n|---|---|---|---|---|---|---|\n| MMLU Pro | 85.2 | 82.6 | 77.2 | 69.4 | 60.0 | 67.6 |\n| LiveCodeBench v6 | 80.0 | 77.1 | 72.0 | 52.0 | 44.0 | 29.1 |\n| Codeforces Elo | 2150 | 1718 | 1659 | 940 | 633 | 110 |\n| SciCode | 43.0 | 40.0 | 38.0 | 24.0 | 21.0 | 21.0 |\n| IFEval | 98.9 | 98.5 | 97.2 | 96.7 | 94.6 | 90.4 |\n| MMMLU | 88.4 | 86.3 | 83.4 | 76.6 | 67.4 | 70.7 |\n\nVision holds up (MMMU Pro 76.9 / MATH-Vision 85.6 / InfographicVQA 92.0 for 31B at full resolution), and long context does what the KV work promises — **RULER at 128k** stays high where Gemma 3 27B falls off:\n\n| Long-context @ 128k | 31B | 26B-A4B | 12B | E4B | Gemma 3 27B |\n|---|---|---|---|---|---|\n| RULER (accuracy) | 96.4 | 89.8 | 91.2 | 86.6 | 66.0 |\n| LOFT (recall@k) | 79.5 | 66.3 | 66.4 | 58.5 | 8.6 |\n\n## The take\n\nGemma 4's contribution isn't a benchmark crown — it's **efficiency engineering shipped in the open**. The pieces compose: a 5:1 local:global ratio and `values=keys` on global layers keep the KV cache flat enough that a 128k prompt costs single-digit gigabytes of cache; QAT puts 31B in ~19 GB; the encoder-free 12B deletes two frozen towers; the MTP drafter does speculative decoding without a second prefill. None of these is individually novel, but the combination is a serious on-device and single-accelerator story, released under **Apache 2.0** with a **31B dense model that is the top open dense entry on Arena**.\n\nThe honest caveats are the first-party ones. The benchmarks are Google's own, thinking mode is on for Gemma 4 and off for the Gemma 3 baseline it's measured against, and \"leading open model\" is true only in the **dense** category — larger open MoE models (GLM 5.1, DeepSeek V4) rank above it on the same board. The encoder-free design is proven at exactly one size (12B), and there's no published MTP speedup to hold them to. For a team that wants an open, multimodal, long-context model that fits the hardware it already owns, the 12B and 31B are the ones I'd reach for — and the KV-cache design is the part I'd study regardless of which model I ended up serving.\n\n---\n\n*Built on the [Gemma 4 Technical Report](https://arxiv.org/abs/2607.02770) (Gemma Team, Google DeepMind, 2026), Apache 2.0 model license. All benchmark numbers are first-party from the report (thinking mode unless noted; the Gemma 3 27B baseline is non-thinking). The interactive diagrams are schematic illustrations of the mechanism, not measured traces; the two paper figures are reproduced for commentary.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/gemma-4","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Hunyuan Hy3: Tencent's 295B-A21B MoE, and the community 1M GGUF","description":"Tencent's Hy3 (preview) is a 295-billion-parameter Mixture-of-Experts model that activates only 21B per token, ships open weights with a 256K native context, and pairs grouped-query attention with a multi-token-prediction draft head for speculative decode. This is a first-principles walk through the architecture, the hybrid fast/slow-thinking training, the honest gap between the trained 256K window and the community YaRN 1M GGUF, the GGUF quant ladder for local inference, and the full provider-reported benchmark suite.","date":"2026-07-08","tags":["explainer","llm","mixture-of-experts","long-context","quantization","inference-optimization"],"draft":false,"cover":"/articles/hunyuan-hy3/fig1.png","featured":true,"interest":4,"helpful":3,"kind":"articles","slug":"hunyuan-hy3","body":"**Hy3** is Tencent Hunyuan's newest open model — released as a *preview* checkpoint, `Hy3 preview (295B A21B)`. It is a Mixture-of-Experts language model with **295 billion total parameters** that activates **21 billion per token**, ships **open weights** on Hugging Face and ModelScope, and serves a **256K native context**. Tencent frames it as a *hybrid fast-and-slow-thinking* model: one set of weights, a `reasoning_effort` knob that goes from `no_think` to `high`. The pitch is efficiency — \"intelligence comparable to flagship models with two to five times its parameter scale.\"\n\nI read all three primary sources for this: the [research page](https://hy.tencent.com/research/hy3), the [`Tencent-Hunyuan/Hy3-preview` repo](https://github.com/Tencent-Hunyuan/Hy3-preview), and a community [`Hy3-1M-GGUF`](https://huggingface.co/satgeze/Hy3-1M-GGUF) build that YaRN-extends the context to 1,048,576 tokens for local inference. Two things are worth pinning down before any of the benchmark charts: the \"1M\" context is a **community artifact**, not a Tencent release, and the whole benchmark suite is **provider-reported**. Both matter, and I keep them separate below.\n\n<Callout type=\"warn\">\nEverything numeric here is **provider-reported** — Tencent's own harness for the model charts, the community author's own samples for the GGUF. Hy3 tops **no** frontier benchmark: on the hardest STEM and agentic tasks **GPT-5.4**, **Gemini-3.1-Pro**, and **Claude Opus 4.6** all lead. Read it as a strong model *for its 21B active size*, competitive with **GLM-5** and **Kimi-K2.5**. The \"comparable to 2–5× larger models\" line and the \"270-expert blind eval (2.67/4 vs GLM-5.1's 2.51/4)\" are provider claims I cannot independently verify. And the **1M context is a community YaRN extension** of a model trained to **256K** — explicitly experimental and not needle-certified.\n</Callout>\n\n## Where the 295B lives\n\nThe parameter accounting is the whole economic argument, so start there. Hy3 is **80 decoder layers** of MoE, plus **1 extra multi-token-prediction (MTP) layer** (3.8B params). Each MoE layer holds **192 experts**; the router keeps the **top-8** per token. So of 295B total, only ~**21B is active** on any given token — roughly a **7% activation rate**. That sparsity is what lets a 295B model serve at the cost of a ~21B dense one.\n\nAttention is **grouped-query attention (GQA)**: **64 query heads** but only **8 KV heads**, `head_dim` 128, hidden size 4096. The 8-way sharing is not a detail — it is an 8× cut in KV-cache memory versus full multi-head attention, and it is the reason a 256K window is affordable at all (more on that below). The MTP layer drafts the next token so the main model can **verify several tokens per step** — speculative decoding, built into the weights rather than bolted on.\n\nWalk one token through a block. Flip stages to see the KV grouping, the top-8 route, and the draft head:\n\n<Hy3Arch />\n\n<Diagram\n  ascii={`\nHy3 preview — 295B total / 21B active (top-8 of 192 experts)\n\n  token id ─▶ embed 4096 (vocab 120,832, RMSNorm)\n                │\n                ▼\n  ┌────────────────────────────────────────────┐\n  │  × 80 decoder layers                        │\n  │                                             │\n  │   GQA:  64 query heads ── share ──▶ 8 KV    │   head_dim 128\n  │         (KV cache is 8× smaller than MHA)   │\n  │                                             │\n  │   MoE:  router → top-8 of 192 experts       │   ~21B active\n  │         intermediate 13312                  │\n  └────────────────────────────────────────────┘\n                │\n                ▼\n  MTP layer (3.8B) ─▶ draft next token ─▶ main model verifies\n`}\n  caption=\"Hy3 preview block: GQA attention + top-8/192 MoE, ×80, then a shared MTP draft head for speculative decode.\"\n/>\n\nIf the MoE routing here is new, I built it from nothing in [Mixture of Experts, from scratch](/articles/mixture-of-experts-from-scratch) — the router, the top-k gate, and why activating a sparse subset is the entire reason a model this large is cheap to run. The MTP head is the same idea I unpacked in [Multi-Token Prediction](/articles/multi-token-prediction): predict more than one token so a cheap draft can be verified in a single forward pass.\n\nOne honest architecture note Tencent does not hide but does not emphasize: this is the **preview** release. The card labels it `295B A21B`, and the community GGUF card rounds the same weights to `~299B / ~17B active`. I use Tencent's official figures (295B / 21B) throughout; the ~2% discrepancy is a community rounding, not a second model.\n\n## Training: one model, two speeds\n\nTencent is thin on the training story, and I will not pad it. What the sources actually say: Hy3 was built with \"strengthened reinforcement learning and enhanced data quality and diversity,\" and refined \"through use by global developers and across Tencent's large-scale real-world business scenarios.\" No token count, no data-mix percentages, no stage breakdown. Treat the training narrative as a claim.\n\nThe concrete, testable part is the **hybrid thinking** interface. A single `reasoning_effort` parameter switches inference mode:\n\n- `no_think` — direct answer, no chain-of-thought (the default, and the cheapest).\n- `low` — moderate CoT.\n- `high` — deep CoT for hard reasoning.\n\nThat is exposed at serve time, so you pay for reasoning only when the task needs it. On an OpenAI-compatible endpoint you set it per request:\n\n```python\n# after deploying Hy3 behind vLLM / SGLang (OpenAI-compatible)\nresp = client.chat.completions.create(\n    model=\"hy3-preview\",\n    messages=[{\"role\": \"user\", \"content\": \"prove there are infinitely many primes\"}],\n    temperature=0.9,      # Tencent's recommended default\n    top_p=1.0,\n    extra_body={\"chat_template_kwargs\": {\"reasoning_effort\": \"high\"}},\n)\n```\n\nThe pretrained base model is where the numbers are cleanest, because base-model evals are the least harness-sensitive. On general knowledge Hy3-Base sits a hair under the larger open models — **MMLU 87.42** (Kimi-K2 88.24, GLM-4.5 87.73, DeepSeek-V3 87.68) — but it *leads* its comparison set on several reasoning and math rows:\n\n| Benchmark | Hy3-Base | Kimi-K2 | DeepSeek-V3 |\n|---|---|---|---|\n| MMLU | 87.42 | **88.24** | 87.68 |\n| MMLU-Pro | 65.76 | **65.98** | 63.98 |\n| MATH | **76.28** | 71.20 | 59.37 |\n| GSM8K | **95.37** | — | — |\n| LiveCodeBench-v6 | **34.86** | — | — |\n| CRUXEval-I | **71.19** | — | — |\n| SuperGPQA | **51.60** | — | — |\n| MMMLU (multilingual) | **80.15** | 77.63 | 79.54 |\n\nThe pattern holds across the suite: Hy3-Base trails the biggest models slightly on broad knowledge, then pulls ahead on math (MATH 76.28 vs DeepSeek-V3's 59.37 is not close) and code reasoning. For a model activating 21B, that is the interesting shape.\n\n## Long context: 256K trained, 1M borrowed\n\nHere is where the honesty matters most. Hy3's **trained** context is **256K**. The headline \"1M\" in this article's third source is a **community** build (`satgeze/Hy3-1M-GGUF`) that applies **YaRN** to stretch positional encoding out to **1,048,576 tokens**. YaRN rescales RoPE frequencies so the model can *index* positions it never saw in training — it extends reach, it does not re-train competence. The GGUF card says so plainly: the 1M window is \"unverified at full length,\" experimental, and **not yet needle-certified**.\n\nSo the useful mental model is two numbers: 256K you can lean on, and a 1M ceiling you should measure before trusting. Drag the window below and watch the KV-cache cost — and the point where you cross from *trained* into *extrapolated*:\n\n<Hy3Context />\n\nThe KV-cache arithmetic is worth doing by hand, because it explains both the GQA choice and why 1M is expensive regardless of quality. Per token, the cache stores K and V for every layer:\n\n$$\n\\text{bytes/token} = 2 \\times n_{kv} \\times d_{head} \\times L \\times b\n= 2 \\times 8 \\times 128 \\times 80 \\times b\n$$\n\nWith $b = 2$ bytes (fp16) that is **320 KiB/token**. Multiply by context:\n\n- **256K** tokens → ~**80 GiB** of KV cache (fp16), ~40 GiB at fp8.\n- **1M** tokens → ~**320 GiB** (fp16), ~160 GiB at fp8 — *on top of* the weights.\n\nNow the GQA payoff is obvious. Full MHA would use **64 KV heads**, not 8, so every one of those numbers would be **8× larger** — 640 GiB of cache at 256K. GQA is not a quality trick here; it is what makes the window fit in memory at all. If you want to push the cache down further, that is exactly the territory of [TurboQuant KV-cache quantization](/articles/turboquant-kv-cache) and why [how LLM inference works](/articles/how-llm-inference-works) spends so long on the cache.\n\nTencent's own long-context numbers land where you'd expect for a 256K-trained model: on **LongBench v2** Hy3 scores **65.4** (up from Hy2's 56.4), matching Kimi-K2.5 (65.6) and edging GLM-5 (62.5), while GPT-5.4 (67.4) and Gemini-3.1-Pro (67.1) lead. On **AA-LCR** it's **66.3**. Competitive at its size, not a long-context leader.\n\n<Figure\n  src=\"/articles/hunyuan-hy3/fig3.png\"\n  alt=\"Five grouped bar panels — AdvancedIF, AA-LCR, LongBench v2, CL-bench, CL-bench Life — comparing Hy3 preview and Hy2 (blue) against Gemini-3.1-Pro, GLM-5, Kimi-K2.5, and GPT-5.4. Hy3 improves clearly over Hy2 in every panel and matches GLM-5 and Kimi-K2.5, but GPT-5.4 and Gemini-3.1-Pro top most panels.\"\n  caption=\"Context-learning and long-context suite. Hy3 (dark blue) over Hy2 (light blue); frontier models still lead the hardest panels (Tencent Hunyuan, Fig 3).\"\n/>\n\n## Quantization and running it locally\n\nThe base weights are ~**590 GB** in BF16 — multi-GPU territory. Three paths bring that down.\n\n**FP8, at serve time.** vLLM quantizes the loaded BF16 weights online with `--quantization fp8`, roughly **halving the footprint to ~295 GB** (sources conflict on whether a standalone `Hy3-FP8` checkpoint also ships — the runtime path is the one I'd rely on). Tencent's `AngelSlim` toolkit adds low-bit quantization and speculative-sampling support on top.\n\n**GGUF, for CPU/Mac.** This is what the community `Hy3-1M-GGUF` build is for: `llama.cpp`-style quantization that runs on a single machine with lots of RAM instead of a rack of GPUs. The quant ladder trades size for quality. Pick a RAM budget and see what fits:\n\n<Hy3Quant />\n\nThe sizes on that chart are the exact ones the card reports; the quality pip is an *illustrative* ordering from bits-per-weight, not a measured score — the card is explicit that even the good quants prove \"coherence and basic instruction-following, not reasoning, long-context retrieval, or factual accuracy.\" The practical reads:\n\n- **IQ1_M (62 GB)** fits a 128 GB Mac but is visibly weaker (it dropped list formatting in the author's samples).\n- **IQ2_M (~92 GB)** is the recommended baseline; **MTP-IQ2_M (~100 GB)** bakes in a `q8_0` draft head for speculative decode.\n- **Q4_K_M (183 GB)** is the highest-quality GGUF and needs a **192 GB+** box.\n\nRunning it is a `llama.cpp` server with the model's chat template and the extended context:\n\n```bash\n# 256K context; needs a hy_v3-capable llama.cpp build\nllama-server -m hy3-1M-IQ2_M.gguf -c 262144 -np 1 --jinja \\\n  --chat-template-file chat_template_llamacpp.jinja\n\n# MTP speculative decode (draft head)\nllama-server -m hy3-1M-MTP-IQ2_M.gguf -c 262144 --jinja \\\n  --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-p-min 0.75\n```\n\nReported throughput: **24–25 tok/s** generation on a **MacBook Pro M3 Max (128 GB)**, dropping to 17–19 tok/s in long conversations. The MTP draft head gives **+26–37%** on CUDA (an RTX 5090) but is roughly **neutral on Apple Silicon** — speculative decode helps when verification is compute-bound, which it is on the GPU and mostly isn't on the Mac. That is an honest, useful asymmetry: don't expect the draft head to speed up your laptop.\n\nFor the datacenter path, the official serving stacks are vLLM and SGLang, both with the MTP/EAGLE draft and Hunyuan's tool + reasoning parsers:\n\n```bash\nvllm serve tencent/Hy3-preview \\\n  --tensor-parallel-size 8 \\\n  --speculative-config.method mtp --speculative-config.num_speculative_tokens 1 \\\n  --tool-call-parser hy_v3 --reasoning-parser hy_v3 \\\n  --enable-auto-tool-choice --served-model-name hy3-preview\n```\n\n## The benchmarks, in full\n\nThe clearest way to read Hy3 is as a **trajectory**: what changed from Hy2 (the previous generation, Nov 2025) to Hy3 preview (Apr 2026). On the agentic suite the jump is large, and it lands Hy3 in the pack with GLM-5 and Kimi-K2.5 — below Claude Opus 4.6.\n\n<Figure\n  src=\"/articles/hunyuan-hy3/fig1.png\"\n  alt=\"Four line panels — SWE-bench Verified, Terminal-Bench 2.0, BrowseComp, WideSearch — plotting each model from 2025-11 to 2026-04. Hy3 preview (blue) rises steeply from Hy2 to land near GLM-5 and Kimi-K2.5, below Claude Opus 4.6, on every panel.\"\n  caption=\"Agent-benchmark trajectory from Hy2 to Hy3 preview against Kimi-K2/K2.5, GLM-4.7/GLM-5, and Claude Opus 4.5/4.6 (Tencent Hunyuan, Fig 1).\"\n/>\n\nThe four agent numbers, provider-reported:\n\n- **SWE-bench Verified**: Hy2 53.0 → **Hy3 74.4**. Field: GLM-4.7 73.8, Kimi-K2.5 76.8, GLM-5 77.8, Claude Opus 4.6 80.8.\n- **Terminal-Bench 2.0**: Hy2 23.2 → **Hy3 54.4**. Kimi-K2.5 50.8, GLM-5 56.2, Claude Opus 4.6 65.4.\n- **BrowseComp**: Hy2 28.7 → **Hy3 67.1**. GLM-4.7 67.5, Kimi-K2.5 74.9, GLM-5 75.9, Claude Opus 4.6 84.0.\n- **WideSearch**: Hy2 53.9 → **Hy3 70.2**. GLM-5 69.8, Kimi-K2.5 72.7, Claude Opus 4.6 77.2.\n\nThe two coding-agent panels tell the \"competitive, not leading\" story cleanly. Hy3 tracks GLM-5 and Kimi-K2.5, and trails the Claude Opus 4.6 line:\n\n<BenchBars\n  title=\"SWE-bench Verified (%) — provider-reported\"\n  unit=\"\"\n  bars={[\n    { label: \"Hy3 preview\", value: 74.4, highlight: true },\n    { label: \"Kimi-K2.5\", value: 76.8 },\n    { label: \"GLM-5\", value: 77.8 },\n    { label: \"Claude Opus 4.6\", value: 80.8 },\n  ]}\n/>\n\n<BenchBars\n  title=\"Terminal-Bench 2.0 (%) — provider-reported\"\n  unit=\"\"\n  bars={[\n    { label: \"Hy3 preview\", value: 54.4, highlight: true },\n    { label: \"Kimi-K2.5\", value: 50.8 },\n    { label: \"GLM-5\", value: 56.2 },\n    { label: \"Claude Opus 4.6\", value: 65.4 },\n  ]}\n/>\n\nOn raw reasoning and STEM the shape splits. Hy3 **matches or leads** its size class on GPQA-Diamond (87.2, vs GLM-5 86.0, Kimi-K2.5 87.6) and the Chinese-curriculum exams (China High School Biology Olympiad 87.8 — the top bar), but the frontier models pull away on the very hardest sets: on **HLE** (Humanity's Last Exam) Hy3 is **30.0** against GPT-5.4's 39.8 and Gemini-3.1-Pro's 44.4, and on the IMO Answer Bench it's 84.3 vs 89–92 for the closed pair. Note the asterisks in the figure: domestic models are scored on a text-only subset, so cross-vendor HLE/CHSBO numbers are not strictly like-for-like.\n\n<Figure\n  src=\"/articles/hunyuan-hy3/fig2.png\"\n  alt=\"Six STEM bar panels — FrontierScience Olympiad, IMO Answer Bench, HLE, GPQA-Diamond, Tsinghua Qiuzhen math PhD qualifier, CHSBO 2025 — with Hy3 preview and Hy2 in blue against Gemini-3.1-Pro, GLM-5, Kimi-K2.5, and GPT-5.4. Hy3 tops GPQA and CHSBO in its class; GPT-5.4 and Gemini-3.1-Pro lead HLE and IMO.\"\n  caption=\"STEM and reasoning suite. Hy3 leads its size class on GPQA and the Chinese-curriculum exams; the frontier pair leads the hardest sets. Note: domestic models on a text-only subset (Tencent Hunyuan, Fig 2).\"\n/>\n\nThe agentic **Claw** benchmarks (tool-use) are the honest ceiling: Hy3 improves hugely over Hy2 (ClawEval 32.4 → **55.0**, WildClawBench 33.7 → **45.3**) and edges Kimi-K2.5, but Claude Opus 4.6 is clearly ahead (ClawEval 66.3, WildClawBench 60.4).\n\n<Figure\n  src=\"/articles/hunyuan-hy3/fig4.png\"\n  alt=\"Two bar panels — WildClawBench and ClawEval — with Hy3 preview and Hy2 in blue against GLM-5, Kimi-K2.5, and Claude Opus 4.6. Hy3 roughly doubles Hy2 and edges Kimi-K2.5, but Claude Opus 4.6 is the tallest bar in both.\"\n  caption=\"Claw agentic tool-use benchmarks. Hy3 roughly doubles Hy2 and matches GLM-5 / Kimi-K2.5; Claude Opus 4.6 leads both (Tencent Hunyuan, Fig 4).\"\n/>\n\n## The take\n\nHy3's real claim isn't a leaderboard crown — it's **efficiency at 21B active**. It roughly matches GLM-5 and Kimi-K2.5 across coding, search, and STEM while activating a fraction of their parameters, and it packages the pieces that make a MoE cheap to serve: **GQA** (8 KV heads → 8× smaller cache), an **MTP** draft head for speculative decode, a **256K** trained window, open weights, and a `reasoning_effort` knob so you pay for chain-of-thought only when you need it. That is a coherent systems story, and the base-model math scores (MATH 76.28) back the reasoning pitch.\n\nThe caveats are the usual open-weights ones, stated plainly. Every number is Tencent's own harness; independent reproduction reports treat them as upper bounds. This is a **preview** checkpoint. Hy3 trails **Claude Opus 4.6**, **GPT-5.4**, and **Gemini-3.1-Pro** on the hardest agentic and STEM tasks — sometimes by a wide margin (HLE 30.0 vs 44.4). And the eye-catching **1M context is a community YaRN extension**, not a trained window: 256K is what I'd trust, 1M is what I'd measure. The community GGUF is a genuinely useful gift for anyone with a 128 GB Mac and patience — but the card is right to call it experimental. For a team that wants an open, agent-capable model at a real inference discount, and can either run 8×GPU tensor-parallel or a fat single box, Hy3 preview earns a look. As \"flagship intelligence at 2–5× smaller\" — that part is Tencent's claim, and worth checking yourself.\n\n---\n\n*Built from the [Hy3 research page](https://hy.tencent.com/research/hy3), the [`Tencent-Hunyuan/Hy3-preview` repo](https://github.com/Tencent-Hunyuan/Hy3-preview) (295B-A21B, 256K context), and the community [`satgeze/Hy3-1M-GGUF`](https://huggingface.co/satgeze/Hy3-1M-GGUF) build (YaRN 1M, experimental). All benchmark numbers are provider-reported; the four figures are reproduced from Tencent's model card for commentary. The interactive diagrams are illustrations of the mechanism, not measured traces — the architecture walk-through, the GGUF size ladder, and the KV-cache calculator all use the published configs and reported sizes, but the quality ordering in the quant explorer is illustrative, not benchmarked. The community 1M window is a third-party artifact, not a Tencent release.*\n","readingTimeMins":14,"url":"https://ai.thesatyajit.com/articles/hunyuan-hy3","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"The Jacobian lens: reading the residual stream with a derivative","description":"Anthropic's Jacobian lens transports an intermediate residual-stream activation into the model's final-layer basis using the corpus-averaged input–output Jacobian, then unembeds it into a ranked token list — surfacing what a layer is disposed to say, with the honest caveat that a single averaged linear map only approximates a nonlinear stack.","date":"2026-07-08","tags":["explainer","interpretability","transformers","llm"],"draft":false,"cover":"/articles/jacobian-lens/fig1.png","featured":false,"interest":5,"helpful":3,"kind":"articles","slug":"jacobian-lens","body":"The **Jacobian lens** answers one question: *given a model's internal state at some layer, what is that state disposed to make the model say?* It is a small piece of code — Anthropic's [`jacobian-lens`](https://github.com/anthropics/jacobian-lens) repo, Apache-2.0, \"reference implementation, not maintained\" — that fits on any open-weights decoder transformer and reads intermediate activations out as ranked vocabulary tokens. The whole method is one line:\n\n$$\n\\operatorname{lens}_\\ell(h) = \\operatorname{unembed}\\!\\big(J_\\ell \\, h\\big), \\qquad J_\\ell = \\mathbb{E}\\!\\left[\\frac{\\partial h_{\\text{final}}}{\\partial h_\\ell}\\right]\n$$\n\nTwo moves. $J_\\ell$ **transports** a residual-stream vector $h$ at layer $\\ell$ into the final-layer basis. Then the model's own **unembedding** decodes it into logits over the vocabulary. The transport matrix is a *Jacobian* — the derivative of the final-layer residual with respect to layer $\\ell$ — averaged over a corpus. That is the entire idea, and it is worth being precise about, because the derivative is doing something the [logit lens](#a-lens-is-a-choice-of-transport) never could.\n\n<Callout type=\"note\">\nThis is the companion tool for the paper [*Verbalizable Representations Form a Global Workspace in Language Models*](https://transformer-circuits.pub/2026/workspace/index.html) (Anthropic, 2026). Everything below is drawn from the repo README, the `jlens.fitting` source, and that paper. The causal numbers I quote are the paper's own, measured on Anthropic's models (Haiku/Sonnet/Opus 4.5). I have not reproduced them; I mark them as paper-reported.\n</Callout>\n\n## Why a derivative\n\nA residual stream is not written in the output vocabulary. The activation $h_\\ell$ at some middle layer lives in a basis the model has been rotating and rewriting layer by layer; the unembedding $W_U$ only makes sense against the *final* layer. So you cannot just apply $W_U$ to a middle activation and expect a sensible token — the coordinates don't line up. That is the flaw in the logit lens, and the Jacobian lens is the fix: it first maps $h_\\ell$ **through** the layers above it, then unembeds.\n\nMapping through the layers exactly would mean running the rest of the network — nonlinear, and no longer a \"lens.\" The Jacobian is the linear stand-in. It is the best linear approximation to \"run layers $\\ell{+}1 \\dots L$\" around an operating point, and the repo takes that operating point to be the *average* over a text corpus. One matrix per layer, fit once, applied everywhere.\n\n<Figure\n  src=\"/articles/jacobian-lens/fig1.png\"\n  alt=\"Three-panel method figure. Panel A: computing the lens by backpropagating from the final-layer residual stream at present and future token positions back to an activation h at layer l, forming the d_model by d_model Jacobian matrix, then aggregating over token positions and dataset examples into J_l. Panel B: reading with the lens replaces all layers above l with the single matrix J_l followed by the unembedding, producing a ranked token readout such as Mars, color, planet, fourth. Panel C: intervening in J-space by swapping projections onto two lens vectors, with the patched-activation formula.\"\n  caption=\"The Jacobian lens: (A) compute J_ℓ by backprop from the final layer to h_ℓ and average over positions and prompts; (B) read by replacing everything above ℓ with J_ℓ then unembed; (C) intervene by swapping J-space coordinates (Anthropic, transformer-circuits.pub, Figure 4).\"\n/>\n\n## How the matrix is estimated\n\nThe estimator has one subtlety worth getting right. For a source position $p$ at layer $\\ell$, the influence on the final layer is not a single vector — it is spread over the current position and every *future* position (a decoder is causal, so $h_\\ell[p]$ can affect the output at $p, p{+}1, \\dots$). The repo sums that influence over all target positions, then averages over source positions and over prompts. In the paper's own pseudocode:\n\n```python\n# Compute J_ℓ for all layers ℓ.\n# h_ℓ[t] : residual stream at layer ℓ, position t\n# z[t]   : residual stream at the target layer L (final by default)\nfor each prompt p in corpus:\n    run forward pass; cache h_ℓ[t] for all ℓ, t\n    for i in 1..d_model:                    # one backward pass per output dim (batched)\n        grad_z = e_i ⊗ 1_T                  # inject ∂/∂z_i = 1 at every position\n        for each layer ℓ:\n            G_ℓ = ∂(Σ_t z[t]) / ∂h_ℓ        # autodiff → shape [T, d_model]\n            J_ℓ^(p)[i, :] = mean_t G_ℓ[t, :]  # mean over source positions\nfor each layer ℓ:\n    J_ℓ = mean over prompts of J_ℓ^(p)      # aggregate the average Jacobian\n# apply:\nlens(h_ℓ) = softmax( W_U · norm( J_ℓ · h_ℓ ) )\n```\n\nBecause $\\sum_t z[t]$ is differentiated, a one-hot cotangent lands at every target position at once; causality makes $\\partial z[p']/\\partial h_\\ell[p]$ vanish for $p' < p$, so what survives is exactly the sum over current-and-future targets. The paper's lenses use **1000 sequences of 128 tokens**; the README notes quality \"saturates quickly\" and ~100 prompts is already usable. Cost is dominated by the model's own backward pass — one per output dimension, batched — which is why fitting is embarrassingly parallel across corpus slices (`JacobianLens.merge()`).\n\nThe rows of $W_U J_\\ell$ are the interesting object: the paper calls them the **J-lens vectors**, one per vocabulary token, each a direction in residual-stream space \"associated with a single token.\"\n\n## A Jacobian is a tangent\n\nHere is the mechanism I want to build intuition for, because it is also the method's main limitation. A Jacobian is a *local linear map*. Collapse the stack to one scalar activation coordinate and one token's logit, and the true function is a curve; the lens is its tangent line. Slide the probe below.\n\n<LensTangent />\n\nIn `local` mode the tangent is re-taken at the probe, so it always touches — that is the textbook Jacobian, exact at the point and good nearby. But the real lens is `corpus-average` mode: **one** tangent, taken once at the corpus operating point, then used for every activation you feed it. Near that point the linear readout is faithful; far from it the error grows. This is the honest cost of turning \"run the rest of the network\" into a single matrix. The lens tells you what an activation is disposed to say *to first order, on average*; it is not a faithful simulation of the model from layer $\\ell$ onward.\n\n## Then it is just a dot product\n\nThe second half, `unembed(·)`, is the easy half. Unembedding is a matrix of one row per token; a logit is that row dotted with the transported vector. Softmax ranks them, and the lens *readout* is the top of the list — the token whose direction $J_\\ell h$ points most toward. Rotate the transported vector and watch the readout hand off between neighbouring concepts:\n\n<UnembedReadout />\n\nThe superscript rank you see on a real slice page — `nose³`, `smile¹⁰⁴` — is exactly a token's position in this sorted vocabulary. Rank, not just top-1, is what makes the lens useful: a concept can be climbing toward the top for several layers before it ever wins.\n\n## What it surfaces: the ASCII-face\n\nThe repo ships one example that makes the point in a single screenshot. Give the model an ASCII-art face and ask what it depicts. The `^` character is the nose. Select that position and the lens, at *middle* layers, reads out **nose** — a word that never appears in the prompt.\n\n<Figure\n  src=\"/articles/jacobian-lens/fig2.png\"\n  alt=\"A layer-by-position slice page for an ASCII-art face. A grid shows the lens top-1 token at each position and layer, with superscript ranks. At the caret (nose) position, column 28, the mid layers around layer 42 read out 'nose' at rank 1, roughly 10 percent. A rank heatmap below shows a bright hotspot for 'nose' concentrated at that position and mid depth, and rank-versus-layer and rank-versus-position line charts track 'nose' and 'smile' peaking mid-stack.\"\n  caption=\"The ASCII-face slice: at the caret (nose) position, the lens reads out 'nose' at mid layers (rank 1, ≈10%) though the word is absent from the prompt — the model parsed the drawing spatially (Anthropic, jacobian-lens repo, assets/slice_vis.png).\"\n/>\n\nReading the page: each cell is the lens top-1 word at that `(position, layer)`; the bottom row is the model's actual output; the heatmap and line charts track a pinned concept's rank across the grid. The signal for \"nose\" is not at the output layer — it is a hotspot in the *middle*, then it fades as the model resolves what to actually say. That is the whole pitch: the lens shows intermediate content the output distribution has already moved past.\n\n## A lens is a choice of transport\n\nEvery \"lens\" is the same unembedding applied to a transported activation; they differ only in the transport $J_\\ell$.\n\n| lens | transport $J_\\ell$ | how it's obtained | early-layer behaviour |\n|---|---|---|---|\n| **logit lens** | identity $I$ | none | assumes one basis for all layers; recovers little early content |\n| **tuned lens** | learned linear map | trained to match the output distribution (correlational) | tends to \"skip ahead\" to the output |\n| **Jacobian lens** | $\\mathbb{E}[\\partial h_{\\text{final}}/\\partial h_\\ell]$ | fit by autodiff over a corpus | corrects for cross-layer basis change by construction |\n\nThe logit lens is the $J_\\ell = I$ special case — it works only where the residual basis already matches the final layer, i.e. the last few layers, and the paper notes the J-lens \"agrees closely\" there and diverges earlier. The tuned lens also fits per-layer linear maps, but on a *correlational* objective (match the output), which the paper finds \"skips ahead\" and buries exactly the unverbalized intermediates you wanted to see. The Jacobian's objective is the derivative itself, which is why it recovers interpretable content at depths where the logit lens does not.\n\n## Does the readout mean anything causally\n\nA readout is a correlation until you intervene on it. The paper's stronger claim is that the J-lens directions are *causally* privileged, and it tests this by swapping coordinates in the space those vectors span (\"J-space\", panel C above). The headline: split a concept vector into its J-space component and the orthogonal remainder, then swap along each.\n\n<BenchBars\n  title=\"Top-5 concept-swap success (%) — paper-reported, Anthropic models\"\n  unit=\"%\"\n  bars={[\n    { label: \"J-lens vectors\", value: 88, highlight: true },\n    { label: \"concept's J-space part\", value: 59 },\n    { label: \"orthogonal remainder\", value: 5 },\n  ]}\n/>\n\nThe striking part is the budget: that J-space component carries a **median of only 6–7%** of the concept vector's variance, with ~93% in the orthogonal remainder — yet the small component drives the swap (59%) and the remainder barely moves it (5%). A few percent of the variance is doing almost all of the *reportable* work. Intervening at intermediate layers also propagates: swapping a concept mid-stack flips the model's top-1 output on a majority of trials, scaling with model size.\n\n<BenchBars\n  title=\"Intermediate-swap success (%) — top-1 output flips, paper-reported\"\n  unit=\"%\"\n  bars={[\n    { label: \"Haiku 4.5\", value: 54 },\n    { label: \"Sonnet 4.5\", value: 70 },\n    { label: \"Opus 4.5\", value: 70 },\n  ]}\n/>\n\nThe companion result is ablation: project the top J-space contents out of the residual stream and multi-hop reasoning collapses toward zero, while shallow tasks — classification, comparison, factual recall — are essentially unaffected. Read together, the lens is not just labelling activations; the directions it finds carry content the model actually uses for the harder, chained computations.\n\n## What it can and cannot tell you\n\nHonest boundaries, several of them the authors' own words:\n\n- **It is a linear approximation, and an average one.** One Jacobian per layer, taken at the corpus mean, stands in for a nonlinear stack. The `LensTangent` widget is the whole caveat — faithful near the operating point, progressively wrong away from it.\n- **Single tokens only.** Each J-lens vector is tied to one vocabulary token. Concepts that span multiple tokens are not directly captured (the appendices discuss extensions). The lens sees \"Mars,\" not \"the fourth planet.\"\n- **Approximate and incomplete.** The paper states plainly that the J-lens \"only approximately and incompletely captures the model's underlying workspace structure,\" and that a \"true workspace\" may operate in layers the lens misses.\n- **A readout is not a mechanism.** The lens says what an activation is *disposed* to output; it does not tell you the circuit that put it there. The causal swaps above are what upgrade a readout from suggestive to load-bearing — do the intervention before you trust the picture.\n- **It is a reference implementation.** Not optimized, not maintained; fitting is dominated by the model's backward pass. Fine for research, not a production probe.\n\n## Running it\n\nThe API is two calls — fit (or download) a lens, then apply it at chosen positions:\n\n```python\nimport transformers, jlens\n\nhf  = transformers.AutoModelForCausalLM.from_pretrained(\"org/model\").cuda()\ntok = transformers.AutoTokenizer.from_pretrained(\"org/model\")\nmodel = jlens.from_hf(hf, tok)\n\nlens = jlens.JacobianLens.from_pretrained(\"org/lens-repo\", filename=\"model/lens.pt\")\nlens_logits, model_logits, _ = lens.apply(\n    model, \"Fact: The currency used in the country shaped like a boot is\",\n    positions=[-2])\nfor layer, logits in sorted(lens_logits.items()):\n    print(layer, [tok.decode([t]) for t in logits[0].topk(5).indices])\n```\n\nFitting your own is `jlens.fit(model, prompts=...)`; the `walkthrough.ipynb` notebook goes end to end and renders a slice page like the ASCII-face one.\n\n## The take\n\nThe Jacobian lens is a clean idea executed narrowly. Swap the logit lens's implicit identity transport for the real averaged derivative, and you get a readout that works in the middle of the network, where the interesting, not-yet-verbalized computation lives. The mechanism is a first-order Taylor term — a tangent — and the honest framing is exactly that: a good local, on-average picture of what a layer is disposed to say, not a faithful replay of the layers above it. What earns it more than \"nice visualization\" is the causal follow-through: a component holding ~6–7% of a concept's variance drives most of the reportable behavior, and ablating the J-space directions specifically breaks multi-hop reasoning while leaving shallow tasks intact. That is a real, testable claim about which internal directions the model uses to talk — reached with a derivative and an unembedding, and not much else.\n\n---\n\n*Built on the [`jacobian-lens`](https://github.com/anthropics/jacobian-lens) reference implementation (Anthropic, Apache-2.0) and the paper [Verbalizable Representations Form a Global Workspace in Language Models](https://transformer-circuits.pub/2026/workspace/index.html). Figures reproduced from the paper (Figure 4) and the repo (`assets/slice_vis.png`) for commentary. The `LensTangent` and `UnembedReadout` widgets are my own illustrations of the mechanism, not measured traces; all quantitative results are paper-reported on Anthropic's own models and I have not independently reproduced them.*\n","readingTimeMins":12,"url":"https://ai.thesatyajit.com/articles/jacobian-lens","signal":{"interest":5,"helpful":3,"score":8,"level":4,"label":"High"}},{"title":"J-space in the open: a CKA map of workspace geometry across 38 models","description":"Elie Bakouch's interactive J-lens CKA explorer computes the geometry of token-steering directions at every layer of 38 open models, then asks whether two layers — inside one model or across unrelated families — arrange those directions the same way. The answer is a sensory / workspace / motor block structure that sits at nearly the same relative depth in Gemma, Qwen, Llama, and OLMo alike. This is a walk through what the map plots, the exact CKA it computes, and an honest read of how universal the pattern really is.","date":"2026-07-08","tags":["explainer","interpretability","transformers","llm"],"draft":false,"cover":"/articles/jspace-open/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"jspace-open","body":"[**jspace-open**](https://eliebak.com/viz/jspace-open) is an interactive built by **Elie Bakouch**. It takes one idea from Anthropic's [*Verbalizable Representations Form a Global Workspace in Language Models*](https://transformer-circuits.pub/2026/workspace/) — the **Jacobian lens**, which reads out the concepts a layer is disposed to say — and runs it across **38 open-weights models**, **1,411 layers** in total, from GPT-2 and Pythia up to Qwen3-32B and Llama-3.3-70B. Then it asks a single geometric question, cell by cell: *do two layers arrange the vocabulary the same way?* The map that falls out is oddly regular. Every model grows the same three-block layout — a sensory front, a workspace middle, a motor tail — and unrelated families put those blocks at nearly the same **fraction of depth**.\n\n<Callout type=\"note\">\nTwo honest scoping notes up front. First, this visualizes **open models**; the causal story — that these directions are *verbalizable*, that steering them changes what the model reports, that ablating them breaks multi-hop reasoning — was established on **Claude** in the paper, not re-proven here. The open map shows the *geometry* echoes across families; it does not re-run the interventions. Second, the lens weights are pre-fitted checkpoints from [`neuronpedia/jacobian-lens`](https://huggingface.co/neuronpedia/jacobian-lens) (Anthropic companion code, fit on ~1,000 WikiText prompts) — and `qwen3-32b`'s public fit is an **80-prompt checkpoint**, so read that block with more caution than the rest.\n</Callout>\n\n## The J-lens vector: one steering direction per token, per layer\n\nStart with what sits in each cell. The Jacobian lens asks: at layer $\\ell$, which vocabulary tokens is this activation pushing the model toward *eventually* saying? It answers with a linearized map from the layer's hidden state to the final residual stream, averaged over contexts:\n\n$$\nJ_\\ell \\;=\\; \\mathbb{E}_{t}\\!\\left[\\frac{\\partial h_{\\text{final},\\,t'}}{\\partial h_{\\ell,\\,t}}\\right]\n$$\n\nPush a hidden state $h_\\ell$ through it and read it out on the vocabulary:\n\n$$\n\\operatorname{lens}(h_\\ell) \\;=\\; \\operatorname{softmax}\\!\\big(W_U\\,\\operatorname{norm}(J_\\ell\\, h_\\ell)\\big)\n$$\n\nwhere $W_U$ is the unembedding and $\\operatorname{norm}$ is the final normalization. Turn that around and every token gets a **direction**. For token $t$, its J-lens vector is the row of $(W_U\\,\\gamma)\\,J_\\ell$ — the direction in activation space that, added to $h_\\ell$, most raises the model's disposition to verbalize $t$ downstream ($\\gamma$ is the final-norm gain). It is a steering direction, indexed by *(token, layer)*. The explorer probes each layer with the same **4,096 token strings** shared by all 38 tokenizers, so the stack of directions at layer $\\ell$ is a matrix\n\n$$\nV_\\ell \\;=\\; (W_U[\\text{ids}]\\,\\gamma)\\,J_\\ell \\;\\in\\; \\mathbb{R}^{4096\\times d}.\n$$\n\nOne row per probe token, $d$ the model width. Different models have different $d$ and different layer counts — which is exactly the problem CKA is built to sidestep.\n\n## What each cell measures: linear CKA on the token geometry\n\nYou cannot compare $V_i$ and $V_j$ coordinate-by-coordinate — different layers (and certainly different models) live in different bases and scales. So the explorer never compares coordinates. It compares the **pairwise geometry** of the 4,096 tokens. Center each layer's directions, then form its token-by-token Gram matrix:\n\n$$\n\\bar V_\\ell = V_\\ell - \\operatorname{mean\\ row}, \\qquad K_\\ell = \\bar V_\\ell\\,\\bar V_\\ell^{\\top} \\in \\mathbb{R}^{4096\\times 4096}.\n$$\n\n$K_\\ell$ tabulates *which tokens' steering directions align with which* at layer $\\ell$ — a pure relational fingerprint, independent of the basis. The cell value is the cosine between two such fingerprints — **linear Centered Kernel Alignment**:\n\n$$\n\\operatorname{CKA}(i,j) \\;=\\; \\frac{\\langle K_i, K_j\\rangle_F}{\\lVert K_i\\rVert_F\\,\\lVert K_j\\rVert_F}, \\qquad 1 = \\text{identical geometry},\\; 0 = \\text{unrelated}.\n$$\n\nBecause it only ever touches Gram matrices, CKA is invariant to rotation and isotropic scaling of each activation space. That invariance is the whole reason a 12-layer GPT-2 and a 64-layer Qwen — different widths, different depths, different training data — can share one axis at all.\n\n## Reading the big matrix\n\nThe headline view stacks all 38 models into one grid. Each block is a model against itself or against another; each cell is the CKA above.\n\n<Figure\n  src=\"/articles/jspace-open/fig1.png\"\n  alt=\"A large square heatmap in the viridis colormap. Bright yellow squares run down the diagonal — each is one model compared to itself, sub-divided by red outlines into family groups (gemma-2, gemma-3, qwen3, qwen3.5, etc.). Off-diagonal blocks compare two different models and show fainter teal 45-degree bands. Row and column labels list all 38 models from pythia-70m and gpt2-small up to qwen3-32b and llama3.3-70b-it.\"\n  caption=\"The full 38-model J-lens CKA matrix — 1,411 layers × 1,411 layers. Bright diagonal blocks are within-model geometry; red outlines frame each sub-family; faint 45-degree bands in the cross blocks are two models organizing the vocabulary the same way at the same relative depth (Elie Bakouch, Fig 1).\"\n/>\n\nThree things stand out. On each model's own diagonal, bright squares mark **stretches of layers that hold one geometry** — the paper's sensory / workspace / motor regions. In the cross blocks, a bright **45° band** means two models line up at matched depth. And the red outlines separate within-family from across-family, so you can see that the band survives even when you leave a family — Llama next to OLMo, Gemma next to Qwen.\n\nThe interactive below rebuilds a single model's diagonal block so you can see the structure directly. Scrub the depth marker; switch the model. The values are a deterministic reconstruction of the pattern, not the measured matrix — but the geometry it encodes is the point:\n\n<CkaBlocks />\n\nThe layer count jumps from 32 to 64 as you switch models, yet the two block boundaries barely move in *relative* terms. That is the first surprise: the layout is a function of fractional depth, not layer index.\n\n## The reindex trick: same layout at the same relative depth\n\nIf the structure lives at relative depth, then to compare two models you have to put them on a common depth axis. The explorer's **reindexed** mode does exactly that — it resamples every block onto a shared 0–100% grid (bilinear), so *matched relative depth becomes the 45° diagonal of every block*. Raw mode keeps true layer counts; reindexed mode makes the alignment legible.\n\nThe widget below is the intuition without the heatmap. Six models, wildly different depths, each split into the three stages at the relative boundaries the explorer reports. Flip between raw and reindexed:\n\n<DepthReindex />\n\nRaw, the boundaries scatter — a 12-layer model finishes its sensory phase in a handful of layers, a 64-layer model takes dozens. Reindexed, they snap onto the same two guides. That shared relative layout is precisely what shows up in the cross blocks as a diagonal band.\n\nThe explorer also summarizes each pair with a single number: the mean CKA along its **matched-depth diagonal**, $j(i) = \\operatorname{round}\\!\\big(i\\,\\tfrac{L_B-1}{L_A-1}\\big)$, so \"does layer 30% of A do the job of layer 30% of B?\" collapses to one scalar per pair. Laid out as a model-by-model matrix, it is the same story at lower resolution:\n\n<Figure\n  src=\"/articles/jspace-open/fig2.png\"\n  alt=\"A smaller square heatmap titled 'pair summary — matched-depth CKA'. Each cell is one model pair; the diagonal (a model versus itself) is bright yellow, off-diagonal cells are teal-green, and red outlines group families. A bright diagonal ridge runs corner to corner.\"\n  caption=\"Pair summary: mean CKA along each pair's matched-depth diagonal, one cell per model pair (the model diagonal is the within-model mean), scaled 0.2–1.0 (Elie Bakouch, Fig 2).\"\n/>\n\n## How universal is it, really?\n\n\"Weirdly universal\" is Elie's phrase, and the map earns it — but the honest version needs the numbers, because a bright block can hide a modest effect. For the full 38-model selection the explorer reports these cross-model means, averaged over all $\\binom{38}{2} = 703$ pairs:\n\n| stat | value | what it is |\n|---|---|---|\n| off-diagonal block CKA | **0.548** | $\\operatorname{BLK}=\\tfrac{1}{L_AL_B}\\sum_{i,j}C(a_i,b_j)$ — the depth-independent floor two models share |\n| matched-depth CKA | **0.588** | $\\operatorname{MD}=\\tfrac{1}{L_A}\\sum_i C(a_i,b_{j(i)})$ — the 45° diagonal only |\n| depth-alignment gain | **+0.040** | $\\operatorname{MD}-\\operatorname{BLK}$ — similarity that is *specifically* at the right depth |\n| block separation | **+0.209** | a model resembles itself more than it resembles others |\n| depth order $\\rho$ | **0.83** | rank correlation of each layer's best-match depth (1.0 = order perfectly preserved) |\n\nTwo of those numbers deserve a hard look. Most of the cross-model similarity is the **lexical backbone**: a floor of $0.548$ that every model shares simply because every model puts \"dog\" near \"dogs.\" The part that is *specifically* about matched depth — the diagonal band over that floor — is only **+0.040**.\n\n<BenchBars\n  title=\"cross-model CKA (0–1) — provider tool, 38-model mean\"\n  unit=\"\"\n  max={1}\n  bars={[\n    { label: \"block floor (lexical)\", value: 0.548 },\n    { label: \"matched-depth\", value: 0.588, highlight: true },\n  ]}\n/>\n\nSo the strong claim — \"layer 30% of Llama and layer 30% of OLMo compute the *same thing*\" — is not what the number supports. What the map actually shows is subtler and, I think, more interesting: the **ordering** is shared. Depth-order $\\rho = 0.83$ says that as you walk down one model, the layer in another model that best matches you almost always walks down in step. Block separation $+0.209$ says the three-stage structure is a real, self-similar object, not an artifact of the lexical floor. The vocabulary geometry reorganizes in the same sequence, at the same relative pace, across families that never saw each other's data. That is the finding — a shared *itinerary*, more than a shared computation.\n\n<Callout type=\"warn\">\nOne more caveat the map makes visible: the exact widths do not match the paper. On Claude the workspace runs roughly layers 38–92% of depth; the explorer's block-finder puts the open-model sensory end at ~**46.5%** and motor start at ~**64.1%** — a much narrower workspace. The *three-block shape and its order* replicate across open models; the specific fractions do not transfer from the Claude measurement. Universality of structure, not of numbers.\n</Callout>\n\n## Where the pattern bends\n\nThe interesting parts of a \"universal\" map are the exceptions, and Elie flags a few. **Base vs instruct** checkpoints (Gemma-4 is the clearest) diverge most in the **early, sensory** layers — instruction tuning rewrites low-level parsing more than it touches the workspace middle, which stays put. **Qwen3-32B** reads as architecturally odd, looking like it skips or compresses an early phase — though that block is also the 80-prompt fit, so I would not over-read it. And on tokenizers: the shared 4,096 probe strings could in principle bias the comparison, but Elie checked with random token sampling and the pattern held, which is the right control to run.\n\n## The take\n\n`jspace-open` is a good piece of interpretability tooling: it takes an Anthropic method that only Anthropic could run on Claude, points it at weights anyone can download, and lets you check the geometry yourself instead of taking it on faith. The honest read is a shared *structure* — three blocks, in order, at matched relative depth, with $\\rho = 0.83$ and clean block separation — rather than a shared *function*, since the matched-depth lift over the lexical floor is only +0.040 and the block widths drift from the paper's Claude numbers. That is still a real result: open models from unrelated families grow the same coarse workspace itinerary. What the map does *not* do is re-establish that these directions are causally verbalizable — that remains a claim about Claude, and the mechanism behind the lens is its own story, covered in [the Jacobian-lens explainer](/articles/jacobian-lens). Here, the contribution is the map: a way to *see* that the structure travels.\n\n---\n\n*Primary source: [jspace-open](https://eliebak.com/viz/jspace-open) (Elie Bakouch, 2026), the J-lens CKA explorer. Built on Anthropic's [*Verbalizable Representations Form a Global Workspace*](https://transformer-circuits.pub/2026/workspace/) and the [`neuronpedia/jacobian-lens`](https://huggingface.co/neuronpedia/jacobian-lens) lens weights. The two figures are screenshots of the explorer, reproduced for commentary; all stats are read directly from the tool's 38-model summary. The interactive diagrams are deterministic reconstructions of the pattern, not the measured CKA matrix.*\n","readingTimeMins":10,"url":"https://ai.thesatyajit.com/articles/jspace-open","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Muon: orthogonalizing the update for hidden layers","description":"Muon takes the momentum update for a 2D weight matrix and orthogonalizes it with a few Newton-Schulz iterations before applying it, so no singular direction dominates the step. This is a first-principles walk through the update rule and the quintic iteration (with the real coefficients), why it runs only on hidden 2D layers, the NanoGPT speedrun wins Keller Jordan reports, and Moonshot's 'Muon is Scalable' result — the ~2x compute-efficiency claim, the weight-decay and update-RMS fixes needed at scale, and the Moonlight model.","date":"2026-07-08","tags":["explainer","training","deep-learning","llm"],"draft":false,"cover":"/articles/muon-optimizer/fig1.png","featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"muon-optimizer","body":"**Muon** is an optimizer for the **hidden layers** of a neural network. The idea is small and specific: take the momentum update you would feed to SGD, and — before applying it to a 2D weight matrix — **orthogonalize** it. Replace the raw update direction with its nearest orthogonal matrix, so every singular direction of the step carries the same weight and no single direction dominates. The name spells out the recipe: **M**oment**U**m **O**rthogonalized by **N**ewton-schulz. Everything else — the embeddings, the final head, the norms and biases — stays on AdamW.\n\nThat one change buys two things that are worth taking seriously. Keller Jordan, who introduced Muon, used it to set a string of NanoGPT training-speed records at essentially the same wall-clock cost per step as Adam. And Moonshot AI's *Muon is Scalable for LLM Training* reports **~2× the compute efficiency of AdamW** at LLM scale — once you add two fixes that only matter once the matrices get big. This is a walk through the mechanism from first principles, then the numbers, honestly labelled.\n\n<Callout type=\"note\">\nMuon only touches **2D hidden weight matrices** — attention and MLP projections. Scalars, vectors, embeddings and the output head are optimized by AdamW, and empirically that split matters (more on why below). Every speed number here is **author- or provider-reported**: Keller's on NanoGPT/CIFAR speedruns, Moonshot's on their own scaling-law sweep and the Moonlight model. I have not re-run them.\n</Callout>\n\n## The idea: orthogonalize the update\n\nStart with the problem. A momentum buffer for a weight matrix $M \\in \\mathbb{R}^{A \\times B}$ has a spectrum — a set of singular values. Write its SVD:\n\n$$\nM = U \\Sigma V^{\\top}, \\qquad \\Sigma = \\mathrm{diag}(\\sigma_1, \\sigma_2, \\dots, \\sigma_r), \\quad r = \\min(A, B).\n$$\n\nHere $U$ and $V$ hold the left/right **singular directions** and $\\sigma_i$ are the **singular values** — how far the update reaches along each direction. Gradient updates on real transformers are badly conditioned: a few singular values are huge and the rest are tiny. So a plain momentum step lurches the weights along the top singular direction and barely moves the others. The step is *anisotropic*.\n\nMuon's fix is to keep the directions and flatten the spectrum. Set every singular value to 1:\n\n$$\nO = U V^{\\top} \\quad\\text{(the } \\Sigma \\text{ in the middle is replaced by the identity).}\n$$\n\n$O$ is the closest **semi-orthogonal** matrix to $M$. It points the same way as $M$ in every singular direction, but now each direction gets the same magnitude. This is the \"spectrally normalized\" step — no direction dominates. Toggle between the raw momentum step and its orthogonalized version:\n\n<MuonStep />\n\nThat is the whole conceptual move. The rest of Muon is about doing $M \\mapsto U V^{\\top}$ **cheaply** — an SVD every step, on every weight matrix, would be far too slow.\n\n## The update rule\n\nPer step, for each 2D hidden weight $W$:\n\n```python\n# Muon, one step, per 2D hidden weight matrix W\nM = mu * M + G                      # momentum buffer; G = this step's gradient, mu ~ 0.95\nU = G + mu * M                      # Nesterov variant (works a bit better in practice)\nO = newton_schulz5(U, steps=5)      # orthogonalize: O ~ U V^T of the update\nW = W - lr * O                      # apply the spectrally-normalized step\n```\n\n$G$ is the gradient, $M$ the momentum buffer, $\\mu$ the momentum coefficient (~0.95), $\\text{lr}$ the learning rate. Muon puts the momentum **before** the orthogonalization — you orthogonalize the accumulated direction, not the raw gradient — and Keller reports Nesterov-style momentum beats plain SGD-momentum in every case he tested. The only unusual line is `newton_schulz5`.\n\n## Newton-Schulz: orthogonalize without an SVD\n\nThe trick is that $U V^{\\top}$ is what you get if you take the SVD and set every singular value to 1. So you never need $U$, $V$, or $\\Sigma$ explicitly — you just need a function that drives every singular value to 1 while leaving the singular *directions* alone. A matrix polynomial in $M$ does exactly that: applying $p(M) = M \\, q(M^{\\top}M)$ acts on the SVD as $U \\, p(\\Sigma) \\, V^{\\top}$, so it touches only the singular values. Pick the polynomial so that iterating it sends every $\\sigma \\in [0, 1]$ toward 1.\n\nMuon uses a fixed **quintic**, applied $T = 5$ times. Here is the exact function, coefficients and all:\n\n```python\ndef newtonschulz5(G, steps=5, eps=1e-7):\n    assert G.ndim == 2\n    a, b, c = (3.4445, -4.7750, 2.0315)     # the tuned quintic coefficients\n    X = G.bfloat16()\n    X /= (X.norm() + eps)                    # normalize so every singular value is in [0, 1]\n    if G.size(0) > G.size(1):\n        X = X.T\n    for _ in range(steps):                   # FIXED count -- no \"until converged\"\n        A = X @ X.T\n        B = b * A + c * A @ A\n        X = a * X + B @ X                    # X <- a*X + b (XX^T)X + c (XX^T)^2 X\n    if G.size(0) > G.size(1):\n        X = X.T\n    return X\n```\n\nOn the singular values this is the scalar map $\\varphi(\\sigma) = a\\sigma + b\\sigma^3 + c\\sigma^5$ applied five times. Two things make it work. First, dividing by the Frobenius norm up front guarantees $\\sigma_{\\max} \\le 1$, so every value lands in $[0, 1]$ where the iteration is designed to converge. Second, the coefficients are tuned so the slope at zero is steep — $\\varphi'(0) = a = 3.4445 > 1$ — which yanks the *smallest* singular values up fast. Watch the spectrum flatten:\n\n<SpectrumCollapse />\n\nThe honest nuance: those coefficients do **not** drive each singular value to exactly 1. $\\varphi(1) = 3.4445 - 4.7750 + 2.0315 = 0.70$, and the iteration settles values into a band roughly $[0.7, 1.2]$ rather than a point. That is deliberate — Keller trades exact convergence for the steep slope at 0, which makes five steps enough. It does not matter: the goal is to *equalize* the singular values so no direction dominates, and the condition number $\\kappa = \\sigma_{\\max}/\\sigma_{\\min}$ collapsing from ~33 to ~1.5 does that. Approximate orthogonalization is all Muon needs. (Moonshot report $T = 10$ gives a cleaner orthogonalization but no downstream gain, so $T = 5$ it is.)\n\nThe reason this is cheap: the iteration is just matrix multiplies in `bfloat16`, no inverses or eigendecompositions. Keller bounds the overhead at $T m / B$ FLOPs relative to the forward/backward pass, where $m$ is the model dimension and $B$ the batch size in tokens. Concretely: NanoGPT ($m=768$, $B{=}524{,}288$) pays $5 \\times 768 / 524288 = 0.7\\%$; a Llama-405B-shaped run ($m{=}16384$, $B{=}16\\text{M}$) pays $5 \\times 16384 / 16\\text{M} = 0.5\\%$. Orthogonalization is almost free, and it gets *cheaper* as models grow.\n\n<Callout type=\"tip\">\nMuon is close kin to Shampoo/SOAP. Strip the preconditioner accumulation out of Shampoo and its update collapses to the same orthogonalized gradient $U V^{\\top}$ — but Shampoo computes it with inverse-fourth-roots (an eigendecomposition), where Muon uses the Newton-Schulz iteration. Same target, far lower wall-clock and FLOP overhead.\n</Callout>\n\n## Why only 2D hidden layers\n\nNewton-Schulz needs a matrix — orthogonalization is defined for 2D. So scalars and vectors (LayerNorm gains, biases) have no spectrum to flatten and go to AdamW by construction. Convolutional filters get flattened to 2D and can be included.\n\nThe less obvious rule is that the **embedding** and the **final classifier head** are 2D but should *still* use AdamW — empirically that split improves results. The intuition: those two layers are indexed per-token. Each row is one token's vector, updated only when that token appears, so the gradient is sparse and row-wise, and there is no shared \"direction\" across the vocabulary worth equalizing. Orthogonalizing across the vocab dimension mixes unrelated tokens. The hidden layers are the opposite — dense, shared, badly conditioned — which is exactly where flattening the spectrum pays off. So the working split is: **hidden 2D weights on Muon, everything else on AdamW.**\n\n## The speedruns\n\nKeller's headline result is the NanoGPT speedrun. Swapping AdamW for Muon set a new training-speed record on 2024-10-15, improving speed by **35%**, and Muon has held as the optimizer of choice through the twelve NanoGPT records set since, by seven different researchers. On the training-to-target curve it dominates: at roughly Adam's cost per step it reaches a lower validation loss than Adam, DistributedShampoo, and SOAP — in less wall-clock time.\n\n<Figure\n  src=\"/articles/muon-optimizer/fig1.png\"\n  alt=\"Validation loss versus wall-clock time on 8xH100 for the NanoGPT speedrun, comparing Adam, DistributedShampoo at two update frequencies, SOAP, and Muon. Muon (purple) reaches the lowest validation loss in the least wall-clock time; SOAP is far slower per step.\"\n  caption=\"NanoGPT speedrun: validation loss vs wall-clock on 8xH100. Muon reaches the lowest loss fastest, at 142 ms/step vs Adam's 139 ms/step (Keller Jordan, Muon post).\"\n/>\n\nThe per-step overhead is the point. From the legend: Adam runs at **139 ms/step**, Muon at **142 ms/step** — a ~2% tax — while matching-or-beating DistributedShampoo (154–179 ms/step) and SOAP (301 ms/step) on loss. Orthogonalization is nearly free per step and the sample efficiency is better, so wall-clock wins:\n\n<BenchBars\n  title=\"NanoGPT speedrun — ms/step (lower is better)\"\n  unit=\" ms\"\n  bars={[\n    { label: \"Adam\", value: 139 },\n    { label: \"Muon\", value: 142, highlight: true },\n    { label: \"Shampoo (uf=32)\", value: 154 },\n    { label: \"Shampoo (uf=10)\", value: 179 },\n    { label: \"SOAP\", value: 301 },\n  ]}\n/>\n\nThe wins hold beyond NanoGPT, in Keller's own runs: a **1.5B-parameter** transformer to GPT-2-XL-level HellaSwag in **10 × 8×H100-hours** where AdamW needs **13.3** (a 1.33× wall-clock speedup); the FineWeb validation-loss record improved by **1.35×**; and the CIFAR-10-to-94% speed record cut from **3.3 to 2.6 A100-seconds**. Small models, but a consistent shape.\n\n## Muon is Scalable — the two fixes\n\nMuon out of the box works at NanoGPT scale. Moonshot AI's *Muon is Scalable for LLM Training* (Liu et al., 2025) is about what breaks when you push it to real LLM training, and the two fixes that close the gap. Both are one-liners once you see them.\n\n**Fix 1 — weight decay.** Base Muon has no weight decay, and over a long run the weights (and with them the logits and activation RMS) drift upward until quality suffers. The fix is decoupled weight decay, exactly as in AdamW:\n\n$$\nW_t = W_{t-1} - \\eta_t\\,\\big(O_t + \\lambda\\, W_{t-1}\\big), \\qquad \\lambda = 0.1.\n$$\n\n$O_t$ is the orthogonalized update, $\\eta_t$ the learning rate, $\\lambda$ the weight-decay coefficient. Without it Muon eventually crosses *above* AdamW late in training; with it Muon stays ahead throughout.\n\n**Fix 2 — match the update RMS.** This one is about magnitude. AdamW's per-element update has a roughly constant RMS (~0.2–0.4) regardless of a matrix's shape, so a single learning rate works everywhere. Muon's orthogonalized update does not. Moonshot's Lemma 1: for a full-rank $[A, B]$ matrix, the RMS of $O = U V^{\\top}$ is\n\n$$\n\\mathrm{RMS}(O) = \\frac{1}{\\sqrt{\\max(A, B)}}.\n$$\n\nThat **shrinks as matrices get wider** — a $4096 \\times 11008$ MLP weight has RMS ≈ 0.01, so its effective step is ~20× smaller than a square layer's under the same learning rate. At scale the wide matrices barely move. The fix scales each Muon update to a fixed target RMS of ~0.2:\n\n$$\nW_t = W_{t-1} - \\eta_t\\,\\Big(0.2 \\cdot O_t \\cdot \\sqrt{\\max(A, B)} + \\lambda\\, W_{t-1}\\Big).\n$$\n\nThe $\\sqrt{\\max(A,B)}$ cancels the shape dependence and the constant 0.2 pins the RMS to AdamW's range, so an AdamW-tuned learning rate transfers to Muon directly — no per-layer retuning.\n\nWith both fixes in, Moonshot's scaling-law sweep (dense models, 0.4B–1.5B, compute-optimal token counts) puts Muon's loss curve cleanly below AdamW's:\n\n<Figure\n  src=\"/articles/muon-optimizer/fig2.png\"\n  alt=\"Scaling-law plot of language-model loss versus compute in PFLOP/s-days, log-x axis, comparing Muon (blue dashed) and AdamW (red dashed) fitted lines with star markers. Muon's line sits below AdamW's throughout; an annotation marks that Muon reaches AdamW's loss at 0.519x the FLOPs.\"\n  caption=\"Fitted scaling laws: Muon reaches AdamW's loss at 0.519x the compute — the ~2x efficiency claim (Liu et al., Figure 1a).\"\n/>\n\nThe fitted curves are $L_{\\text{Muon}} = 2.506\\,C^{-0.052}$ and $L_{\\text{AdamW}} = 2.608\\,C^{-0.054}$, with $C$ the compute budget. Read horizontally, matching AdamW's loss takes Muon about **52% of the training FLOPs** — the \"~2× more compute-efficient\" headline. Worth stating plainly: this is a fit over their own sub-2B sweep, extrapolated; it is a provider result, not an independent one.\n\n## Moonlight\n\nTo show it holds past the toy scale, Moonshot trained **Moonlight** with Muon: a **15.3B-parameter** Mixture-of-Experts model (a DeepSeek-V3-Small-style architecture) that activates **2.24B** parameters per token, on **5.7T** tokens. The training was smooth — no loss or gradient-norm spikes. Placed on a compute-vs-MMLU frontier against open models, Moonlight sits *on* the Pareto front, matching models trained with far more compute:\n\n<Figure\n  src=\"/articles/muon-optimizer/fig3.png\"\n  alt=\"Scatter plot of MMLU score versus training FLOPs (log-x) for many open models, with a dashed Pareto frontier. Moonlight-2.4B checkpoints at 1.2T and 5.7T tokens (red stars) sit on the frontier; Moonlight-5.7T reaches ~70 MMLU near Gemma-2-9B, above Qwen-2.5-3B, Llama-3.1-8B, DeepSeek-V2-Lite and others at similar or higher compute.\"\n  caption=\"MMLU vs training compute. Moonlight (red stars) lands on the efficiency frontier, reaching ~70 MMLU near Gemma-2-9B (Liu et al., Figure 1b).\"\n/>\n\nAgainst comparable open models, Moonlight leads most of the standard suite — with one honest exception:\n\n<BenchBars\n  title=\"MMLU (%) — provider-reported\"\n  unit=\"\"\n  bars={[\n    { label: \"Moonlight (Muon)\", value: 70.0, highlight: true },\n    { label: \"Qwen2.5-3B\", value: 65.6 },\n    { label: \"DeepSeek-V2-Lite\", value: 58.3 },\n    { label: \"Llama3.2-3B\", value: 54.7 },\n  ]}\n/>\n\nOn **MATH** Moonlight reaches **45.3** vs Qwen2.5-3B's 42.6, Llama3.2-3B's 8.5 and DeepSeek-V2-Lite's 17.1; on **GSM8K** it is **77.4**, ahead of Llama and DeepSeek-V2-Lite but a hair *behind* Qwen2.5-3B's 79.1. And in Moonshot's own head-to-head — Moonlight (Muon) vs an identical model trained with AdamW at 1.2T tokens — Muon wins across the board, most visibly on code: HumanEval 37.2 vs 29.3, MBPP 52.9 vs 49.2. These are all provider numbers on their own harness, and Moonlight is an MoE while the scaling-law fits were on dense models — so the ~2× claim and the model result are related but not the same experiment.\n\n## The take\n\nMuon is a genuinely small idea with a clean mechanism: orthogonalize the momentum update for 2D hidden weights so the step is spectrally flat, and do it with a five-step Newton-Schulz iteration that costs well under 1% overhead. The Newton-Schulz coefficients are tuned for speed, not exactness — they collapse the update's condition number toward 1 rather than nailing every singular value to it, and that is enough. The scope is deliberately narrow: hidden 2D matrices only, everything else on AdamW.\n\nThe wins are real but each carries a caveat. Keller's NanoGPT/CIFAR speedruns are small-scale and self-reported, but they are reproducible and the per-step overhead (142 vs 139 ms) is visible and tiny. Moonshot's ~2× compute-efficiency is a fit over their own sub-2B dense sweep, extrapolated, and it *only* holds with the two fixes — weight decay and update-RMS matching — that Muon out of the box lacks. Moonlight is an MoE and a provider-reported result. Read together, the honest summary is: orthogonalizing the update is a cheap, well-motivated change that clearly helps hidden layers, is nearly free per step, and — with the scale fixes — looks like a real efficiency gain that others can now check for themselves.\n\n---\n\n*Built on Keller Jordan's [Muon: An optimizer for hidden layers in neural networks](https://kellerjordan.github.io/posts/muon) (2024) and J. Liu et al., [Muon is Scalable for LLM Training](https://arxiv.org/abs/2502.16982) (arXiv 2502.16982, 2025). Newton-Schulz code and coefficients are quoted from Keller's post; the update-RMS and weight-decay formulas from Liu et al. The interactive widgets are illustrations of the mechanism, not measured traces. Figures are reproduced from the sources for commentary; all speed and benchmark numbers are author- or provider-reported.*\n","readingTimeMins":13,"url":"https://ai.thesatyajit.com/articles/muon-optimizer","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"Rollout Routing Replay: stabilizing MoE reinforcement learning","description":"RL post-training on Mixture-of-Experts models collapses because the router picks different experts at rollout time than at update time, so the importance-sampling ratio explodes. Rollout Routing Replay (R3) records the inference engine's routing masks and replays them during training — aligning expert selection while keeping the router's gradient. It cuts the train-inference KL from 1.54 to 0.75 ×10⁻³, near the dense baseline, prevents the collapses that GRPO/GSPO/TIS hit, and adds under 3% rollout overhead.","date":"2026-07-08","tags":["explainer","mixture-of-experts","reinforcement-learning","llm","training"],"draft":false,"cover":"/articles/rollout-routing-replay/fig1.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"rollout-routing-replay","body":"Reinforcement learning is now the standard last stage of training a reasoning model: sample answers, score them, push up the probability of the good ones. It works well on dense models. On **Mixture-of-Experts** models it has a habit of blowing up — the validation curve climbs for 100 steps, then falls off a cliff. **Rollout Routing Replay (R3)**, from a Peking University and Xiaomi team, names the cause and fixes it with one idea: the router chooses different experts when you *generate* a rollout than when you *score* it for the update, so make training reuse the routing decisions the rollout already made.\n\n<Callout type=\"note\">\nAll numbers here are the paper's own experiments, on one model family: **Qwen3-30B-A3B** (30B total, ~3B active) for MoE and **Qwen3-8B** for the dense baseline, trained with `VeRL` — **SGLang** for rollout, **Megatron** for the update. The tasks are math RLVR (AIME/AMC/MATH) and one multi-turn SWE-agent task. The interactive diagrams below are illustrations of the mechanism with deterministic toy numbers; the benchmark and KL values are measured and cited to the figure/table they come from.\n</Callout>\n\n## Why RL training runs two different models of the same policy\n\nModern RL frameworks split the work across two engines. An **inference engine** (SGLang, vLLM) generates rollouts fast, with its own fused kernels and quantization. A **training engine** (Megatron, FSDP) recomputes probabilities and applies gradients. They implement the same math but not the same arithmetic, so the probability a rollout token gets from each engine is slightly different.\n\nThat difference matters because PPO-style objectives are off-policy. You sample from the inference policy $\\pi_\\text{infer}$ but compute the loss with the training policy $\\pi_\\text{train}$:\n\n$$\nJ(\\theta) = \\mathbb{E}_{x,\\,y\\sim\\pi_\\text{infer}(\\theta_\\text{old})}\\left[\\frac{1}{|y|}\\sum_{t=1}^{|y|}\\min\\!\\big(w_t(\\theta)\\,\\hat{A}_t,\\ \\text{clip}(w_t(\\theta),\\,1-\\varepsilon,\\,1+\\varepsilon)\\,\\hat{A}_t\\big)\\right]\n$$\n\nThe whole update is scaled by the **importance-sampling ratio**\n\n$$\nw_t(\\theta) = \\frac{\\pi_\\text{train}(\\theta)(y_t\\mid x,y_{<t})}{\\pi_\\text{train}(\\theta_\\text{old})(y_t\\mid x,y_{<t})}.\n$$\n\nWhen the engines agree, $w_t \\approx 1$ and the clip does its job. When a token's training probability diverges from its inference probability, $w_t$ drifts away from 1, the clip either saturates or lets a huge ratio through, and gradient variance spikes. Enough such tokens and the run collapses. TIS (truncated importance sampling) and GSPO (sequence-level ratios) are two existing attempts to bound this; R3 attacks the source instead.\n\n## MoE makes the gap an order of magnitude worse\n\nA dense model is a continuous function of its inputs — nudge the activations, the output moves a little. A MoE layer is not. The router computes logits $s = x\\,W_r$, keeps the top-$K$ experts, and routes through only those. A perturbation smaller than the top-$K$ margin flips which expert is selected, and the layer output jumps discretely. The two engines' small numeric differences are exactly such perturbations.\n\nThe paper quantifies it on 2,048 math problems (~20M tokens), scored once by SGLang and once by Megatron:\n\n- **~10% of routers** pick a different expert between the two engines (per token, per layer).\n- **94% of tokens** get a different expert in at least one layer somewhere in the stack.\n- Train–inference **KL is 1.54×10⁻³** for the MoE model versus **0.64×10⁻³** for the dense one — more than double.\n- Even running Megatron **twice on the same sequence** gives KL 0.84×10⁻³: the MoE forward pass is not deterministic, so the \"old policy\" itself is noisy.\n\nScatter each token's inference probability against its training probability and the shape of the problem is visible directly. The dense model hugs the diagonal; the MoE model fans out into a wide band with extreme off-diagonal tokens; R3 pulls it back. Switch models below:\n\n<DiscrepancyScatter />\n\nThe extreme tail is what actually breaks training. The paper measures it with $F(\\tau)$, the fraction of tokens whose train/infer probability ratio exceeds $\\tau$. For $\\tau>2$ the MoE model has an order of magnitude more such tokens than the dense model — and R3 removes that excess:\n\n<Figure\n  src=\"/articles/rollout-routing-replay/fig2.png\"\n  alt=\"A log-log plot of F(tau), the fraction of tokens whose training/inference probability ratio exceeds tau, against tau from 1 to 100. Three curves: Qwen3-8B (dense) lowest, Qwen3-30B-A3B (MoE) about an order of magnitude higher across the range, and Qwen3-30B-A3B + R3 dropping back down to sit almost on top of the dense curve.\"\n  caption=\"The extreme-token distribution F(τ): the MoE model (orange) has ~10× more tokens with a large train-inference probability ratio than the dense model (blue); adding R3 (green) collapses it back to the dense baseline (paper, Figure 2).\"\n/>\n\n## R3: replay the rollout routing mask\n\nThe fix is almost anticlimactic once the diagnosis is clear. During rollout, record the router's top-$K$ selection mask $I_\\text{infer}$ for every token and every layer. During the training forward pass, **use that mask instead of recomputing one** — but still run the softmax over the *training* logits, so the router's weights keep receiving gradient.\n\nStart from a normal MoE layer on the training side. The router scores experts and keeps the top $K$ as a binary mask:\n\n$$\ns_\\text{train} = x_\\text{train} W_r, \\qquad I_\\text{train} = \\text{TopKMask}(s_\\text{train}, K), \\quad I_\\text{train}\\in\\{0,1\\}^M\n$$\n\nGating weights are a softmax over the *selected* experts' logits, and the output is their weighted sum:\n\n$$\ng_{\\text{train},i} = \\frac{I_{\\text{train},i}\\,\\exp(s_{\\text{train},i})}{\\sum_j I_{\\text{train},j}\\,\\exp(s_{\\text{train},j})}, \\qquad y_\\text{train} = \\sum_{i=1}^{M} g_{\\text{train},i}\\,E_i(x_\\text{train})\n$$\n\nR3 changes exactly one term: replace the training mask with the **inference** mask captured during rollout, $I_\\text{infer} = \\text{TopKMask}(s_\\text{infer}, K)$, while keeping the softmax on the training logits:\n\n$$\ng_{\\text{replay},i} = \\frac{I_{\\text{infer},i}\\,\\exp(s_{\\text{train},i})}{\\sum_j I_{\\text{infer},j}\\,\\exp(s_{\\text{train},j})}, \\qquad y_\\text{replay} = \\sum_{i=1}^{M} g_{\\text{replay},i}\\,E_i(x_\\text{train})\n$$\n\nTwo properties fall out of that single substitution. **Alignment:** the training pass now activates exactly the experts the rollout used, so the layer output matches and $w_t$ returns to ~1. **Gradient survives:** only the discrete mask $I_\\text{infer}$ is borrowed; the softmax still runs over $s_\\text{train}$, so $\\partial/\\partial W_r$ keeps flowing and the router keeps training. You borrow the *decision*, not the *weights*.\n\nBelow is the same mechanism at one layer. The rollout router picks its top-2; the training router, on a different engine, recomputes logits and can land on a different top-2 — and when it does, the importance ratio blows up. Toggle **R3 on** to replay the rollout mask and watch every token snap back to $w \\approx 1$:\n\n<RouterReplay />\n\nThe paper's own schematic makes the data flow explicit: the rollout selection `select (1, 4)` is captured once and fed into both later forward passes (the recompute of the old policy and the update of the new one), overriding whatever the training routers would have chosen on their own:\n\n<Figure\n  src=\"/articles/rollout-routing-replay/fig1.png\"\n  alt=\"A three-panel diagram. Left panel: an Inference Engine forward pass through a router that selects experts 1 and 4 from four experts, producing an output. Middle and right panels: Training Engine passes for the old policy and the updated policy, each with its own router whose selection is crossed out with a trash-can icon; green arrows labelled Rollout Routing Replay carry the inference engine's select (1, 4) into both training passes.\"\n  caption=\"R3 captures the routing mask from the rollout (inference) engine and replays it in both the recompute and update passes of the training engine, discarding the training routers' own selections (paper, Figure 1, left).\"\n/>\n\nIf the router and top-$K$ gate are unfamiliar, I built them up from one MLP in [Mixture of Experts, from scratch](/articles/mixture-of-experts-from-scratch) — R3 is a small surgery on exactly that dispatch step.\n\n### What it costs, and the caching trick that makes it free at rollout\n\nStoring a mask per token per layer sounds expensive. It is not: a top-$K$ mask is a handful of small integers, and the paper reports **under 3% latency overhead** during rollout. The neat part is multi-turn. Inference engines already cache the KV of a prefix so repeated turns don't re-prefill; R3 caches the **routing masks alongside that KV**. Same prefix, same masks, no recomputation. That is what keeps R3 cheap on agent tasks — [software-engineering and browsing agents](/articles/agents-a1) that interleave many generation and tool-call turns — where re-prefilling to regenerate masks would otherwise dominate.\n\n## Does it work\n\nThree things to check: does it kill the extreme tokens, does it stop the collapse, does it score better.\n\n**Alignment.** Replaying the masks drops the train–inference KL from **1.54×10⁻³ to 0.75×10⁻³**, essentially the dense model's 0.64×10⁻³, and cuts the large-ratio tail by an order of magnitude — the green curve in Figure 2 above.\n\n**Stability.** In the single-mini-step setting, all three runs *without* R3 collapsed. The tell was mechanical: KL and $F(\\tau{=}2)$ climbed together, and once $F(\\tau{=}2)$ passed 0.1 — 10% of tokens differing by more than 2× between engines — the run fell over (SFT + GRPO collapsed at step 60). With R3, $F(\\tau{=}2)$ stayed below 10⁻⁴ for most of training and nothing collapsed. The clearest picture is the validation curve: without R3 it climbs to ~0.62 and then craters; with R3 it climbs smoothly past 0.70.\n\n<Figure\n  src=\"/articles/rollout-routing-replay/fig3.png\"\n  alt=\"A line plot of average validation score against global training step. The red w/o R3 curve rises to about 0.62 by step 100, then collapses sharply to 0.40 around step 110. The green w/ R3 curve rises smoothly and monotonically to about 0.70 by step 170 without any collapse.\"\n  caption=\"Average validation score over training. Without R3 (red) the run collapses near step 100; with R3 (green) it climbs smoothly to ~0.70 (paper, Figure 1, bottom right).\"\n/>\n\n**Performance.** On the single-mini-step SFT model, R3 beats TIS by **5.58 points** of average math score — and unlike GRPO and GRPO+TIS, it never crashes:\n\n<BenchBars\n  title=\"Qwen3-30B-A3B-SFT · avg math score, single mini-step (Table 1)\"\n  unit=\"\"\n  bars={[\n    { label: \"GRPO (crashed @60)\", value: 62.23 },\n    { label: \"GRPO+TIS (crashed @105)\", value: 66.24 },\n    { label: \"GRPO+R3\", value: 71.83, highlight: true },\n  ]}\n/>\n\nIn the multi-mini-step setting the story repeats against GSPO: GRPO+R3 edges GSPO by 1.29, and stacking R3 on GSPO adds another 0.95 — while plain GRPO collapsed at step 120:\n\n<BenchBars\n  title=\"Qwen3-30B-A3B-SFT · avg math score, multi mini-step (Table 1)\"\n  unit=\"\"\n  bars={[\n    { label: \"GSPO\", value: 66.76 },\n    { label: \"GRPO+R3\", value: 68.05 },\n    { label: \"GSPO+R3\", value: 69.00, highlight: true },\n  ]}\n/>\n\nAnd it generalizes past math. On a multi-turn SWE-agent task (R2E-Gym train, SWE-bench Verified eval), GRPO collapses at step 90; GRPO+R3 stays stable and finishes **6.8 points higher** on Pass@1:\n\n<BenchBars\n  title=\"SWE-bench Verified · Pass@1, multi-turn RL (Table 2)\"\n  unit=\"\"\n  bars={[\n    { label: \"GRPO (crashed @90)\", value: 31.80 },\n    { label: \"GRPO+R3\", value: 38.60, highlight: true },\n  ]}\n/>\n\n## How it differs from GSPO's routing replay\n\nGSPO already proposed a \"routing replay,\" so it is worth being precise about what R3 changes. An RL step has three forward passes: **rollout** (generate), **recompute** (score the old policy), **update** (score the new policy). GSPO's *Recompute* Routing Replay caches the mask from the recompute pass and replays it in the update pass — it fixes routing drift *caused by the weight update*, but does nothing about the rollout-vs-training **framework gap**. R3 caches from the **rollout** pass and replays it in both recompute and update, so it fixes the framework gap *and*, because both training passes now share one mask, the update drift too.\n\nThe distinction bites at `mini_step=1`, where the old and new policies are identical and GSPO's recompute-based replay has nothing to correct — yet the framework gap is still there, and only R3 closes it. One honest caveat from the same experiments: R3 already removes most of the discrepancy, so **stacking TIS on top does not help and can hurt** — TIS+R3 scored 1.69 below R3 alone on the single-mini-step SFT model. If you run R3, drop the importance-sampling patch.\n\n## The take\n\nR3 is the kind of fix that reads as obvious only after someone isolates the cause. The instability everyone attributed vaguely to \"MoE being finicky\" turns out to be a specific, measurable thing — the router selecting different experts in the two engines — and the remedy is to stop letting the training pass re-decide something the rollout already decided. It aligns the two policies at the source rather than clipping the symptom downstream, it costs under 3% at rollout, it caches cleanly for multi-turn agents, and it is orthogonal to GRPO/GSPO/DAPO so you can bolt it on.\n\nThe caveats are the usual single-paper ones: everything is one model family (Qwen3-30B-A3B / Qwen3-8B) on math plus one SWE task, and it needs you to reach into the inference engine to capture and store routing masks — free in principle, real integration work in practice. But the mechanism is clean, the diagnosis is well-measured, and the result — MoE RL that trains as stably as a dense model — is worth the plumbing.\n\n---\n\n*Built on [Rollout Routing Replay](https://arxiv.org/abs/2510.11370) (Ma et al., 2025; arXiv:2510.11370). Figures are reproduced from the paper for commentary. The interactive diagrams use deterministic toy numbers to illustrate the mechanism; all KL, $F(\\tau)$, and benchmark figures are the paper's measured values, cited to their source figure or table.*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/rollout-routing-replay","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Reading a torch.profiler trace: overhead-bound vs compute-bound","description":"A working engineer's walk through Hugging Face's torch.profiler guide: the 20-line setup, how to read the key_averages() table and the Perfetto timeline, and how the exact same matmul+add flips from overhead-bound (GPU 23 us of work inside a 2.31 ms wall, ~98% idle) to compute-bound (a 4.285 ms gemm kernel) just by growing the matrices — plus what torch.compile actually fuses.","date":"2026-07-08","tags":["explainer","systems","inference-optimization","training"],"draft":false,"cover":"/articles/torch-profiler/fig1.png","featured":false,"interest":3,"helpful":5,"kind":"articles","slug":"torch-profiler","body":"Every \"my GPU is slow\" bug is one of two things: the GPU is doing too much work, or it is doing nothing while the CPU flails. You cannot tell which by staring at the code. You attach a profiler and read the trace. Hugging Face's [torch.profiler guide](https://huggingface.co/blog/torch-profiler) teaches this with the smallest possible workload — `y = matmul(x, w) + b`, bf16, on an **NVIDIA A100-SXM4-80GB** — and shows the same three lines of code land on opposite ends of that spectrum depending only on the matrix size. This is a walk through what the profiler prints, how to read it, and the two bottleneck regimes it exposes.\n\n<Callout type=\"note\">\nAll numbers here are from the post's runs on one A100 in bf16. Kernel timings drift a few percent run to run (GPU clocks, thermals, power caps), so treat them as representative, not exact constants. The interactive diagrams are redrawn from the post's traces to explain the mechanism — they are not live captures.\n</Callout>\n\n## The 20-line setup\n\nThe workload is deliberately trivial so the profiler output is the whole story, not the model:\n\n```python\n# 01_matmul_add.py — profile y = matmul(x, w) + b on an A100, bf16\nimport torch\n\nN = 64  # start small; bump to 4096 later\nx = torch.randn(N, N, dtype=torch.bfloat16, device=\"cuda\")\nw = torch.randn(N, N, dtype=torch.bfloat16, device=\"cuda\")\nb = torch.randn(N, N, dtype=torch.bfloat16, device=\"cuda\")\n\ndef fn(x, w, b):\n    return torch.add(torch.matmul(x, w), b)\n\ndef step():\n    with torch.profiler.record_function(\"matmul_add\"):   # a named region in the trace\n        fn(x, w, b)\n```\n\n`record_function(\"matmul_add\")` is the one line people skip and then regret: it draws a labelled box around your code in the trace so you can find it among thousands of `aten::*` ops. The harness wraps `step()` in the profiler:\n\n```python\nschedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1)\n\nwith torch.profiler.profile(\n    activities=[\n        torch.profiler.ProfilerActivity.CPU,\n        torch.profiler.ProfilerActivity.CUDA,\n    ],\n    schedule=schedule,\n) as prof:\n    for _ in range(5):          # 1 wait + 1 warmup + 3 active = 5 steps\n        step()\n        prof.step()             # advance the schedule one step\n\nprint(prof.key_averages().table(sort_by=\"cuda_time_total\", row_limit=15))\nprof.export_chrome_trace(\"trace.json\")   # open in https://ui.perfetto.dev\n```\n\nTwo knobs matter here. `activities` records both CPU-side dispatch and CUDA kernels — you want both, because the whole point is comparing them. `schedule` decides which steps count.\n\n## schedule: which steps land in the trace\n\n`prof.step()` advances a little state machine. `wait` steps are skipped entirely, `warmup` steps run but get discarded, `active` steps run and get recorded. The warmup exists to throw away the first-step cold start — on step zero the CPU sits idle for a couple hundred microseconds before it issues its first launch, and you do not want that artifact averaged into your numbers.\n\n<ScheduleStrip />\n\nWith `wait=1, warmup=1, active=3` over `range(5)`, exactly **3 steps** are recorded — which is why every row in the table below reads `# of Calls = 3`, and why the first recorded box in the trace is labelled `ProfilerStep#2`.\n\n## Reading the table: Self CPU vs Self CUDA\n\n`key_averages().table()` is the first thing to read, before any timeline. Here is the 64x64 run, sorted by CUDA time:\n\n<Figure\n  src=\"/articles/torch-profiler/fig2.png\"\n  alt=\"A torch.profiler key_averages table for a 64x64 bf16 matmul+add. Columns include Self CPU, CPU total, Self CUDA, CUDA total and number of calls. cudaDeviceSynchronize dominates Self CPU at 1.786 ms (77.20%); the ampere bf16 gemm kernel is 14.272 us of CUDA time and a vectorized elementwise kernel is 8.832 us. Self CPU time total is 2.314 ms, Self CUDA time total is 23.104 us.\"\n  caption=\"key_averages() for the 64x64 run: 2.314 ms of CPU against 23.104 us of GPU. cudaDeviceSynchronize alone is 1.786 ms (Hugging Face, Fig 1).\"\n/>\n\nThe two columns that decide everything are **Self CPU** and **Self CUDA**. Self time is a row's own time, excluding its children — so `aten::matmul` shows big CPU total but ~0 self time (its work lives in the child `aten::mm` and the kernel it launches). Read the self columns and the picture is stark:\n\n- **Self CUDA time total: 23.104 us.** The GPU does 23 microseconds of real work. That splits into two kernels — the GEMM (`ampere_bf16_s16816gemm_bf16_64x64_...`, 14.272 us, ~62%) and a `vectorized_elementwise_kernel` for the add (8.832 us, ~38%).\n- **Self CPU time total: 2.314 ms** — a hundred times larger. And 1.786 ms of it (77.20%) is a single row: `cudaDeviceSynchronize`, the CPU blocking to wait for the GPU.\n\n<BenchBars\n  title=\"64x64 run — Self CUDA time by kernel (23.104 us total)\"\n  unit=\" us\"\n  bars={[\n    { label: \"ampere bf16 gemm\", value: 14.272, highlight: true },\n    { label: \"elementwise add\", value: 8.832 },\n  ]}\n/>\n\nWhen Self CPU dwarfs Self CUDA like this, the GPU is starved. The kernel isn't slow; there is barely any kernel. This is **overhead-bound**.\n\n## Reading the trace: two lanes, one dependency\n\nThe table tells you *what* is expensive; the timeline tells you *when* and *why there are gaps*. Export the trace, open it in [Perfetto](https://ui.perfetto.dev), and you get two lanes that matter — the CPU main thread and the GPU stream:\n\n<Figure\n  src=\"/articles/torch-profiler/fig1.png\"\n  alt=\"A Perfetto trace with two lanes. The top CPU lane (main thread) shows three ProfilerStep boxes with a matmul_add region and aten ops, numbered 1, 2, 3, followed by a long cudaDeviceSynchronize block spanning most of the width. The bottom GPU lane (stream 7) is almost entirely empty except for a tiny sliver of kernel work at the far right.\"\n  caption=\"The 64x64 trace: three recorded steps on the CPU lane, then a long cudaDeviceSynchronize, while the GPU lane (stream 7) sits nearly empty (Hugging Face, Fig 2).\"\n/>\n\nThe mental model: **the CPU launches, the GPU executes, and the two lanes are offset in time.** The CPU calls `cudaLaunchKernel`, which returns almost immediately — the kernel is queued, not run. The GPU picks it up a moment later on its own stream. So a fast CPU op and a slow GPU kernel show up as *staggered* boxes, not stacked ones. The dashed dependency in the diagram below is that hand-off.\n\nIn the 64x64 trace the GPU lane is nearly empty: 23 us of kernels scattered in a wall that is milliseconds wide. The GPU is idle roughly **98%** of the time. Flip the interactive to see the same lanes fill up when the matrices grow:\n\n<TraceLanes />\n\n## Same code, two regimes\n\nChange one number — `N = 64` to `N = 4096` — and rerun. Nothing else moves. The table inverts:\n\n| run | Self CPU total | Self CUDA total | dominant cost | verdict |\n|---|---|---|---|---|\n| `64 x 64` | 2.314 ms | 23.104 us | `cudaDeviceSynchronize` 1.786 ms (77%) | overhead-bound |\n| `4096 x 4096` | 4.908 ms | 4.495 ms | gemm kernel 4.285 ms (95% of GPU) | compute-bound |\n\nAt 4096, Self CUDA (4.495 ms) finally rivals Self CPU (4.908 ms). One kernel — `ampere_bf16_s16816gemm_bf16_128x256_...`, 4.285 ms, 95.33% of GPU time — is now the entire budget. Note the tile even changed: cuBLAS picks a `128x256` GEMM tile for the big matrices where it used `64x64` for the small ones. `cudaDeviceSynchronize` is still 94% of the CPU wall, but the reading is different: here the CPU is legitimately blocked on real GPU work, not spinning on launch overhead. Same row, opposite meaning — which is exactly why you read both lanes.\n\nThe verdict changes what you do next:\n\n- **Overhead-bound** (64x64): stop launching so many tiny kernels. Batch more work per launch, fuse ops, or hand it to `torch.compile`. Making the kernel faster buys you nothing — it is already 23 us.\n- **Compute-bound** (4096x4096): the gemm *is* the job. Optimize the kernel — lower precision, better tiling, a fused epilogue — or reduce FLOPs. Cutting launch overhead buys you nothing here.\n\n<Callout type=\"tip\">\nThe one-line diagnostic: compare **Self CUDA time total** against **Self CPU time total**. GPU much smaller than CPU means overhead-bound — you are launch- and sync-limited. GPU comparable to or larger than CPU means compute-bound — go optimize kernels. Everything else is detail.\n</Callout>\n\n## What torch.compile actually does here\n\nThe obvious fix for the overhead-bound case is to stop dispatching `matmul` and `add` as two separate ops. `torch.compile` does that:\n\n```python\ncfn = torch.compile(fn)\n\ndef step():\n    with torch.profiler.record_function(\"matmul_add\"):\n        cfn(x, w, b)\n```\n\nIn the trace, the two ops collapse into a single `aten::addmm` dispatch — a GEMM with the bias folded into its epilogue instead of a separate elementwise kernel:\n\n<Figure\n  src=\"/articles/torch-profiler/fig3.png\"\n  alt=\"A Perfetto trace of the torch.compiled region. The CPU lane shows a Torch-Compiled Region and a CompiledFxGraph call, under which the matmul and add have fused into a single aten::addmm box, followed by a cudaMemcpyAsync and a cudaLaunchKernel.\"\n  caption=\"torch.compile fuses matmul + add into one aten::addmm dispatch, wrapped in the compiled-graph call and a Device-to-Device memcpy for the bias (Hugging Face, Fig 3).\"\n/>\n\nTwo things are worth seeing honestly. First, there is still a **Device-to-Device `cudaMemcpyAsync`** in the region — the bias has to be staged/broadcast before it folds into the GEMM, so \"fused\" does not mean \"zero extra work\". Second, the compiled path adds its own CPU cost: a `CompiledFxGraph` call plus Dynamo's guard and cache lookup roughly **double** the per-step CPU overhead versus eager. Underneath, it is still the same `ampere` cuBLAS GEMM kernel doing the math.\n\nSo `torch.compile` is a real win when the fusion removes launches across *many* ops or feeds a big epilogue — but on a single `matmul + add` over small inputs, its fixed dispatch overhead is not amortized and can cost more CPU than it saves. The profiler is how you tell the difference instead of guessing. Measure both, keep the faster one.\n\n## The workflow, condensed\n\n- **Wrap regions** with `record_function(\"name\")` so you can find your code in the trace.\n- **Use a `schedule`** with at least one `warmup` step; discard the cold start.\n- **Read `key_averages().table()` first.** Compare Self CUDA total vs Self CPU total to get the regime.\n- **Then open the trace in Perfetto.** CPU lane launches, GPU lane executes, offset in time. Empty GPU lane = overhead-bound; a fat kernel filling the GPU lane = compute-bound.\n- **Fix the regime you actually have.** Fuse/batch for overhead; optimize the kernel for compute. Re-profile to confirm the win is real and not a torch.compile tax.\n\nNone of this needs a big model. A 64x64 matmul on an A100 is enough to show the difference between a GPU that is busy and a GPU that is waiting — and that difference is most of GPU performance work.\n\n---\n\n*Built on Hugging Face's [Understanding the torch.profiler](https://huggingface.co/blog/torch-profiler) (2026). All timings are the post's A100 / bf16 runs, reproduced from its `key_averages()` tables and Perfetto traces for commentary; the `TraceLanes` and `ScheduleStrip` widgets are redrawn illustrations of the mechanism, not live captures.*\n","readingTimeMins":9,"url":"https://ai.thesatyajit.com/articles/torch-profiler","signal":{"interest":3,"helpful":5,"score":8,"level":4,"label":"High"}},{"title":"zvec: an in-process vector database, and the ANN search inside it","description":"Alibaba's zvec is an embedded, Apache-2.0 vector database — the 'SQLite for vectors' — built on the Proxima engine, with SDKs for Python, Node, Go, Rust, and Dart. It self-reports 8,475 QPS on VectorDBBench's Cohere 10M set from a 16-vCPU box using HNSW + int8 + a full-precision refiner. This is a first-principles walk through the two mechanisms that get you there — greedy graph descent and quantize-then-refine — with the self-reported numbers in full and an honest note on why the comparison isn't apples-to-apples.","date":"2026-07-08","tags":["explainer","vector-search","systems","quantization","information-retrieval"],"draft":false,"cover":"/articles/zvec/fig1.png","featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"zvec","body":"**zvec** is Alibaba's open-source (**Apache 2.0**, Tongyi Lab) **in-process vector database** — it links into your app as a library instead of running as a server. The pitch is \"SQLite for vectors\": no cluster, no network hop, one embedded engine that does approximate nearest-neighbour (ANN) search over embeddings. The core is C++ (built on **Proxima**, Alibaba's older vector-search engine), with SDKs for **Python, Node.js, Go, Rust, and Dart/Flutter** and builds for Linux (x86_64/ARM64), macOS (ARM64), Windows, and Android.\n\nThe headline number is a throughput claim: on VectorDBBench's **Cohere 10M** set (10M vectors, 768-d) a 16-vCPU / 64-GiB instance serves **8,475 QPS**, roughly 2× the next-fastest entry on that board. That is the top bar in the cover figure. Two mechanisms do the work, and neither is unique to zvec — they are the two levers every fast ANN index pulls. This post explains both from first principles, then puts the self-reported numbers up in full.\n\n<Callout type=\"warn\">\nEvery number here is **self-reported** by zvec, measured with [VectorDBBench](https://github.com/zilliztech/VectorDBBench). And the comparison is not apples-to-apples: zvec runs **embedded on local hardware**, so its number has no network in it, while several of the competitors on the same chart (ZillizCloud, Pinecone, Qdrant Cloud) are **managed cloud services** whose QPS includes round-trip latency. The hardware also differs row to row (core counts, node counts, index versions are baked into each label). Read it as \"fast for an in-process engine,\" not as a clean head-to-head. No independent third-party run has been published yet.\n</Callout>\n\n## Why nearest-neighbour search is hard\n\nThe naive query is brute force: score the query vector against all $N$ stored vectors, keep the top-$k$. That is $O(N)$ distance computations per query. At $N = 10^7$ and 768 dimensions, each query touches ten million dot products — correct, and far too slow to serve at thousands of QPS.\n\nThe standard fix is a **graph index**. zvec's default is **HNSW** (Hierarchical Navigable Small World): wire every vector to its nearest neighbours, then answer a query by *walking* the graph — start somewhere, repeatedly hop to whichever neighbour is closer to the query, stop at a local minimum. You compute distances only to the nodes you actually visit, and that count grows roughly *logarithmically* with $N$, not linearly. Step through one descent:\n\n<NavGraph />\n\nThat is the first lever: **visit fewer vectors**. The graph decides *which* candidates to score. It trades a small, bounded loss in recall (you can land on an approximate neighbour, not the exact one) for skipping the overwhelming majority of the dataset. `ef-search` is the knob — a larger search frontier means more nodes visited, higher recall, lower QPS. The benchmark run uses `--ef-search 118` with `--m 50` (`m` = neighbours per node in the graph).\n\n## Quantize to make each score cheap\n\nThe graph decides how *many* distances you compute. Quantization decides how *expensive* each one is. A 768-d `fp32` vector is 3072 bytes; scoring it is a 768-wide float dot product. Compress it and both the memory footprint and the per-distance cost drop:\n\n- **int8** (scalar quantization) — 768 bytes/vector, a 4× shrink. Distances become int8 dot products with a tiny quantization error. This is what the headline run uses.\n- **int4 / fp16** — zvec also exposes 4-bit and half-precision codes for tighter memory/accuracy tradeoffs.\n- **RaBitQ** (1-bit, added in v0.3.0 via [the SIGMOD 2024 method](https://github.com/gaoj0017/RaBitQ)) — one bit per dimension, 96 bytes for 768-d, a 32× shrink. A distance collapses to a `popcount`. RaBitQ's selling point is a theoretical error bound that, its authors argue, keeps recall high *without* a re-ranking pass.\n\nThe catch: aggressive codes distort distances, so the *ranking* off the compressed vectors is wrong. zvec's answer is the **refiner** (the `--is-using-refiner` flag). Retrieve a broad shortlist using the cheap quantized distances, then **re-score just that shortlist with the original full-precision vectors** and return the exact-scored top-$k$. Coarse pass to go fast, fine pass to stay accurate. Toggle refinement off and watch a true neighbour fall out of the result:\n\n<QuantizeRerank />\n\nThe refiner is why the benchmark can run int8 codes and still report high recall: the int8 pass is only a *filter*, and the returned order is decided by full-precision math on a handful of survivors. RaBitQ is the more aggressive bet — it aims to skip that refine step entirely, which is a stronger claim and the one I'd want independent numbers on before trusting.\n\n## What zvec actually ships\n\nThe two levers above sit inside a fuller engine. The index and quantization menu:\n\n| Layer | Options |\n|---|---|\n| Index | HNSW (dense + sparse), IVF, Flat (brute-force), HNSW-RaBitQ, Vamana / DiskANN (on-disk) |\n| Quantization | fp16, int8, int4, RaBitQ (1-bit) |\n| Distance | full-precision refiner pass over any quantized index |\n| Retrieval | dense + sparse vectors, multi-vector queries, full-text search with hybrid fusion |\n\nSystems details that matter for the throughput story:\n\n- **CPU auto-dispatch.** zvec detects `AVX2`, `AVX512`, and `NEON` at runtime (via its `ailego` kernel library) and dispatches the SIMD distance kernels accordingly — so the same binary uses Ice Lake AVX512 on the benchmark box and NEON on ARM. int8 L2 distance is computed in batches.\n- **Persistence.** A write-ahead log (WAL) for crash recovery; **RocksDB** holds metadata and the scalar index; vectors live in auto-scaling mmap'd segment files.\n- **Concurrency.** Many concurrent readers; writes are single-process exclusive — the embedded, single-writer model, same as SQLite.\n- **Filtered search.** Scalar predicates are pushed *into* the HNSW traversal instead of filtering after the fact, so a filtered query doesn't first retrieve then discard.\n\nThe exact config behind the headline run, as published:\n\n```bash\n# VectorDBBench · Cohere 10M (10M × 768-d) · Alibaba Cloud g9i.4xlarge (16 vCPU / 64 GiB)\nzvec-bench \\\n  --index hnsw \\\n  --quantize-type int8 \\\n  --m 50 \\\n  --ef-search 118 \\\n  --is-using-refiner \\\n  --threads 12-20\n# → 8,475 QPS, index build ≈ 1 hour\n```\n\nThe Python surface is the usual embedded-DB shape — a `CollectionSchema` to declare fields and the vector index, `Doc` objects to insert, and a `VectorQuery` to search — so wiring it into a RAG loop is a library import, not a service to stand up.\n\n## The numbers, in full\n\nThe cover chart is the VectorDBBench QPS ranking, with zvec highlighted at the top:\n\n<Figure\n  src=\"/articles/zvec/fig1.png\"\n  alt=\"A horizontal bar chart titled 'Qps (more is better)'. The top bar, Zvec-16c64g-v0.1, reaches 8,475 and is boxed in red. Below it: ZillizCloud-8cu-perf 3,957; OpenSearch-16c128g-force_merge 1,611; ElasticCloud-8c60g-force_merge 1,520; Pinecone-p2.x8-1node 1,131; then a long tail of managed services from ~505 down to 8.7.\"\n  caption=\"VectorDBBench QPS on Cohere 10M; zvec is the boxed top bar at 8,475. Bars mix embedded and managed-cloud entries on differing hardware — the labels encode each configuration (zvec benchmarks, VectorDBBench).\"\n/>\n\nOnly the top of the field, redrawn so the gap is legible. The label suffixes (`16c64g`, `8cu-perf`, `p2.x8-1node`) are each entry's own hardware and version — this is a leaderboard of different setups, not one controlled sweep:\n\n<BenchBars\n  title=\"VectorDBBench · Cohere 10M · QPS (self-reported; hardware differs per row)\"\n  unit=\"\"\n  bars={[\n    { label: \"zvec 16c64g\", value: 8475, highlight: true },\n    { label: \"ZillizCloud 8cu\", value: 3957 },\n    { label: \"OpenSearch 16c128g*\", value: 1611 },\n    { label: \"ElasticCloud 8c60g*\", value: 1520 },\n    { label: \"Pinecone p2.x8\", value: 1131 },\n    { label: \"Qdrant 16c64g\", value: 446.9 },\n    { label: \"Milvus 16c64g sq8\", value: 437.2 },\n  ]}\n/>\n\nQPS is the throughput axis; VectorDBBench measures it at a matched recall level per entry, and that recall panel isn't shown on this chart, so treat the ranking as \"throughput at comparable accuracy\" rather than raw speed at any accuracy. The `*` rows (OpenSearch, ElasticCloud) use `force_merge`, a build-time optimisation that trades index time for query speed. zvec's own build is the ~1-hour figure above.\n\n## The take\n\nThe genuinely interesting thing about zvec is not the top bar — it's the *form factor*. An embedded, Apache-2.0, single-file-ish vector engine with real SDKs across five languages, that runs on-device down to Android, is a useful thing to have for local RAG where standing up Milvus or a managed cloud is overkill. \"SQLite for vectors\" is the right mental model, and the single-writer/many-reader concurrency model matches it exactly.\n\nThe 8,475 QPS is real but oversold by the framing. It is an in-process number — no network — sitting on a chart next to managed services that pay round-trip latency, on hardware that varies row to row. The mechanisms getting it there are the standard two: a graph index that visits a logarithmic slice of the data, and int8 quantization with a full-precision refiner so the cheap codes only *filter* while exact math decides the final order. Both are well-understood; zvec's contribution is a clean, SIMD-dispatched, embeddable implementation of them, plus a newer **RaBitQ** 1-bit path whose \"high recall without re-ranking\" claim is the one I'd hold out for independent verification on. For a team that wants vector search *inside* the application binary, it's worth a real evaluation — just run VectorDBBench yourself, on your hardware, against your recall target, before believing any single bar.\n\n---\n\n*Built on the [zvec release](https://github.com/alibaba/zvec) (Alibaba Tongyi Lab, Apache 2.0) — GitHub README, [zvec.org docs](https://zvec.org/en/docs/db/benchmarks/), and the [v0.3.0 notes](https://github.com/alibaba/zvec/releases/tag/v0.3.0). All QPS numbers are self-reported via [VectorDBBench](https://github.com/zilliztech/VectorDBBench) on Cohere 10M; the two interactive diagrams are illustrations of greedy graph descent and quantize-then-refine, not measured traces. The benchmark figure is reproduced from the project's published chart for commentary.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/zvec","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"LongCat 2.0: a 1.6T open-weights MoE, and the sparse attention behind it","description":"Meituan's LongCat 2.0 is a 1.6-trillion-parameter Mixture-of-Experts model that activates only ~48B per token, ships under MIT, and was trained end-to-end on AI ASICs over 35T+ tokens. Its headline idea is LongCat Sparse Attention — a hierarchical, cross-layer, streaming-aware indexer that reads only a token-precise slice of a 1M-token context. This is an honest walk through the architecture and the mechanism, with the provider-reported benchmarks in full: a strong open-weights model that beats Gemini 3.1 Pro and GPT-5.5 on some coding tasks while trailing the top closed model on most.","date":"2026-07-05","tags":["llm","mixture-of-experts","attention","inference-optimization","explainer"],"draft":false,"cover":"/articles/longcat-2/fig1.png","featured":true,"interest":4,"helpful":3,"kind":"articles","slug":"longcat-2","body":"**LongCat 2.0** is Meituan's largest open model: a Mixture-of-Experts language model with **1.6 trillion total parameters** that activates only **~48 billion per token**, released with **MIT-licensed weights** on Hugging Face and ModelScope (plus a separate `LongCat-2.0-FP8` artifact for deployment). Two things make it worth a close read. First, the whole training run and serving stack were built on **AI ASIC superpods** rather than GPUs — a claim about frontier-scale training on alternative silicon. Second, its headline architectural bet is **LongCat Sparse Attention (LSA)**, a redesigned sparse-attention indexer aimed squarely at long-context serving.\n\n<Callout type=\"warn\">\nEvery benchmark here is **provider-reported** and, unless marked otherwise, run in-house under Meituan's own harness. On the published suite LongCat 2.0 does **not** top a single benchmark against the full frontier set: it is competitive with — and on a few coding/agentic tasks beats — **Gemini 3.1 Pro** and **GPT-5.5**, while the strongest closed model (usually **Claude Opus 4.8**) leads every row. Read it as a strong *open-weights* model, not overall SOTA. The ASIC training story and \"millions of accelerator-days\" are provider details we cannot independently verify.\n</Callout>\n\n## Where the parameters live\n\n1.6T total but ~48B active is a **~3% activation rate** — the sparsity that makes a model this large affordable to run. But LongCat 2.0 has a second, less common parameter store: an **N-gram Embedding** of **135B parameters**, inherited from LongCat-Flash-Lite. Crucially, these are *not* extra experts — they expand parameters along **sparse dimensions orthogonal to the MoE**, adding capacity through a token-n-gram lookup rather than a wider expert pool. Meituan frames it as a scaling principle: once MoE sparsity has \"crossed the sweet spot,\" a bounded slice of N-gram parameters beats simply adding equivalent MoE capacity.\n\n<ScaleBank />\n\nIf the MoE routing here is unfamiliar, we built it up from nothing in [Mixture of Experts, from scratch](/articles/mixture-of-experts-from-scratch) — the router, the top-k gate, and why activating a sparse subset is the whole economic argument for a trillion-parameter model.\n\n## LongCat Sparse Attention\n\nThe expensive part of long context is attention: under full attention every query reads every past token, so a 1M-token context is brutal to serve. The field's fix is to make attention **sparse** — read only a chosen subset of the past. LongCat's starting point is [DeepSeek's Sparse Attention (DSA)](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) and its \"Lightning Indexer,\" whose weaknesses Meituan names directly: **output discontinuity** and a **quadratic scoring bottleneck**. LSA answers with three *orthogonal* improvements.\n\nThe core one is **Hierarchical Indexing (HI)**, and it is best understood against the sparse-attention design we covered earlier — [MiniMax Sparse Attention (MSA)](/articles/minimax-sparse-attention), which scores the past in **128-token blocks** and keeps the top-k *whole blocks*. LSA treats that block selection as only a **coarse recall**: a cheap block-level pass proposes candidate blocks, then a **fine pass selects the most relevant individual tokens** inside them. Same idea of scoring first and reading less — but token-precise, over a smaller candidate set the indexer has to score. Flip the mode below to watch the budget move from whole blocks to individual tokens:\n\n<LsaIndex />\n\nThe second improvement, **Cross-Layer Indexing (CLI)**, cuts cost on a different axis. Deciding what to read costs an *index pass* per layer; but attention saliency is empirically stable across adjacent layers, so LongCat **shares one index every 2 layers** — half the layers reuse a neighbour's selection instead of recomputing it, taught by cross-layer distillation during training. The same trick collapses the model's 3-step [Multi-Token-Prediction](/articles/multi-token-prediction) draft into a single shared pass for speculative decoding:\n\n<CrossLayer />\n\nThe third, **Streaming-aware Indexing (SI)**, is a memory-systems move: it reshapes the token-selection budget to combine hardware-aligned **contiguous** access with dynamic random selection, turning fragmented reads into predictable sequential ones for coalesced HBM bandwidth. Together the three attack the same target from different sides — HI shrinks *what* the indexer scores, CLI shrinks *how often* it runs, SI makes the reads it does issue cheaper.\n\nOne honest deployment note: the public SGLang integration **drops the hierarchical stage for simplicity** and serves LongCat 2.0 on 16× H20 with tensor + expert parallelism. So the shipping inference path today is the CLI/SI part of LSA, not the full HI pipeline.\n\n## Training, and what \"1M context\" means\n\nPretraining spans **35T+ tokens** with, Meituan reports, **no rollbacks or irrecoverable loss spikes** — a stability claim about frontier-scale training on ASICs. Long-context ability comes from training on **hundreds of billions of tokens of 1M-context data**. That \"1M\" is a **training-data** figure: the sources describe the data the model saw, and the README does not publish a separate usable-context window or a long-context retrieval eval (e.g. RULER/HELMET) to pin down how far that quality actually holds at inference. Worth keeping the two apart.\n\n## The numbers, in full\n\nLongCat 2.0 is evaluated against Gemini 3.1 Pro, GPT-5.5, and three Claude Opus checkpoints (4.6 / 4.7 / 4.8). The official chart:\n\n<Figure\n  src=\"/articles/longcat-2/fig1.png\"\n  alt=\"Six grouped bar panels — Terminal-Bench 2.1, SWE-bench Pro, SWE-bench Multilingual, FORTE, RWSearch, BrowseComp — each comparing LongCat-2.0 (green) against Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.6, 4.7, and 4.8. Bars are close in height across models; LongCat is near the leaders on coding and search panels but not the tallest bar in any panel.\"\n  caption=\"LongCat 2.0 vs frontier closed models across code-agent and search benchmarks; bars are unlabeled in the source. LongCat is competitive throughout but tops no panel outright (LongCat 2.0 model card, benchmark chart).\"\n/>\n\nOn **SWE-bench Pro** LongCat's 59.5 edges past GPT-5.5 (58.6), Gemini 3.1 Pro (54.2) and Opus 4.6 (57.3) — but the newer Opus checkpoints pull ahead (4.7 = 64.3, 4.8 = 69.2):\n\n<BenchBars\n  title=\"SWE-bench Pro (%) — provider-reported\"\n  unit=\"\"\n  bars={[\n    { label: \"LongCat 2.0\", value: 59.5, highlight: true },\n    { label: \"GPT-5.5\", value: 58.6 },\n    { label: \"Gemini 3.1\", value: 54.2 },\n    { label: \"Opus 4.8\", value: 69.2 },\n  ]}\n/>\n\n**IFEval** shows the opposite shape: LongCat (90.0) trails Gemini 3.1 Pro (96.1) and GPT-5.5 (95.0), but the newest Claude checkpoints regressed here, so LongCat sits *above* Opus 4.8 (86.0):\n\n<BenchBars\n  title=\"IFEval (%) — provider-reported\"\n  unit=\"\"\n  bars={[\n    { label: \"LongCat 2.0\", value: 90.0, highlight: true },\n    { label: \"Gemini 3.1\", value: 96.1 },\n    { label: \"GPT-5.5\", value: 95.0 },\n    { label: \"Opus 4.8\", value: 86.0 },\n  ]}\n/>\n\nThe rest of the suite tells the same \"competitive, not leading\" story. LongCat edges Gemini 3.1 Pro on **Terminal-Bench 2.1** (70.8 vs 70.7) and **FORTE** (73.2 vs 70.3, matching Opus 4.6), leads it on **RWSearch** (78.8 vs 76.3) and **IMO-AnswerBench** (81.8 vs 79.5 for GPT-5.5) — yet on each of those a closed model still tops the row (GPT-5.5 = 77.8 on FORTE; 85.3 on RWSearch; Opus 4.8 = 78.9 on Terminal-Bench; Gemini = 90.0 on IMO-AnswerBench, 96.1 on IFEval, 94.3 on GPQA-diamond, where LongCat is 88.9). On **BrowseComp** it clearly trails (79.9 vs Gemini's 85.9). No single number here is a headline win over the field.\n\n## The take\n\nLongCat 2.0's real contribution isn't a benchmark crown — it's the *combination*: a genuinely large (1.6T) MoE, trained end-to-end on non-NVIDIA silicon, shipped under **MIT** weights, with a sparse-attention design that is a real step past the DSA/MSA lineage. LSA's three moves are cleanly separated — **HI** goes finer than MSA's whole-block selection (token-precise, over a coarsely-recalled candidate set), **CLI** amortizes the indexer across layers, **SI** makes the memory access regular — and each targets a distinct cost, which is the kind of engineering that turns theoretical sparsity into wall-clock savings. The honest caveats are the usual open-weights ones: the benchmarks are self-run and self-selected; the model is competitive with Gemini 3.1 Pro and GPT-5.5 on coding/agentic tasks but trails Claude Opus 4.8 on most; the striking \"1M context\" and \"millions of accelerator-days\" are provider claims, not independently measured; and the part of LSA that actually ships today (in SGLang) is CLI/SI, with the hierarchical stage dropped for simplicity. For a team that wants frontier-scale, open, and long-context — and can run 16× H20 — it's a serious option. As a claim that you don't need NVIDIA to train at this scale, it's the more interesting story.\n\n---\n\n*Built on the [LongCat 2.0 release](https://github.com/meituan-longcat/LongCat-2.0) (Meituan, 2026) — GitHub README and [Hugging Face model card](https://huggingface.co/meituan-longcat/LongCat-2.0), MIT license. All benchmark numbers are provider-reported (in-house harness unless marked `*` = official model report); the interactive diagrams are illustrations of the mechanism, not measured traces. The benchmark figure is reproduced from the model card for commentary.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/longcat-2","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Cosmos 3: a world model that reasons and generates in one sequence","description":"Physical AI needs a model that can both plan and imagine the consequences. NVIDIA's Cosmos 3 does it with a Mixture-of-Transformers: an autoregressive Reasoner tower and a diffusion Generator tower sharing one token sequence, joined by dual-stream joint attention so the generator reads the reasoner's keys directly. It plans in pixel space via Action-CoT, and one backbone spawns six task models. Honestly framed: the SOTA claims are scoped to open models — Gemini 3.1 Pro still leads — and some numbers lean on best-of-N and provider-selected harnesses.","date":"2026-07-04","updated":"2026-07-20","tags":["world-models","diffusion","mixture-of-experts","physical-ai","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"cosmos-world-model","body":"A language model can tell you what will probably happen if you tip a full mug. A **world model** has to\n*show* you — render the next frames, consistent with gravity, contact, and the fact that the coffee ends\nup on the table. That is the bet behind **physical AI**: robots and agents need a model that can both\n**reason** about what to do and **imagine** the visual consequences of doing it. NVIDIA's **Cosmos 3**\n(arXiv 2606.02800) is a world foundation model built around exactly that pairing — and the interesting\npart is *how* it fuses the two, not that it does.\n\nThe usual way to bolt reasoning onto a generator is to run an LLM, get some text, and feed it to a\ndiffusion model as a prompt. Cosmos instead puts both in **one network and one token sequence**, so the\ngenerator can read the reasoner's internal state as it works. The mechanism is a **Mixture-of-Transformers\n(MoT)**.\n\n## Two towers, one sequence\n\nAn MoT keeps **separate weights per modality** but runs **one shared attention** over a single sequence.\nCosmos uses two towers: an **autoregressive Reasoner** that does the planning, and a **diffusion\nGenerator** that produces video and world frames. Each token in the sequence is routed to its tower's\nweights, but they all attend together. Step through it:\n\n<DualStream />\n\nThe load-bearing detail is the attention pattern. The Reasoner is autoregressive, so its queries attend\nonly **causally over its own keys** (`K_AR`) — standard next-token reasoning. The Generator is a diffusion\nmodel, and its queries `Q_DM` attend over the **concatenation `[K_AR ; K_DM]`** — the reasoner's keys *and*\nits own. That one cross-stream read is the whole idea: the generator can look directly at what the reasoner\ndecided, so a plan flows into pixels inside a single attention operation rather than through a text\nbottleneck. This is the paper's dual-stream joint-attention design:\n\n<Figure\n  src=\"/articles/cosmos-world-model/fig1.png\"\n  alt=\"Architecture diagram of Cosmos 3. On the left, a Reasoner processes an AR subsequence of vision (ViT) and language tokens through layer norm, a shared multimodal attention block, and an MLP, using causal self-attention Attn(Q_AR, K_AR, V_AR). On the right, a Generator processes a diffusion subsequence of noisy vision (VAE), audio, and action tokens through the same shared attention block but with full attention Attn(Q_DM, [K_AR;K_DM], [V_AR;V_DM]). On the far right, an attention-mask grid shows AR queries attending causally (triangular) only to K_AR, while DM queries attend fully to both K_AR and K_DM.\"\n  caption=\"The Mixture-of-Transformers: an AR Reasoner and a diffusion Generator share one sequence and one attention operation. AR queries attend causally over K_AR; the diffusion queries attend over the full concatenation [K_AR ; K_DM], so reasoning is fused into generation (paper, Figure 5).\"\n/>\n\nIf you have read our [TwoTower explainer](/articles/nemotron-twotower), the silhouette will look familiar —\na frozen autoregressive tower feeding a diffusion denoiser — but the systems solve different problems.\nTwoTower is a *language* model that splits one job (context vs. denoising) to speed up text decode; Cosmos\nis a *physical-AI world model* whose two towers span different modalities and are trained jointly, with the\ngenerator reading the reasoner online rather than cross-attending to a frozen copy. Same family tree,\ndifferent animal.\n\nTwo contrasts pin down what MoT *is*. A **dense** transformer would force one set of weights to be good at\nboth causal reasoning and bidirectional denoising — the conflict TwoTower also fights. A [Mixture-of-Experts\nmodel](/articles/mixture-of-experts-from-scratch) routes *tokens* to expert FFNs but shares one attention\nand one modality regime. MoT is the third option: **separate weights per modality/role (the two towers),\none shared attention** — so the split is by *what kind of token this is*, and the fusion happens in\nattention. The diffusion half itself is a standard denoiser; if that machinery is unfamiliar, our\n[diffusion](/articles/set-diffusion) and [diffusion-language-model](/articles/illada-diffusion-language-model)\npieces cover it.\n\n## How the world becomes tokens\n\nBefore any of that attention can happen, five modalities have to become tokens in one sequence — and\nCosmos encodes each differently. Understanding tokens land in the Reasoner subsequence; generation\ntokens land in the Generator subsequence. Trace a modality through its encoder:\n\n<TokenStack />\n\nThe asymmetry is deliberate. The **understanding** path uses a ViT encoder trained *jointly* with the\nbackbone, so the reasoner sees vision the way its Qwen3-VL ancestor did. The **generation** path uses\n*frozen* VAEs — a Wan2.2 video VAE (4× temporal, 32×32 spatial compression) and an audio VAE — so the\ndiffusion tower only has to produce latents a fixed decoder already knows how to render. And **actions**\nfrom every embodiment collapse into one shared latent action space, which is what lets a single model be\na policy for arms, humanoids, and vehicles alike.\n\n## Planning in pixel space: Action-CoT\n\nReasoning about *action* is not the same as reasoning in words. To act in the world you need a plan\nexpressed in terms of *where things move*. Cosmos's **Action-CoT** turns an instruction into a **2D motion\nplan on the image plane** — a chain-of-thought drawn as a trajectory — before and while it generates. Pick\nan instruction and scrub the plan into existence:\n\n<ActionCoT />\n\nConcretely: the model predicts a path of waypoints across the frame (the gripper's route to the mug, the\nblock's slide, the drawer's pull), and that trajectory *conditions the diffusion tower*. The frames it\ndenoises then have to realize the motion, not merely look plausible — the chain-of-thought lives in the\nsame coordinate space the physics does. It is a neat answer to a real problem: language is a lossy way to\nspecify a manipulation, and image-plane motion is exactly what a downstream controller can consume.\n\n## One backbone, six models\n\nBecause reasoning and generation share a backbone and everything is tokens, the *same* weights become six\ndifferent task models just by choosing which modalities go in and which come out. Route it:\n\n<OneBackbone />\n\n<Figure\n  src=\"/articles/cosmos-world-model/fig2.png\"\n  alt=\"Overview diagram titled Cosmos 3, an Omnimodal World Model, with modality icons for Language, Image, Video, Audio, and Action. Below, six task models each show inputs flowing into a Cosmos 3 box and outputs coming out: Vision-Language Model, Image Generation Model, Audio-Visual Generation Model, Policy/World-Action Model, Forward Dynamics Model, and Inverse Dynamics Model.\"\n  caption=\"One backbone spawns six task models — differing only in which modalities enter and exit, from vision-language to forward and inverse dynamics (paper, Figure 1).\"\n/>\n\nThe two dynamics models are the ones that matter for physical AI. A **forward dynamics** model predicts the\nnext video given the past frames and an action — that is the world model as a *simulator* an agent can plan\nagainst. An **inverse dynamics** model recovers the action that connects two frames — useful for learning\ncontrol from unlabeled video. Both are the same network with the arrows reversed.\n\nCosmos comes in three sizes, all built on the dual-tower MoT: **Edge — 4B total on a dense 2B transformer\ntrained from scratch** (28 layers, a later release), **Nano — 16B total on a dense 8B**, initialized from\n**Qwen3-VL-8B** (36 layers), and **Super — 64B total on a dense 32B**, from **Qwen3-VL-32B** (64 layers).\nThe Qwen initialization is telling — the reasoner tower inherits a strong pretrained VLM, while the generator\nis a **flow-matching** diffusion tower (it predicts a constant velocity, `v* = ε − x₀`) grafted on and trained\nto read it. Everything is openly released under **OpenMDW-1.1**: the Nano and Super checkpoints, the code,\nfive synthetic **SDG** datasets (physics, robots, driving, digital humans, warehouses), and the **Cosmos-HUE**\nevaluation benchmark.\n\n## The training pipeline\n\nTwo towers means two training tracks, joined where it counts. Click through the stages:\n\n<TrainingStages />\n\nThe Reasoner is extended from a VLM and fine-tuned on reasoning and Action-CoT data; the Generator is\npre-trained as a flow-matching denoiser, then **mid-trained jointly** with the reasoner — the stage that\nactually wires up the dual-stream attention — before splitting into task-specific post-training for\ntext-to-image, image-to-video, and robot policy. Several headline results lean on **best-of-N sampling\nagainst a learned reward model (WMReward)**, which is worth holding in mind when reading the numbers.\n\n## The numbers, honestly\n\nBy the report's own tally, the post-trained models were the **best open-source Text-to-Image and\nImage-to-Video models on Artificial Analysis**, and the **best policy model on RoboArena**, at the time of\nwriting — and Cosmos leads a physical-AI reasoning leaderboard **among open models**. Those are real, but\nthey're *open-model* rankings, and the scope is easy to lose in a press release. On the reasoning benchmark\nwhere Cosmos Super posts its headline result, a closed frontier model still sits above it:\n\n<BenchBars\n  title=\"Physical-AI reasoning benchmark (%) — scoped comparison\"\n  unit=\"%\"\n  bars={[\n    { label: \"Gemini 3.1 Pro\", value: 77.5 },\n    { label: \"Cosmos Super 64B (best open)\", value: 73.7, highlight: true },\n  ]}\n/>\n\n<Callout type=\"warn\">\nRead the SOTA claims narrowly. **Best *open* model is not best model** — Gemini 3.1 Pro (77.5) beats Cosmos\nSuper (73.7) on the reasoning benchmark, and Veo-3.1 leads on audio generation. Any \"#1\" leaderboard\nposition is a **dated snapshot** that moves as models ship. The text-to-image comparison uses a\n**provider-selected harness** — a setup its authors chose, so treat the framing as favorable. And several\nreported results use **best-of-N sampling with Cosmos's own reward model**, not single-shot generation;\nthat is a legitimate technique but not the same as raw one-shot quality.\n</Callout>\n\nNone of that makes the work less interesting — it makes the *claim* precise. As an **open**, openly-licensed\nomnimodal world model that fuses reasoning and generation in one attention operation and plans in image\nspace, Cosmos 3 is a genuinely new capability tier for people building on open weights. It just isn't the\nbest model in the world at everything, and the paper's own numbers say so if you read the parentheses.\n\n## The take\n\nThe idea worth keeping is the **dual-stream joint attention**. Most \"reasoning + generation\" systems chain\ntwo models and pay a text bottleneck between them; Cosmos makes the generator's queries attend over the\nreasoner's keys inside one operation, so the plan reaches the pixels without being flattened into a prompt.\nPair that with **Action-CoT** — chain-of-thought as motion on the image plane — and you get a world model\nwhose reasoning is expressed in the same space its physics has to hold. The Mixture-of-Transformers is the\nenabling structure: separate weights per modality, one shared sequence, fusion in attention. Whether the\nscoped-SOTA numbers hold as the leaderboards churn is beside the point; the architecture is the\ncontribution, and it is a clean one.\n\n---\n\n*Built on **NVIDIA Cosmos 3** (arXiv 2606.02800; OpenMDW-1.1 license). The Reasoner/Generator MoT, dual-stream\njoint attention, and Action-CoT are described in the paper (Figures 5 and 1); the interactive diagrams are\nillustrations of the mechanism. Benchmark figures are quoted as scoped comparisons — best among open models,\nwith a closed frontier model (Gemini 3.1 Pro) still ahead — and some results use best-of-N with the\nmodel's own reward model.*\n","readingTimeMins":9,"url":"https://ai.thesatyajit.com/articles/cosmos-world-model","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"MiniMax Sparse Attention: let each query pick its own blocks","description":"Full attention makes every query read the whole context — the cost that caps long-context serving. MiniMax's MSA keeps it exact but sparse: a lightweight index branch scores past KV in blocks, keeps the top-16 (2,048 tokens) per query, and attends only those — per GQA group, so different heads see different blocks. At 1M context that's a 28.4× FLOP cut and a measured 14.2× prefill / 7.6× decode speedup, at on-par quality. A walk through the mechanism, the fixed-budget economics, and the honest caveats.","date":"2026-07-04","tags":["llm","attention","inference-optimization","mixture-of-experts","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"minimax-sparse-attention","body":"The thing that makes long context expensive is that, under full attention, **every query token reads\nevery past token**. The [KV cache](/articles/how-llm-inference-works) that stores those keys and values\ngrows linearly with sequence length, and the attention itself scales with it — so a 1M-token context is\nbrutal to serve. The field's answer has been to make attention *cheaper*: linear-attention hybrids\n[like HydraHead](/articles/hydrahead), sliding windows [like MiMo-V2-Flash](/articles/mimo-v2-flash), or\nKV compression [like TurboQuant](/articles/turboquant-kv-cache). **MiniMax Sparse Attention (MSA)** takes\na different route: keep attention **exact**, but let each query attend to only a **small, learned subset**\nof the context.\n\nThe trick is to think in **blocks**. Partition the past KV into fixed blocks of 128 tokens; for each\nquery, a cheap *index branch* scores every block, keeps only the highest-scoring few, and the main\nattention runs — exactly — over just those. Scrub the query and watch which blocks it actually reads:\n\n<BlockSelect />\n\nTwo design choices make this work. First, the selection happens **per GQA group**: MiniMax uses grouped-\nquery attention with 64 query heads tied to 4 KV heads, so there are 4 groups, and *each group picks its\nown blocks* off the same keys and values — flip the group above and the arrows swing. Second, the index\nis **trained, not heuristic**: an auxiliary KL loss aligns the index branch's block scores with the main\nattention's true distribution over blocks, with a stop-gradient so it only trains the tiny index\nprojections and never disturbs the backbone.\n\n## The mechanism, precisely\n\nEach attention layer splits into two branches:\n\n- **Index branch (selection).** It adds one lightweight index-query head per group and a single shared\n  index-key head. For query `i` it computes `Q_idx · K_idxᵀ`, pools those token scores to the **block**\n  level by max-pooling, and takes the **top-k blocks** — deployed as **block size 128, k = 16**, so a\n  fixed budget of **2,048 tokens** per query. The block the query sits in (the *local* block) is always\n  kept.\n- **Main branch (compute).** Given the selected block indices, it runs **exact** attention over only\n  those blocks. Because every head in a group reuses the same block set, KV reads stay block-contiguous —\n  which is what lets a custom kernel turn the FLOP savings into real wall-clock speedup.\n\n<Figure\n  src=\"/articles/minimax-sparse-attention/fig1.png\"\n  alt=\"MSA architecture diagram. Left, an Index Branch takes hidden states through linear projection, norm and RoPE to index query and key heads, computes a score matrix, applies block max pooling, and emits a Top-K block index. Right, a Main Branch selects those Top-K KV blocks and runs exact sparse attention over Q, K, V. Far right, two attention-mask grids for Query Group 1 and Query Group 2 show different selected key blocks per group.\"\n  caption=\"A lightweight Index Branch scores KV blocks (Q·Kᵀ → block max pool → top-k); the Main Branch runs exact attention over only the selected blocks. Each GQA group selects its own blocks, so Group 1 and Group 2 attend to different long-range keys (paper, Figure 1).\"\n/>\n\n## Why the fixed budget is the whole point\n\nBecause every query attends to the same **2,048 tokens** no matter how long the context is, the *fraction*\nof the context each query reads collapses as context grows — but, honestly, the measured speedup is much\nsmaller than that fraction would suggest, because the index branch still scans every block. Slide the\ncontext length and watch both numbers:\n\n<SparsityAtScale />\n\n<Figure\n  src=\"/articles/minimax-sparse-attention/fig2.png\"\n  alt=\"Three line charts comparing GQA (blue) and MSA (green) as sequence length grows from 32k to 1M. Left: per-token attention FLOPs — GQA rises steeply while MSA stays nearly flat, annotated 28.4x reduction at 1M. Middle: prefilling latency, annotated 14.2x speedup. Right: decoding latency per token, annotated 7.6x speedup.\"\n  caption=\"Efficiency vs GQA at matched head config (64 query / 4 KV heads; MSA block 128, k=16, 2,048-token budget). At 1M tokens on H800: 28.4× per-token attention-FLOP reduction, 14.2× prefill and 7.6× decode wall-clock speedup (paper, Figure 4).\"\n/>\n\n## The numbers\n\nMSA is trained two ways — from scratch (**MSA-PT**) and by converting a full-attention checkpoint\n(**MSA-CPT**) — and compared against MiniMax's *own* GQA full-attention model at a matched 3T-token\nbudget. The headline is parity, not a free lunch: MSA holds or slightly beats full attention on most of a\n28-benchmark suite, with real regressions on a few. The efficiency, meanwhile, is decisive at long\ncontext:\n\nAgainst MiniMax's own full-attention model (values in parentheses), MSA-PT holds or edges ahead —\nRULER-8K 84.2 (79.8), GSM8K 77.7 (76.2), MMLU 67.2 (67.0), HumanEval 64.0 (61.0), VisualWebBench 68.4\n(55.6):\n\n<BenchBars\n  title=\"MSA-PT vs full attention (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"RULER-8K\", value: 84.2, highlight: true },\n    { label: \"GSM8K\", value: 77.7, highlight: true },\n    { label: \"MMLU\", value: 67.2, highlight: true },\n    { label: \"HumanEval\", value: 64.0, highlight: true },\n    { label: \"VisualWebBench\", value: 68.4, highlight: true },\n  ]}\n/>\n\nThe efficiency wins that motivate all of this, at 1M context on H800:\n\n<BenchBars\n  title=\"Speedup vs GQA full attention at 1M context (×)\"\n  unit=\"×\"\n  bars={[\n    { label: \"Attn FLOPs (theory)\", value: 28.4, highlight: true },\n    { label: \"Prefill\", value: 14.2, highlight: true },\n    { label: \"Decode\", value: 7.6, highlight: true },\n  ]}\n/>\n\nA few honesty notes. The baseline is **provider-selected and internal** — MSA vs MiniMax's own GQA model,\nnot against other sparse-attention methods (NSA, MoBA) or external frontier models. Quality is \"on par,\"\nand the converted **MSA-CPT** variant does trail full attention on some tasks (GSM8K 73.7 vs 76.2,\nHumanEval 57.9 vs 61.0, HELMET-128K −0.60) — the fixed budget shows up as small losses on\nretrieval-heavy long-context tasks. And the striking efficiency figures are the **1M** extreme with a\nfixed 2,048-token budget on a specific head config; at 32k the advantage is barely 1.6×. The **28.4×** is\na theoretical FLOP count read off a chart, not measured throughput — the honest wall-clock numbers are\n14.2× and 7.6×.\n\n## The take\n\nMSA's contribution is that it makes *selection* a first-class, trainable part of attention rather than a\nbolt-on. The block granularity is the quiet key: picking whole 128-token blocks (not individual tokens)\nkeeps memory access regular enough that a kernel can actually realize the savings, and sharing the\nselection across a GQA group keeps it cheap. Set against the other long-context playbooks — linear\nattention trades exactness for O(1) state; sliding windows drop the far past; KV quantization shrinks\neach entry — MSA keeps attention **exact and full-range** and simply reads *less of it*, chosen per query\nand per group. Whether a fixed 2,048-token budget holds up as tasks demand genuinely global reasoning is\nthe open question the retrieval regressions hint at; but as a way to serve a 1M context at a fraction of\nthe cost while staying on the full-attention quality curve, it's a clean, well-engineered bet. The\nproduction model, MiniMax-M3, ships with it (open weights, minimax-community license).\n\n---\n\n*Built on [MiniMax Sparse Attention](https://arxiv.org/abs/2606.13392) (Lai, Xu, Yang et al.; MiniMax,\n2026) and the [MiniMax-M3 release](https://huggingface.co/MiniMaxAI/MiniMax-M3). Benchmark and efficiency\nfigures are quoted from the paper (the 109B-total / 6B-active experimental model; block size 128, top-16);\nthe interactive diagrams are illustrations of the mechanism. Speedups are measured on H800 at 1M context.*\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/articles/minimax-sparse-attention","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"MrFlow: climb the resolution in pixel space, not diffusion steps","description":"Flow and diffusion image models spend most of their compute running every denoising step at full resolution. MrFlow is a training-free reshuffle of that budget: run most steps at low resolution, do the resolution climb with a cheap pixel-space super-resolution network, re-inject a small closed-form amount of noise, and finish with a single high-resolution refine step. It buys real speedups (roughly 4–9× training-free on the good configs) at near-native quality — but the headline numbers are best-case, and how much quality you keep depends heavily on the config and the base model.","date":"2026-07-04","tags":["diffusion","image-generation","inference-optimization","flow-matching","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"mrflow-diffusion-acceleration","body":"A modern text-to-image model — a flow-matching or [diffusion](/articles/set-diffusion) network like\nFLUX or Qwen-Image — makes a picture by running a stack of denoising steps. The expensive fact about\nthat stack is where the compute lands: **every step runs at the output resolution**. A 20-step sample\nat 1024² pays the full-resolution price twenty times over, and most of those steps are spent on detail\nthe early ones can't even see yet. That's the cost MrFlow goes after — and it does so **without any\nfine-tuning**, purely by rearranging *when* the model works at full size.\n\nThe idea is a budget reshuffle. Structure is decided early and is cheap to compute at small sizes;\nsharp high-frequency detail is what full resolution actually buys you. So MrFlow spends the pricey\ndiffusion steps at **low resolution**, does the climb to full resolution in **pixel space** with an\noff-the-shelf super-resolution network, and then spends just **one** diffusion step at high resolution\nto clean up. Walk the four stages:\n\n<StageStepper />\n\n## The pipeline, precisely\n\nMrFlow is four stages, and only the outer two touch the full-resolution latent:\n\n1. **Low-resolution generate.** Run the flow/diffusion model for ~12 of its 20 steps at a low\n   resolution (e.g. 512²). Steps are cheap at this size, and this is where composition and layout are\n   fixed. Decode to a low-resolution image.\n2. **Pixel-space super-resolution.** Upsample that image to full resolution with a **pixel-space SR\n   network** — Real-ESRGAN — in a single forward pass. This is the resolution jump, and crucially it is\n   *not* more latent diffusion steps.\n3. **Low-strength noise re-injection.** Encode the upscaled image back to latent and add a small amount\n   of noise, `σ_t ∈ [0.1, 0.15]`, computed **closed-form** from the flow schedule. No extra network,\n   no training — just enough noise to give the model something to denoise.\n4. **One HR refine step.** A single high-resolution diffusion step blends the upscaled detail back into\n   the model's own distribution and cleans up the SR network's artifacts.\n\n<Figure\n  src=\"/articles/mrflow-diffusion-acceleration/fig1.png\"\n  alt=\"Pipeline comparison. Top left, 'Native' runs 50 diffusion steps at high resolution then a VAE decoder to the final image. Top right, 'Latent Upsampling' runs 20 steps, upsamples the latent, adds noise, runs 10 more steps, then decodes. Bottom, 'MrFlow' runs 12 low-resolution steps and decodes to a low-resolution image, applies an SR GAN to reach high resolution, VAE-encodes, adds a small closed-form noise σ_t·ε, runs a single refine step, and decodes to the final high-resolution image. A legend marks latent space in green and pixel space in orange.\"\n  caption=\"Native runs every step at full resolution; latent-upsampling schemes climb by adding more latent diffusion steps. MrFlow instead climbs in pixel space with an SR network, then spends a single high-resolution refine step — the resolution jump costs one forward pass, not a second block of diffusion (paper, Figure 2).\"\n/>\n\nThe contrast with the usual \"latent upsampling\" trick (middle row of the figure) is the whole point.\nThose schemes also start small, but they climb by running **more latent diffusion steps** at the higher\nresolution — you pay full-resolution diffusion twice. MrFlow does the climb with a cheap pixel-space\nnetwork and buys back fidelity with a *single* diffusion step.\n\n## Where the budget actually goes\n\nThe reason this is faster isn't subtle: high-resolution diffusion steps are the expensive line item,\nand MrFlow runs almost none of them. If you count an HR step as roughly 4× an LR step (2× the linear\nresolution is 4× the pixels), a native 20-step HR sample and a MrFlow `(12, 1)` config are worlds apart\nin compute. Drag the config and watch the budget — and the honest quality tradeoff — move:\n\n<ConfigSplit />\n\nThat schematic is where the **config dependence** lives, and it's the first honest caveat. The speedup\nis the reliable part — fewer, cheaper steps is unambiguously less compute. Quality is the part that\nswings: strip the refine steps and Real-ESRGAN's artifacts survive to the final image; starve the\nlow-resolution stage and the composition never locks in.\n\n<Callout type=\"warn\">\nThe advertised **\"within 1% of native quality\" is a best-case figure**, and it lives at the conservative\nend of that slider. Push the config harder and the gap widens fast — FLUX at the aggressive `(12, 1)`\nsetting drops roughly **18% on OneIG**. Degradation is real, and it is both **config- and\nmodel-dependent**.\n</Callout>\n\n## Why pixel-space SR instead of more diffusion\n\nThe subtle design question is: once you have a low-resolution image, how do you get to full resolution?\nThe latent-upsampling answer is \"more diffusion.\" MrFlow's answer is \"a super-resolution network, then\none diffusion step to fix it.\" The paper compares SR backbones — bilinear interpolation, SwinIR,\nOSEDiff, Real-ESRGAN — with and without the final HR refine step:\n\n<Figure\n  src=\"/articles/mrflow-diffusion-acceleration/fig3.png\"\n  alt=\"Staged super-resolution comparison. Far left, a low-resolution input crop of a shop scene with a chalkboard reading '9am'. To the right, a grid: columns are Interpolate, SwinIR, OSEDiff, and Real-ESRGAN; the top row is labelled 'SR' (super-resolution only) and the bottom row 'High Resolution Refine' (SR followed by one diffusion refine step). Zoomed insets of the '9am' text show interpolation stays blurry, and the Real-ESRGAN column with the refine step recovers the sharpest, cleanest lettering.\"\n  caption=\"The SR stage on its own (top row) can leave text and edges blurry or artifact-ridden; a single high-resolution refine step (bottom row) cleans them up. Real-ESRGAN plus the refine step recovers the crispest detail — the refine step exists precisely to fix the SR network's mistakes (paper, Figure 3).\"\n/>\n\nTwo things read off that grid. First, the SR network alone is **not enough** — plain interpolation\nstays soft, and even a strong SR net can hallucinate wrong detail (look at the \"9am\" text). Second, the\nsingle refine step is doing real work: the bottom row is visibly cleaner than the top. Real-ESRGAN can\nintroduce its own artifacts, and the HR refine step is there **precisely to fix them** — the two stages\nare a pair, not alternatives.\n\n## The speed–quality frontier\n\nThe payoff is best seen as a Pareto plot: quality against speedup, versus the obvious baseline of just\nrunning the model with fewer native steps. Toggle the model and walk the refine-step configs:\n\n<SpeedQuality />\n\n<Figure\n  src=\"/articles/mrflow-diffusion-acceleration/fig2.png\"\n  alt=\"Two scatter plots of GenEval score (y-axis) versus speedup ratio (x-axis), for FLUX.1-dev on the left and Qwen-Image on the right. A star marks native quality at speedup 1. A dark 'Native Steps' curve falls steeply as speedup increases. Several acceleration baselines — ToMA, TeaCache, DB-Taylor, RALU, SPEED — sit below and to the left. MrFlow's three configs (+1, +2, +3 refine steps) form a red frontier, highlighted with an ellipse, that stays high in quality much further to the right than any baseline.\"\n  caption=\"Quality (GenEval) versus speedup for FLUX.1-dev and Qwen-Image. Just cutting native steps (dark curve) sheds quality fast; other accelerators sit below the frontier. MrFlow's +1/+2/+3 configs (red) hold near-native quality much further to the right — but note the frontier still bends down as speed climbs (paper, Figure 5).\"\n/>\n\nThe shape is the honest summary. On the good configs, MrFlow's frontier dominates both \"fewer native\nsteps\" and prior accelerators like TeaCache and RALU — you get several× speedup while staying close to\nthe native star. But the frontier still slopes downward: more speed costs some quality, and the `+1`\nconfig (fastest) sits measurably below `+3`. Qwen-Image holds its quality far better than FLUX across\nthe same speedups, which is exactly the point about **model dependence** — some base models tolerate the\nreshuffle much better than others.\n\n## The honest headline\n\nThe clean numbers to keep are the training-free ones: roughly **4–9× faster** on the frontier configs at\n**near-native** quality, no fine-tuning required. The bigger figures you may see quoted come with strings:\n\n<Callout type=\"warn\">\nThe **10×/25× speedups are config-dependent**, and the top end is not training-free. The **25× figure is\nreached only with distillation stacked on top** of MrFlow's staged sampling — not by the training-free\npipeline alone. Quote the ~4–9× training-free range if you want the number MrFlow earns on its own.\n</Callout>\n\nA few smaller honesty notes. \"Within 1%\" is a best case measured on forgiving configs and models; the\nsame pipeline pushed to `(12, 1)` on FLUX loses ~18% on OneIG. The compute-unit accounting in the\ndiagrams above is a schematic (an HR step is *roughly* 4× an LR step) — the paper's speedups are\nmeasured wall-clock, and they depend on the SR network's own cost, which the unit count glosses over.\nAnd Real-ESRGAN, being a GAN, can invent detail that isn't in the low-resolution image; the refine step\nmitigates but doesn't fully erase this.\n\n## The take\n\nMrFlow's contribution is a reframing more than a new network: the resolution climb doesn't have to be\npaid for in diffusion steps. Set against the usual acceleration playbooks — caching redundant\ncomputation ([TeaCache-style](/articles/how-llm-inference-works)), or simply taking fewer steps — it\nmakes a sharper bet: **structure is cheap and settles early; resolution is expensive and can be borrowed\nfrom a pixel-space SR net; artifacts are cleanable in one diffusion step**. Because every piece is\ntraining-free and closed-form, it drops onto an existing checkpoint with no retraining, which is the\npractical reason to care. The catch is the one every honest acceleration paper carries — the best-case\nheadline is best-case. On a forgiving model at a conservative config it really is near-free; push the\nconfig or pick a brittle model and you pay for the speed in quality. As a way to make an existing\nimage model several times cheaper to sample without touching its weights, though, it's a clean idea,\ncleanly executed.\n\nFor the diffusion background this builds on, see [Set Diffusion](/articles/set-diffusion) and, for how\ndenoising models differ from autoregressive ones, [the diffusion *language* model\nwalkthrough](/articles/illada-diffusion-language-model).\n\n---\n\n*Built on MrFlow (arXiv 2607.01642), a training-free staged-sampling accelerator for flow and diffusion\nimage models. Benchmark and speedup figures (GenEval, OneIG; FLUX.1-dev and Qwen-Image) are quoted from\nthe paper; the interactive diagrams are illustrations of the mechanism, and the per-stage compute units\nare a schematic, not measured FLOPs. Speedups are config- and model-dependent; the top-end figures\nrequire distillation on top of the training-free pipeline.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/mrflow-diffusion-acceleration","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Nemotron in NVFP4: training a frontier model natively in 4-bit","description":"Quantization usually happens after training — you train in BF16, then squeeze the weights for serving. NVIDIA's Nemotron flips that: the forward and backward GEMMs run natively in NVFP4, a 4-bit format, during the training run itself. This is a walk through what NVFP4 actually is (a 4-bit element plus a two-level shared scale), the three tricks that keep 4-bit gradients from diverging, and the honest caveats — it's mixed-precision not end-to-end FP4, the quality claim is a training-loss gap, and the run diverged twice.","date":"2026-07-04","tags":["llm","quantization","training","mixture-of-experts","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"nemotron-nvfp4","body":"Almost every \"4-bit model\" you have heard of is 4-bit *after the fact*: the network is trained in BF16,\nand then a [post-training quantizer](/articles/turboquant-kv-cache) compresses the finished weights so\nthey are cheaper to serve. Training itself stays in 16-bit, because the gradients are delicate and 4 bits\nis a very small number. NVIDIA's Nemotron does something more aggressive: it runs the actual training\nGEMMs — the big matrix multiplies in the forward *and* backward pass — natively in **NVFP4**, a 4-bit\nfloating-point format, while the model is still learning. The headline is a training-loss gap under\n**0.4%** against a BF16 reference. The interesting part is everything that had to be true to get there.\n\n## What NVFP4 actually is\n\nFour bits cannot, on their own, represent the range of values a weight tensor spans. NVFP4's answer is to\nsplit the job: store each value in a tiny 4-bit **element**, and recover dynamic range from a **shared\nscale** that a whole block of elements multiplies through. The element is **E2M1** — 1 sign bit, 2\nexponent bits, 1 mantissa bit. The scale is where NVFP4 differs from the MXFP4 format you may have seen:\nit shares one **FP8 (E4M3)** scale across every **16** elements, plus a single FP32 scale for the whole\ntensor. Flip between the formats:\n\n<BitLayout />\n\nThat two-level scaling is the whole trick. A 4-bit element with an 8-bit block scale over 16 values\nbehaves, in effective dynamic range, more like a ~10-bit (\"E6M4\"-ish) number than a raw 4-bit one — while\ncosting **4.5 bits per element** to store (4 for the element, 8/16 = 0.5 for the scale). MXFP4 uses a\ncoarser power-of-2 (E8M0) scale over a wider 32-element block: cheaper at 4.25 bits, but blunter, because\none outlier drags a bigger block and a power-of-2 scale can only snap to coarse steps. NVFP4's finer block\nand FP8 scale are what make it stable enough to *train* in, not just serve in.\n\n<Figure\n  src=\"/articles/nemotron-nvfp4/fig1.png\"\n  alt=\"Line charts of the relative training-loss difference (percent) between NVFP4 and BF16 segments across the 5T, 10T and 16T-token checkpoints. The gap hovers around 0.3 percent with occasional spikes, and a lower panel shows the gap converging toward zero under longer BF16 training.\"\n  caption=\"The NVFP4-vs-BF16 relative training-loss gap stays around 0.3% — under 0.4% — across checkpoints; longer BF16 continuation closes it toward zero. A loss gap, not a downstream benchmark A/B (paper, Figure 3).\"\n/>\n\n## Why 4-bit training normally falls apart — and the three fixes\n\nPost-training quantization only has to preserve the *forward* pass of a frozen network. Training in 4-bit\nis harder on two counts: the **gradients** flow through the same low-precision GEMMs, and any systematic\nerror compounds over trillions of tokens. Nemotron leans on three stabilizers, each aimed at a specific\nfailure:\n\n1. **2D block quantization on weights.** Instead of quantizing weights in 1D strips, quantize them in 2D\n   blocks, so a block's shared scale better fits the local structure of the weight matrix and fewer values\n   get clipped.\n2. **Random Hadamard transform on the wgrad inputs.** The weight-gradient (wgrad) matmul is the one most\n   poisoned by outliers. Multiplying its inputs by a random Hadamard matrix *spreads* those outliers\n   across the block before quantization (and is undone analytically), so no single large value blows out\n   the block scale.\n3. **Stochastic rounding on gradients.** Deterministic rounding biases small gradients toward zero — over\n   trillions of steps that lost signal is fatal. Rounding gradients *stochastically* is unbiased in\n   expectation, so the gradient direction survives quantization even when individual values don't.\n\nWhere each of these lives is easier to see than to say. Here is one path through the stack, colored by\nprecision, with the stabilizers attached to the FP4 GEMMs — flip to the backward pass:\n\n<PrecisionMap />\n\n## The honest part: this is not end-to-end FP4\n\nThe precision map makes the biggest caveat visual: **most of the network is not FP4.** Native FP4 training\nmeans the heavy expert/MLP weight-GEMMs run in NVFP4 — that is the bulk of the FLOPs — but a meaningful\nfraction of the model is deliberately kept at higher precision, because FP4 is most fragile exactly there:\n\n<Callout type=\"warn\">\n  **Kept at higher precision:** the **final ~16 layers**, the **Mamba-2 projection layers**, the **QKV\n  projections**, the **MTP (multi-token-prediction) module**, and the **embeddings**. So \"native FP4\n  training\" is really *mixed-precision* training with FP4 doing the heavy lifting on the compute-bound\n  GEMMs — not a network where every tensor is 4-bit. Read the quality claim the same way: it's a\n  **training-loss gap under 0.4%**, not a downstream task-benchmark parity result. A small loss gap is\n  necessary for parity but does not prove it.\n</Callout>\n\nAnd it was not a smooth ride. The run **diverged twice** — around ~8T and ~16T tokens — and each time the\nteam had to roll back to an earlier checkpoint and restart the segment (with an FP32-rounding fix and a\nre-annealed learning rate) to recover. Four-bit training is not a free lunch you turn on and forget:\n\n<Figure\n  src=\"/articles/nemotron-nvfp4/fig2.png\"\n  alt=\"Training and validation loss versus training tokens in trillions. The original phase-1 run diverges twice — insets labeled Divergence 1 near 8T tokens and Divergence 2 near 15–16T tokens — where the loss curls upward. A rollback run with FP32 rounding and an annealed-learning-rate phase-2 run continue smoothly downward past the divergence points.\"\n  caption=\"Two real loss divergences during the run (near ~8T and ~16T tokens) each required a rollback and restart; the recovered runs (FP32-rounding rollback, then annealed-LR phase 2) continue down cleanly (paper, Figure 5).\"\n/>\n\n## The model underneath\n\nThe precision story rides on a specific architecture: a **550B-total / 55B-active** hybrid that interleaves\n**Mamba-2** state-space blocks, periodic **attention**, and **[LatentMoE](/articles/mixture-of-experts-from-scratch)**\nexpert layers, with a **[multi-token-prediction](/articles/multi-token-prediction)** head on top. The\nMamba-2 blocks carry most of the sequence mixing cheaply; attention appears sparingly for exact long-range\nrecall; the MoE layers are where the parameters (and the FP4 GEMMs) live. The repeating layer pattern:\n\n<Figure\n  src=\"/articles/nemotron-nvfp4/fig3.png\"\n  alt=\"The Nemotron layer pattern: repeating groups of Mamba-2 blocks, occasional Attention blocks, and Latent MoE blocks, with group multipliers x2, x3 and x4 across the stack, bracketed as a repeating hybrid unit.\"\n  caption=\"The hybrid layer pattern — Mamba-2 and attention for sequence mixing, LatentMoE for sparse capacity — repeated across the stack (paper, Figure 2).\"\n/>\n\nStoring that many parameters in 4.5-bit elements is a real memory win, but — honestly — not the clean 4×\nthe \"4-bit\" label implies, once you count the scale overhead. Slide the parameter count:\n\n<MemoryCalc />\n\nFor the effective storage cost per element, the formats line up cleanly — and NVFP4's 4.5 bits sits\nbetween MXFP4's leaner-but-blunter 4.25 and BF16's 16:\n\n<BenchBars\n  title=\"Effective storage cost (bits per element · lower = smaller)\"\n  unit=\"b\"\n  bars={[\n    { label: \"BF16\", value: 16 },\n    { label: \"MXFP4\", value: 4.25 },\n    { label: \"NVFP4\", value: 4.5, highlight: true },\n  ]}\n/>\n\n## One more thing not to conflate\n\nThere is a *second* quantization result in this work that is easy to mix up with the training story:\n**inference PTQ** — post-training-quantizing the finished model down for cheaper serving. Those serving\nnumbers are a separate experiment about deployment, measured on the trained checkpoint; they are not\nevidence about training precision. The claim on the table here is narrower and more interesting: that you\ncan run the *training* GEMMs in 4-bit and land within a fraction of a percent of a BF16 loss curve. Keep\nthe two apart.\n\n## The take\n\nThe quiet lesson is that \"native 4-bit training\" is an engineering result about *where* you dare to put\n4 bits, not a claim that the whole network is 4-bit. NVFP4's two-level scale (4-bit element + FP8 block\nscale + FP32 tensor scale) buys back enough dynamic range to make the compute-heavy expert GEMMs\ntrainable in 4 bits; the three stabilizers — 2D block quantization, Hadamard-smeared wgrad inputs, and\nstochastic gradient rounding — keep the gradients honest; and mixed precision quietly protects the fragile\nedges (embeddings, final layers, projections, MTP). The payoff is a real reduction in training compute and\nmemory bandwidth on hardware built for FP4, at a training-loss gap under 0.4%. The caveats are equally\nreal: it is not end-to-end FP4, a loss gap is not benchmark parity, the run diverged twice, and the\ninference-PTQ numbers are a different story. As a demonstration that frontier-scale training can leave the\n16-bit comfort zone, though, it is a genuinely aggressive, well-instrumented bet — and it stuck.\n\n---\n\n*Built on NVIDIA's Nemotron NVFP4 training report. NVFP4 is a 4-bit E2M1 element with a per-16-element\nFP8 (E4M3) block scale and an FP32 per-tensor scale; figures are reproduced from the paper for commentary,\nand the interactive diagrams are illustrations of the mechanism. Related: [why quantization is\nhard](/articles/turboquant-kv-cache), [mixture-of-experts from\nscratch](/articles/mixture-of-experts-from-scratch), [multi-token\nprediction](/articles/multi-token-prediction), and [large-scale\ntraining](/articles/megatrain-single-gpu-training).*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/nemotron-nvfp4","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Program-as-Weights: compiling a natural-language spec into a tiny local model","description":"Some functions resist clean rules — 'alert on the log lines that matter', 'repair malformed JSON', 'rank by intent' — so we outsource them to a big LLM API, paying on every call. Program-as-Weights (PAW) compiles such a fuzzy function once, from a plain-language spec into a small LoRA adapter, then runs it on a frozen 0.6B interpreter locally and cheaply forever after. A 0.6B model executing PAW programs matches direct prompting of a 32B model at ~1/50th the memory, 30 tok/s on a MacBook. A walk through the compiler, the economics, and the honest caveats.","date":"2026-07-04","tags":["llm","inference-optimization","fine-tuning","systems","explainer"],"draft":false,"featured":false,"interest":5,"helpful":3,"kind":"articles","slug":"program-as-weights","body":"Some everyday programming tasks refuse to be written as rules. Alerting a human on only the log lines\nthat matter, repairing malformed JSON, ranking snippets by intent, deciding whether a message is\nurgent — even a regex for \"parse this messy text\" shatters on edge cases. The modern move is to punt:\ncall a large language model API and let it decide, per input. That works, but you pay for it on\n**every call**, and you give up locality, reproducibility, and price.\n\n**Program-as-Weights (PAW)** — from Waterloo, Cornell, and Harvard — proposes a different deal. Treat\nthat fuzzy function like source code: **compile it once** from a natural-language specification into a\ncompact, locally-executable neural artifact, and then *run* that artifact cheaply for every subsequent\ninput. The compiler is a 4B model; the artifact is a small **LoRA adapter**; the thing that executes it\nis a **frozen 0.6B interpreter**. Play with the loop first — pick a spec, compile it, then feed inputs\nthrough the interpreter:\n\n<PawCompiler />\n\nThe headline result is the payoff of that split: a **0.6B Qwen3 interpreter** running PAW programs\n**matches direct prompting of Qwen3-32B**, while using **roughly one-fiftieth of the inference memory**\nand running at **30 tokens/second on a MacBook M3**.\n\n## The compiler–interpreter system\n\nPAW borrows the oldest idea in programming — separate the compiler from the runtime — and instantiates\nit with neural parts. The **compiler** reads a function specification (plus an optional pseudo-program\nsketch) and emits a parameter-efficient adapter. The **interpreter** is a small, *frozen* model that,\nloaded with that adapter, executes the function on real inputs.\n\n<Figure\n  src=\"/articles/program-as-weights/fig1.png\"\n  alt=\"Three-panel PAW architecture. Left, the LoRA Compiler: a trainable Compiler model reads a function spec, a pseudo-program, and prefix tokens. Middle, the LoRA Mapper: mean-pooled compiler features go through an MLP that mixes a set of learned LoRA A and B basis matrices into the adapter's low-rank A and B factors. Right, the Interpreter: a frozen model runs the pseudo-program plus input through the injected LoRA to produce output.\"\n  caption=\"PAW's best instantiation, Text-to-LoRA: the compiler's representation of the spec is mapped by a small MLP into a low-rank LoRA adapter (a mixture of learned basis matrices), which is injected into the frozen interpreter (paper, Figure 3).\"\n/>\n\nThe load-bearing piece is the **LoRA mapper**. Rather than have the compiler regress raw adapter\nweights (a huge, unstructured output), it mean-pools the compiler's features and passes them through an\nMLP that produces mixing coefficients over a **set of learned LoRA basis matrices** — the emitted\nadapter is a combination of reusable low-rank factors. That keeps the compiler's output small and\nwell-conditioned, and it's what makes \"spec → weights\" learnable at all. (An earlier prefix-tuning\nvariant works too; Text-to-LoRA is just the current best.)\n\nTo train the compiler, the authors built **FuzzyBench**, a **10-million-example** dataset of fuzzy\nfunctions spanning text processing, search and matching, custom classification, code and NL commands,\nsafety and verification, agentic tool use, and format repair — with clean and noisy specification\nvariants so the compiler learns to be robust to sloppy prompts:\n\n<Figure\n  src=\"/articles/program-as-weights/fig3.png\"\n  alt=\"Donut chart of FuzzyBench's 10 million examples by theme: core text processing and NLP 30%, search/matching/web intelligence 18%, custom classification and filtering 15%, code and natural-language commands 12%, safety/verification/domain knowledge 12%, agentic and tool use 8%, format repair and validation 5%.\"\n  caption=\"FuzzyBench: 10M fuzzy-function examples across seven themes, released with the paper, used to train the 4B compiler (paper, Figure 2).\"\n/>\n\n## Why compile at all? The economics\n\nThe reason this framing matters is cost structure. Calling a big model's API is a cost you pay on\n**every input**. PAW pays a **one-time compile** — a single pass of the 4B compiler — and then each\napplication runs locally for almost nothing. So cumulative cost crosses over fast, and the gap only\nwidens. Drag the number of calls:\n\n<CostCrossover />\n\nThat's the thesis in one picture: PAW **reframes the foundation model from a per-input problem solver\ninto a tool builder**. You invoke the big model once per *function definition*, get back a small\nreusable artifact you own, and every *function application* after that is cheap, offline, and\nreproducible. Compile a library of them and each becomes a tiny local endpoint:\n\n<Figure\n  src=\"/articles/program-as-weights/fig2.png\"\n  alt=\"Three-stage flow. Function Specification: three plain-language specs (classify message urgency, fix malformed JSON, remove personal information). Neural Programs: the compiler turns each into a pseudo-program plus a colored LoRA adapter (LoRA 1, 2, 3). Local Deployment: each adapter runs on a small local LM as PAW Email Triage, PAW Json Fixer, and PAW PII Redactor, processing inputs into outputs on-device.\"\n  caption=\"Compile a library: each spec becomes a pseudo-program plus a LoRA adapter, deployed as a small, local, single-purpose model (paper, Figure 15).\"\n/>\n\nTwo practical wins reinforce it: the paper reports the adapters **quantize with no measurable accuracy\nloss**, and the whole thing runs at interactive speed (30 tok/s) on a laptop — so the compiled function\nis genuinely local, not a cloud dependency in disguise.\n\n## The honest caveats\n\nPAW is a genuinely new framing, but it buys its efficiency with real constraints, and the paper is\ncandid about them:\n\n- **The compiler and interpreter are a coupled pair.** An adapter compiled for one frozen interpreter\n  isn't portable to a different one — you commit to an interpreter.\n- **The compiled program isn't interpretable.** Unlike source code, you can't read a LoRA adapter to\n  see what the function *does*; you can only run it. \"Program\" is an analogy for the compile-once\n  workflow, not a claim of legibility.\n- **Single-step fuzzy functions.** PAW targets one-shot transformations (classify, repair, rank), not\n  long multi-step agentic control flow.\n- **Trained on synthetic data.** FuzzyBench is model-generated; the compiler's competence is bounded by\n  the distribution of tasks it was synthesized from, and real specs can fall outside it.\n- **The best adapter type is task-dependent.** Text-to-LoRA wins overall, but the paper finds no single\n  parameter-efficient method dominates every task — there's still a choice to make.\n- **\"Matches 32B\" is on FuzzyBench-style tasks.** The parity is measured on the fuzzy-function\n  distribution PAW is built for, not a claim that a 0.6B model equals a 32B model in general.\n\n## The take\n\nWhat I like about PAW is that it's a *systems* idea wearing an ML paper's clothes. The interesting move\nisn't a new architecture — it's noticing that we've quietly turned every fuzzy function into a recurring\nAPI bill, and that the compile/runtime split we use for ordinary code applies here too: pay the big\nmodel once to *build the tool*, then run the tool locally forever. The Text-to-LoRA mapper is the clever\nengineering that makes \"spec → weights\" tractable, but the reframing is the contribution — a foundation\nmodel as a **compiler for behaviors** rather than an always-on oracle. Whether it generalizes past\nsingle-step functions is the open question; as a way to make a hundred small fuzzy tasks local,\nprivate, and nearly free, it's one of the freshest ideas of the season. Code, the 4B compiler, and the\n10M-example FuzzyBench are released (CC BY 4.0).\n\n---\n\n*Built on [Program-as-Weights: A Programming Paradigm for Fuzzy Functions](https://arxiv.org/abs/2607.02512)\n(Zhang, Hotsko, Kim, Nie, Shieber, Deng; University of Waterloo, Cornell, Harvard; 2026, CC BY 4.0).\nFigures are reproduced from the paper; benchmark and efficiency figures (0.6B ≈ 32B, ~1/50 memory, 30\ntok/s on an M3) are quoted from it. The interactive playground and cost model are illustrations of the\nmechanism, not runs of the released model.*\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/articles/program-as-weights","signal":{"interest":5,"helpful":3,"score":8,"level":4,"label":"High"}},{"title":"TabFM: a foundation model that learns tables in-context","description":"Most tabular ML still means fitting a gradient-boosted tree per dataset. TabFM, Google's tabular foundation model, does something stranger: it treats your labeled rows as context and predicts new rows in a single forward pass — no per-dataset training — after being pre-trained on hundreds of millions of synthetic tables. This is a walk through its 3-stage architecture, its TabPFN/TabICL lineage, and an honest read of the TabArena numbers: the license, the class and feature ceilings, and the fact that 'beats GBDTs' means ensemble-vs-ensemble.","date":"2026-07-04","tags":["tabular","foundation-models","in-context-learning","explainer","transformers"],"draft":false,"cover":"/articles/tabular-foundation-model/fig3.png","featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"tabular-foundation-model","body":"For a decade the honest answer to \"what should I use on this spreadsheet?\" has been a\ngradient-boosted tree — XGBoost or LightGBM — fit fresh on each dataset. Deep learning kept\nlosing to it on tabular data. **TabFM**, Google's tabular foundation model, is a bet that the\ntransformer recipe finally transfers, but not by training a network on *your* table. Instead it\ndoes **in-context learning**: you hand it your labeled rows as *context*, and it predicts new rows\nin a single forward pass — no gradient descent on your data at all. The knowledge to do this was\nbaked in ahead of time, by pre-training on **hundreds of millions of synthetic tables**.\n\nThe cleanest way to feel what that means is to watch it. Feed the model labeled example rows and\nit answers a test row; add rows and the prediction firms up; take them all away and it can only\nguess. Crucially, **no training happens** at any point below — the \"training set\" is just context:\n\n<IclDemo />\n\nIf that shape looks familiar, it's because TabFM descends from **TabPFN** and **TabICL** — the\nline of *prior-fitted* / in-context tabular transformers that reframed prediction as inference over\na prompt rather than a fit loop. TabFM scales that idea up: a bigger model, a richer synthetic\nprior, and an architecture built specifically for the two things tables have that text doesn't —\ncolumns with no natural order, and rows that are exchangeable examples.\n\n## The synthetic prior: learning to predict, from causal make-believe\n\nTabFM never sees your data in training. It is trained on tables **generated from structural causal\nmodels (SCMs)** — random causal graphs that emit synthetic features and a label with a known\ngenerative structure. Sample hundreds of millions of such tables, each a fresh little prediction\nproblem, and train one transformer to solve all of them *in-context*. What the model learns is not\nany particular dataset but the **algorithm** of tabular prediction: given some labeled rows, infer\nthe rule and apply it to a new row. Meeting your real table at inference time is then just another\ndraw from a distribution it has already seen a hundred million cousins of. (If synthetic-data\npriors interest you, the same generate-to-learn logic shows up in [set diffusion](/articles/set-diffusion).)\n\n## The architecture, in three stages\n\nTabFM's body is a pipeline that respects tabular structure in three moves. Step through them:\n\n<PipelineStages />\n\n1. **Column attention.** A **Set Transformer** attends *across the features* of a row. Because a\n   set has no order, the model is **permutation-invariant** over columns — shuffle your feature\n   order and nothing changes. Numeric values, which transformers otherwise handle poorly, are\n   embedded with **Fourier features** (sin/cos of the value at several frequencies) so the network\n   can represent magnitude and periodicity.\n2. **Row compression.** Each row, now a set of attended feature embeddings, is pooled down to a\n   single **CLS token** — one vector per row. **RoPE** positional encoding gives the rows an order\n   to work with, so the sequence of rows is something the next stage can index into.\n3. **In-context transformer.** A **24-block** transformer reads the sequence of row tokens — the\n   labeled context rows *and* the test row — and does the actual in-context learning, emitting the\n   test row's predicted label. This is the stage that \"learns\" your dataset, and it does so with\n   attention, in one pass, weights frozen.\n\nThe paper's own diagram lays out the same three stages — the alternating row/column attention that\nfeeds row compression, then the in-context stack that predicts the missing label:\n\n<Figure\n  src=\"/articles/tabular-foundation-model/fig1.png\"\n  alt=\"TabFM architecture. Left: a table of training rows and one test row, features in columns with a Label column; the test row's label is a green cell marked with a question mark. Middle: 'Alternating Row and Column Attention', drawn as grids of blue nodes with curved attention arrows within rows and down columns. Right: three 'Row Embedding' blocks under 'In-Context Learning', connected top to bottom, with the final block emitting 'Predict Missing Label' in green.\"\n  caption=\"Column and row attention build a per-row embedding; a stack of in-context blocks reads the labeled rows plus the test row and predicts its missing label — no gradient training on the table (model card, Figure 1).\"\n/>\n\n## In-context vs fit-a-tree: the shape of the work\n\nIt helps to set TabFM beside the thing it wants to replace. A boosted tree has to *fit* your table\nfirst — a sequential loop of boosting rounds — before it can predict. TabFM has no fit loop for\nyour data at all; the table enters as context and the answer comes out of one forward pass. Drag\nthe rounds and watch the sequential work pile up on one side and vanish on the other:\n\n<IclVsTrain />\n\nThis is a claim about **workflow**, not accuracy. \"No training loop\" is a real ergonomic win — you\nstand up a predictor in one pass — but whether it *predicts better* than a tuned tree is a separate\nquestion, and one worth being careful about.\n\n## The numbers, and what they actually compare\n\nTabFM is evaluated on **TabArena**, a leaderboard that scores tabular methods with a **relative Elo**\nacross a suite of datasets. On classification, TabFM lands at **1727** Elo, and a small ensemble of\nTabFM runs (**TabFM-Ensemble**) reaches **1815** — ahead of the strongest AutoGluon configuration\nand the TabPFN/TabICL baselines:\n\n<BenchBars\n  title=\"TabArena · classification Elo (higher = better, relative)\"\n  bars={[\n    { label: \"TabFM-Ensemble\", value: 1815, highlight: true },\n    { label: \"TabFM\", value: 1727, highlight: true },\n    { label: \"AutoGluon 1.5\", value: 1666 },\n    { label: \"TabPFN-3\", value: 1639 },\n    { label: \"TabICLv2\", value: 1576 },\n  ]}\n/>\n\nThe regression picture is wider: TabFM at **1940**, the ensemble at **2125**, clear of the field.\n\n<BenchBars\n  title=\"TabArena · regression Elo (higher = better, relative)\"\n  bars={[\n    { label: \"TabFM-Ensemble\", value: 2125, highlight: true },\n    { label: \"TabFM\", value: 1940, highlight: true },\n    { label: \"TabPFN-3\", value: 1802 },\n    { label: \"AutoGluon 1.5\", value: 1786 },\n    { label: \"TabDPT\", value: 1722 },\n  ]}\n/>\n\nHere is the load-bearing caveat. The banner \"beats GBDTs\" rests on beating **AutoGluon** — an\n*AutoML ensemble* that stacks and tunes many models — not a single tuned XGBoost or LightGBM. In\nthe headline chart there is **no standalone GBDT bar**; the strongest tree-based entries are the\nAutoGluon configurations. So the honest framing is **ensemble-vs-ensemble**: TabFM-Ensemble edges\nAutoGluon, and single TabFM is competitive with it. That is a genuinely strong result — but it is\nnot \"TabFM beats your tuned XGBoost,\" which the chart does not measure. The paper's full TabArena\nchart shows exactly this — AutoGluon as the tree-side comparison, TabFM and its ensemble on top:\n\n<Figure\n  src=\"/articles/tabular-foundation-model/fig2.png\"\n  alt=\"Two horizontal-titled bar charts of TabArena Elo. Top, classification: TabFM-Ensemble 1815 and TabFM 1727 lead, then AutoGluon 1.5 (extreme, 4h) 1666, TabPFN-3 1639, AutoGluon 1.4 1623, TabPFN-2.6 1585, TabICLv2 1576, RealTabPFN-2.5 1566, RealMLP 1469, TabM 1440. Bottom, regression: TabFM-Ensemble 2125 and TabFM 1940 lead, then TabPFN-3 1802, AutoGluon 1.5 1786, RealTabPFN-2.5 1752, TabPFN-2.6 1751, TabDPT 1722, AutoGluon 1.4 1688, TabICLv2 1687, RealMLP 1654.\"\n  caption=\"TabArena Elo for classification (top) and regression (bottom). The tree-based comparison is AutoGluon's AutoML ensemble, not a standalone GBDT; TabFM-Ensemble leads and single TabFM is competitive (model card, Figure 2).\"\n/>\n\n<Callout type=\"warn\">\n**Read the fine print before you reach for it.** (1) **License:** the released weights are\n**non-commercial** — fine for research, not for shipping a product. (2) **Ensemble-vs-ensemble:**\nthe \"beats GBDTs\" comparison is against AutoGluon's AutoML ensemble; there is no standalone\nXGBoost/LightGBM bar in the headline chart. (3) **~10-class ceiling:** the classifier handles up to\nabout 10 classes. (4) **~500-feature ceiling:** it does not scale to very wide tables. (5) **Elo is\nrelative** — a ranking on TabArena's specific dataset suite, not an absolute accuracy guarantee on\n*your* data. Validate on a held-out slice of your own table before trusting it.\n</Callout>\n\n## The take\n\nTabFM is the tabular field finally getting its \"just add context\" moment. The mechanism is worth\ninternalizing even if the weights' license keeps you from deploying them: prediction reframed as\n**in-context inference** over a synthetic prior, with an architecture that takes tables seriously —\norder-invariant column attention, Fourier-embedded numerics, row-to-CLS compression, and a\n24-block stack that does the learning at inference time. It inherits this from\n[TabPFN and TabICL](/articles/how-transformers-attention-works) and pushes the scale, and the\nTabArena numbers say the bet largely pays off: state-of-the-art *among ensembles*, and a real\nchallenger to the AutoML pipelines that have owned tabular ML.\n\nWhat it is not, yet, is a drop-in XGBoost replacement for production. The non-commercial license,\nthe ~10-class and ~500-feature ceilings, and the ensemble-framed comparison all matter, and the\nElo is a leaderboard ranking, not a promise about your dataset. But as a research artifact it moves\nthe frontier: the fit-a-tree-per-dataset era now has a serious foundation-model rival, and the\ninteresting question is no longer \"can transformers do tables\" but \"how far does one forward pass\ngo before you still need to train.\"\n\n---\n\n*Built on Google's **TabFM** — the [Hugging Face model card](https://huggingface.co/google/tabfm)\nand Google's accompanying [blog post](https://research.google/blog/). There is no primary arXiv\npaper; figures are from the model card, and the interactive diagrams are our own illustrations of\nthe mechanism. TabFM descends from [TabPFN](https://arxiv.org/abs/2207.01848) and TabICL; Elo scores\nare quoted from TabArena.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/tabular-foundation-model","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Laguna's Model Factory: treating model development as an industrial process","description":"Poolside's Laguna M.1 (225B/23B-active) and XS.2 (33B/3B-active) are competent agentic-coding MoE models — but the tech report is really about the thing that built them. The 'Model Factory' is a tightly-versioned loop of data, training, evaluation, and inference where every run is reproducible code and a win ships to production by flipping a flag. It's what let them build XS.2 from scratch in five weeks, and the paper argues that industrialized process — not any architecture trick — is the moat.","date":"2026-07-03","updated":"2026-07-08","tags":["llm","agents","mixture-of-experts","systems","explainer"],"draft":false,"featured":true,"interest":4,"helpful":4,"kind":"articles","slug":"laguna-model-factory","body":"Most model reports are about a model. Poolside's **Laguna** report is unusual: its real subject\nis the *factory*. Yes, it ships two Mixture-of-Experts models for agentic software engineering —\n**Laguna M.1** (225.8B total, 23.4B active) and the open **Laguna XS.2** (33.4B total, 3B active) —\nand they're solid, competitive-in-class coding models. But the argument the report actually makes\nis that the models are downstream of something more valuable: a **Model Factory** that treats\nfoundation-model development as an *industrial process* rather than a craft. The evidence they\noffer is a number — they built XS.2 from inception to release in **five weeks**.\n\n<ModelFactory />\n\n## The two models, briefly\n\nThe models are conventional-but-careful MoE. Both are pre-norm Transformers with RMSNorm and\ntoken-choice routing with a shared expert. The report spells out XS.2's config — **8 of 256\nrouted experts** per token plus one shared expert, the routed output scaled by 2.5 before it\nrejoins the residual (a DeepSeek-V3 / Nemotron-style modulation), a sigmoid router normalized\nafter top-k, and a dense first layer for stability — but gives M.1's totals without its routing\nwidth, so I won't guess it. Both use GQA (not MLA), the **Muon** optimizer, and a served context\nof **256K** — XS.2's attention runs 8 KV heads with softplus per-head gating.\n\n<TwoModels />\n\nThe one architectural wrinkle worth noting is XS.2's attention: where M.1 runs **global attention\non every layer**, XS.2 interleaves **sliding-window and global at a 3:1 ratio** with a 512-token\nwindow — the same [hybrid-attention idea as MiMo-V2-Flash](/articles/mimo-v2-flash), tuned for a\nsmaller model that has to be cheap to serve. And the data is unambiguously code-first: XS.2's\npretraining mix is ~30% raw code plus a large synthetic-code share, on top of >30T tokens (from a\npool of ~27T unique). But by the report's own framing, none of this is the point. The point is how\nit was all *made*.\n\nXS.2 is really M.1 with four ablated deltas flipped on: the hybrid attention above, a\nWarmup-Stable-Decay LR schedule instead of cosine, the routed-expert modulation, and one dense\nlayer instead of three. Each was a config change against the M.1 baseline, chosen on a small MoE\nproxy — cheap precisely because the factory made the data pipeline, trainer, and eval harness\ntransfer for free. The peak LR came from a fitted **WSD scaling law**, $\\text{lr}^\\star \\propto\nN^{-0.46}\\,D^{-0.27}$ in active params and tokens, rather than a re-tuned sweep. The stability\nlessons transferred too: M.1's pre-training surfaced expert collapse around **450B tokens** (fixed\nby Moonlight-style LR scaling, so Muon runs at AdamW-scale weight decay) and a logit-drift blowup\ntraced to a BF16 all-reduce on the LM-head input gradient (fixed by forcing that one reduction to\nFP32). XS.2 hit none of them — the factory's job is to make the second model boring.\n\n## What \"industrial process\" actually means\n\nThe Model Factory is defined as \"a tightly-integrated stack of versioned data, training, evaluation,\nand inference.\" Three principles hold it together:\n\n- **Experiments as code.** Every run's inputs and config are committed to one repo and get a unique\n  ID; a Dagster DAG is the control plane for what runs and what depends on what. The payoff is\n  *end-to-end lineage* — a single token in a packed training shard traces back through dedup,\n  filtering, and synthesis to its source document, and every checkpoint, eval, and deployment traces\n  to the exact run that produced it. Nothing is a mystery artifact.\n- **One code base for research and production.** A promising research idea isn't re-implemented to\n  ship — it's \"promoted into production by flipping a configuration flag.\" The inference library\n  (Atlas, on vLLM) consumes the trainer's (Titan) model definitions *bit-accurately*, so what you\n  evaluate is what you serve is what generates your RL rollouts.\n- **Reserve human attention for novel decisions.** A custom Kubernetes scheduler places jobs in\n  under a minute, auto-recovers from hardware failure, and only pages a human when recovery itself\n  fails. Cross-replica bit-identical weight-hash checks catch the silent data corruption a defective\n  GPU would otherwise smear through a run.\n\nThe components have names, and the flywheel is literal: **Titan** trains, **Blender** streams the\ndata mix, **Hive** generates synthetic data, **Atlas** serves inference *and* RL rollouts, and a\ncontainerized **Code Execution** platform spanning ~1M repositories provides synthetic tasks,\nevaluation, *and* RL execution rewards — one component doing all three. The RL harness they train\nagainst is the same harness they ship to customers. That's the loop the interactive above is\ngesturing at: data → train → eval → infer, where inference feeds the next round of data, and the\nwhole thing is versioned tightly enough that a five-week model is possible.\n\n<Figure\n  src=\"/articles/laguna-model-factory/fig1.png\"\n  alt=\"Left-to-right pipeline of labeled stages: Common Crawl, parsing, language ID, deduplication, quality tagging, conservative filtering, score-and-rank, bucketing, sampling, and final web mix.\"\n  caption=\"One arm of the factory made concrete: the versioned web-data pipeline runs raw Common Crawl through extraction, dedup, quality tagging, conservative filtering, composite scoring, and quota-aware sampling into the training mix — an assembly line with end-to-end lineage (paper, Figure 3).\"\n/>\n\nThat pipeline also flipped a habit. For M.1 they ran a high-precision filter that aggressively\ndropped noisy documents; for XS.2 they went **high-recall** instead — the composite score fully\nrejects only ~25.8% of web samples as pure noise and *recovers ~34%* of documents the old static\nrules had thrown away, then treats quality as a ranking signal and samples from score buckets\nrather than hard-filtering. Under a >30T-token budget, controlling repetition and diversity beats\nmaximizing average quality; over-filtering starves the long tail.\n\n## Choosing the data mix by optimization, not taste\n\nThe web pipeline decides *which documents survive*. A separate problem is *how much of each\nsource to train on* — the mixture weights. Done by hand, that's a few knobs set by taste and a\nhandful of ablations. The report's quietest radical move, **AutoMixer**, turns it into an\noptimization loop, and it's the cleanest single instance of the factory thesis: a decision that\nused to be craft becomes a versioned, automated search.\n\nThe setup is a surrogate-model sweep. For each data ablation they train a **swarm of ~60 proxy\nmodels** — each a ~0.5B-parameter MoE on ~60B tokens — from **different mixtures** sampled over\n**50+ heterogeneous dataset groups** (web, curated edu, academic, raw / grounded / synthetic code,\nmath web, conversational and knowledge sets). Every proxy is one labeled example of \"mixture in,\ncapabilities out.\"\n\n<Figure\n  src=\"/articles/laguna-model-factory/fig4.png\"\n  alt=\"AutoMixer pipeline: a column of dataset groups on the left feeds a six-step loop — sample N mixtures over the simplex, train N proxy models, evaluate M capabilities, fit one regressor per capability, then optimize the mixture weights to maximize all capabilities jointly.\"\n  caption=\"The AutoMixer loop: sample mixtures over the dataset-group simplex, train a proxy model on each, evaluate a small set of capabilities, fit a surrogate regressor per capability, then optimize the mixture. The diagram illustrates the loop with 15 example groups; the real sweep spans 50+ (paper, Figure 6).\"\n/>\n\nFormally they learn a surrogate $\\mathcal{M}: x \\to y$ where $x \\in \\Delta^{d}$ is a mixture over\n$d$ dataset groups and $y \\in \\mathbb{R}^{k}$ is a vector of downstream metrics across $k$\ncapability groups — coding, math reasoning, STEM knowledge, commonsense, general knowledge.\nCandidate mixtures are drawn near a hand-designed prior $x_0$ as $x \\sim \\text{Dirichlet}(\\alpha\nx_0)$ subject to $\\lVert x - x_0 \\rVert_1 < \\epsilon$, so the search stays in realistic regions.\nFor each capability $j$ they fit a regressor $f_j(x) \\approx y_j$ — linear in the simplified\npicture, $\\hat{y}_j = \\beta_j^{\\top} x + b_j$, non-linear in practice. The mixture is then chosen\nby maximizing a weighted sum of the surrogates over the simplex:\n\n$$\n\\max_{x}\\ \\sum_{j=1}^{k} w_j\\, f_j(x)\n\\quad \\text{s.t.}\\quad \\sum_i x_i = 1,\\ \\ x_i \\ge 0,\\ \\ \\lVert x - x_0 \\rVert_1 < \\epsilon\n$$\n\nwith a $\\lambda\\, D_{\\mathrm{KL}}(x \\Vert x_0)$ penalty keeping the answer from collapsing onto a\nfew dominant sources. The knobs $w_j$ are where intent enters: weight coding and math and the\noptimizer allocates data toward them.\n\n<AutoMixer />\n\nThe learned surrogate recovers relationships you'd expect — synthetic and curated code lift coding\nevals; conversational and knowledge corpora lift commonsense — plus finer cross-effects. On a\n3B-param / 1.5T-token check, the optimized mix posts large gains on the targets (HumanEval+ **+43%**,\nCRUX-I **+54%**, GSM8K **+41%**, MultiPL-E **+27%**) and, encouragingly, **generalizes to held-out\nbenchmarks** it wasn't optimized against (MATH **+25%**, LiveCodeBench **+39%**, BigCodeBench +16%).\nThe cost is stated honestly and it's small: a few commonsense tasks regress (ARC-C **−6.8%**, the\nrest under 1.5%), which is exactly what you sign up for when the objective down-weights them. XS.2's\nfinal mixture — **30.6% raw code, 25.4% synthetic/code-text, 25.2% web, 9% math**, the rest\nknowledge / instruction / academic / books (Table 4) — shifted toward web, synthetic, and math\nrelative to M.1's while keeping the code-heavy spine. That's the thesis in one artifact: a data-mix\ndecision made by an optimizer over a learned model, logged and re-runnable, instead of argued in a\nmeeting.\n\n## Agentic training, from real commits\n\nThe coding ability comes from training on the actual job. Poolside turns **real git commits from\npublic repos into verifiable tasks** — a problem statement, a repo checkout, and a hidden test\npatch, with the gold answer being the commit's own diff. A two-sided filter keeps only commits\nwhere the gold diff passes the tests *and* an empty patch fails them (discarding trivial or\nnon-exercising tests), yielding **30–60k tasks from a ~236k-commit pool**. Those tasks feed both\nSFT (as teacher-generated trajectories, sometimes wrapped in synthetic system messages for\ninstruction-following pressure) and the RL pool (where the repo's own test suite is the verifier).\n\nThe RL stage is online: the policy itself drives the **production agent harness** across several\nthousand live containers at a time, and each rollout's reward comes from that container's verifier.\nThe objective is **CISPO** — the importance-ratio-clipping surrogate from MiniMax-M1, not a Poolside\ninvention — paired with a length-weighted leave-one-out group baseline; clipping is asymmetric,\nan effective $[0, 5]$ on the ratio, so it only bites on heavily off-policy tokens. Reward is a\ndeterministic chain of checks: a malformed tool call or template violation is $-0.1$, giving up\nbefore a minimum number of tool calls is $-0.1$, a timeout is $0.0$, and the **only positive reward\nis the binary task verifier** ($1.0$) — unit tests for SWE tasks, bash assertions for terminal\ntasks, exact-match for tool-integrated math. A small $-0.05$ per-token penalty lands on exactly the\ntokens of a failing tool step to sharpen credit assignment; everything else is carried by the\nterminal 1/0. It's the same \"make the process the trainable target\" instinct as [Agents-A1's\nverifier-graded trajectories](/articles/agents-a1), wired straight into the factory — the execution\nenvironment that grades RL is the one that generates data and runs evals.\n\n## The numbers, honestly\n\nHere's where the modest framing matters. The report claims the models are \"competitive with\nstate-of-the-art open models in their respective weight classes,\" and that's accurate — *competitive*,\nnot leading. M.1 lands mid-pack among the ~200B-class open models on SWE-bench Verified:\n\n<BenchBars\n  title=\"SWE-bench Verified — ~200B-class open models (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"DeepSeek-V4-Flash\", value: 79.0 },\n    { label: \"Qwen3.5 (397B-A17B)\", value: 76.2 },\n    { label: \"Laguna M.1 (225B-A23B)\", value: 74.6, highlight: true },\n    { label: \"GLM-4.7 (355B-A32B)\", value: 73.8 },\n    { label: \"Devstral 2 (123B)\", value: 72.2 },\n  ]}\n/>\n\n<Figure\n  src=\"/articles/laguna-model-factory/fig2.png\"\n  alt=\"Grouped bar charts of Laguna M.1 versus Devstral 2, GLM-4.7, DeepSeek-V4-Flash, Qwen3.5 and Claude Sonnet 4.6 across four benchmarks: SWE-bench Verified, Multilingual, Pro, and Terminal-Bench 2.0.\"\n  caption=\"The paper's own headline chart for M.1, across all four agentic benchmarks — not just SWE-bench Verified — versus ~200B-class open and frontier references (paper, Figure 1a).\"\n/>\n\nXS.2 is the more interesting result, because it's competitive in the ~30B class while activating only\n**3B** parameters per token — against dense 24–31B rivals:\n\n<BenchBars\n  title=\"SWE-bench Verified — ~30B-class open models (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Qwen3.6 (35B-A3B)\", value: 73.4 },\n    { label: \"Laguna XS.2 (33B-A3B)\", value: 69.9, highlight: true },\n    { label: \"Qwen3.5 (35B-A3B)\", value: 69.2 },\n    { label: \"Devstral Small 2 (24B)\", value: 68.0 },\n    { label: \"Gemma 4 (31B)\", value: 52.0 },\n  ]}\n/>\n\n<Figure\n  src=\"/articles/laguna-model-factory/fig3.png\"\n  alt=\"Grouped bar charts of Laguna XS.2 versus Devstral Small 2, Gemma 4, Qwen3.5, Qwen3.6, Claude Haiku 4.5 and GPT-5.4 Nano across SWE-bench Verified, Multilingual, Pro, and Terminal-Bench 2.0.\"\n  caption=\"The same headline chart for XS.2 — competitive across all four benchmarks against ~30B-class open and frontier references while activating only 3B parameters per token (paper, Figure 1b).\"\n/>\n\nXS.2 beats Devstral Small 2 and Gemma 4 and edges Qwen3.5, though Qwen3.6 leads the class. Two honesty\nnotes the report itself makes: the baseline numbers are Poolside-selected published scores (not re-run\nin their harness, so there's provider-config bias), and they patched all four benchmarks to remove\ngit-history leaks before scoring — so these differ slightly from public leaderboards by construction.\nIt's the rare case where the *methodology* disclosure is more reassuring than the raw scores.\n\nOne finding from the quantization work is worth carrying away regardless of the leaderboard: **bad\nquantization hurts agentic benchmarks far more than single-turn ones.** A small per-token error\nthat's invisible on a one-shot question compounds across a hundred-step trajectory, and the report\nsaw exactly that — intermediate schemes that barely moved single-turn scores cratered the agentic\nones. The fix started by looking at *where* the error comes from. Naive INT4 (AWQ, `W4A16`) lost\nquality because outlier activations pile up in the residual stream starting around **layer 30 of\nthe 40-layer network**:\n\n<Figure\n  src=\"/articles/laguna-model-factory/fig5.png\"\n  alt=\"Line chart of residual-stream activation magnitude by layer: the median stays near zero across all 40 layers while the per-layer maximum stays small until about layer 30, then jumps to roughly 90 and stays high through the last layers.\"\n  caption=\"Residual-stream activations by layer for XS.2: the median is flat near zero, but the maximum explodes after layer 30 — the outliers that make a flat low-bit scheme unsafe for the late layers (paper, Figure 7).\"\n/>\n\nSo they went mixed-precision: **`INT4` for the first 30 layers, `INT8` for the last 10** (with a\nSpinQuant rotation as a pre-pass). `NVFP4` needed more — direct post-training quantization lost too\nmuch, so they recovered it with **quantization-aware distillation**, training the quantized student\nto match a higher-precision teacher on a fixed dataset. The KV cache goes to `FP8` across the full\n131K context, roughly doubling how many trajectories a replica can hold. The through-line is the\nsame as everything else here: the diverse eval harness is what caught the agentic-only regression\nthat a single-turn benchmark would have waved through.\n\n## The take\n\nI went in expecting an architecture paper and came out thinking about CI. The Laguna models are\ngenuinely fine — a competent ~200B flagship and a genuinely efficient 3B-active open model that holds\nits own in a crowded class — but they're not what the report is selling. It's selling the claim that\n*iteration speed* is the frontier lever: if every run is reproducible code, every win ships by flipping\na flag, and the same execution environment grades your RL, runs your evals, and generates your data,\nthen you can turn out a from-scratch model in five weeks and keep pace with model complexity instead of\ndrowning in it. Whether the \"factory is the moat\" thesis holds as everyone industrializes is the open\nquestion — but as a piece of honest infrastructure writing, with the models presented as *outputs of a\nprocess* rather than heroic artifacts, it's a refreshing shape for a technical report. XS.2's weights\nare open (Apache-2.0); the factory, of course, is not.\n\n---\n\n*Built on the [Laguna M.1/XS.2 Technical Report](https://arxiv.org/abs/2605.27605) (Poolside, 2026) and\nthe [XS.2 model release](https://huggingface.co/collections/poolside/laguna-xs2) (Apache-2.0). Benchmark\nfigures are from the report's tables (baselines are Poolside-selected published scores on leak-patched\nbenchmark images); SWE-bench Verified figures use the report/model-card values (74.6 / 69.9), higher\nthan the earlier launch checkpoint.*\n","readingTimeMins":13,"url":"https://ai.thesatyajit.com/articles/laguna-model-factory","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"Leanstral: proving theorems by being a code agent, not a prover","description":"Most top Lean systems win with bespoke test-time-scaling machinery — conjecture pools, parallel lemma proving, blueprint stages. Mistral's Leanstral 1.5 (119B total, 6B active) refuses all of it: it proves theorems by running the ordinary Mistral Vibe code-agent loop longer, the same interface a human uses. It saturates miniF2F, solves 587/672 PutnamBench, sets SoTA on FATE-X, and its performance scales smoothly with the per-attempt token budget — for about 1/75th the cost of the prover it edges. A walk through the loop, the scaling, and SafeVerify.","date":"2026-07-03","tags":["llm","agents","reinforcement-learning","formal-methods","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"leanstral-formal-proofs","body":"Formal theorem proving is the rare corner of ML where correctness is not a vibe: a proof either\ntype-checks against Lean's kernel or it doesn't. The catch is that the strongest Lean systems tend to\nwin with *machinery around the model* — Seed-Prover runs conjecture proposing and lemma pools at a\nbudget of ~10 H20-days **per problem**; Goedel-Architect generates a blueprint and proves lemmas in\nparallel. Powerful, but none of it is the interface a person actually uses to write Lean.\n\nMistral's **Leanstral 1.5** makes the opposite bet. It's a **119B-parameter Mixture-of-Experts** that\nactivates just **6B** per token, and it proves theorems by doing exactly what a developer does in\n[Mistral Vibe](https://mistral.ai): edit files, run shell commands, and query the Lean language server\nfor goals, types, and errors — then revise, and keep going. No prover scaffold. Its test-time scaling\nis nothing more exotic than *running that loop longer*. And it works: it **saturates miniF2F** (100% on\nvalidation and test), solves **587/672 PutnamBench**, and sets a new open state-of-the-art of **87 on\nFATE-H** and **34 on FATE-X** — while being Apache-2.0.\n\n## The loop is the whole method\n\nThere's no separate \"prover mode.\" A theorem — or a repository with a missing proof — enters the same\nagent harness used for ordinary software engineering, and the model works it turn by turn:\n\n<AgentLoop />\n\nMistral's own diagram of the loop shows the same thing at the token level: the system prompt hands the\nmodel `lean-lsp-mcp` tools, and from there it's assistant turns interleaving `<think>`, tool calls, and\ntool results — the agent editing a Lean file and checking it exactly as it would edit and test any other\ncodebase, through Vibe's raw filesystem / bash / MCP surface.\n\n<Figure\n  src=\"/articles/leanstral-formal-proofs/fig1.png\"\n  alt=\"Diagram of the Leanstral code-agent loop: a system+user prompt grants lean-lsp-mcp tools; assistant turns alternate think blocks, tool calls (lean-toolchain, lakefile.toml, Main.lean) and tool results; later turns query Mathlib via MCP and write/verify through the Lean LSP; a panel shows Vibe connecting to raw filesystem, raw bash, and arbitrary MCP servers.\"\n  caption=\"Leanstral works Lean through the ordinary Mistral Vibe code-agent interface — edit files, run bash, query the Lean language server over MCP — the same loop used for general software engineering (Mistral, Leanstral 1.5).\"\n/>\n\nWhy insist on the ordinary loop? Two reasons. It makes the model **usable** — you point it at a Lean repo\nthe way you'd point a coding assistant at any repo — and, more importantly, it means the model is\n**trained in the same long-horizon interaction pattern it uses at inference**. There's no train/test\ninterface gap to paper over.\n\n## Test-time scaling, without a trick\n\nBecause the model just spends tokens inside that loop, its accuracy is a smooth function of the\n**per-attempt token budget**. This is the headline result, and it's worth playing with — drag the budget\nand watch PutnamBench solved climb:\n\n<TestTimeScaling />\n\nThat monotonic climb — 44 solved at 50k tokens, 244 at 200k, 493 at 1M, 587 at 4M — is Leanstral's whole\nargument in one curve. Rather than giving up when a proof runs long, it keeps reasoning, editing, and\ncompacting context, turning budget directly into solved theorems. Here's Mistral's version of the same\nfigure:\n\n<Figure\n  src=\"/articles/leanstral-formal-proofs/fig3.png\"\n  alt=\"Line chart titled PutnamBench Test-Time Scaling: Pass@8 solved percentage rises smoothly and monotonically as the per-attempt token limit increases from 25k to 4M, annotated with problem counts 44, 126, 244, 396, 493, 573, 587.\"\n  caption=\"Pass@8 on PutnamBench (672 problems) versus per-attempt token budget — performance climbs smoothly from 44 solved at 50k tokens to 587 at 4M, with no plateau trick (Mistral, Leanstral 1.5).\"\n/>\n\nThe economics are the striking part. On PutnamBench, Leanstral edges Seed-Prover 1.5's high setting by 7\nproblems (587 vs 580) at roughly **$4 per problem** against an estimated **$300+** for Seed-Prover, whose\nhigh setting budgets ~10 H20-days per problem. The only systems above it run under different rules —\nnatural-language proof guidance, or a much larger cost like Aleph Prover at $54–68 per problem.\n\n## Why you can't fake a proof\n\nTurning compiler feedback into an RL reward is dangerous: if the objective is just \"make Lean stop\ncomplaining,\" the cheapest policies are to *cheat* — leave a `sorry`, call the unsound `native_decide`,\nassume an extra `axiom`, or loosen the checker with `set_option`. Leanstral is graded by a fork of\n**SafeVerify**, which is built to reject every one of those. Full reward requires a proof that compiles,\nuses **only standard Lean axioms** (checked via `#print axioms`), and took no shortcut. Toggle the cheats\nand watch the verdict flip:\n\n<SafeVerify />\n\nThis adversarial verifier is what makes the reinforcement learning honest. Leanstral trains on two RL\nenvironments through a **CISPO** objective: a **multiturn** environment where it must prove *or disprove*\na theorem, getting Lean compiler feedback between attempts until it succeeds or runs out of budget; and a\n**code-agent** environment where it acts as a developer across a whole repository. Mistral's diagram of\nthe multiturn loop is the picture of a verifier-gated reward — the same instinct as\n[Agents-A1's verifier-graded RL](/articles/agents-a1), specialized to Lean:\n\n<Figure\n  src=\"/articles/leanstral-formal-proofs/fig2.png\"\n  alt=\"Diagram of the multiturn Lean verifier training loop: a theorem dataset feeds 'solve this theorem' to the model, which emits a candidate proof to a Verifier; a passing proof yields a CISPO reward, a failing one returns format feedback and loops back, and the loop also terminates on too many tries.\"\n  caption=\"The prove-or-disprove RL loop: the model submits a candidate proof, a verifier accepts it (CISPO reward) or returns feedback to retry — reward flows only through a genuinely checked proof (Mistral, Leanstral 1.5).\"\n/>\n\n## The numbers\n\nOn competition math, Leanstral is the best *open* result across the board — the caveat being that a\ncouple of systems above it on PutnamBench either use natural-language guidance or cost 10–75× more to run:\n\n<BenchBars\n  title=\"PutnamBench — problems solved (of 672)\"\n  unit=\"\"\n  max={672}\n  bars={[\n    { label: \"Aleph Prover ($54–68/problem)\", value: 668 },\n    { label: \"Leanstral 1.5 (open, ~$4/prob)\", value: 587, highlight: true },\n    { label: \"Seed-Prover 1.5 high (~$300/prob)\", value: 580 },\n    { label: \"Goedel-Architect (w/o NL)\", value: 508 },\n    { label: \"AxProverBase\", value: 365 },\n  ]}\n/>\n\nThe result that best captures the \"keep working the problem\" behaviour is **FLTEval** — proof-engineering\ntasks drawn from real pull requests to the Fermat's Last Theorem repository. Here Leanstral 1.5 tops even\nfrontier general models, at a fraction of the cost:\n\n<BenchBars\n  title=\"FLTEval — pass@8 (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Leanstral 1.5 (open)\", value: 43.2, highlight: true },\n    { label: \"Claude Opus 4.6 (7× the cost)\", value: 39.6 },\n    { label: \"Leanstral 1.0\", value: 31.9 },\n  ]}\n/>\n\nMistral's own charts show the full comparison set — the three-benchmark bar chart (PutnamBench / FATE-H /\nFATE-X) and the FLTEval scaling across pass@1/2/4, where 1.5 pulls clearly ahead of much larger\nopen models:\n\n<Figure\n  src=\"/articles/leanstral-formal-proofs/fig4.png\"\n  alt=\"Grouped bar chart comparing Aleph Prover, Leanstral 1.5, Seed-Prover 1.5 high, Goedel-Architect (w/o NL), AxProverBase and Seed-Prover 1.5 agentic-only across PutnamBench (668/587/580/508/365/359), FATE-H (87/80/66/57) and FATE-X (34/33/24/10); Leanstral 1.5 bars are highlighted as open source.\"\n  caption=\"Leanstral 1.5 (open source, hatched) versus specialized prover systems on PutnamBench, FATE-H, and FATE-X (Mistral, Leanstral 1.5).\"\n/>\n\n<Figure\n  src=\"/articles/leanstral-formal-proofs/fig5.png\"\n  alt=\"Line chart of FLTEval score versus generation budget (pass@1, pass@2, pass@4): Leanstral 1.5 leads at every budget, above Leanstral 1.0, Qwen3.5, Kimi K2.5 and GLM5, reaching about 39% at pass@4.\"\n  caption=\"FLTEval by generation budget — Leanstral 1.5 leads open models 3–10× its size at every pass@k (Mistral, Leanstral 1.5).\"\n/>\n\n## It generalizes past math\n\nBecause the skill it learned is *proof engineering in a repository*, not competition-problem pattern\nmatching, it transfers to code verification:\n\n- **AVL trees.** Leanstral proved the `O(log n)` time-complexity guarantees for a real self-balancing-tree\n  implementation — structural induction mirroring the recursion, unfolding a `TimeM` monad to expose the\n  step counts, exhaustive rebalancing-case analysis. It ran **over 2.7 million tokens across 22 context\n  compactions** to close it: the test-time-scaling curve in practice.\n- **Finding real bugs.** In an automated pipeline — Aeneas translates Rust to Lean, Leanstral infers the\n  intended properties and tries to prove each (or disprove it) — across **57 repositories** it flagged 47\n  violated properties, **11 genuine bugs, 5 previously unreported**. One was an integer overflow in\n  `datrs/varinteger`'s zigzag decode: on `U64.MAX`, `value + 1` overflows — a crash in debug, silent\n  corruption in release, exactly the edge case fuzzing tends to miss.\n\n## The take\n\nLeanstral's contribution isn't a clever proof-search algorithm; it's a stance. The formal-proving\nleaderboard has been climbing by wrapping models in ever-more-specialized test-time scaffolds, and\nLeanstral shows that a plain code agent — trained in the loop it's evaluated in, graded by a verifier it\ncan't cheat — matches or beats that machinery at a fraction of the cost, while staying usable by anyone\nwho can drive a coding assistant. The MoE is efficient (6B active), the license is open (Apache-2.0), and\nthe scaling story is the honest kind: no plateau trick, just compute converted into checked proofs. The\nopen question is how far \"just run the agent longer\" goes as problems get genuinely harder than Putnam —\nbut as a demonstration that formal verification can be *practical* infrastructure rather than a\nprover-lab specialty, it's the most encouraging Lean release in a while.\n\n---\n\n*Built on the [Leanstral technical report](https://github.com/mistralai/LeanstralSafeVerify/blob/main/LeanstralReport.pdf)\nand [Mistral's Leanstral 1.5 announcement](https://mistral.ai/news/leanstral-1-5/) (Mistral AI, 2026;\nApache-2.0, weights on Hugging Face). Figures are reproduced from Mistral's report and blog post; benchmark\nand cost figures are quoted from those sources.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/leanstral-formal-proofs","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"MiMo-V2-Flash: a 128-token window and one global layer in six","description":"Xiaomi's MiMo-V2-Flash is a 309B / 15B-active MoE that's the strongest open-source software-engineering model — and the interesting part is how it stays fast at 256K context. It interleaves sliding-window attention with a tiny 128-token window and a global-attention layer once every six, cutting the KV cache ~6× while long-context quality holds. Add multi-token prediction for self-speculative decoding and a multi-teacher distillation post-train, and it's a clean study in spending compute where it matters.","date":"2026-07-03","tags":["llm","mixture-of-experts","attention","agents","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"mimo-v2-flash","body":"Xiaomi's **MiMo-V2-Flash** is, on the numbers, the **strongest open-source model for software\nengineering** — 73.4 on SWE-Bench Verified, and the top score in its comparison table on\nSWE-Bench Multilingual. It's a **309B-parameter Mixture-of-Experts** that activates just **15B**\nper token. But the part worth an article isn't the leaderboard; it's the **attention design** that\nlets a 309B model serve a 256K context quickly. MiMo doesn't use full attention, and it doesn't use\n[linear attention like HydraHead](/articles/hydrahead) either. It interleaves **sliding-window** and\n**global** layers — with a window so small it's almost provocative.\n\n<HybridAttention />\n\n## The 5:1 hybrid, and a 128-token window\n\nThe backbone is 48 layers arranged into **8 hybrid blocks**, each block being **5 sliding-window\nattention (SWA) layers followed by 1 global attention (GA) layer** — a 5:1 ratio. Each SWA layer\nattends only to the previous **128 tokens**. That window is tiny on purpose; the report frames it as\nan *inductive bias*: \"smaller windows force the model to focus on local context... mitigat[ing]\noverfitting.\" Two things keep 128 from being crippling:\n\n- **Stacking compounds reach.** Like a stack of small convolutions, five SWA layers see far more than\n  128 tokens — each layer's window sits on top of the last, so information propagates several windows\n  back before you ever hit a global layer.\n- **One global layer per block.** Every sixth layer attends to the *entire* sequence, mixing whatever\n  the local layers couldn't reach directly. Full context is restored periodically, at a sixth of the\n  cost of making every layer global.\n\n<Figure\n  src=\"/articles/mimo-v2-flash/fig1.png\"\n  alt=\"MiMo-V2-Flash architecture: eight hybrid blocks, each stacking five sliding-window-attention (SWA) blocks under one global-attention (GA) block, both with a sparse MoE FFN; the first block uses GA with a dense FFN, and a separate 3-layer MTP module with SWA and a dense FFN feeds tied LM heads to predict several future tokens.\"\n  caption=\"The 5:1 hybrid backbone — five SWA blocks per one GA block, times eight blocks — plus the replicated MTP module for self-speculative decoding (paper, Figure 2).\"\n/>\n\nA few architecture details worth having right: attention is **grouped-query (GQA), not MLA** — SWA\nlayers use 64 query / 8 KV heads, GA layers 64 / 4 — with **partial RoPE** on the first 64 dims and a\n**learnable attention-sink bias** that the report credits with much of the hybrid's stability. The\nMoE is **256 experts, top-8, no shared expert**; only the first block runs a dense FFN.\n\n## Why the window is the point\n\nThe reason to accept a 128-token window is the **KV cache**. In a full-attention model, every layer's\ncache grows with the context length. In MiMo, the five SWA layers per block stop growing once the\ncontext passes 128 tokens — only the 1-in-6 global layers keep scaling. So at long context the cache\nis dominated by that one-sixth:\n\n<SwaKvCache />\n\nThe report claims \"nearly a **6× reduction** in KV-cache storage and attention computation for long\ncontexts,\" and the interactive shows exactly where it comes from — the SWA layers flatten, the global\nlayers carry the linear term. This is the same economics as the\n[KV-cache story in inference](/articles/how-llm-inference-works): the cache is what caps concurrency\nand context, so shrinking it 6× is what makes a 309B model cheap to serve at 256K. And crucially,\nlong-context *quality* holds — MiMo posts the best open-source LongBench V2 score and near-perfect\nneedle retrieval (96.7% at 256K), evidence the hybrid isn't paying for its speed with reach.\n\n## Multi-token prediction, for speed\n\nMiMo also ships with **multi-token prediction** built in. It trains one MTP head during pretraining,\nthen replicates it into a **3-layer MTP module** for inference — a built-in draft model for\n**self-speculative decoding**. The reported acceptance length reaches ~3.6 tokens and the measured\nspeedup is up to **2.6× decoding** (2.70× at batch size 96). If you want the mechanism, the\n[multi-token prediction write-up](/articles/multi-token-prediction) covers exactly this\npredict-several-verify-in-one-pass lineage; MiMo is a clean production instance of it.\n\n## How it was trained\n\nPretraining is **27T tokens** in FP8 with the MTP objective, native 32K context, over three stages\n(general → code-heavy with synthetic reasoning → context extension to 256K via RoPE base rescaling,\nnot YaRN). The post-training is the notable bit: it uses the **MOPD recipe** — multi-teacher on-policy\ndistillation, the same [MOPD from the recent arXiv digest](/arxiv/2026-06-30) that\n[Agents-A1](/articles/agents-a1) also builds on — in three stages: SFT, domain-specialized RL, then\ndistilling several domain teachers into the student on its own rollouts. It's another data point that\non-policy multi-teacher distillation is becoming the default way to fuse agentic capabilities into one\ndeployable model.\n\n## The numbers\n\nThe headline is agentic coding. On **SWE-Bench Multilingual** — resolving real GitHub issues across\nlanguages — MiMo tops its entire comparison table, closed models included:\n\n<Figure\n  src=\"/articles/mimo-v2-flash/fig2.png\"\n  alt=\"Grouped bar chart comparing MiMo-V2-Flash against DeepSeek-V3.2, K2-Thinking, Claude Sonnet 4.5, GPT-5 (High), and Gemini 3.0 Pro across seven benchmarks: SWE-Bench Verified, SWE-Bench Multilingual, Tau2-Bench, AIME25, GPQA-Diamond, HLE (w/o tool), and Arena-Hard. MiMo leads on the two agentic-coding benchmarks and trails the frontier on academic reasoning and general capability.\"\n  caption=\"MiMo-V2-Flash's headline benchmark spread against open and closed frontier models — strongest on agentic coding, competitive elsewhere, behind on HLE and creative writing (paper, Figure 1).\"\n/>\n\n<BenchBars\n  title=\"SWE-Bench Multilingual (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"MiMo-V2-Flash (open)\", value: 71.7, highlight: true },\n    { label: \"DeepSeek-V3.2-Thinking\", value: 70.2 },\n    { label: \"Claude-Sonnet-4.5\", value: 68.0 },\n    { label: \"Kimi-K2-Thinking\", value: 61.1 },\n    { label: \"GPT-5-High\", value: 55.3 },\n  ]}\n/>\n\nOn **LiveCodeBench-v6** it's the best open model and edges GPT-5-High, trailing only Gemini-3.0-Pro:\n\n<BenchBars\n  title=\"LiveCodeBench-v6 (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Gemini-3.0-Pro\", value: 90.7 },\n    { label: \"MiMo-V2-Flash (open)\", value: 85.1, highlight: true },\n    { label: \"GPT-5-High\", value: 84.5 },\n    { label: \"DeepSeek-V3.2-Thinking\", value: 83.3 },\n    { label: \"Kimi-K2-Thinking\", value: 83.1 },\n  ]}\n/>\n\nAnd the one that validates the whole attention bet — **LongBench V2**, where the sliding-window hybrid\ncould have hurt but instead lands best-open, a whisker behind Claude:\n\n<BenchBars\n  title=\"LongBench V2 — long-context (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Claude-Sonnet-4.5\", value: 61.8 },\n    { label: \"MiMo-V2-Flash (open)\", value: 60.6, highlight: true },\n    { label: \"DeepSeek-V3.2-Thinking\", value: 58.4 },\n    { label: \"Kimi-K2-Thinking\", value: 48.1 },\n  ]}\n/>\n\nIt's not uniformly frontier — MiMo trails the reasoning leaders on HMMT (84.4), long-context MRCR\n(45.7), and Humanity's Last Exam (22.1), and closed models still edge single-language SWE-Bench\nVerified. The fair summary is: **the best open model for agentic software engineering and\nlong-context**, competitive with closed frontier systems on those axes, from a design that's cheaper\nto serve. (The \"150 tokens/second\" and per-token pricing you'll see quoted are from Xiaomi's product\npage, not the peer-reported tech report — treat them as marketing, not measurements.)\n\n## The take\n\nMiMo-V2-Flash is a satisfying piece of systems thinking more than a scaling flex. The bet is that most\nof what attention does is *local* — so make most layers local and cheap (a 128-token window is an\naggressive way to commit to that), and buy back global reach with one full-attention layer per block\nand a periodic reset. Stack MTP on top for decode speed and MOPD for capability, and you get a 309B\nmodel that serves 256K context at a fraction of a full-attention model's KV cost while topping the\nopen-source agentic-coding board.\n\nThe through-line with [HydraHead](/articles/hydrahead) is worth noticing: both argue you shouldn't pay\nfull-attention cost on every layer, they just cut the budget differently — HydraHead per *head* by\ninterpretability, MiMo per *layer* on a fixed 5:1 schedule. The schedule is blunter, but it's simple,\nit's proven at 309B, and the KV-cache math is undeniable. Open weights (Apache-2.0), MTP included — a\nstrong, honest release.\n\n---\n\n*Built on the [MiMo-V2-Flash Technical Report](https://arxiv.org/abs/2601.02780) (Xiaomi LLM-Core,\n2025) and the [model release](https://github.com/XiaomiMiMo/MiMo-V2-Flash) (Apache-2.0, open weights +\n3-layer MTP). Benchmark figures are quoted from the report's tables; throughput and pricing figures\nare from Xiaomi's product page.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/mimo-v2-flash","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Set Diffusion: one knob from autoregression to diffusion","description":"Autoregressive models are sequential but KV-cacheable; diffusion LMs are parallel and any-order but fixed-length and can't cache. Block diffusion split the difference with fixed left-to-right blocks. Set Diffusion generalizes all three by factorizing generation over flexible-position, flexible-length token *sets* — recovering AR, block diffusion, and order-agnostic diffusion as special cases, while keeping a KV cache that updates every step and the any-order flexibility that enables infilling. A walk through the spectrum and the results.","date":"2026-07-03","tags":["llm","diffusion","language-models","inference-optimization","explainer"],"draft":false,"featured":false,"interest":5,"helpful":3,"kind":"articles","slug":"set-diffusion","body":"There's a spectrum hiding behind two families of language model that usually get treated as\nopposites. **Autoregressive** models generate strictly left-to-right, one token per step —\nsequential, but they support the [KV cache](/articles/how-llm-inference-works) that makes serving\ncheap. **Diffusion** LMs ([like iLLaDA](/articles/illada-diffusion-language-model)) denoise many\ntokens in parallel and in any order — flexible, but fixed-length, and they *can't* cache (every step\nneeds full bidirectional context). **Set Diffusion** (Arriola & Kuleshov, Cornell — the block\ndiffusion authors) makes the spectrum explicit and shows you can slide along it with a single idea:\nchange *which token sets you decode together, and in what order.*\n\n<Spectrum />\n\n## The spectrum, precisely\n\nThe prior bridge between these worlds was **block diffusion** (BD3-LM): generate fixed-size\ncontiguous blocks left-to-right, with diffusion inside each block. That buys variable length and a\nper-block KV cache, but the block is rigid — it can only extend left-to-right, so no infilling, and\nthe cache can only update once a whole block finishes decoding (the within-block denoising needs\nbidirectional context until then).\n\n<Figure\n  src=\"/articles/set-diffusion/fig1.png\"\n  alt=\"Two decoding traces over wall-clock time. Left, Block Diffusion fills fixed sequential blocks and only updates the KV cache after each block completes. Right, Set Diffusion reveals flexible-position, position-biased token sets and updates the KV cache after every step, decoding mostly left-to-right but filling later positions when likely.\"\n  caption=\"Set diffusion decodes flexible-position token sets and refreshes the KV cache after every step, while block diffusion is locked to fixed sequential blocks that can only update the cache once a whole block finishes (paper, Figure 1).\"\n/>\n\nSet Diffusion's move is to stop thinking in blocks and think in **sets**. A *set* is an\narbitrary-position, arbitrary-length subset of token positions you decode together. Factorize the\nlikelihood over a sequence of disjoint sets that cover the whole sequence, and the familiar models\nfall out as special cases:\n\n- **Autoregression** — every set is a single token, in left-to-right order.\n- **Order-agnostic diffusion** — one set of the whole sequence, decoded in random order.\n- **Block diffusion** — fixed contiguous blocks, left-to-right.\n\nThey're not different architectures; they're different **set schedules** for the same object. That's\nthe whole conceptual payoff — and once you see it, the interesting question is how to pick a schedule\n*between* the corners.\n\n## Two knobs, one window\n\nSet diffusion exposes two knobs the block-size parameter conflates: **set size** (how many tokens you\ncommit per step — parallelism) and **ordering bias** (how left-to-right you stay — quality). In\npractice both are controlled by one schedule parameter, a window width `w`: each position gets an\nactive generation window of width `w`, and the widths determine how much decoding overlaps.\n\n<WindowKnob />\n\nThe paper makes the endpoints rigorous. As `w → 1/L` the windows stop overlapping, tokens generate\none at a time in order, and the training objective *becomes the tight autoregressive ELBO* — best\nperplexity, no parallelism. As `w → 1` every position shares one schedule and you recover\norder-agnostic diffusion — maximally parallel and any-order. Set diffusion lives in between: a sliding\nwindow that decodes a few tokens per step, mostly in order but flexible enough to fill gaps. Smaller\n`w` buys perplexity; larger `w` buys parallelism and any-order decoding. One dial, the whole spectrum.\n\n<Figure\n  src=\"/articles/set-diffusion/fig2.png\"\n  alt=\"Four panels of per-token reveal-time CDFs for a length-4 sequence, plotting the probability a token has been revealed against normalized ordering time. AR shows fully staggered step CDFs (strict left-to-right); two sliding-window SetDLM panels show progressively overlapping ramps as the window widens; MDLM shows all tokens sharing one linear schedule (order-agnostic).\"\n  caption=\"Per-token reveal-time CDFs for L=4: the decoding window w slides the ordering bias from strict left-to-right AR, through sliding-window set diffusion, to fully order-agnostic MDLM diffusion (paper, Figure 3).\"\n/>\n\n## Why sets get to keep the KV cache\n\nThe systems win is that generation is **set-causal**: each set attends to itself and to all\n*previously decoded* sets, but not to future ones. Because the ordering across sets is causal,\nfinished sets never need reprocessing — their keys and values are cached and reused, and the cache\n**updates after every inference step**. That's the thing pure diffusion can't do (it needs full\nbidirectional context, so nothing is ever \"final\" enough to cache) and the thing block diffusion does\nonly *per block* (bidirectional context *within* the block blocks earlier caching). The ablation is\nstark: turn the causal mask and KV caching off and GSM8K accuracy collapses from **26.6 to 6.4** while\nthroughput drops too — the causal set structure is buying both.\n\nThe flexibility also gives **infilling** for free. Because sets are flexible-position, the schedule can\nselect gap tokens and condition them on the clean tokens on *both* sides — something block diffusion's\nstrict left-to-right blocks structurally cannot do.\n\n## The numbers\n\nAt GPT-2-small scale (110M params), the headline is that set diffusion beats block diffusion on *both*\naxes at once — accuracy and speed. On GSM8K it tops the whole diffusion field:\n\n<BenchBars\n  title=\"GSM8K — 0-shot pass@1 (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"SW-SetDLM (S≤8)\", value: 66.41, highlight: true },\n    { label: \"BD3-LM (block, S=4)\", value: 63.53 },\n    { label: \"BD3-LM (block, S=8)\", value: 56.94 },\n    { label: \"BD3-LM (block, S=16)\", value: 50.49 },\n    { label: \"MDLM (diffusion)\", value: 6.37 },\n  ]}\n/>\n\n<Figure\n  src=\"/articles/set-diffusion/fig3.png\"\n  alt=\"Speed-accuracy scatter for GSM8K: x-axis decoding throughput in tokens per second, y-axis accuracy. Filled markers for set diffusion (S<=8 and S<=32) sit above and to the right of the open block-diffusion markers (S=4 and S=16) across the frontier, while a single AR star sits highest in accuracy at low throughput. An arrow labels the up-and-right direction as improved tradeoff.\"\n  caption=\"On GSM8K, set diffusion (filled markers) traces a strictly better speed-accuracy frontier than block diffusion (open markers), while a plain autoregressive model still tops accuracy at low throughput (paper, Figure 5).\"\n/>\n\n— and it does so at higher throughput than any of those block-diffusion settings (60.4 vs 55.4\ntok/s). It won't be lost on you that an ordinary AR transformer scores higher still (75.7) and, at this\nsmall scale with full diffusion sampling steps, is even a bit faster; the point of a diffusion LM isn't\nto beat AR on a left-to-right benchmark, it's to keep AR's efficiency *while* offering any-order\ngeneration. Which is exactly where the second result lands — **infilling**, filling a gap given text on\nboth sides, where set diffusion clearly beats block diffusion:\n\n<BenchBars\n  title=\"ROCStories infilling — ROUGE-1 (fill 3 of 5 sentences)\"\n  unit=\"\"\n  bars={[\n    { label: \"SW-SetDLM (S≤32)\", value: 18.1, highlight: true },\n    { label: \"BD3-LM (block, S=16)\", value: 15.8 },\n  ]}\n/>\n\nAt ~25% faster decoding, no less. Across the rest of the suite it's the same shape: on OpenWebText\nit matches block diffusion's perplexity at **22% higher throughput** (and runs ~13× faster than\ncacheless MDLM), on LM1B it posts the best diffusion perplexity *and* the highest diffusion throughput,\nand on CNN/DailyMail it's competitive on ROUGE at up to 10% faster. A strictly better speed-quality\nfrontier than block diffusion, plus the infilling block diffusion gives up.\n\n## The take\n\nWhat I like about Set Diffusion is that it's a *reframing* that pays off, not a new mechanism bolted\non. \"Interpolate between AR and diffusion by varying block size\" was already a good idea (that's block\ndiffusion); the insight here is that block size was the wrong knob — the right one is the **order in\nwhich token sets are generated**, and once you factorize over flexible sets instead of rigid blocks you\nget a strictly larger design space that still contains AR, still contains diffusion, and adds a\nKV-cacheable, any-order, infilling-capable middle that block diffusion couldn't reach.\n\nThe honest caveats: everything is at 110M parameters, an AR model still wins the straight\nleft-to-right benchmarks, and the ideal window schedule is currently hand-tuned (learning it is future\nwork). But as a clean statement of *what the AR↔diffusion spectrum actually is*, and a practical model\nthat sits usefully in its middle, it's the most satisfying diffusion-LM paper I've read since block\ndiffusion itself — which makes sense, given it's the same group closing the loop on their own idea.\n\n---\n\n*Built on [Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion](https://arxiv.org/abs/2607.01775)\n(Marianne Arriola, Volodymyr Kuleshov; Cornell, ICML 2026), which generalizes the same authors'\n[Block Diffusion](https://arxiv.org/abs/2503.09573). Code and weights are at\n[kuleshov-group/setdlms](https://github.com/kuleshov-group/setdlms). Benchmark figures are quoted from\nthe paper's tables (110M-parameter models).*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/set-diffusion","signal":{"interest":5,"helpful":3,"score":8,"level":4,"label":"High"}},{"title":"BM25: the ranking function that refuses to die","description":"BM25 is a ~35-year-old bag-of-words ranking function, and it's still the default in Lucene, Elasticsearch, and OpenSearch — still a punishing baseline for billion-dollar neural retrievers, and still the sparse half of most hybrid search. It's TF-IDF with two honest fixes: term-frequency saturation and document-length normalization. A walk through the full formula, term by term, with a live BM25 engine you can query, and ~30 lines of Python that reproduce it.","date":"2026-07-02","tags":["information-retrieval","search","ranking","algorithms","explainer"],"draft":false,"featured":false,"interest":3,"helpful":5,"kind":"articles","slug":"bm25","body":"Here's a fact that should be more embarrassing for machine learning than it is: a ranking\nfunction designed in the early 1990s, with **no learned parameters** and no notion of meaning,\nis still the default text scorer in Lucene, Elasticsearch, and OpenSearch — and still a\nbaseline that dense neural retrievers regularly *fail to beat* out of domain. It's called\n**BM25**, and if you do search, retrieval, or RAG, it's worth understanding exactly, because\nyou're almost certainly running it.\n\n\"BM25\" is *Best Matching 25* — roughly the 25th matching function tried in the Okapi\ninformation-retrieval project at City University London (Robertson, Spärck Jones, and\ncolleagues), grounded in the probabilistic relevance framework. Strip away the theory and it's\na small, honest idea: score a document by summing, over the query terms it contains, how\n*rare* each term is times how *emphatically* the document uses it — with two corrections that\nmake it work in practice. Let's build it.\n\n## Start with TF-IDF, and its two flaws\n\nThe classic bag-of-words score is **TF-IDF**: for each query term, multiply its frequency in\nthe document (**term frequency**, TF — \"this document uses the word a lot\") by its rarity\nacross the corpus (**inverse document frequency**, IDF — \"and the word is distinctive\"). Sum\nover query terms. It's a reasonable instinct, and it has two clear problems:\n\n1. **TF grows linearly and forever.** A document that says \"jaguar\" 40 times scores 40× one\n   that says it once. But the difference between 1 and 2 mentions is meaningful; the difference\n   between 39 and 40 is noise. Relevance *saturates*.\n2. **Length is unaccounted for.** A 10,000-word document will rack up term counts just by being\n   long, drowning out a tight 100-word document that's genuinely more on-topic.\n\nBM25 is, almost exactly, **TF-IDF with those two flaws fixed.** Here's the whole thing:\n\n$$\n\\text{score}(D, Q) = \\sum_{q \\in Q} \\text{IDF}(q)\\cdot\\frac{f(q, D)\\,(k_1 + 1)}{f(q, D) + k_1\\left(1 - b + b\\,\\dfrac{|D|}{\\text{avgdl}}\\right)}\n$$\n\nThree pieces do all the work: the **IDF** weight, the **saturating** term-frequency factor\n(the $k_1$ part), and the **length normalization** (the $b$ part). Here they are, color-coded —\nthe map for the rest of the piece:\n\n<FormulaAnatomy />\n\nTake them one at a time.\n\n## Rarity: the IDF term\n\n$f(q,D)$ is just the count of term $q$ in document $D$. The multiplier in front is inverse\ndocument frequency — how much a term's presence should count, based on how rare it is:\n\n$$\n\\text{IDF}(q) = \\ln\\!\\left(1 + \\frac{N - n(q) + 0.5}{n(q) + 0.5}\\right)\n$$\n\n$N$ is the number of documents, $n(q)$ the number containing $q$. A term in every document\n(like \"the\") gets an IDF near zero — matching it tells you nothing. A term in one document out\nof a million gets a large IDF — matching it is almost the whole story. This is why BM25 needs\nno stopword list: common words are down-weighted *automatically* because their IDF collapses.\n(The $\\ln(1 + \\cdots)$ form is Lucene's; it keeps IDF non-negative. The original\nRobertson–Spärck-Jones IDF drops the $+1$ and can go slightly negative for terms in more than\nhalf the corpus.)\n\n<IdfRarity />\n\n## Saturation: the k1 term\n\nNow the first fix. Instead of using the raw count $f(q,D)$, BM25 passes it through a saturating\nfunction $\\frac{f\\,(k_1+1)}{f + k_1\\cdot(\\dots)}$ that rises fast for the first few occurrences\nand then flattens toward an asymptote. The parameter $k_1$ controls how fast:\n\n<TfSaturation />\n\nThe first mention of a term is strong evidence; each additional mention adds less. That curve —\nnot the straight line of TF-IDF — is how relevance actually behaves. Set $k_1 = 0$ and it\nbecomes binary (any occurrence counts the same); crank $k_1$ up and it straightens back toward\nlinear TF. Lucene's default is **$k_1 = 1.2$**.\n\n## Length: the b term\n\nThe second fix lives in the denominator: the $k_1$ term is scaled by\n$\\left(1 - b + b\\,\\frac{|D|}{\\text{avgdl}}\\right)$, where $|D|$ is the document's length and\n$\\text{avgdl}$ the average across the corpus. A longer-than-average document gets a bigger\ndenominator, so its term-frequency factor is discounted — the same two mentions count for less\nwhen they're diluted across more text:\n\n<LengthNorm />\n\nThe knob $b \\in [0,1]$ sets how aggressively. At $b = 0$ length is ignored entirely; at $b = 1$\nit's fully normalized. The default **$b = 0.75$** is a compromise that's proven hard to beat.\n\n## Put it together: a live BM25 engine\n\nThat's the entire algorithm. Here it is running on a tiny corpus — edit the query, drag $k_1$\nand $b$, and every document is re-scored with the exact formula above. Expand a document to see\neach query term's contribution (its IDF times the saturated, length-normalized factor):\n\n<Bm25Scorer />\n\nNotice the behaviors fall out on their own: rare query terms dominate the ranking, repeating a\ncommon word barely moves the score, and a long document doesn't win just for being long. No\ntraining, no embeddings — just term statistics arranged sensibly.\n\n## Thirty lines of Python\n\nThere's no magic hiding in a library. The whole thing is a couple of counters and the formula:\n\n```python\nimport math\nfrom collections import Counter\n\nclass BM25:\n    def __init__(self, corpus, k1=1.2, b=0.75):\n        self.k1, self.b = k1, b\n        self.docs = [doc.lower().split() for doc in corpus]\n        self.N = len(self.docs)\n        self.avgdl = sum(len(d) for d in self.docs) / self.N\n        self.tf = [Counter(d) for d in self.docs]          # term counts per doc\n        self.df = Counter()                                 # docs containing each term\n        for d in self.docs:\n            for term in set(d):\n                self.df[term] += 1\n\n    def idf(self, term):\n        n = self.df.get(term, 0)\n        return math.log(1 + (self.N - n + 0.5) / (n + 0.5))\n\n    def score(self, query, i):\n        d, tf = self.docs[i], self.tf[i]\n        norm = self.k1 * (1 - self.b + self.b * len(d) / self.avgdl)\n        s = 0.0\n        for term in query.lower().split():\n            f = tf.get(term, 0)\n            if f:\n                s += self.idf(term) * f * (self.k1 + 1) / (f + norm)\n        return s\n\n    def rank(self, query):\n        return sorted(((i, self.score(query, i)) for i in range(self.N)),\n                      key=lambda x: -x[1])\n```\n\nIn production you don't loop over every document — you keep an **inverted index** (term →\npostings list of documents that contain it) and only score documents that share a term with the\nquery. That's what makes BM25 fast enough to serve web-scale corpora on commodity hardware, and\nit's the same index that's been powering Lucene since 2011.\n\n<InvertedIndex />\n\n## The variants you'll meet\n\nThe core formula spawned a small family, mostly patching edge cases:\n\n- **BM25+** adds a small constant $\\delta$ (default 1.0) to the term-frequency factor, fixing a\n  subtle bug where very long documents can be over-penalized to the point that a document\n  *containing* a rare term scores below one that doesn't.\n- **BM25F** (\"fielded\") scores structured documents — title, body, anchor text — by combining\n  per-field term frequencies *before* saturation, with a weight per field, so a title match\n  counts more than a body match. It's what real search engines actually run.\n- **BM25L** re-weights to stop long documents from being unfairly buried.\n- Lucene's implementation is BM25 with the non-negative IDF above and per-field length norms\n  quantized into a single byte — the pragmatic engineering version of the equation.\n\n## Why it won't die\n\nNeural retrieval was supposed to make this obsolete years ago. It hasn't, for reasons worth\nnaming:\n\n- **It's a brutal baseline.** On out-of-domain benchmarks (the BEIR suite made this famous),\n  BM25 beats or ties many dense retrievers — because it can match *any* term, including names,\n  codes, and jargon a fixed-vocabulary embedding never saw in training. It never has an\n  \"out-of-distribution\" moment.\n- **It's the sparse half of hybrid search.** The current default in serious systems is to run\n  BM25 *and* a dense retriever and fuse the results (often with reciprocal-rank fusion). Lexical\n  precision plus semantic recall beats either alone, which is why \"BM25 is obsolete\" quietly\n  became \"BM25 is one of your two retrievers.\"\n- **It's cheap and interpretable.** No GPU, no training, no embedding drift. When it ranks a\n  document highly you can point at exactly which rare terms did it — which matters when a\n  ranking has to be debugged or defended.\n\nThe lesson I take from BM25 is that a model doesn't have to learn anything to encode real\nknowledge about a problem. Every piece of it is a hypothesis about relevance — rarity matters,\nrepetition saturates, length dilutes — written as arithmetic instead of learned from data. Three\ngood hypotheses, two tunable knobs, and thirty-five years later it's still the thing your search\nbar is probably running.\n\n---\n\n*BM25 originates in the Okapi project (Stephen Robertson, Karen Spärck Jones, et al.) and the\nprobabilistic relevance framework; the formulation and non-negative IDF here follow\n[Lucene's `BM25Similarity`](https://lucene.apache.org/core/9_9_1/core/org/apache/lucene/search/similarities/BM25Similarity.html)\n(defaults $k_1 = 1.2$, $b = 0.75$). BM25+ / BM25L are from Lv & Zhai (2011).*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/bm25","signal":{"interest":3,"helpful":5,"score":8,"level":4,"label":"High"}},{"title":"TwoTower: giving a diffusion LM a frozen autoregressive memory","description":"Diffusion language models generate in parallel, but they've all shared one network between two jobs that fight each other — causally representing the clean context, and bidirectionally denoising the noisy block. NVIDIA's Nemotron-Labs-TwoTower splits them: a frozen autoregressive tower holds the context, a trainable denoiser refines each block by cross-attending to it. Built on a 30B Mamba-Transformer MoE, it keeps 98.7% of the autoregressive baseline's quality at 2.42× the generation throughput. A walk through the architecture and the honest numbers.","date":"2026-07-02","tags":["llm","diffusion","inference-optimization","architecture","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"nemotron-twotower","body":"An autoregressive (AR) language model emits one token per forward pass — the sequential axis\nis the entire sequence, which is exactly why decode is the slow, [memory-bound half of\ninference](/articles/how-llm-inference-works). **Diffusion language models** ([like\niLLaDA](/articles/illada-diffusion-language-model)) offer the escape: denoise many tokens per\nstep and refine iteratively, so generation can be *parallel*. The catch is that every diffusion\nLM so far has made one network do two jobs at once — and those jobs pull in opposite directions.\n\nNVIDIA's **Nemotron-Labs-TwoTower** fixes that by refusing to share. It splits the model into\ntwo towers: a **frozen autoregressive context tower** and a **trainable diffusion denoiser\ntower** that reads from it. Built on a 30B Mamba-Transformer MoE, it retains **98.7%** of the\nautoregressive baseline's quality while generating **2.42× faster**.\n\n<TwoTower />\n\n<Figure\n  src=\"/articles/nemotron-twotower/fig1.png\"\n  alt=\"Side-by-side diagram of the two towers. Left: the AR/Context Tower — token embedding feeding stacked Mamba-2, Self-Attn and MoE blocks over the clean prompt tokens, with a greyed-out LM Head. Right: the Diffusion/Denoiser Tower — masked [M] tokens fed through matching Mamba-2 and MoE blocks whose attention layer is Self-Attn + Cross-Attn, receiving Mamba states and the KV cache from the context tower and looping ×T to unmask the block.\"\n  caption=\"The two towers: the frozen AR/context tower (left) hands its Mamba states and KV cache to the trainable denoiser tower (right), which cross-attends into them to unmask each block over T diffusion steps (paper, Figure 1a).\"\n/>\n\n## One network, two jobs that fight\n\nHere's the tension existing diffusion LMs live with. At every denoising step, the same decoder\nhas to (1) *represent the clean tokens* already committed — which wants strong **causal**\nprocessing, the thing AR pretraining is great at — and (2) *denoise the corrupted block* —\nwhich wants **bidirectional** attention over the noisy tokens. As the paper puts it, this\n\"entanglement pulls the same set of weights in different directions, limiting their capacity to\nexcel at either.\" A single set of weights forced to be both a causal reader and a bidirectional\ndenoiser ends up mediocre at both.\n\nTwoTower's move is to stop asking one network to be both:\n\n- The **context tower** is the *frozen* pretrained AR model. It causally processes clean tokens\n  and never gets a gradient — so it keeps every bit of the 25T-token backbone's context ability\n  intact. It carries the persistent left-context (KV cache and Mamba states) across blocks.\n- The **denoiser tower** is trained from the diffusion objective and does nothing but refine the\n  current noisy block with bidirectional attention. It reads the context through **layer-aligned\n  cross-attention**: denoiser layer *i* attends to context layer *i*, over both the frozen\n  tower's committed blocks and its own in-block tokens.\n\nThe base is `Nemotron-3-Nano-30B-A3B`, an open hybrid **Mamba-Transformer MoE** — 30B total,\n~3B active, 52 layers (23 Mamba-2, 6 attention, 23 MoE). Cross-attending *into* a Mamba-hybrid\nsounds awkward (Mamba is a recurrence, not a KV cache), and the trick is neat: the **Mamba chunk\nsize is matched to the diffusion block size**, so the existing kernel exposes clean recurrent\nstates exactly at block boundaries — right where the denoiser needs them.\n\n## Block-wise autoregressive diffusion\n\nTwoTower isn't fully parallel and isn't fully sequential — it's **autoregressive across blocks,\ndiffusion within a block**. Text is chunked into blocks (size **16** by default); blocks are\ngenerated left-to-right, each conditioned on the finished ones, but *within* a block all tokens\nare denoised together over a few steps:\n\n<BlockDiffusion />\n\nThe diffusion is masked/absorbing-state — the same LLaDA-style \"replace tokens with `[MASK]`\nand predict them back\" as [iLLaDA](/articles/illada-diffusion-language-model), with a linear\nnoise schedule. The number of denoising steps is *adaptive*: a confidence sampler commits any\ntoken whose prediction clears a threshold (γ = 0.8) immediately and lets the uncertain ones wait\nanother step. In practice most tokens of a block resolve in the **first** step, so a block costs\nfar fewer forward passes than its token count — which is the whole source of the speedup. The\nsequential axis is now the number of *blocks*, not the number of *tokens*.\n\nOne sharp constraint falls out of this: you have to **sample with the same block size you\ntrained on**. Sample with blocks *larger* than training and generation collapses — GSM8K drops\nfrom 89.8 to 2.2 at a sampling block of 64. The block size isn't a free inference knob; it's\nbaked in at training time.\n\n## Why decoupling is the whole point\n\nThe paper's central experiment is an ablation that isolates the decoupling. Build the model\nthree ways from the same backbone and measure how much quality survives versus the AR baseline:\n\n<Decouple />\n\nThe entangled single tower — one network trained jointly for both roles — loses 21–26% across\ngeneral, code, and math. Continued AR training does better. But freezing the context tower and\ntraining a *separate* denoiser keeps the most, losing only 6–11%. That gap is the argument:\nneither role compromises the other when they don't share weights. It's the same instinct as\n[HydraHead](/articles/hydrahead) — match the mechanism to the job — applied to whole towers\ninstead of individual heads.\n\n## The numbers\n\nThe released checkpoint is a genuinely strong model in absolute terms — this isn't a toy that\ntrades away quality for speed:\n\n<BenchBars\n  title=\"Nemotron-Labs-TwoTower — released checkpoint (S=16), accuracy (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"GSM8K\", value: 89.84, highlight: true },\n    { label: \"MATH-500\", value: 81.05, highlight: true },\n    { label: \"MMLU\", value: 78.32, highlight: true },\n    { label: \"Multilingual\", value: 77.15, highlight: true },\n    { label: \"HumanEval\", value: 76.4, highlight: true },\n  ]}\n/>\n\nAggregated, that's **98.7%** of the autoregressive baseline's quality — the headline claim.\nAnd the speed lever is the block size: bigger blocks mean more tokens denoised in parallel per\nstep, so higher throughput (the released checkpoint reaches **2.42×**):\n\n<BenchBars\n  title=\"Generation throughput vs AR baseline, by block size (×)\"\n  unit=\"×\"\n  bars={[\n    { label: \"block 32\", value: 2.25, highlight: true },\n    { label: \"block 16\", value: 2.02, highlight: true },\n    { label: \"block 8\", value: 1.71, highlight: true },\n  ]}\n/>\n\n<Figure\n  src=\"/articles/nemotron-twotower/fig2.png\"\n  alt=\"Grouped bar chart comparing the Nemotron-3-Nano autoregressive base (light green) against the TwoTower diffusion model (dark green). Left panel, accuracy by category: General Knowledge 70.6 vs 69.6, Code 77.0 vs 74.9, Math 88.4 vs 85.4, Multilingual 77.4 vs 80.4, Commonsense 85.6 vs 85.9. Right panel, relative generative throughput: AR baseline 1.00 vs TwoTower 2.42.\"\n  caption=\"Category-level accuracy is near-parity with the AR baseline (diffusion even edges ahead on multilingual and commonsense) while generative throughput jumps to 2.42× (paper, Figure 2).\"\n/>\n\nA few honest caveats, because this is a fresh preprint and the framing invites them. The\nquality \"retention\" is an aggregate; the per-category drops aren't uniform — **code (−10.5%) and\nmath (−11.3%)** take the biggest hits, exactly the tasks where a single wrong token derails the\nanswer. The paper reports **no comparison to other diffusion LMs** (LLaDA, Dream, external\nblock-diffusion) — only its own AR baseline and internal ablations — so \"best diffusion LM\"\nis not a claim it makes or supports. Throughput is reported only as a relative speedup (no\ntokens/sec), the 2.42× is the released checkpoint while the ablation tables show 2.02× at the\nsame block size under a different recipe, and running two towers means the frozen context\ntower's weights sit resident in memory on top of the denoiser.\n\n## The take\n\nThe appeal here is architectural honesty. Diffusion LMs have quietly been asking one network to\nbe a causal historian and a bidirectional editor simultaneously, and TwoTower's contribution is\nmostly the observation that you shouldn't — plus the engineering to make cross-attention into a\nfrozen Mamba-hybrid actually work (the chunk-size-equals-block-size trick is the load-bearing\ndetail). Keeping the pretrained AR tower *frozen* is the elegant part: you inherit a 25T-token\nbackbone's context ability for free and spend all your training budget teaching the one thing\nthat's genuinely new, bidirectional block refinement.\n\nWhether this is the design that finally makes diffusion decoding a default is still open — the\ncode/math gap is real, and 2.42× on two H100s with extra resident weights is a solid but not\nseismic win. But \"give the diffusion model a frozen autoregressive memory instead of making it\ngrow its own\" is the kind of clean decomposition that tends to stick, and the weights are out\n(CC BY 4.0) if you want to poke at it.\n\n---\n\n*Built on [Nemotron-Labs-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive\nContext](https://arxiv.org/abs/2606.26493) (Reda, Kamalu, Waleffe, Patwary, Shoeybi, Catanzaro;\nNVIDIA, 2026). Benchmark and throughput figures are quoted from the paper's tables; category-level\nAR-vs-TwoTower comparisons are from its Figure 2.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/nemotron-twotower","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"TurboQuant: rotate first, then quantize the KV cache","description":"The KV cache is the memory that caps how long a context and how big a batch you can serve. You can quantize it, but keys and values have outliers that make naive quantization lossy — and the usual fix, per-channel calibration, isn't available online. TurboQuant's move is to rotate every vector by a random orthogonal matrix first: that spreads the energy into a known Beta distribution, so one data-free optimal quantizer fits every vector, and a 1-bit residual trick keeps the attention scores unbiased. Near-optimal distortion, no calibration — and it shows up in vLLM, llama.cpp, and vector search.","date":"2026-07-02","tags":["llm","inference-optimization","quantization","kv-cache","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"turboquant-kv-cache","body":"The [KV cache runs the economics of LLM serving](/articles/how-llm-inference-works): it grows\nlinearly with context length, per layer, and it's what decides how many requests fit on a GPU.\nThe obvious lever is to **quantize** it — store the cached keys and values in 2–4 bits instead\nof 16. The catch is that key and value vectors have **outlier coordinates**: a few dimensions\ncarry most of the magnitude, so a fixed set of quantization levels either clips the outliers or\nwastes precision on the empty middle. The standard fix — calibrate a per-channel quantizer on\nsample data — doesn't work *online*, where new tokens stream in and you have no calibration set.\n\n**TurboQuant** (Zandieh, Daliri, Hadian, Mirrokni — Google Research, [arXiv 2504.19874](https://arxiv.org/abs/2504.19874))\nsolves this with one idea: **rotate the vector before you quantize it.**\n\n<Rotation />\n\n## Why rotation is the whole trick\n\nMultiply a vector by a random orthogonal matrix and you don't change its length or its inner\nproducts with other (equally rotated) vectors — but you *do* spread its energy evenly across all\ncoordinates. After the rotation, the coordinates of a unit vector provably follow a known\n**Beta distribution**, $f(x) \\propto (1 - x^2)^{(d-3)/2}$ on $[-1, 1]$, which for large $d$\ntightens to a Gaussian $\\mathcal{N}(0, 1/d)$. Every coordinate of every rotated vector looks the\nsame, and there are no outliers left to clip.\n\nThat's what makes the quantizer **data-free**. Because you know the post-rotation distribution\nanalytically, you can solve for the optimal scalar quantizer (Lloyd–Max: the levels that minimize\nexpected squared error against that Beta density) *once, ahead of time*, and it's optimal for\nevery vector you'll ever see — no calibration pass, which is exactly what an online KV cache\nneeds. The paper proves this is **near-optimal**: TurboQuant's distortion is within a small\nconstant factor (about **2.7×**) of the information-theoretic lower bound for *any* vector\nquantizer, at every bit-width and dimension.\n\n<Figure\n  src=\"/articles/turboquant-kv-cache/fig1.png\"\n  alt=\"Two log-scale plots of quantization distortion versus bit-width (1 to 5 bits). Left: inner-product error for TurboQuant-prod and TurboQuant-mse sits between a green lower bound and a red upper bound. Right: mean squared error for TurboQuant-mse tracks just above the lower bound and hugs the upper bound.\"\n  caption=\"TurboQuant's inner-product error (left) and MSE (right) stay wedged between the theoretical lower and upper distortion bounds at every bit-width — the near-optimal, data-free result the whole method rests on (paper, Figure 3).\"\n/>\n\n## The bias nobody mentions\n\nThere's a subtlety the paper is unusually careful about. An MSE-optimal quantizer minimizes\nreconstruction error — but attention doesn't reconstruct vectors, it takes **inner products**\n(query · key) and softmaxes them. And an MSE-optimal quantizer is *biased* as an inner-product\nestimator: on average its scores are systematically off, which quietly warps the attention\nweights. TurboQuant fixes this with a two-stage scheme — quantize with the MSE quantizer, then\nstore the **sign of the residual** under a random projection (a 1-bit *Quantized Johnson–\nLindenstrauss* transform, from the same authors' earlier QJL work). That 1-bit correction is\nexactly what cancels the bias, giving an **unbiased** inner-product estimate:\n\n<Pipeline />\n\nUnbiasedness is the part that lets quality hold at aggressive bit-widths: the paper reports\n**absolute quality neutrality at 3.5 bits per channel**, and only marginal degradation at 2.5.\n\n<Figure\n  src=\"/articles/turboquant-kv-cache/fig2.png\"\n  alt=\"A 3-by-2 grid of needle-in-a-haystack recall heatmaps (depth percent vs token limit from 4k to 104k) for Llama-3.1-8B. SnapKV (0.858), PyramidKV (0.895) and KIVI (0.981) show scattered non-green cells where recall drops; PolarQuant (0.995), Full-Precision (0.997) and TurboQuant (0.997) are almost entirely green.\"\n  caption=\"Needle-in-a-haystack recall for Llama-3.1-8B under a 0.25 KV-cache budget: despite more than 4x compression, TurboQuant matches the full-precision baseline (0.997) while SnapKV, PyramidKV and KIVI visibly lose recall (paper, Figure 4).\"\n/>\n\n## What it costs, what it buys\n\nAn implementation ([`0xsero/turboquant`](https://github.com/0xsero/turboquant)) wires this into\nvLLM and makes the tradeoff concrete. It allocates bits **asymmetrically** — keys get **3 bits**\nwith the unbiased inner-product quantizer (attention scores are precision-sensitive), values get\n**2 or 4 bits** with simpler group quantization (value aggregation is more forgiving). The\nreconstruction quality splits cleanly along that line:\n\n<BenchBars\n  title=\"Reconstruction cosine similarity vs full precision\"\n  unit=\"\"\n  max={1}\n  bars={[\n    { label: \"keys · 3-bit (unbiased)\", value: 1.0, highlight: true },\n    { label: \"values · 4-bit\", value: 0.997, highlight: true },\n    { label: \"values · 2-bit\", value: 0.94 },\n  ]}\n/>\n\nKeys reconstruct essentially perfectly; 4-bit values are near-lossless; 2-bit values are where\nthe quality actually gives (0.94), which is why the implementation recommends 4-bit values for\nanything sensitive. Net, it compresses the full-attention KV cache about **4.4×**, and that turns\nstraight into context length:\n\n<KvMemory />\n\nOn the reported runs, that's **30 GB of KV cache freed** on a 4-GPU RTX 5090 box and a **2.0×**\njump in max context (457K → 914K tokens) for a dense model; a MoE model with linear-attention\nlayers gets less (**1.45×**), because those layers keep a recurrent state that doesn't compress.\nThroughput barely moves (**+5.7%** prefill, **+3.1%** decode) — this is a *memory* win, not a\nspeed one, and the honest read is that the value comes from fitting longer contexts and bigger\nbatches, not faster tokens. The current build also still allocates a full cache during prefill\nand only frees it afterward, and its hybrid decode path dequantizes history to fp32 each step —\nreal limitations the repo names outright.\n\n## The same algorithm, three places\n\nWhat makes TurboQuant worth an article isn't just the KV-cache result — it's that \"rotate, then\nquantize with a data-free optimal quantizer\" is a **general** vector-quantization primitive, and\nit's showing up in very different systems:\n\n- **KV cache** (`0xsero/turboquant`, above): compress the attention cache in vLLM for longer\n  contexts.\n- **Model weights** (`turbo-tan/llama.cpp-tq3`): a `TQ3` quantization type that applies the same\n  rotate-then-quantize idea to *weights* in llama.cpp (see below).\n- **Vector search** ([`turbovec`](/articles/turbovec)): the same rotation + Lloyd–Max +\n  bit-packing, in Rust with SIMD kernels, as a FAISS-competitive similarity index — a 10M-vector\n  corpus in 4 GB instead of 31. (Its own write-up is [here](/articles/turbovec).)\n\nOne paper, one primitive — a quantizer whose optimality comes from *reshaping the data into a\nknown distribution first* rather than learning a codebook from samples — and it drops into\ninference caches, weight files, and ANN indexes alike.\n\n### TQ3 in llama.cpp: the same idea, on weights\n\nThe `llama.cpp-tq3` fork adds `TQ3_1S` / `TQ3_4S` — **3-bit weight** quantization types that run the\nTurboQuant pipeline (a Walsh–Hadamard rotation, then Lloyd–Max scalar quantization per block) on\nmodel weights. Worth clearing up a name collision: llama.cpp already ships `TQ1_0` and `TQ2_0`,\nbut those are *ternary* formats unrelated to this paper — the \"TQ\" match is coincidental. TQ3 is\ngenuinely TurboQuant-based, and at ~3.5 bits per weight it hits **Q4-class quality about 10%\nsmaller** (on Qwen3.5-27B, `TQ3_4S` measures a hair *better* perplexity than `Q3_K_S` at ~12.9 GiB),\nwhich is what lets a 27B model run on a 16 GB GPU.\n\nThe speedup is the interesting engineering twist. Because TQ3 blocks are 3-bit-after-rotation, they\nmap cleanly onto **FP4 tensor cores** on Blackwell-class GPUs — the fork fuses the rotation into an\nFP4 activation quantizer and runs the matmul in FP4. Turning that path on roughly **doubles\nprompt-processing throughput**: on an RTX 3090, Gemma-12B goes 737 → **1,819 tok/s** (+147%) and\nSuperGemma-26B 983 → **2,005** (+104%); on a DGX Spark (GB10), a 27B MTP model goes 360 → **920\ntok/s** (+155%). That's a rotate-then-quantize weight format turning a hardware FP4 unit into free\nspeed — the same primitive, paying off a third way. (The exact `llama-quantize` invocation isn't\ndocumented in the repo yet; the types ship as pre-quantized models on Hugging Face.)\n\n## The take\n\nThe elegant part of TurboQuant is that the hard problem (outliers, calibration) is dissolved\nrather than fought. Instead of detecting and special-casing outlier channels, you rotate them\naway; instead of calibrating on data, you compute the optimal quantizer against the distribution\nthe rotation guarantees. The QJL residual is the tasteful finish — a one-bit patch that turns a\ngood reconstruction quantizer into an unbiased *inner-product* quantizer, which is the thing\nattention actually needs.\n\nIt's not magic: the KV-cache implementation is a memory win, not a throughput one, 2-bit values\nvisibly degrade, and MoE/linear-attention models compress less. But the underlying result —\nnear-optimal, data-free vector quantization with a formal distortion bound — is the kind of solid\nprimitive that ends up everywhere, which is exactly what's happening.\n\n---\n\n*Built on [TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate](https://arxiv.org/abs/2504.19874)\n(Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni; 2025), building on the authors' earlier\nQJL work. Implementation details and benchmarks are from [`0xsero/turboquant`](https://github.com/0xsero/turboquant)\n(GPLv3) and [`turbo-tan/llama.cpp-tq3`](https://github.com/turbo-tan/llama.cpp-tq3).*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/turboquant-kv-cache","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"TurboVec: FAISS-competitive vector search with no training phase","description":"Product quantization made approximate nearest-neighbor search cheap — but it needs a training pass to learn codebooks, and rebuilds as your corpus drifts. TurboVec drops that: it's a Rust vector index built on Google's TurboQuant, which rotates every embedding into a known distribution and quantizes it with a data-free optimal scalar quantizer. No train() step, online ingest, ~16× compression, and it matches or beats FAISS IndexPQ recall while running 12–19% faster on ARM. A 10M-vector corpus in 4 GB instead of 31.","date":"2026-07-02","tags":["vector-search","information-retrieval","quantization","rust","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"turbovec","body":"Vector search is memory-bound. A few million embeddings at float32 are gigabytes of RAM, and\nthe standard fix — **product quantization** (PQ), the workhorse inside FAISS — compresses them by\nlearning a codebook: run k-means over a sample of your data, replace each sub-vector with the\nnearest centroid's index. It works well, but it has a cost that's easy to forget: a **training\nphase**. You need representative data up front, you call `train()` before you can add anything,\nand as your corpus drifts the learned codebook goes stale and wants a rebuild.\n\n**TurboVec** ([`RyanCodrai/turbovec`](https://github.com/RyanCodrai/turbovec)) removes the training\nphase entirely. It's a Rust vector index built on **TurboQuant** — the same\n[rotate-then-quantize algorithm](/articles/turboquant-kv-cache) that compresses LLM KV caches —\nwhose quantizer is *data-free*, so there's nothing to learn. The headline: a 10-million-vector\ncorpus that costs ~31 GB as float32 fits in ~4 GB, and searches faster than FAISS.\n\n<MemoryFit />\n\n## Why there's nothing to train\n\nThe full derivation is in the [TurboQuant write-up](/articles/turboquant-kv-cache), but the short\nversion is the whole reason there's no training step. TurboVec encodes each vector in six moves:\n\n1. **Normalize** to the unit sphere (the length is stored separately for scoring).\n2. **Rotate** by a random orthogonal matrix — built once from a seeded Gaussian via QR, so it's\n   deterministic and, crucially, *data-independent*.\n3. **TQ+ calibration** — a small per-coordinate shift/scale, fit once on the first batch, to snap\n   real embeddings onto the ideal post-rotation marginal.\n4. **Lloyd–Max scalar quantization** — the optimal quantization levels, precomputed against the\n   *known* Beta distribution the rotation produces. This is the data-free part: because you know\n   the distribution analytically, the optimal quantizer is a fixed table, not something learned\n   from your vectors.\n5. **Bit-pack** to 2, 3, or 4 bits per coordinate — 16× smaller than float32 at 2-bit.\n6. **Store a per-vector correction scalar** so the quantized dot product is an *unbiased* estimate\n   of the true inner product (a RaBitQ-style length renormalization).\n\nPQ learns its codebook from your data; TurboQuant reshapes your data into a distribution whose\noptimal codebook is already known. That's the trade — and it's why TurboVec can ingest online with\nno `train()`, no parameter tuning, and no rebuilds as the corpus grows.\n\n<Figure\n  src=\"/articles/turbovec/fig1.png\"\n  alt=\"Two log-scale plots of distortion versus bit-width (1 to 5 bits). Left: inner-product error for the prod and mse variants sits between a green lower bound and red upper bound. Right: mean squared error tracks just above the lower bound and hugs the upper bound.\"\n  caption=\"Because the rotation reshapes every vector into a known distribution, the data-free quantizer's inner-product error (left) and MSE (right) stay within the theoretical distortion bounds at every bit-width — the near-optimality that lets TurboVec skip the learned codebook entirely (paper, Figure 3).\"\n/>\n\n## Does it actually beat FAISS?\n\nMostly yes, and the repo is honest about where it doesn't. Across its benchmark configs (100K\nvectors, 1K queries, k=64), here's TurboVec against FAISS `IndexPQ` on recall, latency, and\ncompression — flip through the datasets and bit-widths:\n\n<ConfigExplorer />\n\nOn the OpenAI embedding sets it wins recall@1 outright (up to +1.9 points at 2-bit) *and* runs\n12–19% faster on ARM, at 8–16× compression with no training pass. The exceptions are real: on x86\nit trails a few percent at 2-bit (it wins the 4-bit configs), and on low-dimensional GloVe vectors\nat 2-bit FAISS edges it by 0.06 of a point — the rotation has less room to spread energy in only\n200 dimensions. Net, it's genuinely competitive with a mature, heavily-optimized library, which is\na high bar for a quantizer with no learned codebook.\n\n<Figure\n  src=\"/articles/turbovec/fig2.png\"\n  alt=\"Three Recall@1 versus top-k line plots (top-k from 1 to 64) on GloVe d=200, OpenAI d=1536, and OpenAI d=3072. Each compares TurboQuant, PQ, and RaBitQ at 2 and 4 bits; TurboQuant's lines sit at or above the PQ and RaBitQ curves, especially at 4 bits and on the higher-dimensional OpenAI sets.\"\n  caption=\"Recall@1 versus top-k across GloVe (d=200) and two OpenAI embedding sets (d=1536, d=3072): TurboQuant matches or beats PQ (the method inside FAISS IndexPQ) and RaBitQ at both 2 and 4 bits, with the largest margins in high dimensions (paper, Figure 5).\"\n/>\n\n## The systems half\n\nA near-optimal quantizer only matters if the search is fast, and TurboVec is a real systems\nproject, not a reference implementation:\n\n- **Hand-written SIMD kernels** — NEON on ARM, AVX-512BW on x86, with an AVX2 fallback and runtime\n  feature detection. Scoring runs on the *packed codes directly* via table lookups; there's no\n  decompression step.\n- **32-vector blocks**, FAISS FastScan-style, with the query LUT built per search. Filtering is a\n  bitmask checked at block granularity — whole blocks with no allowed vectors are skipped, and the\n  filter is applied *inside* the kernel so a restricted search returns the true top-k among allowed\n  items with **no recall penalty and no over-fetch**.\n- **`IdMapIndex`** gives stable `uint64` external IDs with **O(1) removal** (swap-remove, no\n  tombstones) — the vector you delete is replaced by the last one and both ID maps update in\n  constant time.\n- **Online ingest and plain persistence** (`.tv` / `.tvim` files), plus drop-in adapters for\n  LangChain, LlamaIndex, Haystack, and Agno.\n\nThe Python API is what you'd hope for — no training call anywhere:\n\n```python\nfrom turbovec import TurboQuantIndex\n\nindex = TurboQuantIndex(dim=1536, bit_width=4)   # 2, 3, or 4 bits\nindex.add(vectors)                                # float32 (n, dim) — indexed immediately\nscores, ids = index.search(query, k=10)           # searches the packed codes\nindex.write(\"corpus.tv\")\n```\n\n## Where it fits (and where it doesn't)\n\nThe honest scope: TurboVec is a **flat, exhaustive-scan** index — it scores every (packed) vector\nper query. That's exactly the regime where it competes with FAISS's flat PQ scan, and at a hundred\nthousand to a few million vectors it's excellent: no training, tiny memory, strong recall. It is\n*not* a billion-scale graph index — if you need sub-linear search over hundreds of millions of\nvectors you still want an IVF or HNSW structure (and you could quantize *those* with TurboQuant\ntoo). Think of it as the compression-and-scoring core done unusually well, not a replacement for\nevery ANN system.\n\n## The take\n\nWhat I like about TurboVec is that it makes the [TurboQuant](/articles/turboquant-kv-cache) thesis\nconcrete in a second domain: the same \"rotate into a known distribution, then quantize with a\ndata-free optimal quantizer\" that shrinks KV caches also shrinks embedding indexes — and here it\nbuys something PQ structurally can't, the elimination of the training phase. Pair it with a lexical\nscorer like [BM25](/articles/bm25) and you've got both halves of hybrid retrieval, each running on\ncommodity hardware with no GPU and no learned index. It won't dethrone HNSW at billion scale, but\nfor the very common case of a few million embeddings that need to fit in RAM and update live, \"as\ngood as FAISS, with nothing to train\" is a genuinely nice place to land.\n\n---\n\n*Built on [`RyanCodrai/turbovec`](https://github.com/RyanCodrai/turbovec) (Rust + Python, MIT),\nwhich implements [TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate](https://arxiv.org/abs/2504.19874)\n(Zandieh, Daliri, Hadian, Mirrokni; ICLR 2026). Benchmark figures are from the repo's published\nresults (100K vectors, k=64; ARM = Apple M3 Max, x86 = Xeon Sapphire Rapids).*\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/articles/turbovec","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"HydraHead: hybrid attention at the head, not the layer","description":"Full attention is quadratic; linear attention is cheap but forgets. Everyone mixes them — but per layer, in blunt all-or-nothing blocks. HydraHead's finding is that the specialization lives at the head level: inside one layer, a few heads do long-range retrieval (which needs full attention) and the rest do local work (which linear attention handles fine). So it hybridizes along the head axis, keeps full attention only for the retrieval-critical heads it identifies by causal analysis, and holds long-context accuracy where other hybrids collapse.","date":"2026-07-01","tags":["llm","attention","long-context","interpretability","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"hydrahead","body":"The attention that made transformers work is also what makes them expensive: every token\nattends to every other, so cost grows with the **square** of the context length. That\nquadratic term is fine at 4K tokens and ruinous at 512K — exactly the regime long-context\nmodels are pushing into. **Linear attention** (LA) fixes the scaling by keeping a\nfixed-size recurrent state instead of a full attention matrix, so cost grows linearly. But\nthat fixed state is lossy: it can't do exact, long-range token retrieval — the \"find the\none sentence 400K tokens ago\" trick that full attention nails.\n\n<Complexity />\n\nSo the field mixes them — **hybrid attention**. Almost everyone does it *per layer*:\ninterleave whole full-attention (FA) layers with whole linear-attention layers at some\nfixed ratio (3:1, 7:1), sometimes searched with NAS (as in [GLM-5](/articles/glm-5-2)).\n**HydraHead**, from Alibaba, argues that the layer is the wrong unit — and backs it with an\ninterpretability result.\n\n## The finding: layers are smooth, heads are not\n\nHydraHead's authors probe a pretrained dense model (Qwen3-1.7B) and look at two things:\n\n- **Across layers**, the outputs vary *smoothly* — the layer-to-layer output-similarity\n  matrix is one gradual block, with no crisp boundary that says \"put full attention here,\n  linear attention there.\" Layer-wise hybridization is cutting where there's no clean seam.\n- **Within a layer**, the heads are sharply *heterogeneous*. Reading the same input, they do\n  different jobs — and only a few do the long-range retrieval that genuinely needs FA. The\n  per-layer Gini coefficient of head importance averages **0.62**: importance is concentrated\n  in a handful of heads. Across all 448 query heads, only about **6.5% are essential** for\n  retrieval; ~91% can be swapped to linear attention with negligible loss. And the critical\n  ones are *scattered* — almost every layer mixes a couple of retrieval heads with a dozen\n  replaceable ones.\n\nYou can see the heterogeneity directly — heads in one layer specialize, and only some need\nto reach far back:\n\n<RetrievalHeads />\n\nThat's the whole argument in one observation. If retrieval capability lives in a sparse,\nscattered set of *heads*, then the head — not the layer — is the natural granularity for\ndeciding where to spend full attention.\n\n<HeadVsLayer />\n\n## Picking the retrieval-critical heads\n\nHydraHead keeps FA for about **25% of heads** by default and runs the rest as **Gated\nDeltaNet** (GDN, the linear-attention variant it builds on). The question is *which* 25%.\n\n<Figure\n  src=\"/articles/hydrahead/fig1.png\"\n  alt=\"Side-by-side diagram: standard full attention (left) sends all heads through one softmax operation; HydraHead's head-wise hybrid attention (right) splits heads into a full-attention branch and a linear-attention (GDN) branch, recombining them through a Norm & Scale block before the shared output projection.\"\n  caption=\"Standard full attention vs. HydraHead's head-wise hybrid: a subset of heads keeps the full-attention branch while the rest run linear attention (GDN), fused by per-head norm-and-scale before the output projection (paper, Figure 3).\"\n/>\nThe selection is a causal interpretability procedure, not a guess:\n\n- Build **counterfactual pairs** from RULER needle-in-a-haystack probes — swap the needle's\n  value for a same-length distractor while holding the rest of the context fixed, so\n  activations stay in-distribution.\n- Run **activation patching** (for heads that *receive* the retrieved information) and\n  **path patching** (for heads that *send* it), scoring each head by how much restoring it\n  recovers the correct-answer logit.\n- Fuse the per-capability scores, rank all heads, and keep the top-K as FA.\n\nIt's cheap — the ranking stabilizes from roughly **six calibration samples** — and it's\n*faithful*: knock out just the top ~1% of heads by this score and needle-retrieval accuracy\ncollapses, while ablating random heads barely moves it. Crucially, the ablation confirms the\nselection **beats fixed or random head assignment** — the interpretability signal is doing\nreal work.\n\n## Reconciling two kinds of output\n\nYou can't just concatenate FA and GDN heads and project them — their outputs live on\ndifferent scales. Softmax attention is **query-magnitude-modulated**: it produces sharp,\nlow-entropy distributions peaked on a few tokens. Linear attention cancels that magnitude\nout, giving smoother, higher-entropy, more uniform outputs. Splice the two naively and the\nmodel degrades badly.\n\nHydraHead's **scale-normalized fusion** handles it with two moves: an **independent per-head\nRMSNorm** on every head's output, then a **learnable per-head scalar** $\\gamma_h$ that\nre-weights each head before the shared output projection. It's a small module, but it's\nload-bearing — remove the normalization and RULER's extended-context score drops from\n**87.5 to 71.4**. A learnable *scale* also beats a learnable *gate* by ~20 points, so the\nmodel keeps every head's contribution and just rescales it, rather than gating heads off.\n\n## Building it cheaply\n\nYou don't train HydraHead from scratch — you *convert* a pretrained FA model in a three-stage\ntransfer pipeline that reuses as much as possible:\n\n1. **Parameter migration + alignment** — the FA heads keep their pretrained weights; the new\n   GDN heads reuse the base model's Q/K/V projections (repeated channel-wise to bridge the\n   GQA→multi-head shape gap), so nothing starts from random. A per-layer MSE loss aligns the\n   hybrid's hidden states to the original.\n2. **Logit distillation** — unfreeze the whole model and match its output distribution to the\n   original FA teacher with a KL objective.\n3. **Long-context fine-tuning** — ordinary next-token prediction at 16K context.\n\nThe controlled conversion runs on **~2.3B tokens**; the scaled-up model in the paper uses\n**~15B**. Either way it's a rounding error next to pretraining — the capability is inherited,\nnot learned fresh.\n\n## The payoff: long context that doesn't collapse\n\nThe headline result is retention. On RULER single-needle retrieval, most hybrid models — and\nthe base model itself — fall to **near zero** by 256K. HydraHead holds:\n\n<BenchBars\n  title=\"RULER single-needle retrieval @ 256K context (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"HydraHead (hybrid)\", value: 94.53, highlight: true },\n    { label: \"HypeNet-2B (hybrid)\", value: 68.93 },\n    { label: \"Qwen3-1.7B + YaRN\", value: 40.2 },\n    { label: \"Jet-Nemotron-2B\", value: 1.07 },\n    { label: \"Qwen3-1.7B (base)\", value: 0.0 },\n  ]}\n/>\n\nThe harder multi-key retrieval shows the same shape — everything else craters, HydraHead\ndegrades gracefully:\n\n<BenchBars\n  title=\"RULER multi-key retrieval @ 256K context (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"HydraHead (hybrid)\", value: 52.7, highlight: true },\n    { label: \"Qwen3-1.7B + YaRN\", value: 14.2 },\n    { label: \"Jet-Nemotron-2B\", value: 0.0 },\n    { label: \"Qwen3-1.7B (base)\", value: 0.0 },\n  ]}\n/>\n\nAcross the full context sweep the retention gap only widens: HydraHead tracks the baselines\nup to 64K, then holds as they collapse — reaching **+86.6** points on single-needle and\n**+69.2** on multi-key at 512K, closing on Qwen3.5-2B-Base (which ships native 256K support):\n\n<Figure\n  src=\"/articles/hydrahead/fig2.png\"\n  alt=\"Two grouped bar charts of RULER needle-in-a-haystack recall from 16K to 512K context. Left: Single NIAH. Right: Multi-Key NIAH. Qwen3-1.7B and its YaRN variant fall toward zero past 128K, while HydraHead stays high, with red annotations marking +54.3/+86.6 (single) and +58.4/+69.2 (multi-key) gains at 256K and 512K.\"\n  caption=\"RULER needle-in-a-haystack recall from 16K to 512K: HydraHead retains accuracy where Qwen3-1.7B and its YaRN variant collapse, approaching Qwen3.5-2B-Base (paper, Figure 1). The 512K and Qwen3.5 numbers come only from this figure, not a table.\"\n/>\n\nAnd it buys that without wrecking short-context ability — the usual tax on linear-attention\nconversions. On general reasoning it lands within ~3.4 points of the full-attention base\nmodel, and on MMLU it essentially matches it:\n\n<BenchBars\n  title=\"General reasoning — average of MMLU, BBH, MBPP, GSM8k (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Qwen3-1.7B (FA base)\", value: 54.02 },\n    { label: \"HydraHead\", value: 50.62, highlight: true },\n    { label: \"Qwen3-1.7B + YaRN\", value: 50.37 },\n    { label: \"Gemma-3n-E2B\", value: 39.29 },\n  ]}\n/>\n\nThe efficiency claim is the one to internalize: at a **7:1** GDN-to-FA head ratio — only one\nhead in eight keeping full attention — HydraHead matches a **3:1 layer-wise** hybrid's\nlong-context average, while doing *better* on hard reasoning. Same quality, far less full\nattention, which means a smaller KV cache (HydraHead's is ~0.35× a full-attention model's).\nIf you've read the [inference write-up](/articles/how-llm-inference-works), that cache number\nis the whole game at long context — it's what decides how many requests fit on the GPU.\n\n## The take\n\nWhat I like here is that the architecture change *follows from* an interpretability result\ninstead of being reverse-justified by one. The claim \"retrieval lives in a sparse, scattered\nset of heads\" is measured with causal patching, the ablations show fixed/random selection is\nworse, and the fix — hybridize per head, keep FA where the retrieval heads are — falls\nstraight out of the measurement. It's a clean example of interpretability paying rent.\n\nTwo honest caveats. The flashiest numbers — *\"69% improvement at 512K\"* and *\"approaching\nQwen3.5\"* — come only from a figure, with no supporting table; the tabulated results stop at\n256K, so treat the 512K story as directional. And this is all at the **1.7B** scale on\nretrieval-style benchmarks; whether head-level hybridization holds its edge at 30B+ and on\nmessier long-context reasoning is the open question. But the core idea — that the *head* is\nthe right unit for spending your quadratic-attention budget — is the kind of insight that\ntends to generalize.\n\n---\n\n*Built on [HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention\nHybridization](https://arxiv.org/abs/2606.20097) (Tan, Chen, Shen, Liu, Shen, Wu, Ye;\nAlibaba Group, 2026). Benchmark figures are quoted from the paper's tables; the 512K and\nQwen3.5 comparisons are from its Figure 1.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/hydrahead","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Tapered Language Models: spend your width where the work is","description":"Every transformer since 2017 stacks identical layers — same width top to bottom. Tapered Language Models question that default: under a fixed parameter budget, pour MLP width into the early layers and thin out the late ones with a cosine schedule, and perplexity improves at no extra params or FLOPs. The reverse allocation hurts. A walk through the one-line change, the residual-stream evidence behind it, and how consistently it holds across four architectures.","date":"2026-07-01","tags":["llm","transformers","architecture","scaling","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"tapered-language-models","body":"Here's an assumption baked into every transformer, recurrent, and memory-based language\nmodel since 2017: the layers are **identical**. Same residual dimension, same attention\nshape, same MLP width, stacked N deep. It's a default inherited from the original\ntransformer and almost never questioned — parameters are spread *uniformly* across depth.\n\n**Tapered Language Models** (TLMs), from Bayat, Behrouz, and Courville, ask what happens if\nyou don't. Their answer is a one-line change with a free-lunch flavor: under a *fixed*\nparameter budget, give the early layers more MLP width and the late layers less, on a smooth\ncosine schedule. Same parameters, same FLOPs — better model.\n\n<TaperSchedule />\n\n## The default no one questions\n\nWhy would non-uniform be better? Because a growing pile of evidence says layers *don't*\ncontribute equally. The paper points at three findings:\n\n- **Early-exit** methods show the residual stream often converges to its final prediction\n  well before the last layer.\n- **Layer-skipping** shows later layers can be bypassed at inference with minimal damage.\n- Interpretability work finds lower layers capture shallow syntactic patterns while upper\n  layers encode semantics — different jobs, not equal ones.\n\nThe unifying picture: **later layers refine the residual stream rather than transform it.**\nAn early layer rewrites the representation; a late layer nudges it. If you measure how much\neach layer actually changes things, the work is front-loaded:\n\n<WhyEarly />\n\nAnd if late layers are only refining, giving them the *same* width as the layers doing real\ntransformation is wasted capacity — capacity you could have spent where it matters.\n\n## The taper, precisely\n\nTLMs taper exactly one thing: the **MLP intermediate width** $d_{ff}$. Attention — residual\ndimension, head count, key/value dims — is left identical to the baseline. That's a\ndeliberate choice: MLPs dominate the parameter count in every modern LM family, and width is\na single clean axis to vary. The schedule is a cosine over depth:\n\n$$\nd_{ff}(l) = d_{end} + \\frac{d_{start} - d_{end}}{2}\\left(1 + \\cos\\frac{\\pi l}{L-1}\\right)\n$$\n\nLayer 0 is widest ($d_{ff} = d_{start}$), the last layer narrowest ($d_{ff} = d_{end}$),\nmonotonically decreasing in between. The endpoints are multipliers of the baseline width, and\nthe best config is **$1.5\\times$ at the front, $0.5\\times$ at the back** — a 3:1 taper. The\ncrucial property: the per-layer widths **average to the baseline**, so\n\n$$\n\\frac{1}{L}\\sum_l d_{ff}(l) = d_{ff}^{\\text{baseline}}\n$$\n\nEarly layers get wider than baseline, late layers narrower, and the integral is preserved.\nTotal parameters don't change. Total FLOPs don't change. Tapering only shifts *where* the\ncompute is spent, not how much. That's what makes it a redistribution rather than a bigger\nmodel — and why the comparison to the uniform baseline is honest.\n\n<Figure\n  src=\"/articles/tapered-language-models/fig1.png\"\n  alt=\"Left: per-layer MLP intermediate width for a uniform baseline and three tapered schedules (step-wise, linear, cosine) on a 440M Transformer, drawn as stacked bar widths that shrink toward the top for the tapered variants. Right: validation perplexity versus taper strength, with cosine dropping from the 16.28 uniform baseline to a 14.44 minimum at the 1.5→0.5 range while linear stays near baseline.\"\n  caption=\"All schedules share the same total parameter count and FLOPs; the cosine taper reaches the lowest perplexity, 14.44 vs the 16.28 uniform baseline (paper, Figure 1).\"\n/>\n\n## Direction is everything\n\nThe foundational experiment splits the stack into three blocks and moves the extra capacity\naround — early, middle, or late — at fixed budget. The result is unambiguous:\n\n<TaperDirection />\n\nFront-loading helps. Centering the capacity is *worse* than uniform. And back-loading —\nwider late layers — is the worst option of all, **+1.01 perplexity** over just doing nothing.\nThis is the control that makes the whole paper: the gain isn't from \"more capacity somewhere,\"\nit's specifically from putting it where the representational work happens.\n\n<Figure\n  src=\"/articles/tapered-language-models/fig2.png\"\n  alt=\"A 440M Transformer's layers split into three equal blocks (early, middle, late), each block's MLP width scaled while total parameters stay fixed. Wider-early gives 15.96 perplexity (best, green), wider-late 17.29 (worst), wider-middle 16.61, all compared against the 16.28 uniform baseline.\"\n  caption=\"Moving the same extra MLP capacity to the early layers helps (15.96), while back-loading it hurts most (17.29) — the control that isolates direction from raw capacity (paper, Figure 2).\"\n/>\n\nThe *shape* matters too. Sweeping cosine against linear and sigmoid schedules, cosine wins in\nevery setting; sigmoid is often worse than the uniform baseline. And taper strength is\nU-shaped — too gentle leaves gains on the table, too aggressive (a 7:1 ratio) starves the late\nlayers and regresses. The 1.5→0.5 cosine is the bottom of that U.\n\n## How consistently it holds\n\nThe tuned config — cosine 1.5→0.5, found once on a 440M Transformer — is then transferred\n*unchanged* to three scales (440M/30B tokens, 760M/50B, 1.3B/100B) and four architectures:\nplain Transformer, Gated Attention, and Behrouz's own **HOPE** and **Titans** memory models.\nIt keeps improving almost everywhere. At 760M, the perplexity reduction from the exact same\nparameter budget:\n\n<BenchBars\n  title=\"WikiText perplexity reduction from tapering — 760M (higher = bigger drop)\"\n  unit=\"\"\n  bars={[\n    { label: \"Titans\", value: 0.81, highlight: true },\n    { label: \"Gated Attention\", value: 0.76, highlight: true },\n    { label: \"Transformer\", value: 0.44, highlight: true },\n    { label: \"Hope-attention\", value: 0.12, highlight: true },\n  ]}\n/>\n\nAnd it carries through to downstream accuracy — the average over eight commonsense benchmarks\n(LAMBADA, PIQA, HellaSwag, WinoGrande, ARC-easy/challenge, SIQA, BoolQ) rises for every\narchitecture at 760M:\n\n<BenchBars\n  title=\"Commonsense accuracy gain from tapering — 760M (percentage points)\"\n  unit=\"pp\"\n  bars={[\n    { label: \"Titans\", value: 0.99, highlight: true },\n    { label: \"Transformer\", value: 0.59, highlight: true },\n    { label: \"Hope-attention\", value: 0.36, highlight: true },\n    { label: \"Gated Attention\", value: 0.27, highlight: true },\n  ]}\n/>\n\nThe full picture, uniform → tapered, is small-but-consistent rather than dramatic:\n\n| scale | architecture | WikiText ppl | LAMBADA ppl | commonsense avg |\n|---|---|---|---|---|\n| 760M | Transformer | 21.86 → **21.42** | 22.29 → **21.25** | 52.25 → **52.84** |\n| 760M | Gated Attention | 20.74 → **19.98** | 21.85 → **21.44** | 52.61 → **52.88** |\n| 760M | Titans | 21.58 → **20.77** | 23.09 → **22.92** | 52.30 → **53.29** |\n| 1.3B | Transformer | 17.39 → **17.17** | 17.62 → **16.93** | 56.05 → **56.38** |\n| 1.3B | Titans | 16.05 → **15.76** | 14.19 → **14.04** | 56.73 → **57.08** |\n\nThe honest read: at scale the gains are typically 0.1–1.0 perplexity and a few tenths of a\npoint of accuracy — improving in ~15 of 16 measured cells (the lone regression is 1.3B\nHOPE's WikiText, off by 0.03). It's not a step change. It's a **free, universal nudge in the\nright direction** from a config that was never even tuned at these scales.\n\n## Why it works\n\nThe mechanism check makes the story tight. Measure the cosine similarity between each MLP's\noutput and the residual stream it writes into, and it *rises* with depth — later MLPs produce\nupdates increasingly *aligned* with the residual (Pearson r ≈ 0.49–0.71 vs. layer index). An\nupdate aligned with the residual is a refinement; an orthogonal one is a transformation. So\nlate MLPs, writing residual-aligned updates, aren't using their extra width — the hidden\ndimension is spent nudging in a direction the stream already points. Tapering removes width\nexactly where that alignment is highest and moves it to the early layers, which write the\northogonal, representation-defining updates that actually need the capacity. The\n[MLP is where the parameters live](/articles/mixture-of-experts-from-scratch); this is just\nallocating them by how hard each layer is working.\n\n<Figure\n  src=\"/articles/tapered-language-models/fig3.png\"\n  alt=\"Two line plots of cosine similarity versus relative layer depth across the GPT-2 family (124M to 1.5B). Left: block updates versus the residual stream. Right: MLP output versus the residual stream. Both curves rise toward 1.0 at greater depth, showing later-layer updates increasingly align with the residual they write into.\"\n  caption=\"Later layers' updates grow more aligned (higher cosine similarity) with the residual stream they write into, marking them as refinements rather than transformations (paper, Figure 4).\"\n/>\n\n## The take\n\nI like this paper for the same reason I liked [HydraHead](/articles/hydrahead): the\narchitectural change is *derived from* a measurement, not reverse-justified. \"Later layers\nrefine, not transform\" is an old observation; TLMs turn it into a concrete lever — cosine-taper\nthe MLP width — and then run the control (reverse it, and it hurts) that proves the direction\nis what's doing the work.\n\nThe caveats are real and the authors state them. The gains at scale are modest, the single\n1.5→0.5 cosine schedule was tuned only on a 440M Transformer and transferred without\nre-tuning (so there's likely more on the table), and there's no code release. But the appeal\nis that it costs *nothing* — same parameters, same FLOPs, one function applied to the MLP\nwidths. For a default that's gone unquestioned for eight years, \"uniform depth was leaving\nfree perplexity on the floor\" is a satisfying result, and an easy one to try.\n\n---\n\n*Built on [Tapered Language Models](https://arxiv.org/abs/2606.23670) (Reza Bayat, Ali\nBehrouz, Aaron Courville, 2026). Perplexity, accuracy, and ablation figures are quoted from\nthe paper's tables; the residual-alignment correlation is from its Figure 4.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/tapered-language-models","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Agents-A1: scaling the agent horizon, not the parameter count","description":"InternScience's Agents-A1 is a 35B Mixture-of-Experts model with ~3B active parameters that reaches the agentic-benchmark band of trillion-parameter frontier models — not by growing, but by scaling the horizon: 45K-token trajectories across six domains, a knowledge-action graph that turns agent traces into verifiable training targets, and a three-stage recipe that distills six specialist teachers into one student. A walk through the method and the numbers.","date":"2026-06-30","tags":["llm","agents","reinforcement-learning","distillation","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"agents-a1","body":"Most of the frontier's agentic gains over the last year came from one move: make the model\nbigger. Kimi-K2.6 and DeepSeek-V4-pro are *trillion*-parameter systems, and they top the\nhard agent benchmarks. **Agents-A1**, from InternScience, makes the opposite bet. It's a\n**35B Mixture-of-Experts** model with only **~3B active parameters per token** (it's\ninitialized from `Qwen3.5-35B-A3B`), and it lands in the same benchmark band as those 1T\nmodels. Its thesis, verbatim from the tech report: *scaling the horizon, not the\nparameters.*\n\n<HorizonScaling />\n\nThe chart is the whole argument. Agents-A1 and `Qwen3.5-35B-A3B` are the **same model** —\nsame 35B size, same ~3B active params. The ~18-point vertical gap between them is *entirely*\nthe training recipe, and it lifts the 35B student into the band held by models 30× its size.\nThe question this article answers is: what is \"the horizon,\" and how do you scale it instead\nof the parameter count?\n\n## Two axes of horizon\n\n\"Agent horizon\" splits into two things you can grow independently, and Agents-A1 pushes both:\n\n1. **Long-horizon trajectories** — how *far* a single agent run goes. Agents-A1 is trained on\n   agentic trajectories averaging **~45K tokens** (deep-research runs average 44K, coding 48K,\n   scientific reasoning 37K, general agentic 39K; only short instruction-following tasks pull\n   the mean down at ~3K). That's hundreds of think→act→observe steps per task, not a single\n   question-answer turn.\n2. **Heterogeneous abilities** — how *many kinds* of agent the one model can be. Agents-A1\n   unifies **six domains**: long-horizon search, engineering, scientific research, instruction\n   following, general agentic tasks, and scientific agentic tasks.\n\nA single long-horizon run looks like this — a chain of tool calls, each with an observation\nand a **verifier outcome** that says whether the step actually worked:\n\n<AgentLoop />\n\nThe verifier is the load-bearing part. A raw transcript of an agent flailing is not training\ndata; a transcript where every step is *checked* — did the code converge, does the answer\nmatch the cited evidence — is. That check is what turns a 45K-token trace into a sequence of\ntrainable targets. Which is the next problem: where do verified 45K-token trajectories at\nscale come from?\n\n## The knowledge-action graph\n\nYou can't hand-write 100K agent trajectories. Agents-A1's answer is a **knowledge-action\ngraph** (KAG) per domain — a typed four-tuple\n\n$$\n\\mathcal{G}_d = (\\mathcal{C}_d,\\ \\mathcal{A}_d,\\ \\mathcal{O}_d,\\ \\mathcal{V}_d)\n$$\n\n| symbol | what it holds |\n|---|---|\n| $\\mathcal{C}_d$ — **corpus** | evidence chunks, entities, facts, constraints — the domain's grounded knowledge |\n| $\\mathcal{A}_d$ — **actions** | tool calls, retrieval queries, code edits and executions, reasoning steps |\n| $\\mathcal{O}_d$ — **observations** | tool returns, retrieved evidence, execution states, intermediate artifacts |\n| $\\mathcal{V}_d$ — **verifiers** | automatic checks over correctness, evidence support, constraint satisfaction, goal completion |\n\nThe graph is populated by a **proposer–solver–verifier self-play game**: a proposer policy\n$\\pi_P$ samples regions of the graph to pose constrained tasks, a solver $\\pi_S$ attacks them\nwith retrieval and tools, and a verifier $\\pi_V$ checks the answer, the evidence, the\nexecution result, and the trajectory for shortcut-taking. A candidate task is kept **only if**\nit's verifiable, valid, process-informative, evidence-covering, and unambiguously specified.\nEach accepted step is logged as a record $(s_t, a_t, o_t, v_t)$ — prior state, action,\nobservation, verifier outcome — and *that* tuple is the trainable target. The data engine and\nthe agent are the same machinery: the graph that grounds the agent's actions is the graph that\ngenerates its training data.\n\n<Figure\n  src=\"/articles/agents-a1/fig1.png\"\n  alt=\"Knowledge-action infrastructure of Agents-A1: heterogeneous training corpora on the left are decomposed into atomic abilities, organized into a knowledge-action graph recording actions, observations, and verifier outcomes with true/wrong targets, and expanded by a self-play graph search into domain-specific sub-KAGs (coding, agentic, instruction, MLE, scientific, mid-train) gated by a judge and verifier.\"\n  caption=\"The knowledge-action graph turns corpora into atomic abilities, then a self-play loop expands verified sub-KAGs into domain-specific tasks (paper, Figure 3).\"\n/>\n\n## The three-stage recipe\n\nWith verified trajectories in hand, the model is built in three stages — broaden, specialize,\nthen re-unify:\n\n<ThreeStage />\n\n<Figure\n  src=\"/articles/agents-a1/fig2.png\"\n  alt=\"Overview of the Agents-A1 three-stage training pipeline: multi-domain data (search, science, engineering, agent tasks, instruction following) flows through the KAG pipeline into full-domain SFT, then domain-level teacher training (search, science, instruction, tools teachers via SFT and RL with a correctness judge), and finally multi-teacher on-policy distillation matching the student's token distribution to the routed teacher via a reverse-KL loss to produce one unified model.\"\n  caption=\"The full three-stage pipeline: full-domain SFT, domain-specialist teachers, then multi-teacher on-policy distillation into a single unified 35B model (paper, Figure 2).\"\n/>\n\nStages 1 and 2 are familiar: a **full-domain SFT** pass aligns the base model with broad agent\nbehavior across all domains (~100K trajectories, response-token cross-entropy, one epoch at up\nto 131K sequence length), then a set of **domain-level teachers** is trained, each with its own\nrecipe — the search teacher with SFT then GRPO over web-search/read/code tools; the science\nteacher with reasoning-enhanced then tool-augmented SFT; the instruction-following and\ntool-calling teachers with their own GRPO setups and reward shaping. Each teacher goes deep\nwhere a single generalist would be pulled thin.\n\nStage 3 is the interesting one, and it's the same idea three separate papers landed on this\nsame week ([MOPD and DOPD](/arxiv/2026-06-30)): **on-policy distillation** as the way to fuse\ncapabilities. The full name is a mouthful — *multi-teacher domain-routed on-policy distillation\nwith salient vocabulary alignment* — so here's what each piece means:\n\n<DistillNetwork />\n\n- **On-policy.** The *student* generates the rollout, and the teacher supervises the student's\n  own tokens — not a fixed teacher transcript. This kills exposure bias: the student learns to\n  recover from the states it actually visits, not the ones a teacher would have.\n- **Domain-routed.** Routing is hard and per-sample: each trajectory carries a domain label, and\n  it's supervised *only* by that domain's teacher ($\\theta_{t,i} = \\theta_t^{d_i}$). No learned\n  per-token gate — the task picks the specialist.\n- **Salient vocabulary alignment (SVA).** At each position, the distillation loss is computed\n  *only over the teacher's top-$k$ vocabulary* — the handful of tokens the teacher actually puts\n  probability on. Both distributions are renormalized onto that support and matched with a\n  forward-KL term. The long low-probability tail, which carries no decision information, is\n  dropped. You align where the capability lives.\n- **Heterogeneity-aware.** Losses are averaged *within* a domain first, then *across* domains, so\n  a high-volume domain can't drown out a small one — each active domain gets comparable influence\n  on the update.\n\nThe result is one deployable 35B student that inherits all six specialists, with **no teacher\nshipped at inference**. If you've read the [inference write-up](/articles/how-llm-inference-works),\nthis is the training-time mirror of the serving-time story: the whole game is getting frontier\nbehavior out of a model small enough to actually run.\n\n## The numbers\n\nThe payoff is parity with — and on several benchmarks, victory over — models ~30× larger. The\nsharpest case is **FrontierScience-Research**, where the gap to the trillion-parameter field is\nnot subtle:\n\n<BenchBars\n  title=\"FrontierScience-Research (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Agents-A1 (35B)\", value: 40.0, highlight: true },\n    { label: \"GPT-5.5\", value: 26.7 },\n    { label: \"Kimi-K2.6 (~1T)\", value: 17.9 },\n    { label: \"DeepSeek-V4-pro (~1T)\", value: 13.3 },\n    { label: \"Qwen3.5-35B (base)\", value: 2.5 },\n  ]}\n/>\n\nThe base model scores **2.5**; the trained 35B scores **40.0** — above GPT-5.5's 26.7 and more\nthan double the trillion-parameter Kimi and DeepSeek. On long-horizon search, it takes overall\nSOTA on **Seal-0**, edging out frontier systems that are far larger:\n\n<BenchBars\n  title=\"Seal-0 — long-horizon search (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Agents-A1 (35B)\", value: 56.36, highlight: true },\n    { label: \"DeepSeek-V4-pro\", value: 54.95 },\n    { label: \"Kimi-K2.6\", value: 50.45 },\n    { label: \"GPT-5.5\", value: 42.34 },\n    { label: \"Qwen3.5-35B (base)\", value: 41.4 },\n  ]}\n/>\n\nAnd on instruction following it leads outright, which matters because it's the capability most\nlikely to *degrade* when you fuse many domains into one model:\n\n<BenchBars\n  title=\"IFBench — instruction following (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"Agents-A1 (35B)\", value: 80.61, highlight: true },\n    { label: \"GPT-5.5\", value: 75.9 },\n    { label: \"DeepSeek-V4-pro\", value: 73.47 },\n    { label: \"Kimi-K2.6\", value: 71.77 },\n    { label: \"Qwen3.5-35B (base)\", value: 70.2 },\n  ]}\n/>\n\nAcross the full suite, Agents-A1 takes overall SOTA on **six** benchmarks (Seal-0, HiPhO 46.4,\nFrontierScience-Olympiad 79.0, FrontierScience-Research 40.0, IFBench 80.6, IFEval 94.8) and is\nthe best ~35B-class model on most of the rest (BrowseComp 75.5, GAIA 96.0, XBench-DS 86.0,\nSciCode 44.3, HLE-w/-tools 47.6, MolBench-Bind 56.8). It does **not** win everywhere — GPT-5.5\nstill leads pure web search (BrowseComp 84.4) and engineering (SciCode 56.1, MLE-Lite 72.7), and\nDeepSeek-V4-pro tops GAIA (98.1) and the general-agentic τ²-Bench. The honest summary is parity\nin the frontier band, with clear leads in science and instruction following — from a model you\ncan serve on a fraction of the hardware.\n\n<Figure\n  src=\"/articles/agents-a1/fig3.png\"\n  alt=\"Grid of twelve grouped bar charts comparing Agents-A1 (35B, hatched blue) against Qwen3.6-35B-A3B, Step-3.5-Flash, Kimi-K2.6, DeepSeek-V4-pro, and gpt-5.5 across HLE, HiPhO, FrontierScience-Olympiad, FrontierScience-Research, BrowseComp, XBench, SEAL-0, GAIA, IFBench, IFEval, SciCode, and MolBench-Bind; Agents-A1's bar and score are highlighted in each panel.\"\n  caption=\"Agents-A1 (35B) versus 35B-class and trillion-parameter models across twelve agentic benchmarks (paper, Figure 1).\"\n/>\n\n## Running it\n\nAgents-A1 is **Apache-2.0** and runs on the standard stack — Hugging Face Transformers, vLLM, or\nSGLang with OpenAI-compatible endpoints, at a served context of **262K tokens**. The release\nrecommends specific sampling for the long-horizon behavior to hold up:\n\n```python\n# vLLM / SGLang OpenAI-compatible call\nsampling = dict(\n    temperature=0.85,\n    top_p=0.95,\n    top_k=20,\n    min_p=0.0,\n    presence_penalty=1.1,   # discourages the repetitive loops long agents fall into\n)\n```\n\nThe `presence_penalty` is the non-obvious one: long agent rollouts are prone to getting stuck\nrepeating a failing action, and a mild penalty keeps the trajectory exploring.\n\n## The take\n\nWhat I like about Agents-A1 is that it's an honest systems argument, not a parameter flex. The\nrecipe is the contribution: a knowledge-action graph that makes verified long-horizon data a\nrenewable resource, and an on-policy distillation stage that folds many specialists into one\nsmall model without the capability erosion you'd expect. It converges with a clear 2026 theme —\n[on-policy distillation](/arxiv/2026-06-30) is becoming the default way to *integrate*\ncapabilities rather than trade them off, and [horizon, not size](/articles/how-llm-inference-works),\nis where the agentic gains are now coming from.\n\nThe caveats are the usual ones for a benchmark-led release: the expert count and MoE routing\naren't disclosed, the \"trillion-parameter performance\" framing rests on benchmark parity rather\nthan a fitted scaling law, and benchmark SOTA is not the same as robustness in a messy\nproduction loop. But the direction is the point. If a 35B model with 3B active parameters can be\ntrained to sit in the frontier's agentic band, the interesting frontier stops being *how big*\nand becomes *how far* — how long the horizon, how many the domains, how good the verifiers.\n\n---\n\n*Built on [Agents-A1: Reaching Trillion-Parameter Performance with a 35B Agent](https://arxiv.org/abs/2606.30616)\n(InternScience, 2026), the [project page](https://internscience.github.io/Agents-A1/), and the\n[model release](https://huggingface.co/InternScience/Agents-A1) (Apache-2.0). Benchmark figures\nare quoted from the tech report and model card.*\n","readingTimeMins":9,"url":"https://ai.thesatyajit.com/articles/agents-a1","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"FAST-LIO2 from scratch: LiDAR-inertial odometry you can actually reproduce","description":"A direct, tightly-coupled LiDAR-inertial SLAM system built on an iterated error-state Kalman filter and an incremental k-d tree. A full walk through the math and the real code — IMU propagation, scan deskewing, the point-to-plane iterated update with its reformulated Kalman gain, and the ikd-Tree map — with C++ from a clean reimplementation and simplified Python, so you can rebuild it without ROS and understand SLAM.","date":"2026-06-28","tags":["slam","lidar","state-estimation","point-cloud","explainer"],"draft":false,"featured":false,"interest":3,"helpful":4,"kind":"articles","slug":"fast-lio2-lidar-inertial-odometry","body":"FAST-LIO2 is the LiDAR-inertial odometry I keep coming back to: it's accurate, it runs at\n100 Hz on a laptop, it survives 1000 deg/s rotations, and the whole thing is one tight loop\naround a Kalman filter. But the official code is a maze of templates, and the cleanest\nannotated fork is in Chinese. So this is the article I wanted: what FAST-LIO2 actually\n*does*, derived from first principles, with real code — and a path to rebuild the core\nwithout ROS, because once you can do that, you understand SLAM.\n\nIf the Kalman filter isn't fresh, read [my Kalman piece](/articles/kalman-filter) first —\nFAST-LIO2 is exactly the \"iterated, error-state, on-manifold\" filter that article ends on,\nfed by a high-rate IMU and corrected by thousands of LiDAR points per scan.\n\n## The whole system in one loop\n\nThe problem: a LiDAR gives you ~100k 3D points per scan at 10 Hz, but a scan takes ~100 ms\nduring which the sensor moves, so the points are distorted; and LiDAR alone is slow to\nregister and fragile under fast motion. An IMU gives you 200–1000 Hz acceleration and\nangular velocity — great for short-term motion, but it drifts. Fuse them tightly and each\nfixes the other's weakness.\n\n<Figure\n  src=\"/articles/fastlio2/x1.png\"\n  alt=\"FAST-LIO2 system overview: IMU and LiDAR inputs feed forward propagation, points accumulation, backward propagation, residual computation, and an iterated state update that loops until converged; on the right an ikd-Tree map handles point-wise insert, box-wise delete, and kNN search.\"\n  caption=\"FAST-LIO2's system overview (paper, Figure 1). Left: the state-estimation loop — IMU forward propagation, point accumulation, backward-propagation deskew, point-to-plane residual, iterated state update, repeat until converged. Right: the ikd-Tree map — incremental insert with on-tree downsampling, box-wise delete as the map window moves, and kNN search feeding the residual.\"\n/>\n\nBefore the math, here's the cycle as a sequence — what each stage does and, just as\nimportant, *why it has to be there*. It runs once per LiDAR scan; the IMU drives the\nprediction in between:\n\n<LioFlow />\n\nTwo contributions set FAST-LIO2 apart from its predecessor:\n\n1. **Direct registration.** No edge/plane feature extraction — it registers *raw* points to\n   the map by point-to-plane residuals. Less to tune, and it uses all the geometry.\n2. **The ikd-Tree.** An incremental k-d tree that inserts, deletes, downsamples, and\n   re-balances in place, so the map updates in real time instead of being rebuilt.\n\nUnderneath both is a **tightly-coupled iterated error-state Kalman filter on a manifold**.\nLet's build it piece by piece. I'll quote C++ from\n[`zlwang7/S-FAST_LIO`](https://github.com/zlwang7/S-FAST_LIO) — a clean reimplementation\nthat writes the filter out explicitly instead of hiding it in template magic — and give\nsimplified NumPy alongside.\n\n## The state lives on a manifold\n\nYou can't store an orientation as a 3-vector and add to it — rotations live on the manifold\n$SO(3)$. FAST-LIO2 tracks a **24-dimensional nominal state** but a **23-dimensional error\nstate** (and covariance), because each $SO(3)$ rotation needs 3 tangent dimensions, not the\n4 of a quaternion, and gravity lives on the sphere $S^2$ (2 dimensions). The state is:\n\n$$\n\\mathbf{x} = \\big[\\ \\mathbf{p},\\ \\mathbf{R},\\ \\mathbf{R}_{L}^{I},\\ \\mathbf{t}_{L}^{I},\\\n\\mathbf{v},\\ \\mathbf{b}_g,\\ \\mathbf{b}_a,\\ \\mathbf{g}\\ \\big]\n$$\n\nposition, attitude, LiDAR→IMU extrinsic rotation and translation, velocity, gyro bias,\naccel bias, and gravity. In the clean code that's one manifold declaration:\n\n```cpp\n// include/use-ikfom.hpp  — 24-D nominal, 23-D tangent\nMTK_BUILD_MANIFOLD(state_ikfom,\n  ((vect3, pos))      ((SO3, rot))\n  ((SO3, offset_R_L_I)) ((vect3, offset_T_L_I))\n  ((vect3, vel))      ((vect3, bg)) ((vect3, ba))\n  ((S2, grav)));\n```\n\nWe update on the manifold with $\\boxplus$ (retraction) and measure differences with\n$\\boxminus$ — for the $SO(3)$ part, $\\mathbf{R}\\boxplus\\delta = \\mathbf{R}\\,\\mathrm{Exp}(\\delta)$\nand $\\mathbf{R}_1\\boxminus\\mathbf{R}_2 = \\mathrm{Log}(\\mathbf{R}_2^{\\!\\top}\\mathbf{R}_1)$.\nEverything else is ordinary vector $+/-$.\n\n## Forward propagation: ride the IMU\n\nBetween LiDAR scans, the IMU drives the state forward. The continuous kinematics are the\nstandard strapdown model — position integrates velocity, attitude integrates de-biased\nangular velocity, velocity integrates de-biased, gravity-corrected acceleration:\n\n$$\n\\dot{\\mathbf{p}} = \\mathbf{v}, \\qquad\n\\dot{\\mathbf{R}} = \\mathbf{R}\\,(\\boldsymbol{\\omega}_m - \\mathbf{b}_g)^\\wedge, \\qquad\n\\dot{\\mathbf{v}} = \\mathbf{R}(\\mathbf{a}_m - \\mathbf{b}_a) + \\mathbf{g},\n$$\n\nwith the biases doing a slow random walk. That's `get_f` verbatim:\n\n```cpp\n// f(x,u): continuous-time kinematics\nvect3 omega      = in.gyro - s.bg;            // ω = ω_m − b_g\nvect3 a_inertial = s.rot * (in.acc - s.ba);   // R(a_m − b_a)\nres(i)      = s.vel[i];                        // ṗ = v\nres(i + 3)  = omega[i];                        // Ṙ ← ω\nres(i + 12) = a_inertial[i] + s.grav[i];       // v̇ = R(a_m−b_a) + g\n```\n\nThe predict step pushes the *mean* forward by $\\mathbf{x} \\boxplus (\\Delta t\\,\\mathbf{f})$ and\nthe *covariance* forward with the error-state Jacobians $\\mathbf{F_x},\\mathbf{F_w}$:\n\n$$\n\\hat{\\mathbf{x}} = \\mathbf{x}\\boxplus(\\Delta t\\,\\mathbf{f}), \\qquad\n\\hat{\\mathbf{P}} = \\mathbf{F_x}\\,\\mathbf{P}\\,\\mathbf{F_x}^{\\!\\top} +\n(\\Delta t\\,\\mathbf{F_w})\\,\\mathbf{Q}\\,(\\Delta t\\,\\mathbf{F_w})^{\\!\\top}.\n$$\n\n```cpp\nvoid predict(double &dt, Matrix<double,12,12> &Q, const input_ikfom &i_in) {\n    Matrix<double,24,1>  f_     = get_f(x_, i_in);    // 24×1\n    Matrix<double,24,23> df_dx_ = df_dx(x_, i_in);    // ∂f/∂x\n    Matrix<double,24,12> df_dw_ = df_dw(x_, i_in);    // ∂f/∂w\n    x_ = x_.plus(f_, dt);                              // x ⊞ (dt·f)\n    // F_x = I + dt·A·df_dx ,  F_w = dt·df_dw  (assembled via the boxplus Jacobian)\n    P_ = F_x1 * P_ * F_x1.transpose() + (dt*F_w1) * Q * (dt*F_w1).transpose();\n}\n```\n\nIn NumPy the shape of it is just:\n\n```python\ndef predict(x, P, imu, dt, Q):\n    f      = get_f(x, imu)              # kinematics above\n    F_x, F_w = jacobians(x, imu, dt)   # error-state transition + noise maps\n    x = boxplus(x, f * dt)             # advance the mean on the manifold\n    P = F_x @ P @ F_x.T + F_w @ Q @ F_w.T\n    return x, P\n```\n\nThis runs once per IMU sample, and the per-sample poses are cached — we need them next.\n\n## Backward propagation: deskew the scan\n\nBecause the scan sweeps over time while the platform moves, every point was measured from a\nslightly different pose. Stack them naively and a flat wall comes out sheared:\n\n<Deskew />\n\nFAST-LIO2 fixes this with **backward propagation**: walk the cached IMU poses from the\nscan-end time backward, and transform each point from the pose it was *actually* sampled at\ninto the scan-end frame. For a point sampled at time $\\rho_j$ with the IMU pose\n$(\\mathbf{R}_j,\\mathbf{t}_j)$ relative to the scan-end pose $(\\mathbf{R}_e,\\mathbf{t}_e)$:\n\n$$\n\\mathbf{p}^{\\text{end}}_j = \\mathbf{R}_{L}^{I\\,\\top}\\!\\Big(\\mathbf{R}_e^{\\!\\top}\\big(\\mathbf{R}_j(\\mathbf{R}_{L}^{I}\\mathbf{p}_j + \\mathbf{t}_{L}^{I}) + (\\mathbf{t}_j - \\mathbf{t}_e)\\big) - \\mathbf{t}_{L}^{I}\\Big)\n$$\n\nwhich is exactly the compensation in `UndistortPcl`:\n\n```cpp\nM3D R_i(R_imu * Exp(angvel_avr, dt));        // attitude at this point's sample time\nV3D T_ei(pos_imu + vel_imu*dt + 0.5*acc_imu*dt*dt - imu_state.pos);\nV3D P_compensate = imu_state.offset_R_L_I.conjugate() *\n    (imu_state.rot.conjugate() * (R_i * (imu_state.offset_R_L_I * P_i\n     + imu_state.offset_T_L_I) + T_ei) - imu_state.offset_T_L_I);\n```\n\nNow every point lives in one consistent frame and it's safe to register against the map.\n\n## The measurement: point-to-plane\n\nFAST-LIO2 doesn't extract features. For each deskewed point it transforms it into the world\nwith the current state, finds its 5 nearest map points via the ikd-Tree, fits a plane to\nthem, and the **residual is the point-to-plane distance** — zero when the point sits exactly\non the surface:\n\n<Figure\n  src=\"/articles/fastlio2/x2.png\"\n  alt=\"The point-to-plane measurement model: a scan point (red) and the corresponding plane fit from nearby map points (blue), with the plane normal u_j; the residual is the signed distance from the point to the plane.\"\n  caption=\"The measurement model (paper, Figure 2): a scan point (red) is matched to the local plane fit from its nearest map points (blue). The residual is the signed point-to-plane distance along the normal uⱼ — minimized when the estimated pose lands the point on the surface.\"\n/>\n\nFor a point $\\mathbf{p}$ in the body frame, transformed to world\n$\\mathbf{p}^W = \\mathbf{R}(\\mathbf{R}_L^I\\mathbf{p}+\\mathbf{t}_L^I)+\\mathbf{p}$, a plane with\nunit normal $\\mathbf{u}$ and offset $d$ gives residual $z = \\mathbf{u}^{\\!\\top}\\mathbf{p}^W + d$.\nThat's `h_share_model`:\n\n```cpp\nV3D p_global(s.rot * (s.offset_R_L_I * p_body + s.offset_T_L_I) + s.pos);  // to world\nikdtree.Nearest_Search(point_world, NUM_MATCH_POINTS, points_near, sqDis); // 5 nearest\nif (esti_plane(pabcd, points_near, 0.1f)) {                                 // fit plane (a,b,c,d)\n    float pd2 = pabcd(0)*x + pabcd(1)*y + pabcd(2)*z + pabcd(3);            // point-to-plane dist\n    ...\n}\n// Jacobian row (w.r.t. attitude θ and extrinsic), residual = −distance\nV3D C(s.rot.conjugate() * norm_vec);          // Rᵀu\nV3D A(point_I_crossmat * C);                  // (R_L^I p + t_L^I)^∧ Rᵀu\nekfom_data.h_x.block<1,12>(i,0) << norm_p.x, norm_p.y, norm_p.z, A, ...;   // [ u | A | … ]\nekfom_data.h(i) = -norm_p.intensity;          // the residual\n```\n\nThe crucial detail: the Jacobian $\\mathbf{H}$ is $m\\times 12$ — $m$ thousands of points,\nbut only **12 columns** (6 for pose, 6 for the extrinsic), because a single LiDAR scan can't\nobserve velocity, biases, or gravity directly. Hold that thought; it's why the next step is\nfast. In NumPy:\n\n```python\ndef build_H_z(points_body, x, ikdtree, map_pts, R_LI, t_LI):\n    H, z = [], []\n    for p in points_body:\n        pw = x.R @ (R_LI @ p + t_LI) + x.p          # body → world\n        nn = ikdtree.nearest(pw, k=5)               # 5 nearest map points\n        n, d = fit_plane(map_pts[nn])               # unit normal, offset\n        r = n @ pw + d                              # point-to-plane distance\n        if abs(r) < 0.1:                            # keep confident matches\n            pI = R_LI @ p + t_LI\n            A  = skew(pI) @ (x.R.T @ n)             # ∂r/∂θ  (attitude block)\n            H.append(np.concatenate([n, A]))        # [ normal | attitude ]\n            z.append(-r)\n    return np.array(H), np.array(z)                 # H: m×6 (here), z: m\n```\n\n## The iterated update, and the gain that makes it cheap\n\nA single EKF update would linearize the very-nonlinear point-to-plane fit once, at a\npossibly-wrong pose, and be off. So FAST-LIO2 **iterates**: re-associate, rebuild $\\mathbf{H}$\nat the latest estimate, take one Kalman step, repeat until the correction is tiny. Watch the\nscan snap onto the map:\n\n<IEKFRegister />\n\nEach iteration is\n\n$$\n\\delta\\mathbf{x} = \\mathbf{K}\\,\\mathbf{z} + (\\mathbf{I}-\\mathbf{K}\\mathbf{H})(\\mathbf{x}^\\kappa \\boxminus \\hat{\\mathbf{x}}),\n\\qquad \\mathbf{x}^{\\kappa+1} = \\mathbf{x}^\\kappa \\boxplus \\delta\\mathbf{x},\n$$\n\niterating $\\kappa$ until every component of $\\delta\\mathbf{x}$ drops below $10^{-3}$. The\npiece that makes FAST-LIO *fast* is the **reformulated Kalman gain**. The textbook form,\n\n$$\n\\mathbf{K} = \\hat{\\mathbf{P}}\\mathbf{H}^{\\!\\top}(\\mathbf{H}\\hat{\\mathbf{P}}\\mathbf{H}^{\\!\\top}+\\mathbf{R})^{-1},\n$$\n\ninverts an $m\\times m$ matrix — and $m$ is *thousands* of points. FAST-LIO uses the\ninformation-form identity to rewrite it as\n\n$$\n\\mathbf{K} = (\\mathbf{H}^{\\!\\top}\\mathbf{R}^{-1}\\mathbf{H} + \\hat{\\mathbf{P}}^{-1})^{-1}\\mathbf{H}^{\\!\\top}\\mathbf{R}^{-1},\n$$\n\nwhich inverts a $23\\times 23$ matrix — the **state** dimension — no matter how many points\nthere are. That's the whole trick, and in clean code it's one block:\n\n```cpp\n// R is a scalar (LASER_POINT_COV = 0.001), so R⁻¹ = 1/R\nauto K_front = (HTH / R + P_.inverse()).inverse();      // (HᵀR⁻¹H + P⁻¹)⁻¹  — 23×23\nK = K_front.block<23,12>(0,0) * H.transpose() / R;      // … Hᵀ R⁻¹\nMatrix<double,23,1> dx_ = K * dyn_share.h               // K z\n       + (Matrix<double,23,23>::Identity() - K*H) * dx_new;  // (I−KH)(x ⊟ x̂)\nx_ = x_.boxplus(dx_);\n// convergence: every |dx_[j]| < epsi (0.001); then update covariance\nP_ = (Matrix<double,23,23>::Identity() - K*H) * P_;\n```\n\nThe same loop in NumPy, with the cheap gain spelled out:\n\n```python\ndef update_iterated(x, P, points_body, ikdtree, map_pts, R=1e-3, max_iter=4, eps=1e-3):\n    x_prior = x.copy()\n    n = P.shape[0]                                   # 23 (error-state dim)\n    for _ in range(max_iter):\n        H, z = build_H_z(points_body, x, ikdtree, map_pts, R_LI, t_LI)  # relinearize\n        # information-form gain: invert (state × state), independent of len(z)\n        S = H.T @ H / R + np.linalg.inv(P)           # n×n\n        K = np.linalg.solve(S, H.T) / R              # K = S⁻¹ Hᵀ R⁻¹\n        dx = K @ z + (np.eye(n) - K @ H) @ (-boxminus(x, x_prior))\n        x  = boxplus(x, dx)\n        if np.max(np.abs(dx)) < eps:\n            break\n    P = (np.eye(n) - K @ H) @ P\n    return x, P\n```\n\nThat's the engine. The converged $\\mathbf{x}$ is your odometry output, published at LiDAR\nrate.\n\n## The map: an incremental k-d tree\n\nThe nearest-neighbor search in the measurement step is the hot path, and the map is\n*growing and moving*. A static k-d tree would be rebuilt every scan — fatal. The **ikd-Tree**\ninstead inserts points in place, downsamples on the tree, deletes whole regions with one\nbox-wise delete as the local map window slides with the sensor, and lazily re-balances only\nthe subtrees that get lopsided:\n\n<IkdMap />\n\n<Figure\n  src=\"/articles/fastlio2/x3.png\"\n  alt=\"2D illustration of ikd-Tree map region management: a local map cube around the sensor that slides as the platform moves, with regions added at the leading edge and removed at the trailing edge.\"\n  caption=\"Map region management (paper, Figure 3): the ikd-Tree keeps a local map window around the sensor. As the platform moves, new regions are inserted and far regions are removed with box-wise deletes — keeping the active map bounded and the kNN query fast.\"\n/>\n\nIn code it's a handful of calls:\n\n```cpp\nikdtree.Build(feats_down_world->points);          // first scan\nikdtree.Add_Points(PointToAdd, true);             // incremental insert + on-tree downsample\nikdtree.Delete_Point_Boxes(cub_needrm);           // box-wise delete (window slid)\nikdtree.Nearest_Search(point_world, 5, near, d);  // kNN, inside the measurement step\n```\n\nThe payoff is real: on the authors' benchmarks FAST-LIO2 spends *less* time per scan than\nFAST-LIO while holding a *larger* map, on both Intel and Arm.\n\n<Figure\n  src=\"/articles/fastlio2/x6.png\"\n  alt=\"Processing time per LiDAR scan over time for FAST-LIO and FAST-LIO2 on Intel and Arm CPUs (log scale, top), and the number of map points held over time (bottom); FAST-LIO2 is consistently faster while keeping more map points.\"\n  caption=\"Per-scan processing time (paper, Figure 8): FAST-LIO2 (cyan/red) stays below FAST-LIO (green/purple) on both Intel and Arm — while the bottom panel shows it maintaining a larger map. The ikd-Tree is why the direct, all-points approach is still real-time.\"\n/>\n\n## Putting it together — and dropping ROS\n\nHere's the entire main loop, which is shorter than you'd expect:\n\n```cpp\nwhile (running) {\n  if (sync_packages(Measures)) {                  // group IMU + one LiDAR scan by time\n    p_imu->Process(Measures, kf, feats_undistort); // forward-propagate + deskew\n    downSizeFilterSurf.filter(*feats_down_body);   // voxel-downsample the scan\n    kf.update_iterated_dyn_share_modified(         // the iterated point-to-plane EKF\n        LASER_POINT_COV, feats_down_body, ikdtree, Nearest_Points,\n        NUM_MAX_ITERATIONS, extrinsic_est_en);\n    state_point = kf.get_x();                      // odometry output\n    map_incremental();                             // ikdtree.Add_Points(...)\n  }\n}\n```\n\nNotice what's *not* algorithm here: `sync_packages` is just time-aligning two streams,\n`publish_odometry`/`publish_frame_world` are ROS topics, and `tf` is bookkeeping. **None of\nthat is the filter.** To reproduce FAST-LIO2 without ROS you only need:\n\n| You need | You don't need |\n|---|---|\n| read IMU samples (t, ω, a) from a file/array | ROS subscribers / message types |\n| read LiDAR points (x, y, z, per-point time) | rosbag, nodelets |\n| forward-propagate + deskew (the IMU code) | tf tree |\n| a k-d tree over the map (ikd-Tree, or even scipy `cKDTree` rebuilt per scan to start) | rviz, publishers |\n| the iterated point-to-plane update | the IKFoM template layer |\n\nA no-ROS skeleton is just the loop, fed from arrays:\n\n```python\nx, P = init_state(), init_cov()\nikdtree = KDMap(voxel=0.5)                 # or scipy cKDTree to begin with\nfor scan in lidar_scans:                   # each: points + per-point timestamps\n    imu_batch = imu_between(prev_t, scan.t_end)\n    for imu in imu_batch:                  # 1) forward propagation\n        x, P = predict(x, P, imu, imu.dt, Q)\n    pts = deskew(scan.points, imu_poses, x)        # 2) backward propagation\n    pts = voxel_downsample(pts, 0.5)               # 3) downsample\n    x, P = update_iterated(x, P, pts, ikdtree, ikdtree.points)  # 4) iterated EKF\n    ikdtree.add(transform_to_world(pts, x))        # 5) grow the map\n    yield x.p, x.R                                 # pose = odometry\n```\n\nStart with a `cKDTree` rebuilt each scan to get the algorithm working end to end, then swap\nin a true incremental tree once you care about speed. That ordering — correctness first,\nthen the ikd-Tree — is exactly how to learn it.\n\nTo anchor yourself in the real repo, here's the file → concept map for the clean version:\n\n| File | What it owns |\n|---|---|\n| `use-ikfom.hpp` | the state manifold, `get_f`, `df_dx`, `df_dw` |\n| `esekfom.hpp` | the explicit ESEKF: `predict`, `h_share_model`, `update_iterated_dyn_share_modified`, the reformulated gain |\n| `IMU_Processing.hpp` | IMU init, forward propagation, `UndistortPcl` (deskew) |\n| `ikd_Tree.cpp` | `Build`, `Add_Points`, `Delete_Point_Boxes`, `Nearest_Search` |\n| `laserMapping.cpp` | the ROS glue + main loop (the part you replace) |\n\n## A complete, runnable implementation\n\nI wrote the whole thing as one dependency-light file —\n[`fastlio2_mini.py`](/articles/fastlio2/fastlio2_mini.py) (≈390 lines, `numpy` +\n`scipy` + `rosbags`). It's a faithful teaching implementation: the on-manifold state,\nforward/backward propagation, the iterated point-to-plane update with the reformulated\ngain — all the code blocks above, assembled and tested. It takes a real LiDAR→IMU\nextrinsic and does per-scan voxel downsampling; the remaining simplifications, called out\nhonestly, are gravity fixed after init and a `scipy` cKDTree rebuilt per scan instead of a\ntrue ikd-Tree.\n\nThe driver is the no-ROS loop, fed from plain arrays — read IMU, propagate (caching\nposes), deskew into the IMU frame, downsample, iterated-update, grow the map:\n\n```python\ndef run_offline(imu_stream, lidar_scans, voxel=0.4, scan_voxel=0.5,\n                T_LI=None, R_LI=None, acc_cov=1e-2, gyr_cov=1e-2,\n                bacc_cov=1e-4, bgyr_cov=1e-4, init_secs=0.5):\n    Q = np.diag([gyr_cov]*3 + [acc_cov]*3 + [bgyr_cov]*3 + [bacc_cov]*3)\n    R_LI = np.eye(3) if R_LI is None else np.asarray(R_LI, float)\n    T_LI = np.zeros(3) if T_LI is None else np.asarray(T_LI, float)\n    to_imu = lambda p: (R_LI @ p.T).T + T_LI                # LiDAR points -> IMU frame\n    g, bg = imu_init([s for s in imu_stream if s[0] < imu_stream[0][0] + init_secs])\n    kf = ESEKF(g); kf.x.bg = bg\n    lmap = LocalMap(voxel); traj = []; imu_i = 0; bootstrapped = False\n    for scan in lidar_scans:\n        poses = []\n        while imu_i < len(imu_stream) and imu_stream[imu_i][0] <= scan['t_end']:\n            t, acc, gyro = imu_stream[imu_i]\n            dt = t - (imu_stream[imu_i-1][0] if imu_i > 0 else t)\n            if dt > 0: kf.predict(acc, gyro, dt, Q)        # 1. forward propagation\n            poses.append((t, kf.x.R.copy(), kf.x.p.copy()))\n            imu_i += 1\n        body = to_imu(scan['points'])                                   # extrinsic\n        pts = deskew(body, scan['dts'], poses, scan['t_end'])           # 2. backward deskew\n        pts = voxel_downsample(pts, scan_voxel)                         # sparse, even set\n        if not bootstrapped:\n            lmap.add((kf.x.R @ pts.T).T + kf.x.p); bootstrapped = True   # seed the map\n        else:\n            kf.update(pts, lmap)                            # 3. iterated point-to-plane EKF\n            lmap.add((kf.x.R @ pts.T).T + kf.x.p)           # 4. grow the map\n        traj.append((scan['t_end'], kf.x.p.copy(), kf.x.R.copy()))\n    return traj, lmap\n```\n\nIt ships with a synthetic world (a robot looping through a 10×10×3 m room) so you can run\nit with **no dataset at all** — and that's how I validated it:\n\n```\n$ python fastlio2_mini.py\nin-memory  : ATE rmse = 0.037 m   final = 0.019 m\nvia .bag   : ATE rmse = 0.069 m   final = 0.057 m\n```\n\nThe second line is the important one: the file also writes the simulated data to a real\nROS1 `.bag` (`sensor_msgs/Imu` + `PointCloud2`), reads it back through `read_bag()`, and\nre-runs — exercising the exact bag-parsing path you'd use on real hardware, end to end, to\n**4–7 cm** of absolute trajectory error. The math is correct.\n\n## Reading a real `.bag` — including Livox\n\n`read_bag()` uses the pure-python `rosbags` (no ROS install) and handles both standard\n`sensor_msgs/PointCloud2` (Velodyne/Ouster) and Livox's custom `CustomMsg`. Livox is the\ncatch with FAST-LIO data — its bags aren't PointCloud2, they're a custom message, so you\nregister the type definition and parse it yourself:\n\n```python\nfrom rosbags.typesys import Stores, get_typestore\nfrom rosbags.typesys.msg import get_types_from_msg\nts = get_typestore(Stores.ROS1_NOETIC)\nts.register(get_types_from_msg(                              # the Livox point struct\n    \"uint32 offset_time\\nfloat32 x\\nfloat32 y\\nfloat32 z\\n\"\n    \"uint8 reflectivity\\nuint8 tag\\nuint8 line\\n\", 'livox_ros_driver/msg/CustomPoint'))\nts.register(get_types_from_msg(                              # the Livox scan message\n    \"std_msgs/Header header\\nuint64 timebase\\nuint32 point_num\\nuint8 lidar_id\\n\"\n    \"uint8[3] rsvd\\nlivox_ros_driver/CustomPoint[] points\\n\", 'livox_ros_driver/msg/CustomMsg'))\n# then: msg.points -> (x,y,z, offset_time);  offset_time is per-point time for the deskew\n```\n\nSo fetching and running an actual HKU dataset is two steps:\n\n```bash\npip install gdown\ngdown 1YqxHuDKzWUcda80QKBV61lXI86TXsGjP -O avia.bag    # a Livox Avia indoor bag\npython fastlio2_mini.py avia.bag\n```\n\n## What happens on real data — and the bug that taught me the most\n\nI ran exactly that on the HKU Avia \"quick-shack\" bag (49 s, 9953 IMU + 491 Livox scans).\nIt tracks. The sensor is waved roughly in place — **47.4 rad of cumulative rotation** over\n**38 m of path** in a small room — and the filter stays locked the whole way, returning to\n**within ~0.6 m of its start** and reconstructing a crisp room with single, sharp walls:\n\n<Figure\n  src=\"/articles/fastlio2/avia-trajectory.png\"\n  alt=\"Two plots from running fastlio2_mini on the HKU Livox Avia bag: a top-down view showing the estimated path (colored by time) tracing through a reconstructed room point cloud, and a side x–z view; the path stays bounded within the room rather than collapsing or diverging.\"\n  caption=\"fastlio2_mini on the real HKU Livox Avia bag (491 scans, 49 s). Left: top-down path (colored by scan time) over the reconstructed map — the room's walls come out single and crisp, and the trajectory returns to within ~0.6 m of where it started after 38 m of path (≈1.6% drift). Right: the x–z side view stays a flat room slab. This is the minimal Python filter — no ROS, one ~390-line file — tracking a real Livox bag end to end. (A handful of stray points are cropped from the wide view.)\"\n/>\n\nBut it did **not** track on the first try. Getting from \"reads the bag\" to the figure above\ntook three separate fixes, and each one is a lesson worth more than the result — because\nnone of them was the filter *math*. Here they are in the order I hit them.\n\n### Bug 1 — a sensor-clock mismatch: the filter never moved\n\nThe first run collapsed exactly the way a broken LIO does: the trajectory froze within a few\ncentimetres of the origin while the IMU clearly showed the sensor swinging through ~1 rad/s\nof rotation. I almost wrote it off as \"the teaching filter isn't robust enough.\" It wasn't\nthat. I instrumented the scan timestamps and found them\nlanding at `t ≈ -1.6e9` *relative to the IMU* — an impossible 50-year gap. The Livox\n`CustomMsg` header stamps each scan on the **sensor's own clock** (seconds since the LiDAR\nbooted, ~361 s into this bag), while the `/livox/imu` messages are stamped on the **bag's\nrecord clock** (Unix time, ~1.6 billion). My propagate-up-to-scan-end loop compares the two:\n\n```python\nwhile imu_stream[imu_i][0] <= scan['t_end']:   # IMU time vs LiDAR header time\n    kf.predict(...)                            # ...never true → never runs\n```\n\nBecause every IMU timestamp (1.6e9) was vastly larger than every scan's header time (361),\nthat condition was *never* true. **The IMU never propagated.** The filter sat at its\ninitial pose, the update snapped each scan onto the origin-seeded map, and the whole thing\nlooked like a plausible \"data-association collapse\" — when really it was a unit/epoch bug\ntwo layers down. The fix is to ignore the Livox header entirely and timestamp each scan\nwith the **bag record time** (which `rosbags` gives you for every message, on one\nconsistent clock):\n\n```python\nfor conn, t, raw in reader.messages(connections=conns):   # t = bag record time (ns)\n    ...\n    lidar_scans.append({'t_end': t * 1e-9, 'points': pts, 'dts': dts})  # not msg.header!\n```\n\nThat one change is the difference between a frozen origin and a moving trajectory. The\nlesson: in sensor fusion, **check your clocks first.** A mismatched epoch or a\nnanosecond-vs-second unit error masquerades perfectly as a modelling failure, and you can\nwaste a day tuning covariances that were never the problem. (While here, I also wired in the\ncalibration any real deployment needs: the **LiDAR→IMU extrinsic** from `avia.yaml`,\nthe **real IMU noise** `acc_cov = gyr_cov = 0.1` — my synthetic `Q` was 100× too small — and\nper-scan voxel downsampling.)\n\n### Bug 2 — a stub deskew: the map smeared\n\nNow it moved, but the reconstructed map came out with **doubled, smeared walls** — the same\nphysical wall drawn twice, slightly rotated. That's within-scan distortion. My first `deskew`\nwas a stub: it dropped each point into the *nearest* cached IMU pose with no compensation for\nthe motion *across* the sweep. But a Livox sweep takes ~100 ms, and at ~1 rad/s and walking\nspeed the sensor rotates and translates meaningfully in that window — so every point has to\nbe carried from the pose it was **actually sampled at** to the scan-end frame. That's the\nbackward propagation from [earlier](#backward-propagation-deskew-the-scan): I interpolate the\nIMU-propagated trajectory (rotation on $SO(3)$, position linearly) to each point's capture\ntime before registering. Single-line idea, big effect — the per-plane thickness of the\nreconstructed map drops to **~4 cm** and the doubled walls collapse into one.\n\n### Bug 3 — an outlier gate that starved the update\n\nWith the deskew fixed it tracked cleanly for ~150 scans and then **diverged** — a clean\nstraight ramp off into space, the unmistakable signature of the LiDAR constraint dropping out\nand the IMU dead-reckoning. The cause was a gate I'd copied *too* faithfully. FAST-LIO accepts\na point-to-plane match with a range-normalized test (`s = 1 − 0.9|d|/√range`); on its dense\nclouds that's fine. On my sparse, voxel-downsampled scans, the moment the prediction was\nslightly off it rejected **every** correspondence, the update was skipped, and with nothing\nto correct it the pose ran away. A gentler metric gate (tolerance scaled mildly with range)\nkeeps hundreds of inliers per scan, and the filter stays locked for the whole bag. The\nlesson: **an outlier gate that's correct in a dense reference can starve a sparse\nreimplementation** — watch the inlier *count*, not just the residual.\n\nAll of these are wired into `run_offline`, and the CLI uses the Avia values by default, so\n`python fastlio2_mini.py avia.bag` reproduces the figure above. The honest caveats that\nremain are the ones this is a *teaching* filter for: a `cKDTree` rebuilt per scan (so a full\nbag is a few minutes offline, not sensor-rate), gravity fixed at init rather than estimated\non $S^2$, no loop closure, and a handful of stray points where fast rotation meets the narrow\n~70° FOV. Real FAST-LIO2's ikd-Tree, in-state gravity, and tighter handling close those gaps.\nBut the spine — the five steps, the manifold state, the reformulated gain — is exactly what's\nrunning here, and it's enough to track a real Livox bag and rebuild the room.\n\n## The whole file, end to end\n\nEverything above — the SO(3) helpers, the manifold state, forward/backward propagation,\nthe iterated point-to-plane update with the reformulated gain, the map, and the no-ROS\nbag reader — is one self-contained file. It's deliberately unoptimized for readability\n(a `cKDTree` rebuilt per scan, plain Python loops), but it runs and it tracks the real\nAvia bag. Here it is in full (393 lines) — expand to read or copy the whole thing,\nor [download it](/articles/fastlio2/fastlio2_mini.py):\n\n<CodeCollapse label=\"fastlio2_mini.py — the whole file\" collapsedHeight={460}>\n\n```python\n\"\"\"\nfastlio2_mini.py — a minimal, ROS-free FAST-LIO2-style LiDAR-inertial odometry.\n\nA teaching reimplementation of the FAST-LIO2 core: an iterated error-state Kalman\nfilter on SO(3), fed a high-rate IMU and corrected by raw point-to-plane LiDAR\nresiduals over an incremental k-d-tree map. It supports a LiDAR->IMU extrinsic and\nper-scan voxel downsampling; the simplifications vs. the paper (called out where\nthey matter) are gravity fixed after init and a scipy cKDTree rebuilt per scan\ninstead of a true ikd-Tree. Everything else — the manifold state, forward/backward\npropagation, the reformulated Kalman gain — is faithful.\n\n    pip install numpy scipy rosbags\n    python fastlio2_mini.py                 # runs a synthetic demo (no bag needed)\n    python fastlio2_mini.py avia.bag        # runs a real Livox Avia bag (calibrated)\n    # or: from fastlio2_mini import read_bag, run_offline\n\nValidated: ~4 cm ATE on an 8 s synthetic trajectory, and it tracks the real HKU\nLivox Avia bag (491 scans, ~50 s) — see run_offline's calibration arguments.\n\"\"\"\nimport numpy as np\nfrom scipy.spatial import cKDTree\n\n# ============================================================ SO(3) utilities\ndef hat(w):                                   # vector -> skew-symmetric matrix\n    return np.array([[0, -w[2], w[1]], [w[2], 0, -w[0]], [-w[1], w[0], 0]])\n\ndef Exp(w):                                   # so(3) -> SO(3)  (Rodrigues)\n    th = np.linalg.norm(w)\n    if th < 1e-9:\n        return np.eye(3) + hat(w)\n    K = hat(w / th)\n    return np.eye(3) + np.sin(th) * K + (1 - np.cos(th)) * K @ K\n\ndef Log(R):                                   # SO(3) -> so(3)\n    c = np.clip((np.trace(R) - 1) / 2, -1, 1)\n    th = np.arccos(c)\n    v = np.array([R[2, 1] - R[1, 2], R[0, 2] - R[2, 0], R[1, 0] - R[0, 1]])\n    return 0.5 * v if th < 1e-9 else (th / (2 * np.sin(th))) * v\n\n# ============================================================ state on the manifold\n# error-state layout (15): [ p(0:3) th(3:6) v(6:9) bg(9:12) ba(12:15) ]\nclass State:\n    def __init__(s):\n        s.p = np.zeros(3); s.R = np.eye(3); s.v = np.zeros(3)\n        s.bg = np.zeros(3); s.ba = np.zeros(3)\n    def copy(s):\n        t = State()\n        t.p, t.R, t.v, t.bg, t.ba = s.p.copy(), s.R.copy(), s.v.copy(), s.bg.copy(), s.ba.copy()\n        return t\n\ndef boxplus(x, d):                            # x ⊞ d  (retract onto the manifold)\n    y = x.copy()\n    y.p += d[0:3]; y.R = x.R @ Exp(d[3:6]); y.v += d[6:9]\n    y.bg += d[9:12]; y.ba += d[12:15]\n    return y\n\ndef boxminus(a, b):                           # a ⊟ b  (tangent so that a = b ⊞ d)\n    d = np.zeros(15)\n    d[0:3] = a.p - b.p; d[3:6] = Log(b.R.T @ a.R); d[6:9] = a.v - b.v\n    d[9:12] = a.bg - b.bg; d[12:15] = a.ba - b.ba\n    return d\n\n# ============================================================ the filter\nclass ESEKF:\n    def __init__(s, g):\n        s.x = State(); s.P = np.eye(15) * 1e-2; s.g = g.copy()\n\n    def predict(s, am, wm, dt, Q):\n        \"\"\"Forward propagation: integrate one IMU sample, inflate covariance.\"\"\"\n        x = s.x\n        w = wm - x.bg                          # de-biased angular velocity\n        a = x.R @ (am - x.ba) + s.g            # de-biased, gravity-corrected accel (world)\n        # --- nominal mean ---\n        x.p = x.p + x.v * dt + 0.5 * a * dt * dt\n        Rn = x.R @ Exp(w * dt)\n        x.v = x.v + a * dt\n        x.R = Rn\n        # --- error-state transition F_x and noise map F_w (paper Eq. 7/8) ---\n        A = np.zeros((15, 15))\n        A[0:3, 6:9] = np.eye(3)                # dp/dv\n        A[3:6, 3:6] = -hat(w); A[3:6, 9:12] = -np.eye(3)      # dth/dth, dth/dbg\n        A[6:9, 3:6] = -x.R @ hat(am - x.ba); A[6:9, 12:15] = -x.R   # dv/dth, dv/dba\n        Fx = np.eye(15) + A * dt\n        Fw = np.zeros((15, 12))\n        Fw[3:6, 0:3] = -np.eye(3); Fw[6:9, 3:6] = -x.R\n        Fw[9:12, 6:9] = np.eye(3); Fw[12:15, 9:12] = np.eye(3)\n        s.P = Fx @ s.P @ Fx.T + (Fw * dt) @ Q @ (Fw * dt).T\n\n    def update(s, pts_body, lmap, R=1e-3, max_iter=4, eps=1e-3):\n        \"\"\"Iterated point-to-plane update with the reformulated Kalman gain.\"\"\"\n        x_prior = s.x.copy()\n        n = 15; K = None; Hfull = None\n        for _ in range(max_iter):\n            x = s.x\n            pw = (x.R @ pts_body.T).T + x.p     # body -> world at current estimate\n            H_rows, z = [], []\n            for i in range(len(pts_body)):\n                nrm, off, ok = lmap.fit_plane(pw[i])    # nearest-5 plane via kd-tree\n                if not ok:\n                    continue\n                r = nrm @ pw[i] + off          # point-to-plane distance\n                # FAST-LIO weights acceptance by range (`s = 1 - 0.9|d|/sqrt(range)`),\n                # but on a sparse, voxel-downsampled scan that gate can starve a slightly-\n                # off prediction of *all* correspondences — the update is then skipped and\n                # the pose dead-reckons away. We keep a plain metric gate (correct deskew\n                # already removes the smear a tight gate was meant to fight) with a mild\n                # range allowance so far points must still fit reasonably.\n                rng = np.linalg.norm(pts_body[i])\n                if abs(r) > 0.3 + 0.05 * rng:\n                    continue\n                Hr = np.zeros(15)\n                Hr[0:3] = nrm\n                Hr[3:6] = hat(pts_body[i]) @ (x.R.T @ nrm)   # d(residual)/d(theta)\n                H_rows.append(Hr); z.append(r)\n            if len(H_rows) < 10:\n                break\n            H = np.array(H_rows); z = np.array(z)\n            dx_prior = boxminus(s.x, x_prior)\n            # reformulated gain: invert a 15x15 (state), NOT an mxm (measurements)\n            S = H.T @ H / R + np.linalg.inv(s.P)\n            K = np.linalg.solve(S, H.T) / R    # K = (H'R^-1 H + P^-1)^-1 H' R^-1\n            Hfull = H\n            dx = -K @ z - (np.eye(n) - K @ H) @ dx_prior\n            s.x = boxplus(s.x, dx)\n            if np.max(np.abs(dx)) < eps:\n                break\n        if K is not None:\n            s.P = (np.eye(n) - K @ Hfull) @ s.P\n\n# ============================================================ map (stand-in for ikd-Tree)\nclass LocalMap:\n    def __init__(s, voxel=0.4, cap=60000):\n        s.voxel = voxel; s.cap = cap; s.pts = None; s.tree = None\n    def add(s, world_pts):\n        s.pts = world_pts if s.pts is None else np.vstack([s.pts, world_pts])\n        if len(s.pts) > s.cap:\n            s.pts = s.pts[-s.cap:]\n        s.tree = cKDTree(s.pts)                # a real ikd-Tree updates in place instead\n    def fit_plane(s, p, k=5, max_d=1.0, thick=0.1):\n        d, idx = s.tree.query(p, k=k)\n        if d[-1] > max_d:\n            return None, None, False\n        near = s.pts[idx]; c = near.mean(0)\n        _, _, Vt = np.linalg.svd(near - c)     # smallest singular vector = normal\n        nrm = Vt[2]\n        if np.max(np.abs((near - c) @ nrm)) > thick:\n            return None, None, False           # neighbours aren't planar enough\n        return nrm, -nrm @ c, True\n\n# ============================================================ deskew (backward propagation)\ndef deskew(points, point_dts, imu_poses, t_end):\n    \"\"\"Backward propagation: transform each point from the pose it was *sampled* at\n    into the single scan-end frame, undoing the shear a moving sensor bakes into a\n    sweep. imu_poses: list of (t, R, p) propagated across the sweep; point_dts:\n    per-point time before scan end. We interpolate the propagated trajectory to each\n    point's capture time — SO(3) for rotation, linear for position — so the\n    within-sweep *rotation and velocity* are both compensated (using only the nearest\n    pose, as a naive version does, leaves fast scans warped and smears the map).\"\"\"\n    ts = np.array([q[0] for q in imu_poses])\n    R_end, p_end = imu_poses[-1][1], imu_poses[-1][2]\n    n = len(imu_poses)\n    out = np.empty_like(points)\n    for i, pb in enumerate(points):\n        t = t_end - point_dts[i]\n        j = min(max(np.searchsorted(ts, t) - 1, 0), n - 2) if n >= 2 else 0\n        if n >= 2:\n            t0, R0, p0 = imu_poses[j][0], imu_poses[j][1], imu_poses[j][2]\n            t1, R1, p1 = imu_poses[j + 1][0], imu_poses[j + 1][1], imu_poses[j + 1][2]\n            a = 0.0 if t1 == t0 else min(max((t - t0) / (t1 - t0), 0.0), 1.0)\n            R_c = R0 @ Exp(a * Log(R0.T @ R1))     # interpolate rotation on SO(3)\n            p_c = p0 + a * (p1 - p0)               # interpolate position (carries velocity)\n        else:\n            R_c, p_c = imu_poses[0][1], imu_poses[0][2]\n        wpt = R_c @ pb + p_c                       # point in world at its capture pose\n        out[i] = R_end.T @ (wpt - p_end)           # back into the scan-end frame\n    return out\n\n# ============================================================ downsample (voxel grid)\ndef voxel_downsample(pts, voxel=0.5):\n    \"\"\"One representative point per occupied voxel — FAST-LIO's per-scan downsample.\n    100k raw points per scan is overkill; a sparse, even set keeps the update real-time.\"\"\"\n    if len(pts) == 0:\n        return pts\n    keys = np.floor(pts / voxel).astype(np.int64)\n    _, idx = np.unique(keys, axis=0, return_index=True)\n    return pts[np.sort(idx)]\n\n# ============================================================ offline driver\ndef imu_init(imu_samples, g_mag=9.81):\n    \"\"\"Estimate gravity direction and gyro bias from a short static window.\"\"\"\n    a = np.mean([s[1] for s in imu_samples], 0)   # mean specific force\n    w = np.mean([s[2] for s in imu_samples], 0)   # mean angular velocity = gyro bias\n    g = -a / np.linalg.norm(a) * g_mag            # gravity opposes measured accel\n    return g, w\n\ndef run_offline(imu_stream, lidar_scans, voxel=0.4, scan_voxel=0.5,\n                T_LI=None, R_LI=None, acc_cov=1e-2, gyr_cov=1e-2,\n                bacc_cov=1e-4, bgyr_cov=1e-4, init_secs=0.5):\n    \"\"\"imu_stream: list of (t, acc[3], gyro[3]); lidar_scans: list of dict with\n       't_end', 'points'(N,3 body), 'dts'(N per-point time before scan end).\n\n    T_LI / R_LI: LiDAR->IMU extrinsic (point in IMU frame = R_LI @ p_lidar + T_LI).\n    acc_cov/gyr_cov: IMU noise densities (Avia's avia.yaml uses 0.1; the synthetic\n    demo is quieter). The process-noise Q is built from these — too small and the\n    filter trusts a stale prediction and refuses to move, too large and it's jumpy.\"\"\"\n    Q = np.diag([gyr_cov]*3 + [acc_cov]*3 + [bgyr_cov]*3 + [bacc_cov]*3)\n    R_LI = np.eye(3) if R_LI is None else np.asarray(R_LI, float)\n    T_LI = np.zeros(3) if T_LI is None else np.asarray(T_LI, float)\n    to_imu = lambda p: (R_LI @ p.T).T + T_LI         # LiDAR points -> IMU body frame\n    # --- init gravity + gyro bias from the first static window of IMU ---\n    t0 = imu_stream[0][0]\n    static = [s for s in imu_stream if s[0] < t0 + init_secs]\n    g, bg = imu_init(static)\n    kf = ESEKF(g); kf.x.bg = bg\n    lmap = LocalMap(voxel)\n    traj = []; imu_i = 0; bootstrapped = False\n    for scan in lidar_scans:\n        poses = []\n        # forward-propagate every IMU sample up to scan end, caching poses for deskew\n        while imu_i < len(imu_stream) and imu_stream[imu_i][0] <= scan['t_end']:\n            t, acc, gyro = imu_stream[imu_i]\n            dt = t - (imu_stream[imu_i-1][0] if imu_i > 0 else t)\n            if dt > 0:\n                kf.predict(acc, gyro, dt, Q)\n            poses.append((t, kf.x.R.copy(), kf.x.p.copy()))\n            imu_i += 1\n        if not poses:\n            poses = [(scan['t_end'], kf.x.R.copy(), kf.x.p.copy())]\n        body = to_imu(scan['points'])                       # into the IMU body frame\n        pts = deskew(body, scan['dts'], poses, scan['t_end'])\n        pts = voxel_downsample(pts, scan_voxel)             # sparse, even set for the update\n        if not bootstrapped:                    # seed the map from the first scan\n            lmap.add((kf.x.R @ pts.T).T + kf.x.p); bootstrapped = True\n        else:\n            kf.update(pts, lmap)                # the iterated EKF correction\n            lmap.add((kf.x.R @ pts.T).T + kf.x.p)\n        traj.append((scan['t_end'], kf.x.p.copy(), kf.x.R.copy()))\n    return traj, lmap\n\n# ============================================================ read a .bag without ROS\n_LIVOX_DEFS = (\n    \"uint32 offset_time\\nfloat32 x\\nfloat32 y\\nfloat32 z\\nuint8 reflectivity\\nuint8 tag\\nuint8 line\\n\",\n    \"std_msgs/Header header\\nuint64 timebase\\nuint32 point_num\\nuint8 lidar_id\\nuint8[3] rsvd\\n\"\n    \"livox_ros_driver/CustomPoint[] points\\n\",\n)\n\ndef read_bag(path, imu_topic='/livox/imu', lidar_topic='/livox/lidar', g_mag=9.81):\n    \"\"\"Read IMU + LiDAR from a ROS1 bag with the pure-python `rosbags` (no ROS install).\n       Handles sensor_msgs/PointCloud2 (Velodyne/Ouster) AND livox_ros_driver/CustomMsg\n       (Livox Avia/Horizon). Livox accel (reported in g) is auto-scaled to m/s^2.\"\"\"\n    from pathlib import Path\n    from rosbags.rosbag1 import Reader\n    from rosbags.typesys import Stores, get_typestore\n    from rosbags.typesys.msg import get_types_from_msg\n    ts = get_typestore(Stores.ROS1_NOETIC)\n    ts.register(get_types_from_msg(_LIVOX_DEFS[0], 'livox_ros_driver/msg/CustomPoint'))\n    ts.register(get_types_from_msg(_LIVOX_DEFS[1], 'livox_ros_driver/msg/CustomMsg'))\n    imu_stream, lidar_scans = [], []\n    with Reader(Path(path)) as reader:\n        conns = [c for c in reader.connections if c.topic in (imu_topic, lidar_topic)]\n        for conn, t, raw in reader.messages(connections=conns):\n            msg = ts.deserialize_ros1(raw, conn.msgtype)\n            if conn.topic == imu_topic:\n                a, w = msg.linear_acceleration, msg.angular_velocity\n                imu_stream.append((t * 1e-9, np.array([a.x, a.y, a.z]), np.array([w.x, w.y, w.z])))\n            elif 'CustomMsg' in conn.msgtype:                      # Livox\n                pts, dts = parse_livox(msg)\n                lidar_scans.append({'t_end': t * 1e-9, 'points': pts, 'dts': dts})\n            else:                                                   # PointCloud2\n                pts, dts = parse_pointcloud2(msg)\n                lidar_scans.append({'t_end': t * 1e-9, 'points': pts, 'dts': dts})\n    if imu_stream and np.mean([np.linalg.norm(s[1]) for s in imu_stream[:50]]) < 2.0:\n        imu_stream = [(t, a * g_mag, w) for t, a, w in imu_stream]   # g -> m/s^2\n    return imu_stream, lidar_scans\n\ndef parse_livox(msg):\n    \"\"\"Decode a livox_ros_driver/CustomMsg into (N,3) xyz and per-point dt-before-scan-end.\n\n    Note: we deliberately ignore msg.header.stamp here. On real Avia bags the Livox\n    header runs on the sensor's own clock (seconds-since-boot), while the IMU is stamped\n    with the bag's record clock (Unix time). Mixing them silently breaks IMU/LiDAR sync,\n    so read_bag uses the bag record time `t` for every scan's t_end and only uses the\n    per-point offsets here for deskew.\"\"\"\n    P = msg.points\n    xyz = np.array([[p.x, p.y, p.z] for p in P], float)\n    off = np.array([p.offset_time for p in P], float) * 1e-9        # ns -> s from scan start\n    keep = np.linalg.norm(xyz, axis=1) > 0.5\n    xyz, off = xyz[keep], off[keep]\n    return xyz, (off.max() - off if len(off) else off)             # dt before scan end\n\ndef parse_pointcloud2(msg):\n    \"\"\"Decode a sensor_msgs/PointCloud2 into (N,3) xyz + per-point time offset.\"\"\"\n    dtype = np.dtype({'names': [f.name for f in msg.fields],\n                      'formats': [_PF[f.datatype] for f in msg.fields],\n                      'offsets': [f.offset for f in msg.fields],\n                      'itemsize': msg.point_step})\n    arr = np.frombuffer(msg.data, dtype=dtype, count=msg.width * msg.height)\n    xyz = np.stack([arr['x'], arr['y'], arr['z']], -1).astype(float)\n    # the per-point time field is named 'time'/'t'/'offset_time' depending on driver\n    tcol = next((n for n in ('time', 't', 'offset_time', 'timestamp') if n in arr.dtype.names), None)\n    dts = (arr[tcol].astype(float) if tcol else np.zeros(len(xyz)))\n    if dts.max() > 1.0:                         # ns/us -> s heuristics\n        dts = dts * (1e-9 if dts.max() > 1e6 else 1e-3)\n    return xyz, dts\n_PF = {1: 'i1', 2: 'u1', 3: 'i2', 4: 'u2', 5: 'i4', 6: 'u4', 7: 'f4', 8: 'f8'}\n\n# ============================================================ synthetic world (no dataset)\ndef simulate_room(seconds=8, seed=0):\n    \"\"\"A robot looping through a 10x10x3 m room. Returns (imu_stream, lidar_scans,\n    truth) in exactly the format run_offline / a real bag would give you.\"\"\"\n    rng = np.random.default_rng(seed)\n    Rz = lambda a: np.array([[np.cos(a), -np.sin(a), 0], [np.sin(a), np.cos(a), 0], [0, 0, 1]])\n    truth = lambda t: (np.array([2*np.sin(0.4*t), 1.5*(1-np.cos(0.4*t)), 0.0]), Rz(0.3*np.sin(0.5*t)))\n    acc = lambda t: np.array([-0.32*np.sin(0.4*t), 0.24*np.cos(0.4*t), 0.0])\n    yawrate = lambda t: 0.15*np.cos(0.5*t)\n    g = np.array([0, 0, -9.81])\n    wall = []\n    for _ in range(4000):\n        f = rng.integers(0, 5); u, v = rng.uniform(-5, 5), rng.uniform(0, 3)\n        wall.append([[-5, u, v], [5, u, v], [u, -5, v], [u, 5, v], [u, rng.uniform(-5, 5), 0]][f])\n    wall = np.array(wall, float)\n    bg_t, ba_t = np.array([2e-3, -1e-3, 1.5e-3]), np.array([2e-2, -3e-2, 1e-2])\n    imu_stream = []                                       # 200 Hz IMU\n    for k in range(int(seconds * 200)):\n        t = k / 200; _, R = truth(t)\n        am = R.T @ (acc(t) - g) + ba_t + rng.normal(0, 0.01, 3)   # specific force in body\n        wm = np.array([0, 0, yawrate(t)]) + bg_t + rng.normal(0, 1e-3, 3)\n        imu_stream.append((t, am, wm))\n    scans = []                                            # 10 Hz LiDAR, skewed over the sweep\n    for s in range(1, int(seconds * 10)):\n        tc = s / 10; p, _ = truth(tc)\n        vis = wall[np.linalg.norm(wall - p, axis=1) < 8]\n        idx = rng.choice(len(vis), size=min(400, len(vis)), replace=False)\n        pts, dts = [], []\n        for j, kk in enumerate(idx):\n            tau = tc - 0.1 + (j / len(idx)) * 0.1; pp, RR = truth(tau)\n            pts.append(RR.T @ (vis[kk] - pp) + rng.normal(0, 0.01, 3)); dts.append(tc - tau)\n        scans.append({'t_end': tc, 'points': np.array(pts), 'dts': np.array(dts)})\n    return imu_stream, scans, truth\n\ndef write_demo_bag(path, seconds=6):\n    \"\"\"Write the simulated room to a real ROS1 .bag (sensor_msgs/Imu + PointCloud2),\n    so you can exercise the read_bag() path without downloading a dataset.\"\"\"\n    import struct\n    from rosbags.rosbag1 import Writer\n    from rosbags.typesys import Stores, get_typestore\n    ts = get_typestore(Stores.ROS1_NOETIC)\n    Imu, PC2, PF = (ts.types[f'sensor_msgs/msg/{n}'] for n in ('Imu', 'PointCloud2', 'PointField'))\n    Header, Time = ts.types['std_msgs/msg/Header'], ts.types['builtin_interfaces/msg/Time']\n    Quat, Vec3 = ts.types['geometry_msgs/msg/Quaternion'], ts.types['geometry_msgs/msg/Vector3']\n    H = lambda t, f: Header(seq=0, stamp=Time(sec=int(t), nanosec=int((t % 1) * 1e9)), frame_id=f)\n    imu_stream, scans, _ = simulate_room(seconds)\n    with Writer(path) as w:\n        ci = w.add_connection('/imu', Imu.__msgtype__, typestore=ts)\n        cp = w.add_connection('/velodyne_points', PC2.__msgtype__, typestore=ts)\n        for t, a, wv in imu_stream:\n            m = Imu(header=H(t, 'imu'), orientation=Quat(x=0., y=0., z=0., w=1.),\n                    orientation_covariance=np.zeros(9),\n                    angular_velocity=Vec3(x=wv[0], y=wv[1], z=wv[2]), angular_velocity_covariance=np.zeros(9),\n                    linear_acceleration=Vec3(x=a[0], y=a[1], z=a[2]), linear_acceleration_covariance=np.zeros(9))\n            w.write(ci, int(t * 1e9), ts.serialize_ros1(m, Imu.__msgtype__))\n        flds = [PF(name=n, offset=o, datatype=7, count=1) for n, o in (('x', 0), ('y', 4), ('z', 8), ('time', 12))]\n        for sc in scans:\n            blob = b''.join(struct.pack('ffff', *p, d) for p, d in zip(sc['points'], sc['dts']))\n            m = PC2(header=H(sc['t_end'], 'lidar'), height=1, width=len(sc['points']), fields=flds,\n                    is_bigendian=False, point_step=16, row_step=16 * len(sc['points']),\n                    data=np.frombuffer(blob, np.uint8).copy(), is_dense=True)\n            w.write(cp, int(sc['t_end'] * 1e9), ts.serialize_ros1(m, PC2.__msgtype__))\n\ndef _ate(traj, truth):\n    err = [np.linalg.norm(p - truth(t)[0]) for t, p, _ in traj]\n    return float(np.sqrt(np.mean(np.square(err)))), float(err[-1])\n\nif __name__ == '__main__':\n    import sys\n    if len(sys.argv) > 1:                       # python fastlio2_mini.py path/to/real.bag\n        # defaults target the HKU Livox Avia bag: its avia.yaml extrinsic + noise\n        imu, scans = read_bag(sys.argv[1], imu_topic='/livox/imu', lidar_topic='/livox/lidar')\n        traj, lmap = run_offline(imu, scans, T_LI=[0.04165, 0.02326, -0.0284],\n                                 acc_cov=0.1, gyr_cov=0.1, scan_voxel=0.5, init_secs=1.5)\n        ps = np.array([p for _, p, _ in traj])\n        plen = float(np.sum(np.linalg.norm(np.diff(ps, axis=0), axis=1)))\n        print(f\"{len(traj)} poses, map={len(lmap.pts)} pts, path={plen:.2f} m, \"\n              f\"final={np.round(ps[-1], 2)}\")\n    else:                                       # self-contained demo: in-memory + a real .bag\n        imu, scans, truth = simulate_room()\n        traj, _ = run_offline(imu, scans)\n        print(\"in-memory  : ATE rmse = %.3f m   final = %.3f m\" % _ate(traj, truth))\n        write_demo_bag('/tmp/fastlio2_demo.bag')\n        imu_b, scans_b = read_bag('/tmp/fastlio2_demo.bag',\n                                  imu_topic='/imu', lidar_topic='/velodyne_points')\n        traj_b, _ = run_offline(imu_b, scans_b)\n        print(\"via .bag   : ATE rmse = %.3f m   final = %.3f m\" % _ate(traj_b, truth))\n```\n\n</CodeCollapse>\n\n## Honest notes\n\n- **It needs a decent IMU and initialization.** The filter assumes a high-rate IMU\n  (200 Hz+) and a short static period at start to estimate the gravity direction and biases.\n  Garbage init, garbage trajectory.\n- **Geometric degeneracy is the real failure mode.** Point-to-plane constraints vanish in a\n  long featureless tunnel or an open field — the update becomes unobservable along the\n  degenerate direction and the IMU drift takes over. This is fundamental to LiDAR odometry,\n  not a bug.\n- **It's odometry, not loop-closing SLAM.** FAST-LIO2 drifts slowly but has no global loop\n  closure; pair it with a pose-graph backend (e.g. a FAST-LIO-SLAM setup) if you need\n  globally consistent maps.\n- **The \"100 Hz\" is real but hardware-shaped.** The headline rates assume the ikd-Tree and a\n  reasonable CPU; the per-scan cost grows with map density and point count, which is exactly\n  what the downsampling and the moving window are there to bound.\n\nThe thing I'd take away: FAST-LIO2 is not a pile of heuristics — it's *one* iterated\nerror-state Kalman filter, fed a deskewed point cloud, corrected by point-to-plane residuals,\nover an incremental map, with a gain rewritten so thousands of measurements cost the same as\na few. Understand those five steps and you can rebuild it, and you understand the spine of\nmodern LiDAR SLAM.\n\n---\n\n*Built on [FAST-LIO2: Fast Direct LiDAR-Inertial Odometry](https://arxiv.org/abs/2107.06829)\n(Xu, Cai, Bai, Zhang, 2021), the original [FAST-LIO](https://arxiv.org/abs/2010.08196) and\n[ikd-Tree](https://arxiv.org/abs/2102.10808) papers, and the\n[HKU-MARS/FAST_LIO](https://github.com/hku-mars/FAST_LIO) source. C++ snippets are from the\nclean reimplementation [zlwang7/S-FAST_LIO](https://github.com/zlwang7/S-FAST_LIO); Python is\nsimplified for teaching.*\n","readingTimeMins":36,"url":"https://ai.thesatyajit.com/articles/fast-lio2-lidar-inertial-odometry","signal":{"interest":3,"helpful":4,"score":7,"level":3,"label":"Notable"}},{"title":"How LLM inference works: prefill, decode, and where the time goes","description":"Every generate() call runs two phases with opposite bottlenecks — a compute-bound prefill and a memory-bound decode — and almost every inference optimization targets one or the other. A walk through the full path from tokens to streamed output, the KV cache that dominates the economics, and how to tell which phase is actually slow.","date":"2026-06-28","tags":["llm","inference-optimization","systems","kv-cache","explainer"],"draft":false,"featured":false,"interest":3,"helpful":5,"kind":"articles","slug":"how-llm-inference-works","body":"When a model feels slow in production, the first question I ask is *which phase* is\nslow. Because a single `generate()` call isn't one workload — it's two, with opposite\nbottlenecks running on the same GPU:\n\n- **prefill** processes the prompt and is **compute-bound**,\n- **decode** generates tokens one at a time and is **memory-bound**.\n\nAlmost every inference optimization you'll read about targets one of these two phases.\nSo before reaching for a fix, you have to know which one is hurting. Here's the whole\npipeline, and where the time actually goes.\n\n<PrefillDecode />\n\n## From text to vectors\n\nBefore either phase, the text becomes numbers. A tokenizer — usually byte-pair encoding\n(BPE) — splits the string into integer IDs from a vocabulary of roughly 50,000 entries.\nEach ID indexes a row of the embedding table, a learned `[vocab_size, hidden_dim]`\nmatrix, so for `hidden_dim = 4096` every token becomes a 4096-dimensional vector.\n\nPosition is injected here. Modern models use rotary position embeddings (RoPE), which\nencode position by *rotating* each query/key vector by an angle proportional to its\nindex, rather than adding a separate positional vector. It's cheap and it's what lets\nthe same weights generalize across lengths.\n\n## Inside a layer\n\nThe embedded sequence flows through a stack of transformer layers — 32 for a 7B, 80+ for\nthe big ones. Each layer is two operations:\n\n1. **Self-attention** projects every token into a query `Q`, key `K`, and value `V`.\n   Each token's query is scored against every token's key; scale, softmax, and the scores\n   become weights that mix the values. This is the only place information moves *between*\n   positions.\n2. **Feed-forward network (FFN)** — a two-layer MLP applied to each token independently.\n   Attention routes information across positions; the FFN transforms it in place.\n\nAfter the last layer, the final position's hidden state is projected back to vocabulary\nsize, softmaxed, and sampled — that's one output token. How that projection-and-sample\ngets *driven* is exactly what differs between the two phases.\n\n## Prefill: compute-bound\n\nPrefill processes the entire prompt at once. `Q`, `K`, `V` are computed for every prompt\ntoken in parallel, and attention is a big **matrix-matrix multiply**. That's dense\narithmetic, and it saturates the GPU's math units — utilization runs near 100%. The\nmetric that captures this phase is **Time To First Token (TTFT)**: how long before the\nfirst output appears.\n\nPrefill also populates the **KV cache** — the `K` and `V` tensors for every layer get\nwritten to GPU memory so they never have to be recomputed. That cache is what makes the\nnext phase cheap, and also what makes it expensive.\n\n## Decode: memory-bound\n\nOnce the first token exists, generation switches to one token per step. For each new\ntoken the model computes `Q`, `K`, `V` for *that token only*; the keys and values for\neverything before it are already cached. So the attention is one query vector against a\ncached key matrix — a **matrix-vector** multiply, almost no arithmetic.\n\nAnd yet decode is the slow part per token, because the GPU still has to **stream every\nweight matrix and the entire KV cache out of memory** to do that tiny computation. The\nbottleneck flips from arithmetic to **memory bandwidth**. The metric here is **Inter-Token\nLatency (ITL)** — the gap between consecutive tokens, which is what makes a stream feel\nfast or sluggish. GPU utilization during decode can sit at 30% on a fully loaded server,\nbecause the math units are starved waiting on memory.\n\n| | prefill | decode |\n|---|---|---|\n| Work | whole prompt, parallel | one token at a time |\n| Attention shape | matrix × matrix | matrix × vector |\n| Bottleneck | compute (arithmetic) | memory bandwidth |\n| Metric | TTFT | ITL |\n| GPU util | ~95% | ~30% |\n| Optimize by | more FLOPs, better kernels | smaller cache, faster memory, batching |\n\n## The KV cache runs the economics\n\nThe cache is the single most important object in LLM serving. Prefill writes one entry per\nprompt token in a single pass; then each decode step appends exactly *one* entry and reuses\neverything already there, recomputing nothing. Watch it accumulate — bright is written this\nstep, faded is reused:\n\n<CacheGrow />\n\nThat reuse is the whole point. Without the cache, generating a 1000-token response would\nre-attend over the whole growing sequence every step — quadratic work. With it, each step\ndoes constant new work — linear. Toggle it and watch the per-step cost, then drag the\ncontext length to see what the cache costs in memory:\n\n<KVCache />\n\nThe trade is brutal and unavoidable: the cache grows linearly with sequence length, *per\nlayer*. For a 13B model it's roughly **1 MB per token**, so a 4K context is ~4 GB of VRAM\nspent on cache alone — before a single weight. And that memory competes directly with\nbatch size: every gigabyte on cache is a gigabyte not serving another request. Long\ncontexts are expensive not because of compute, but because they evict concurrency.\n\nThe standard mitigations all attack the cache from different angles:\n\n- **Quantize it** to INT8 or INT4 (it's just tensors).\n- **Sliding-window attention** — drop tokens outside a fixed window.\n- **Grouped-query attention (GQA)** — share `K`/`V` across attention heads so there are\n  fewer cached tensors. (This is exactly the change [iLLaDA](/articles/illada-diffusion-language-model)\n  and most modern models make.)\n- **PagedAttention** — the trick behind vLLM: page the cache in fixed-size blocks like an\n  OS pages virtual memory, killing fragmentation and packing in more concurrent requests.\n\n## Redesigning attention around the cache\n\nThe deeper move is to make the cache structurally smaller from the start, by changing\nattention itself. DeepSeek's V4 series does this with a hybrid of two compressed\nmechanisms: **Compressed Sparse Attention** (compress KV ~4× with softmax-gated pooling,\nthen attend sparsely) and **Heavily Compressed Attention** (consolidate KV across 128\ntokens into one entry, attend densely over those). At a 1M-token context, V4-Pro needs\nabout **27% of the single-token inference FLOPs and 10% of the KV cache** of its\npredecessor — in absolute terms, ~9.62 GiB of cache per sequence in bf16 versus an\nestimated ~83.9 GiB for the older design, and fp4/fp8 halves it again. (I went deeper on\nV4's drafter in the [DSpark write-up](/articles/deepseek-dspark).) The cache has become\nthe constraint the architecture is being designed *around*.\n\n## Quantization\n\nTraining needs FP32/BF16 for gradient stability. Inference doesn't. Dropping bit-width\nsaves memory linearly, and quality barely moves. Pick a size and precision:\n\n<Quantization />\n\nINT4 is the reason a 7B model runs on a 4–6 GB laptop GPU at all. Methods like GPTQ and\nAWQ use per-channel scaling to keep the lossy compression within 1–2 points of full\nprecision on standard benchmarks. And going FP16 → INT8 often roughly halves latency with\nnegligible quality loss — which makes quantization the highest-leverage single change for\nmost deployments.\n\n## The serving layer\n\nOn top of the prefill/decode loop sits the infrastructure that makes a GPU economical:\n\n- **Continuous batching** interleaves tokens from many requests on the same GPU step. This\n  is the big one: decode leaves most of the arithmetic idle, so you fill that idle\n  capacity with other requests' tokens. It's why one GPU serves dozens of users at once.\n- **Speculative decoding** drafts several tokens with a cheap model and verifies them in\n  one pass of the big model — turning sequential decode steps into one parallel\n  verification when acceptance is high. (Two whole articles' worth:\n  [DSpark](/articles/deepseek-dspark) and\n  [multi-token prediction](/articles/multi-token-prediction).)\n- **PagedAttention** for the cache memory, as above.\n\nFrameworks like vLLM, TensorRT-LLM, and TGI combine all of this. The throughput they get\ncomes mostly from the fact that decode is memory-bound, so there's spare arithmetic lying\naround for batching to soak up.\n\n## The full path\n\n1. **Tokenize** — text → integer IDs via BPE.\n2. **Embed** — IDs → vectors; RoPE rotates in position.\n3. **Prefill** — all prompt tokens through every layer in parallel; compute-bound; KV\n   cache populated; first token emitted (TTFT).\n4. **Decode loop** — one token per step: project `Q`, attend over cached `K`/`V`, run FFN,\n   sample, append to cache; memory-bound (ITL).\n5. **Detokenize** — IDs → text, streamed out.\n\n## How to actually use this\n\nThe whole point of splitting it this way is diagnosis. When something is slow:\n\n- **Slow to start** → you're prefill-bound. Long prompts dominate TTFT; optimize the\n  prompt path (caching, chunked prefill, more compute).\n- **Slow to stream** → you're decode-bound. Long outputs dominate ITL; the fix is *not*\n  more compute — it's a smaller cache, faster memory, or better batching.\n- **Context length is never free.** It bloats the KV cache and directly cuts how many\n  requests fit on the GPU, so it shows up as reduced throughput long before it shows up as\n  an out-of-memory error.\n\nThat last instinct is the one I'd internalize: during decode the arithmetic units are\nmostly idle, so when a decode-bound server is slow, throwing a bigger compute budget at it\ndoes nothing. The bottleneck is the memory bus. Optimize the thing that's actually full.\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/how-llm-inference-works","signal":{"interest":3,"helpful":5,"score":8,"level":4,"label":"High"}},{"title":"iLLaDA: how far a masked-diffusion language model scales","description":"Almost every LLM is autoregressive — causal, left-to-right, one token per pass. iLLaDA is an 8B masked diffusion model with fully bidirectional attention, trained from scratch on 12T tokens. A walk through how masked diffusion LMs work, what iLLaDA changes over LLaDA, and the honest result: base-model parity with Qwen2.5, but a real gap that remains on instruction tuning.","date":"2026-06-28","tags":["llm","diffusion","language-models","architecture","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"illada-diffusion-language-model","body":"Almost every language model you use is **autoregressive**: it factorizes text\nleft-to-right, $p(x) = \\prod_i p(x_i \\mid x_{<i})$, with a causal attention mask, and\ngenerates one token per forward pass. It works so well that the alternatives barely get\nairtime.\n\n**iLLaDA** is one of the alternatives, scaled up until it's hard to ignore. It's an 8B\n**masked diffusion** language model with *fully bidirectional* attention, trained from\nscratch on 12 trillion tokens by a team from Renmin University and ByteDance Seed — the\ndirect successor to [LLaDA](https://arxiv.org/abs/2502.09992). No causal mask, no\nleft-to-right factorization. The question it's built to answer: can a bidirectional\ndiffusion model, trained from scratch, actually keep up with a strong autoregressive\nmodel? The honest answer turns out to be *yes for base models, not yet for instruct* —\nand the path to that answer is worth understanding.\n\n## How a masked diffusion language model works\n\nForget Gaussian noise. The \"diffusion\" here is **masking** — a discrete, absorbing-state\nprocess over tokens.\n\n<Figure\n  src=\"/articles/illada-diffusion-language-model/fig1.png\"\n  alt=\"Three-panel schematic: (a) pre-training masks all tokens of a sequence independently at ratio t and a mask predictor recovers them; (b) SFT masks only response tokens; (c) sampling starts from a fully masked response at t=1 and iteratively predicts then re-masks tokens over intermediate steps down to t=0.\"\n  caption=\"The whole model on one page: (a) pre-training, (b) supervised fine-tuning, and (c) sampling all use the same mask-predictor, differing only in what gets masked (LLaDA paper, Figure 2).\"\n/>\n\n### The forward process: corrupt by masking\n\nPick a masking ratio $t \\sim \\mathcal{U}[0,1]$. Each token is independently replaced by a\nspecial `[MASK]` token with probability $t$. At $t=0$ the sequence is clean; at $t=1$\nit's fully masked. Drag $t$ and watch the corruption — and the loss weighting — change:\n\n<MaskingProcess />\n\nThe model $p_\\theta$ sees the corrupted sequence $x_t$ and is trained to predict the\n*original* tokens at every masked position at once. The objective is a masked\ncross-entropy, computed only on masked positions and reweighted by $1/t$:\n\n$$\n\\mathcal{L}(\\theta) \\;=\\; -\\,\\mathbb{E}_{t,\\,x_0,\\,x_t}\\!\\left[\\frac{1}{t}\\sum_{i=1}^{L}\n\\mathbf{1}\\!\\left[x_t^{i} = \\mathrm{M}\\right]\\,\\log p_\\theta\\!\\left(x_0^{i} \\mid x_t\\right)\\right]\n$$\n\nThe indicator $\\mathbf{1}[x_t^i = \\mathrm{M}]$ restricts the loss to masked positions;\nthe $1/t$ factor re-normalizes so heavily- and lightly-masked samples both contribute\ncorrectly. Averaged over $t$, this is a Monte-Carlo **upper bound on the negative\nlog-likelihood** — a principled training objective, not a heuristic. iLLaDA keeps this\n*same* objective through pre-training **and** supervised fine-tuning.\n\n### Bidirectional attention comes for free\n\nAn autoregressive model *must* hide the future — if token $i$ could attend to token\n$i{+}1$, it would just read the answer it's supposed to predict. A masked diffusion model\npredicts *masked* positions anywhere in the sequence, not \"the next\" one, so there's\nnothing to hide. Every position attends to every other, left and right:\n\n<AttentionModes />\n\nThat full context is the structural argument for diffusion LMs: on infilling and tasks\nwhere later text disambiguates earlier text, seeing both sides at every layer should\nhelp.\n\n### Generation: unmask in parallel, over a few steps\n\nGeneration runs the process backward. Start from a block of all-`[MASK]` tokens. Each\ndenoising step, the model predicts every masked position, **commits its most confident\npredictions**, and **re-masks the low-confidence ones** to try again next step. A whole\nblock resolves over a handful of steps, in confidence order — not reading order. Flip\nbetween the two paradigms:\n\n<Unmasking />\n\nThis is the crux. Autoregression spends one forward pass per output token, in series.\nDiffusion spends a fixed, smaller number of denoising passes over the whole block — the\npromise being fewer sequential steps, at the cost of needing enough steps for quality.\n\n<Diagram caption=\"The two directions of the same model. Training: mask a fraction t of the tokens and predict the originals (loss on masked positions only). Generation: start fully masked and iteratively unmask the confident predictions, re-masking the rest, until the block resolves.\">\n  <svg viewBox=\"0 0 640 220\" role=\"img\" aria-label=\"Training corrupts a clean sequence by masking and predicts the originals; generation starts fully masked and iteratively unmasks confident tokens.\" style={{ width: \"100%\", height: \"auto\" }}>\n    {/* training row */}\n    <text x=\"16\" y=\"34\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--muted-foreground)\">training</text>\n    <rect x=\"16\" y=\"44\" width=\"120\" height=\"34\" rx=\"6\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"76\" y=\"65\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">clean x₀</text>\n    <text x=\"150\" y=\"65\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--muted-foreground)\">→ mask (t) →</text>\n    <rect x=\"216\" y=\"44\" width=\"120\" height=\"34\" rx=\"6\" fill=\"oklch(0.72 0.13 60)\" opacity=\"0.3\" stroke=\"var(--border)\" />\n    <text x=\"276\" y=\"65\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">corrupted xₜ</text>\n    <text x=\"356\" y=\"65\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--muted-foreground)\">→ predict →</text>\n    <rect x=\"430\" y=\"44\" width=\"130\" height=\"34\" rx=\"6\" fill=\"oklch(0.72 0.13 150)\" opacity=\"0.3\" stroke=\"var(--border)\" />\n    <text x=\"495\" y=\"65\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">x̂₀ on masked</text>\n    {/* generation row */}\n    <text x=\"16\" y=\"134\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--muted-foreground)\">generation</text>\n    <rect x=\"16\" y=\"144\" width=\"120\" height=\"34\" rx=\"6\" fill=\"oklch(0.72 0.13 60)\" opacity=\"0.45\" stroke=\"var(--border)\" />\n    <text x=\"76\" y=\"165\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">all [MASK]</text>\n    <text x=\"172\" y=\"165\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--muted-foreground)\">→ unmask conf. →</text>\n    <rect x=\"246\" y=\"144\" width=\"120\" height=\"34\" rx=\"6\" fill=\"oklch(0.72 0.14 150)\" opacity=\"0.55\" stroke=\"var(--border)\" />\n    <text x=\"306\" y=\"165\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">partly filled</text>\n    <text x=\"402\" y=\"165\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--muted-foreground)\">→ repeat →</text>\n    <rect x=\"466\" y=\"144\" width=\"120\" height=\"34\" rx=\"6\" fill=\"oklch(0.72 0.15 150)\" opacity=\"0.85\" stroke=\"var(--border)\" />\n    <text x=\"526\" y=\"165\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"oklch(0.2 0 0)\">complete</text>\n    {/* loop arrow */}\n    <path d=\"M 306 178 q 0 26 -120 26 q -120 0 -120 -26\" fill=\"none\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1\" strokeDasharray=\"3 3\" />\n    <text x=\"186\" y=\"214\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">re-mask low-confidence positions</text>\n  </svg>\n</Diagram>\n\n## What iLLaDA changes over LLaDA\n\niLLaDA is, more than anything, a careful **scale-up** of LLaDA — proof that the recipe\nkeeps paying off with more tokens and a better post-training pass.\n\n### A bigger, leaner backbone\n\nThe architecture is a standard dense Transformer (RMSNorm, SwiGLU, RoPE, no biases), but\nre-tuned for cheaper inference:\n\n| | iLLaDA | LLaDA |\n|---|---|---|\n| Attention heads | 32 | 32 |\n| Key/Value heads | **8 (GQA)** | 32 (MHA) |\n| FFN dim | 14,336 | 12,288 |\n| Vocabulary | 155,136 | 126,464 |\n| Max sequence length | **8192** | 4096 |\n| Embedding / LM head | **tied** | untied |\n| Total parameters | 7.62B | 8.02B |\n\nThe load-bearing change is **grouped-query attention** (8 KV heads instead of 32),\nadopted to shrink the cached key/value footprint at inference — plus a larger vocab,\ndoubled context, and tied embeddings.\n\n### The headline spend: 12T tokens\n\n- **Pre-training: 12T tokens**, up ~5.2× from LLaDA's 2.3T. AdamW, weight decay 0.1, LR\n  warmed to $2\\times10^{-4}$, held, then cosine-decayed to $5\\times10^{-6}$.\n- **SFT: a 25B-token instruction corpus for 12 epochs.** The new wrinkle: SFT now applies\n  the *same* masking as pre-training across the entire sequence (prompt, response, EOS),\n  rather than keeping the prompt fully visible — a more consistent objective end to end.\n\nAnd the fine-tuning clearly hadn't saturated. The SFT-epoch ablation rises monotonically\nthrough all 12 epochs (they stopped on compute, not convergence):\n\n<Diagram caption=\"SFT-epoch ablation (from the paper, Figure 1), redrawn. Accuracy on GSM8K, MATH, and MMLU-Pro keeps climbing through 12 epochs of fine-tuning — the curve had not flattened where they stopped.\">\n  <svg viewBox=\"0 0 560 240\" role=\"img\" aria-label=\"Three rising curves of accuracy versus SFT epoch for GSM8K, MATH, and MMLU-Pro, all increasing through 12 epochs.\" style={{ width: \"100%\", height: \"auto\" }}>\n    {/* axes */}\n    <line x1=\"48\" y1=\"200\" x2=\"520\" y2=\"200\" stroke=\"var(--border)\" strokeWidth=\"1\" />\n    <line x1=\"48\" y1=\"20\" x2=\"48\" y2=\"200\" stroke=\"var(--border)\" strokeWidth=\"1\" />\n    {/* y ticks 45..90 mapped 200..20 */}\n    {[45,60,75,90].map((v) => (\n      <g key={v}>\n        <text x=\"40\" y={200 - ((v-45)/45)*180 + 3} textAnchor=\"end\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">{v}</text>\n        <line x1=\"48\" y1={200 - ((v-45)/45)*180} x2=\"520\" y2={200 - ((v-45)/45)*180} stroke=\"var(--border)\" strokeOpacity=\"0.3\" strokeWidth=\"1\" />\n      </g>\n    ))}\n    {/* x ticks epochs 3,6,9,12 mapped 48..520 */}\n    {[3,6,9,12].map((e) => (\n      <text key={e} x={48 + ((e-3)/9)*460} y=\"216\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">{e}</text>\n    ))}\n    <text x=\"284\" y=\"234\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--muted-foreground)\">SFT epoch</text>\n    {/* helper: x(e)=48+((e-3)/9)*460 ; y(v)=200-((v-45)/45)*180 */}\n    {/* GSM8K 86.7,84.9,88.4,89.0 */}\n    <polyline points=\"48,33.2 201,40.4 355,26.4 508,24.0\" fill=\"none\" stroke=\"oklch(0.72 0.15 150)\" strokeWidth=\"2\" />\n    <text x=\"512\" y=\"27\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"oklch(0.72 0.15 150)\">GSM8K</text>\n    {/* MATH 49.6,51.9,55.6,56.3 */}\n    <polyline points=\"48,181.6 201,172.4 355,157.6 508,154.8\" fill=\"none\" stroke=\"oklch(0.72 0.15 250)\" strokeWidth=\"2\" />\n    <text x=\"512\" y=\"158\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"oklch(0.72 0.15 250)\">MATH</text>\n    {/* MMLU-Pro 48.4,51.5,51.8,52.2 */}\n    <polyline points=\"48,186.4 201,174.0 355,172.8 508,171.2\" fill=\"none\" stroke=\"oklch(0.72 0.15 40)\" strokeWidth=\"2\" />\n    <text x=\"512\" y=\"174\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"oklch(0.72 0.15 40)\">MMLU-Pro</text>\n  </svg>\n</Diagram>\n\n### Two inference-side tricks\n\n- **Variable-length generation.** Instead of committing to a fixed output block and\n  denoising all of it, iLLaDA appends a mask block, runs the sampler, commits confident\n  tokens, and continues until termination — so it only denoises as many positions as the\n  answer needs, rather than padding to a worst case. (The paper argues the efficiency,\n  but — notably — does not report latency or step-count numbers for it.)\n- **Confidence-based multiple-choice scoring.** Rather than a likelihood estimate, they\n  score a candidate by revealing its tokens one at a time, each step unmasking the\n  highest-confidence position, and summing the log-probs:\n  $$\n  S_{\\text{conf}}(y \\mid p) \\;=\\; \\sum_k \\log p_\\theta\\!\\left(y^{i_k} \\mid p,\\, \\tilde{y}_{k-1}\\right),\n  \\quad i_k = \\arg\\max_{i \\in \\mathcal{M}_{k-1}} p_\\theta\\!\\left(y^i \\mid p,\\, \\tilde{y}_{k-1}\\right)\n  $$\n  The authors are upfront that this is \"not a likelihood estimate\" but a task-specific\n  surrogate. Its ablation is modest: +1.3 PIQA, +0.6 ARC-C, +2.3 HellaSwag over\n  likelihood scoring.\n\n## The results, honestly\n\nTwo stories live in these tables, and they point in different directions.\n\n### Base models: genuine parity with Qwen2.5\n\nAs a base model, iLLaDA improves broadly over LLaDA and lands **even with Qwen2.5-7B** on\naverage — winning several benchmarks outright:\n\n<BenchBars\n  title=\"Base models — average over 8 benchmarks (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"iLLaDA 8B\", value: 63.9, highlight: true },\n    { label: \"Qwen2.5 7B\", value: 63.3 },\n    { label: \"Dream 7B\", value: 61.4 },\n    { label: \"LLaDA 8B\", value: 51.1 },\n  ]}\n/>\n\nThe original LLaDA paper made the same point with its headline figure — its 8B base\nmodel tracing out a wide envelope over the LLaMA baselines across general, math, code, and\nChinese tasks:\n\n<Figure\n  src=\"/articles/illada-diffusion-language-model/fig2.png\"\n  alt=\"Radar chart over twelve benchmarks (MMLU, ARC-C, TruthfulQA, C-Eval, CMMLU, MBPP, HumanEval, Math, GSM8K and others) comparing LLaDA 8B Base against LLaMA3 8B Base and LLaMA2 7B Base; LLaDA's shaded region matches or exceeds LLaMA3 on most axes and clearly dominates LLaMA2.\"\n  caption=\"LLaDA's own headline result: the 8B base model is competitive with LLaMA3 8B and well ahead of LLaMA2 7B across zero/few-shot benchmarks (LLaDA paper, Figure 1). iLLaDA extends this parity story to Qwen2.5.\"\n/>\n\nThe per-benchmark picture, with the gains over LLaDA that the abstract leads on:\n\n| Base | iLLaDA | LLaDA | Qwen2.5 | Δ vs LLaDA |\n|---|---|---|---|---|\n| MMLU | 74.8 | 65.9 | 71.9 | +8.9 |\n| BBH | 71.3 | 49.7 | 63.9 | **+21.6** |\n| ARC-C | 60.8 | 45.9 | 51.5 | **+14.9** |\n| HellaSwag | 76.6 | 70.5 | 79.0 | +6.1 |\n| GSM8K | 81.9 | 70.3 | 78.9 | +11.6 |\n| MATH | 38.4 | 31.4 | 41.1 | +7.0 |\n| HumanEval | 50.0 | 35.4 | 56.7 | +14.6 |\n| MBPP | 57.8 | 40.0 | 63.6 | +17.8 |\n\niLLaDA-Base beats Qwen2.5-Base on MMLU, BBH, ARC-C, and GSM8K; Qwen still wins on\nHellaSwag, MATH, and code. But the average edges ahead — and *that's the real result*:\na from-scratch bidirectional diffusion model matching a strong autoregressive base.\n\n### Instruct models: the gap that's left\n\nThis is the part the abstract's \"competitive on several benchmarks\" softens. After\ninstruction tuning, iLLaDA **trails Qwen2.5 by ~10 points on average**, with double-digit\ngaps on the hard reasoning and coding tasks:\n\n<BenchBars\n  title=\"Instruct models — average over 7 benchmarks (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Qwen2.5 7B\", value: 77.1 },\n    { label: \"iLLaDA 8B\", value: 67.1, highlight: true },\n    { label: \"Dream 7B\", value: 60.2 },\n    { label: \"LLaDA 8B\", value: 54.5 },\n  ]}\n/>\n\n| Instruct | iLLaDA | LLaDA | Qwen2.5 | Δ vs LLaDA |\n|---|---|---|---|---|\n| MMLU | 71.6 | 65.5 | 76.6 | +6.1 |\n| MMLU-Pro | 52.3 | 37.0 | 56.3 | +15.3 |\n| GSM8K | 89.0 | 77.5 | 91.6 | +11.5 |\n| MATH | 56.7 | 42.2 | 75.5 | **+14.5** |\n| HumanEval | 65.9 | 49.4 | 84.8 | **+16.5** |\n| MBPP | 58.0 | 41.0 | 79.2 | +17.0 |\n\nThe improvement *over LLaDA* is huge and real (+12.6 average). The gap *to Qwen* on\nMATH (56.7 vs 75.5), HumanEval (65.9 vs 84.8), and MBPP (58.0 vs 79.2) is also real, and\nthe authors don't hide it — they point to the lack of RL alignment as part of the cause.\n\n## What I make of it\n\n- **The base-model result is the one that matters, and it's solid.** A bidirectional\n  masked diffusion LM, trained from scratch, reaching autoregressive base parity is a\n  genuine data point: diffusion LMs *scale* like AR LMs. The paradigm is viable, not a\n  curiosity.\n- **\"Competitive\" is doing some work in the abstract.** On instruction-tuned reasoning\n  and code, AR still wins by 10–20 points. Read the instruct tables before repeating the\n  headline.\n- **The efficiency case is asserted, not measured.** GQA and variable-length generation\n  are motivated by cost, but the paper reports no sampling-step counts, no latency, no\n  tokens/sec — and the number of denoising passes is *exactly* diffusion's central\n  liability. \"More efficient\" is a design argument here, not a demonstrated result.\n- **Parity wasn't cheap.** 12T tokens at 8B is a frontier-scale data spend, ~5× LLaDA and\n  on par with what strong AR models of this size consume. iLLaDA shows diffusion can reach\n  AR base parity — by paying full AR-scale training cost, and still trailing after\n  post-training. There's also an honest failure mode noted: the sampler can fall into\n  repetitive reasoning loops that need inference-time mitigation.\n\nThe fair summary: bidirectional masked diffusion is now a **scalable paradigm at parity\nwith autoregressive base models** — no longer something you can wave off — but not yet a\nproven win on post-trained quality, and not yet a proven efficiency advantage. That's a\nmeaningful place to have gotten to, stated without the gloss.\n\n---\n\n*Built on [Improved Large Language Diffusion Models](https://arxiv.org/abs/2606.25331)\n(Nie et al., Renmin University & ByteDance Seed, 2026) and its predecessor\n[LLaDA](https://arxiv.org/abs/2502.09992). All numbers are from the paper's Tables 1–3;\nthe SFT-epoch curves are redrawn from its Figure 1.*\n","readingTimeMins":12,"url":"https://ai.thesatyajit.com/articles/illada-diffusion-language-model","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"The Kalman filter from first principles","description":"A noisy sensor and a model of how the world moves, fused optimally into one estimate that knows how unsure it is. A first-principles walk through the predict-update cycle, the gain, and the full matrix equations — with runnable Python — building up to the EKF, the iterated EKF, and the on-manifold error-state filter that powers modern LiDAR-inertial SLAM.","date":"2026-06-28","tags":["state-estimation","kalman-filter","slam","robotics","explainer"],"draft":false,"featured":false,"interest":4,"helpful":5,"kind":"articles","slug":"kalman-filter","body":"I spend a lot of time on odometry and SLAM, and underneath almost all of it is the same\nidea: you have a noisy sensor, and a model of how the thing you're tracking moves, and you\nwant to fuse them into a single best estimate that also tells you how much to trust itself.\nThat's the Kalman filter. It's the optimal recursive estimator for a linear system with\nGaussian noise, and once it clicks, you see it everywhere — GPS, IMUs, radar tracking, and\nthe LiDAR-inertial filters I'll build on in the next article.\n\nLet me derive it the way it actually makes sense: as fusing two Gaussians.\n\n## Two sources of information\n\nYou're tracking a state $\\mathbf{x}$ — say the position and velocity of something. You have\ntwo things:\n\n1. A **process model**: how the state evolves on its own. For constant velocity,\n   $p_{t} = p_{t-1} + v\\,\\Delta t$. You can *predict* forward, but the prediction drifts —\n   it accumulates uncertainty.\n2. A **measurement model**: a sensor that observes some function of the state, noisily.\n   A position sensor gives you $z = p + \\text{noise}$.\n\nNeither is enough alone. The prediction drifts; the sensor is noisy and jittery. The\nKalman filter combines them, and the key fact is that **combining two Gaussian estimates\nyields a third Gaussian that is sharper than either input**.\n\n<KalmanFuse />\n\nThat's the whole update step. If the prediction is $\\mathcal{N}(\\mu_1, \\sigma_1^2)$ and the\nmeasurement is $\\mathcal{N}(\\mu_2, \\sigma_2^2)$, their normalized product has\n\n$$\n\\mu = \\mu_1 + K(\\mu_2 - \\mu_1), \\qquad \\sigma^2 = (1-K)\\,\\sigma_1^2,\n\\qquad K = \\frac{\\sigma_1^2}{\\sigma_1^2 + \\sigma_2^2}.\n$$\n\n$K$ is the **Kalman gain** — the fraction of the way you move from the prediction toward\nthe measurement, set by which one is more certain. Trust the sensor ($\\sigma_2 \\to 0$) and\n$K \\to 1$; trust the prediction ($\\sigma_1 \\to 0$) and $K \\to 0$. Everything else is this,\ngeneralized to vectors.\n\n## The predict-update cycle\n\nIn general the state is a vector $\\mathbf{x}$ with covariance $\\mathbf{P}$, and the models\nare linear maps with Gaussian noise:\n\n$$\n\\mathbf{x}_t = \\mathbf{F}\\mathbf{x}_{t-1} + \\mathbf{B}\\mathbf{u}_t + \\mathbf{w},\n\\quad \\mathbf{w}\\sim\\mathcal{N}(0,\\mathbf{Q});\n\\qquad\n\\mathbf{z}_t = \\mathbf{H}\\mathbf{x}_t + \\mathbf{v}, \\quad \\mathbf{v}\\sim\\mathcal{N}(0,\\mathbf{R}).\n$$\n\n$\\mathbf{F}$ is the state-transition, $\\mathbf{H}$ maps state to measurement, $\\mathbf{Q}$\nis process noise (how much the model drifts), $\\mathbf{R}$ is measurement noise. The filter\nalternates two steps forever:\n\n**Predict** — push the state and its uncertainty forward through the model:\n\n$$\n\\hat{\\mathbf{x}} = \\mathbf{F}\\mathbf{x} + \\mathbf{B}\\mathbf{u}, \\qquad\n\\hat{\\mathbf{P}} = \\mathbf{F}\\mathbf{P}\\mathbf{F}^{\\!\\top} + \\mathbf{Q}.\n$$\n\n**Update** — correct with the measurement, by the matrix Kalman gain:\n\n$$\n\\mathbf{y} = \\mathbf{z} - \\mathbf{H}\\hat{\\mathbf{x}}\n\\quad(\\text{innovation}), \\qquad\n\\mathbf{S} = \\mathbf{H}\\hat{\\mathbf{P}}\\mathbf{H}^{\\!\\top} + \\mathbf{R}\n\\quad(\\text{innovation covariance}),\n$$\n$$\n\\mathbf{K} = \\hat{\\mathbf{P}}\\mathbf{H}^{\\!\\top}\\mathbf{S}^{-1}, \\qquad\n\\mathbf{x} = \\hat{\\mathbf{x}} + \\mathbf{K}\\mathbf{y}, \\qquad\n\\mathbf{P} = (\\mathbf{I} - \\mathbf{K}\\mathbf{H})\\hat{\\mathbf{P}}.\n$$\n\n<Diagram caption=\"The Kalman loop: predict pushes the estimate and its covariance forward through the motion model (uncertainty grows by Q); update pulls it back toward the measurement by the gain K (uncertainty shrinks). The two steps alternate for every timestep.\">\n  <svg viewBox=\"0 0 620 170\" role=\"img\" aria-label=\"The predict-update cycle: predict grows covariance via the motion model, update shrinks it via the measurement and Kalman gain.\" style={{ width: \"100%\", height: \"auto\" }}>\n    <rect x=\"40\" y=\"56\" width=\"170\" height=\"58\" rx=\"10\" fill=\"oklch(0.72 0.13 60)\" opacity=\"0.25\" stroke=\"var(--border)\" />\n    <text x=\"125\" y=\"80\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"12\" fill=\"var(--foreground)\">predict</text>\n    <text x=\"125\" y=\"98\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">x̂=Fx · P̂=FPFᵀ+Q</text>\n    <rect x=\"410\" y=\"56\" width=\"170\" height=\"58\" rx=\"10\" fill=\"oklch(0.72 0.13 195)\" opacity=\"0.25\" stroke=\"var(--border)\" />\n    <text x=\"495\" y=\"80\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"12\" fill=\"var(--foreground)\">update</text>\n    <text x=\"495\" y=\"98\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">x=x̂+K(z−Hx̂)</text>\n    {/* arrows */}\n    <path d=\"M 210 70 q 100 -34 200 0\" fill=\"none\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" markerEnd=\"url(#ka)\" />\n    <text x=\"310\" y=\"36\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">prior x̂, P̂</text>\n    <path d=\"M 410 100 q -100 34 -200 0\" fill=\"none\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" markerEnd=\"url(#ka)\" />\n    <text x=\"310\" y=\"146\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">posterior x, P → next step</text>\n    <text x=\"495\" y=\"34\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">measurement z</text>\n    <line x1=\"495\" y1=\"38\" x2=\"495\" y2=\"54\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" markerEnd=\"url(#ka)\" />\n    <defs><marker id=\"ka\" markerWidth=\"7\" markerHeight=\"7\" refX=\"6\" refY=\"3.5\" orient=\"auto\"><path d=\"M0,0 L7,3.5 L0,7 Z\" fill=\"var(--muted-foreground)\" /></marker></defs>\n  </svg>\n</Diagram>\n\nWatch it run on a constant-velocity tracker. The state is $[\\text{position}, \\text{velocity}]$,\nthe sensor sees only position, and the filter has to infer velocity and smooth the noise.\nDrag $R$ and $Q$ to feel the trust trade-off the gain encodes:\n\n<KalmanTrack />\n\n## In code\n\nThe whole thing is a few lines of linear algebra. Here's the constant-velocity tracker\nabove, in NumPy — no framework, runnable as-is:\n\n```python\nimport numpy as np\n\ndt = 1.0\nF = np.array([[1, dt],      # constant-velocity transition\n              [0, 1]])\nH = np.array([[1.0, 0.0]])  # observe position only\nQ = 0.5 * np.array([[dt**3/3, dt**2/2],   # process noise (drift)\n                    [dt**2/2, dt]])\nR = np.array([[49.0]])      # measurement noise (sensor variance)\n\nx = np.array([[0.0], [0.0]])      # initial state\nP = np.eye(2) * 50.0              # initial uncertainty\n\ndef step(x, P, z):\n    # predict\n    x = F @ x\n    P = F @ P @ F.T + Q\n    # update\n    y = z - H @ x                 # innovation\n    S = H @ P @ H.T + R           # innovation covariance\n    K = P @ H.T @ np.linalg.inv(S)   # Kalman gain\n    x = x + K @ y\n    P = (np.eye(2) - K @ H) @ P\n    return x, P\n\nfor z in measurements:            # stream of noisy position readings\n    x, P = step(x, P, np.array([[z]]))\n    # x[0] is the smoothed position estimate, P[0,0] its variance\n```\n\nTwo knobs do all the tuning. `R` says how noisy the sensor is; `Q` says how much you let\nthe model drift. Get their ratio right and the filter is optimal; get it wrong and it either\nlags reality (`Q` too small) or chases noise (`R` too small). In practice you start from the\nsensor's datasheet for `R` and tune `Q` until the innovation $\\mathbf{y}$ looks like white\nnoise.\n\n<Callout type=\"note\">\nFor numerical robustness in real systems, use the **Joseph form** of the covariance update,\n$\\mathbf{P} = (\\mathbf{I}-\\mathbf{KH})\\hat{\\mathbf{P}}(\\mathbf{I}-\\mathbf{KH})^{\\!\\top} +\n\\mathbf{KRK}^{\\!\\top}$, which stays symmetric positive-definite even with floating-point\nerror, where the compact $(\\mathbf{I}-\\mathbf{KH})\\hat{\\mathbf{P}}$ can drift and diverge.\n</Callout>\n\n## When the world isn't linear\n\nThe plain Kalman filter assumes $\\mathbf{F}$ and $\\mathbf{H}$ are linear. Robotics is full\nof rotations and projections that aren't. Three extensions matter, and the last one is the\nbridge to LiDAR-inertial SLAM:\n\n- **Extended KF (EKF).** The dynamics $f$ and measurement $h$ are nonlinear, so you\n  *linearize* them at the current estimate: use the Jacobians\n  $\\mathbf{F} = \\left.\\frac{\\partial f}{\\partial \\mathbf{x}}\\right|_{\\hat{\\mathbf{x}}}$ and\n  $\\mathbf{H} = \\left.\\frac{\\partial h}{\\partial \\mathbf{x}}\\right|_{\\hat{\\mathbf{x}}}$ in\n  place of the matrices, run the same equations. Cheap, and it works when the nonlinearity\n  is mild over one step.\n- **Iterated EKF (iEKF).** One linearization point can be bad if the prior is far from the\n  truth. So *relinearize*: after the update, recompute $\\mathbf{H}$ at the new estimate and\n  redo the update, iterating until it converges. Each iteration is a Gauss-Newton step on\n  the maximum-a-posteriori objective. This is what FAST-LIO uses — it matters because the\n  point-to-plane LiDAR residual is very nonlinear in the pose.\n- **Error-state / on-manifold KF.** You can't add a vector to a rotation\n  $\\mathbf{R}\\in SO(3)$ and stay on the manifold. So you track the state on the manifold but\n  the *error* (and its covariance) in the tangent space, fusing with the $\\boxplus/\\boxminus$\n  operators instead of $+/-$. This keeps rotations valid and the covariance minimal\n  (3 numbers for orientation, not 9).\n\nThat last combination — an **iterated, error-state Kalman filter on a manifold** — is\nexactly the engine inside FAST-LIO2, fusing a high-rate IMU prediction with thousands of raw\nLiDAR points per scan. The Kalman gain there gets one more clever rewrite to handle those\nthousands of measurements cheaply, which is where I'll pick up next.\n\n## The one-paragraph summary\n\nA Kalman filter holds a Gaussian belief over a state. **Predict** moves the belief through a\nmotion model and inflates its uncertainty; **update** multiplies it by the measurement's\nGaussian, which sharpens it and pulls the mean toward the data by the gain $\\mathbf{K}$ —\noptimally weighted by which source is more certain. Linearize for nonlinear systems (EKF),\nrelinearize for hard ones (iEKF), and track the error in the tangent space for rotations.\nThat's the whole toolkit, and it runs real robots.\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/kalman-filter","signal":{"interest":4,"helpful":5,"score":9,"level":5,"label":"Essential"}},{"title":"MegaTrain: training a 120B model on one GPU by inverting where memory lives","description":"Most large-model training is bottlenecked by GPU memory. MegaTrain flips the layout — host RAM holds every parameter, gradient, and optimizer moment, and the GPU is a transient compute engine that streams one layer at a time. A walk through the memory-centric design, the double-buffered CUDA-stream pipeline that hides PCIe, and the honest scope of training 120B params at full precision on a single H200.","date":"2026-06-28","tags":["llm","training","systems","cuda","explainer"],"draft":false,"featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"megatrain-single-gpu-training","body":"The thing that stops most people from training a big model isn't FLOPs — it's GPU\nmemory. A 70B model's parameters, gradients, and Adam moments don't fit in an 80GB card,\nso you reach for tensor/pipeline parallelism across a cluster you may not have, or for\noffloading systems that thrash and OOM as the model grows.\n\n**MegaTrain** takes the other path: invert the memory hierarchy. Host CPU memory becomes\nthe authoritative store for *all* persistent state — parameters, gradients, and optimizer\nmoments — and the GPU is demoted to a **transient compute engine** that holds only the\nlayer it's working on right now. On a single H200 with 1.5 TB of host RAM, that's enough\nto train models up to **120B parameters at full precision** — no quantization, no second\nGPU. It's a systems paper, and a good one, so let's read it as systems.\n\n## The inversion\n\nStart with the accounting. Mixed-precision Adam costs about **12 bytes per parameter**: 2\nfor the BF16 weight, 2 for the BF16 gradient, and 8 for the FP32 first and second moments.\nA GPU-centric trainer keeps all of that in HBM, so the moment $12 \\times \\text{params}$\nexceeds the card, you're done. Move that state to host memory and stream one layer at a\ntime, and the device footprint goes flat while the host's terabytes set the ceiling. Drag\nthe model size and watch where it OOMs:\n\n<MemoryPlacement />\n\nThat's the whole idea in one slider: a single H200's 141 GB HBM caps a GPU-centric run\naround 11–12B parameters, but with the persistent state in 1.5 TB of host RAM and only a\nstreamed layer resident, the same card reaches past 120B. The paper's own architecture\nmakes the split concrete — everything lives in the CPU domain; the GPU domain is scratch:\n\n<Figure\n  src=\"/articles/megatrain/fig2.png\"\n  alt=\"MegaTrain architecture: the CPU domain holds the parameter store, optimizer states, and CPU Adam in pinned memory; the GPU domain holds transient layer templates, a weight buffer, gradient slabs, and activation checkpoints, connected over PCIe/NVLink-C2C with double buffers.\"\n  caption=\"MegaTrain's architecture (paper, Figure 2): the CPU domain is the authoritative store — layer parameters, optimizer moments (m, v), CPU Adam — in pinned memory. The GPU domain is transient: stateless layer templates, a weight buffer, gradient slabs, and an activation-checkpoint workspace, fed over PCIe / NVLink-C2C through two alternating buffers.\"\n/>\n\n## The bottleneck, and the step\n\nInverting the layout creates one obvious problem: **PCIe bandwidth**. Every layer's\nweights now have to cross the bus into the GPU, and its gradients have to cross back —\n~128 GB/s on H200's PCIe Gen5, versus 4.8 TB/s of on-card HBM. If you did this naively the\nGPU would spend most of its life waiting on DMA.\n\nThe training step is three phases built to keep that traffic off the critical path:\n\n1. **Streaming forward** — stream each layer's weights in (H2D), compute, checkpoint\n   activations every $K$ layers, release the weights immediately.\n2. **Block-wise backward** — recompute activations from the nearest checkpoint, stream the\n   layer's weights back in, compute gradients in reverse, offload them (D2H), release.\n3. **Optimizer update** — run Adam **entirely on the CPU** (AVX-512), so the freshly\n   computed gradients and the moments never make another round trip to the device.\n\nBlock-wise recomputation bounds activation memory at $O(N \\cdot A_{\\max} \\cdot L/K)$ —\nindependent of total depth — which is what lets depth scale without the activation memory\nexploding.\n\n## The double-buffered pipeline\n\nThis is the optimization that makes it fast instead of merely possible. MegaTrain runs\n**three CUDA streams** concurrently — one for compute, one for H2D weight transfer, one\nfor D2H gradient evacuation — and double-buffers the weights so that while the compute\nstream works on layer $i$ out of one buffer, the next layer's weights prefetch into the\nother. Flip between naive serialization and the double-buffered schedule:\n\n<PipelineStreams />\n\nThe coordination is three events — *weights-ready*, *backward-done*, *buffer-free* — and\nthe payoff is a compute lane with no gaps: the GPU never stalls on PCIe. The ablation\nmakes the importance unambiguous. Remove double-buffering and throughput drops **31.3%**\n(266 → 183 TFLOPS at 14B) — by far the largest single contributor, more than the gradient\nslab pool (−3.3%) or tighter checkpointing. Here's the paper's own timeline of the overlap:\n\n<Figure\n  src=\"/articles/megatrain/fig3.png\"\n  alt=\"MegaTrain's end-to-end pipelined execution timeline across three CUDA streams, showing weight transfer, forward/backward compute, and gradient offload overlapping in a double-buffered schedule.\"\n  caption=\"The pipelined execution timeline (paper, Figure 3): weight transfer, compute, and gradient offload overlap across three CUDA streams in a double-buffered schedule, with the synchronization events that keep the buffers from colliding.\"\n/>\n\nA few more systems details earn their keep:\n\n- **Stateless layer templates.** A persistent autograd graph assumes weights stay\n  resident — incompatible with streaming and eviction. MegaTrain uses kernel templates\n  with no baked-in weight pointers and a `Bind` primitive that maps streamed buffer views\n  into the template's input slots, so device memory never exceeds a single layer.\n- **Layer-contiguous tiling.** BF16 weights, BF16 grads, and FP32 moments for a layer are\n  packed into one 4 KB-aligned block, so a layer moves as a single large-burst DMA that\n  saturates PCIe instead of many fragmented transfers.\n- **Pinned slab pool.** A fixed pool of pinned staging slabs (default 12), each sized to\n  the *largest layer* rather than the whole model, JIT-packed by a CPU worker — you get\n  pinned-memory transfer speed without pinning the entire model.\n\n## What it delivers\n\nThe headline is a capability, and it's the most convincing part: **120B parameters on one\nH200**, and a **512K-token context on a single GH200**. These are regimes where the\noffload baselines simply OOM — so the comparison is binary, which is the strongest kind.\n\n<Figure\n  src=\"/articles/megatrain/fig1.png\"\n  alt=\"Sustained TFLOPS versus model scale from 7B to 120B; MegaTrain stays high and stable while DeepSpeed ZeRO-3, ZeRO-Infinity, and PyTorch degrade and then fail to run.\"\n  caption=\"Throughput vs scale (paper, Figure 1): MegaTrain holds sustained throughput from 7B to 120B, where ZeRO-3 / ZeRO-Infinity / PyTorch degrade and then fall off entirely (they can't fit).\"\n/>\n\nWhere the baselines *can* run but are memory-starved — a 14B model on a PCIe A100 — the\nmargin is large:\n\n<BenchBars\n  title=\"14B on a single A100 PCIe — throughput (TFLOPS)\"\n  unit=\"\"\n  bars={[\n    { label: \"MegaTrain\", value: 122, highlight: true },\n    { label: \"Gemini\", value: 15 },\n    { label: \"ZeRO-3 Offload\", value: 10 },\n  ]}\n/>\n\nThat's 8.1× over Gemini and 12.2× over ZeRO-3 — and on a 48 GB A6000 or a 24 GB RTX 3090,\nMegaTrain trains 14B at all (56.8 and 30.2 TFLOPS) while ZeRO-3 OOMs. Crucially, accuracy\ndoesn't move — full precision means no drift:\n\n| MetaMathQA accuracy | MegaTrain | ZeRO-3 | ZeRO-Infinity | PyTorch |\n|---|---|---|---|---|\n| 7B | 88.99 | 88.93 | 88.97 | 88.91 |\n| 14B | 92.52 | 92.41 | — | — |\n\nOn depth, it's the only system that keeps going: ZeRO-3 OOMs by 132 layers and FSDP by 84,\nwhile MegaTrain runs the whole range and is **6.14× faster than FSDP at 56 layers**. On\nwidth, both baselines OOM at 4.0× while MegaTrain alone reaches 5.0×.\n\n## What I make of it\n\n- **The capability claims are real and well-supported.** Training 120B at full precision\n  on one GPU, 512K context on one GH200, and 14B on a 3090 — in each case the baseline\n  *cannot run*. \"Only system that works here\" is the most honest result a systems paper\n  can have, and the accuracy-parity table backs the full-precision claim.\n- **Double-buffering is the load-bearing idea, and the ablation proves it.** −31.3% without\n  it is a clean, isolated attribution. The rest — contiguous tiling, stateless templates,\n  CPU Adam — are the supporting cast that make the streaming viable.\n- **Read the throughput claims with the regime attached.** This is single-GPU only — no\n  multi-node scaling, and the metric is TFLOPS, not MFU, which a recomputation-heavy design\n  inflates (you do extra FLOPs re-deriving activations). At small or unconstrained sizes\n  the baselines are actually *faster* (FSDP 501 vs MegaTrain 406 TFLOPS at 1.0× width);\n  MegaTrain wins specifically once the model is large enough that offload systems are\n  thrashing or out of memory. The 1.84× and 6–12× numbers live near that memory cliff, not\n  everywhere.\n- **The best numbers lean on expensive hardware.** GH200's 900 GB/s NVLink-C2C and H200's\n  1.5 TB host RAM do a lot of work; on a plain PCIe Gen4 box the absolute throughput is far\n  lower (122 vs 266 TFLOPS). So it democratizes *what fits*, more than it democratizes\n  *speed*.\n\nThe honest summary: MegaTrain redefines the memory ceiling for single-GPU training —\nprovably, at full precision, with public code — and the double-buffered pipeline is a\ngenuinely nice piece of CUDA-stream engineering. Just don't read \"1.84× faster\" as a\ngeneral speedup; read it as \"it runs, fast enough, where nothing else runs at all.\"\n\n---\n\n*Built on [MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a\nSingle GPU](https://arxiv.org/abs/2604.05091) (Yuan et al., 2026;\n[code](https://github.com/DLYuanGod/MegaTrain)). All numbers are from the paper's tables\nand figures.*\n","readingTimeMins":8,"url":"https://ai.thesatyajit.com/articles/megatrain-single-gpu-training","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"DeepSeek DSpark: making speculative decoding draft better and verify smarter","description":"DSpark isn't a new DeepSeek model — it's a speculative-decoding module bolted onto DeepSeek-V4 that combines a semi-autoregressive drafter with a confidence-scheduled, load-aware verifier. A walk through how it lifts accepted length 16–31% over prior drafters and shifts the serving Pareto frontier, losslessly.","date":"2026-06-27","tags":["llm","inference-optimization","speculative-decoding","deepseek","explainer"],"draft":false,"featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"deepseek-dspark","body":"The first thing to get straight: **DSpark is not a new model.** The Hugging Face\ncards say it plainly — `DeepSeek-V4-Flash-DSpark` \"is not a new model. It is the\nsame checkpoint with an additional speculative decoding module attached.\" DSpark is\nan *inference accelerator* for the existing DeepSeek-V4 weights, shipped alongside an\nopen training repo called [DeepSpec](https://github.com/deepseek-ai/DeepSpec). It\nmakes generation faster without changing a single output token.\n\nThat last clause is the whole reason to care. Speculative decoding is **lossless** by\nconstruction: a cheap draft model proposes a block of tokens, the full target model\nverifies the whole block in one forward pass, and the acceptance rule keeps exactly\nthe prefix the target would have produced anyway, plus one free \"bonus\" token. Output\nis bit-identical to plain decoding. You only buy latency.\n\nSo the entire game is the draft-and-verify loop, and DSpark improves both halves of\nit. Recall the per-token latency of speculative decoding:\n\n$$\nt_{\\text{token}} \\;\\approx\\; \\frac{t_{\\text{draft}} + t_{\\text{verify}}}{g}\n$$\n\nwhere $g$ is the **accepted length** — how many real tokens one expensive target\nforward bought. You win three ways: draft faster, draft better (raise $g$), or verify\nsmarter. Prior work chased the first two. DSpark is the first to seriously attack the\nthird.\n\n## What's being accelerated: DeepSeek-V4\n\nDSpark exists to serve [DeepSeek-V4](https://arxiv.org/abs/2606.19348) — two MoE models,\n**V4-Pro** (1.6T params, 49B activated) and **V4-Flash** (284B, 13B activated), both with\n**1M-token context**. V4 is already aggressively efficiency-engineered: a hybrid of\nCompressed Sparse Attention and Heavily Compressed Attention, Manifold-Constrained\nHyper-Connections (mHC), and the Muon optimizer, trained on >32T tokens.\n\n<Figure\n  src=\"/articles/deepseek-dspark/v4-fig1.png\"\n  alt=\"DeepSeek-V4 benchmark bars and a comparison of inference FLOPs and KV-cache size versus DeepSeek-V3.2, showing large reductions at million-token context.\"\n  caption=\"DeepSeek-V4 (paper, Figure 1): at 1M-token context, V4-Pro uses ~27% of the single-token inference FLOPs and ~10% of the KV cache of V3.2. The architecture is already squeezed hard — which is why the remaining latency win has to come from the decoding loop.\"\n/>\n\nThat context matters: when the model itself is this optimized, the decoding loop is where\nthe last big latency wins live, and speculative decoding is the lever. DSpark is the\ndrafter that lever needed.\n\n## The loop, one round at a time\n\nFirst, watch the loop run end to end. The drafter proposes a block (faint), the target\nverifies it in one forward, the matching prefix locks in and a mismatch is corrected for\nfree — and the meters track the mean accepted length $g$, which *is* the speedup over\nvanilla one-token-at-a-time decoding:\n\n<DecodeStream />\n\nNow the same thing in slow motion, one round at a time, so the accept/reject/bonus rule\nis unambiguous — step through and watch $g$ change round to round:\n\n<DraftVerify />\n\nThe mismatch case is the one to internalize. When the target disagrees at position\n$k$, everything after $k$ is thrown away — but because the target *also* tells you its\nown token at $k$, the round still nets $k+1$ accepted tokens. You never lose ground,\nand you never change the answer. The only question is how big $g$ gets on average.\n\n## Why long draft blocks were a trap\n\nEarly drafters were **autoregressive** — each draft token conditions on the previous\none (EAGLE-style). Quality is high, but drafting latency grows linearly with block\nsize, so you're forced into short, shallow blocks.\n\n**Parallel drafters** (DFlash, Medusa) flipped this: produce all draft logits in one\nforward pass, so drafting latency is nearly independent of block size. In principle\nyou can now draft long blocks cheaply. In practice two things break:\n\n- **Quality.** Each position is predicted independently, so it can't condition on the\n  tokens actually sampled elsewhere in the block. Given a context with two plausible\n  continuations — \"of course\" and \"no problem\" — a parallel drafter happily emits\n  \"of problem\" or \"no course\", because each slot marginalizes over all predecessors\n  instead of committing to one. Acceptance decays fast down the block.\n- **System efficiency.** Even when long blocks *are* good, indiscriminately verifying\n  all of them wastes target-model batch capacity. Under high concurrency that\n  capacity is the bottleneck, and verifying tokens that will be rejected is pure loss.\n\nDSpark's two components map one-to-one onto these two failures.\n\n## Component 1: semi-autoregressive drafting\n\nThe fix for the quality problem is to put a *little* sequentiality back, cheaply. A\nheavy **parallel backbone** (DeepSeek uses DFlash here) runs one forward pass over the\nwhole block and emits per-position hidden states $h_1,\\dots,h_\\gamma$ and base logits.\nThen a **lightweight sequential head** runs over those, injecting intra-block\ndependencies so position $j$ can finally see the token sampled at $j-1$.\n\n<Figure\n  src=\"/articles/deepseek-dspark/dspark-architecture.png\"\n  alt=\"DSpark architecture and decoding cycle: the target emits anchor D, a parallel block plus sequential block draft EFGH with confidence scores, a hardware-aware prefix scheduler keeps EFG and drops H, and the target verifies — accepting E and F, rejecting G, and emitting a corrected G*.\"\n  caption=\"The DSpark decoding cycle, from the paper. (1) The target emits anchor token D. (2) A parallel block drafts EFGH in one pass; a sequential block adds intra-block dependencies and a confidence head scores each position c₁–c₄; the hardware-aware scheduler keeps the confident prefix EFG and drops H. (3) The target verifies in parallel — E, F accepted, G rejected — and emits a corrected G* for free.\"\n/>\n\nThe released config keeps the head tiny: a draft network of **three MoE layers** with\nmHC and sliding-window attention of 128, max block size $W=5$. The sequential head\ncomes in two flavors:\n\n- **Markov head** — first-order, memoryless: position $j$ conditions only on the\n  immediately preceding sampled token. Cheap, and scales to large vocabularies. Once\n  position 1 samples \"of\", the Markov head boosts \"course\" and suppresses \"problem\"\n  at position 2 — exactly the collision the parallel drafter couldn't avoid.\n- **RNN head** — carries more history than the memoryless Markov variant, at a little\n  more cost.\n\nThe shipped drafter, \"DSpark-5\", uses the Markov head. It keeps almost all of the\nparallel drafter's speed — drafting latency is still nearly flat in $W$ — while\nrecovering the acceptance rate a fully-parallel block throws away.\n\n<Diagram caption=\"Semi-autoregressive drafting: a heavy parallel backbone produces all W hidden states and base logits in one pass; a tiny sequential head then threads first-order dependencies through them so each position conditions on the token actually sampled before it.\">\n  <svg viewBox=\"0 0 640 250\" role=\"img\" aria-label=\"A parallel backbone emits W base logits in one pass; a lightweight Markov head re-scores each position conditioned on the previously sampled token.\" style={{ width: \"100%\", height: \"auto\" }}>\n    {/* anchor */}\n    <rect x=\"16\" y=\"106\" width=\"70\" height=\"38\" rx=\"8\" fill=\"oklch(0.8 0.12 85)\" opacity=\"0.55\" stroke=\"var(--border)\" />\n    <text x=\"51\" y=\"129\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"12\" fill=\"var(--foreground)\">D</text>\n    <text x=\"51\" y=\"160\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">anchor</text>\n    <line x1=\"86\" y1=\"125\" x2=\"118\" y2=\"125\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    {/* parallel backbone */}\n    <rect x=\"118\" y=\"92\" width=\"150\" height=\"66\" rx=\"8\" fill=\"oklch(0.72 0.1 150)\" opacity=\"0.3\" stroke=\"var(--border)\" />\n    <text x=\"193\" y=\"120\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--foreground)\">parallel block</text>\n    <text x=\"193\" y=\"136\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">1 pass · all W</text>\n    {/* base logits row */}\n    {[0,1,2,3].map((i) => (\n      <g key={i}>\n        <line x1={268} y1={125} x2={300} y2={70 + i*40} stroke=\"var(--border)\" strokeWidth=\"1\" />\n        <rect x={300} y={54 + i*40} width=\"80\" height=\"30\" rx=\"6\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n        <text x={340} y={73 + i*40} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">U{i+1}</text>\n      </g>\n    ))}\n    {/* sequential head */}\n    <rect x=\"410\" y=\"44\" width=\"60\" height=\"170\" rx=\"8\" fill=\"oklch(0.72 0.13 150)\" opacity=\"0.85\" />\n    <text x=\"440\" y=\"124\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"oklch(0.2 0 0)\" transform=\"rotate(90 440 124)\">Markov head</text>\n    {[0,1,2,3].map((i) => (\n      <line key={i} x1={380} y1={69 + i*40} x2={410} y2={69 + i*40} stroke=\"var(--muted-foreground)\" strokeWidth=\"1\" />\n    ))}\n    {/* dependency arrows between outputs */}\n    {[0,1,2].map((i) => (\n      <path key={i} d={`M ${510} ${74 + i*40} q 22 20 0 40`} fill=\"none\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.1\" strokeDasharray=\"3 3\" markerEnd=\"url(#ar)\" />\n    ))}\n    <defs>\n      <marker id=\"ar\" markerWidth=\"6\" markerHeight=\"6\" refX=\"5\" refY=\"3\" orient=\"auto\"><path d=\"M0,0 L6,3 L0,6 Z\" fill=\"var(--muted-foreground)\" /></marker>\n    </defs>\n    {/* final draft tokens */}\n    {[\"E\",\"F\",\"G\",\"H\"].map((t,i) => (\n      <g key={t}>\n        <line x1={470} y1={69 + i*40} x2={500} y2={69 + i*40} stroke=\"var(--border)\" strokeWidth=\"1\" />\n        <rect x={500} y={54 + i*40} width=\"48\" height=\"30\" rx=\"6\" fill=\"oklch(0.7 0.08 300)\" opacity=\"0.5\" stroke=\"var(--border)\" />\n        <text x={524} y={73 + i*40} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--foreground)\">{t}</text>\n      </g>\n    ))}\n    <text x=\"524\" y=\"234\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">draft block</text>\n  </svg>\n</Diagram>\n\nThe parallel backbone is [DFlash](https://arxiv.org/abs/2602.06036) (ICML 2026), which\nfuses the target model's context features into the draft model's KV cache so a single\nforward pass can predict the whole block:\n\n<Figure\n  src=\"/articles/deepseek-dspark/dflash-fig2.png\"\n  alt=\"DFlash inference design: target-model context features are fused into the draft model's KV cache, letting all draft positions be produced in one forward pass.\"\n  caption=\"The DFlash backbone DSpark builds on (DFlash paper, Figure 2): target context features feed the draft KV cache, so the parallel block drafts all positions at once. DSpark's only change is to feed the anchor and treat the block as semi-autoregressive rather than predicting masked positions independently.\"\n/>\n\nOn the offline metric that isolates draft quality — macro-average accepted length per\nround, target models Qwen3-4B/8B/14B at temperature 1.0 across the DeepSpec eval suite\n— DSpark's semi-autoregressive drafter beats both the autoregressive and the\nfully-parallel baselines:\n\n<BenchBars\n  title=\"Accepted length per round — gain over baseline drafters (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"vs EAGLE-3 (4B)\", value: 30.9, highlight: true },\n    { label: \"vs EAGLE-3 (8B)\", value: 26.7, highlight: true },\n    { label: \"vs EAGLE-3 (14B)\", value: 30.0, highlight: true },\n    { label: \"vs DFlash (4B)\", value: 16.3 },\n    { label: \"vs DFlash (8B)\", value: 18.4 },\n    { label: \"vs DFlash (14B)\", value: 18.3 },\n  ]}\n/>\n\nRoughly **+27–31% over the autoregressive EAGLE-3** and **+16–18% over the parallel\nDFlash** it's built on — the semi-autoregressive head recovers most of what pure\nparallelism gave up, without paying EAGLE-3's per-token drafting cost.\n\n## Component 2: confidence-scheduled, load-aware verification\n\nThis is the genuinely new lever. Bolt a **confidence head** onto the drafter, trained\nend-to-end and then *post-hoc calibrated* — the paper cares about calibration error\n(ECE), not just ranking, because the scores have to mean something. The head estimates\nper-position prefix-survival probabilities. Then a **hardware-aware scheduler** reads\nlive engine throughput and chooses, per request, how much of the draft block to bother\nverifying.\n\nThe intuition: verification consumes target-model batch capacity, which is the scarce\nresource under concurrency. Spending it on a low-confidence tail token the target will\nreject is pure waste. So trim the block to its confident head when the system is busy,\nand verify everything when it's idle.\n\n<ConfidenceScheduler />\n\nThere's a real systems subtlety underneath the slider. To avoid GPU pipeline stalls —\nyou'd need the *next* step's capacity estimate before the current step finishes — the\nscheduler approximates upcoming capacity using confidence-head outputs from **two\nsteps prior**, while still sorting candidate tokens by up-to-date cumulative\nconfidence. The two-steps-stale signal only sets the dynamic truncation length; the\nacceptance itself is always exact, so the lossless guarantee holds.\n\nCalibration — not just ranking — is the reason this works. The paper's reliability\ndiagram shows the raw confidence estimator already *discriminates* well (it ranks\nsurvivors above doomed tokens) but is poorly *calibrated* — a raw score of 0.8 doesn't\nmean an 80% survival chance. A scheduler that truncates on a probability threshold needs\nthe second property, not just the first, so DSpark calibrates the head post-hoc and\nmeasures ECE. Once it's calibrated, the threshold means what it says: as it tightens, the\nacceptance rate among verified tokens climbs from roughly 76.9% / 67.6% / 45.7% to about\n92.5% / 92.0% / 95.7% on Math / Code / Chat respectively — the scheduler is keeping the\ntokens that actually survive.\n\nDSpark also studied how deep to make the drafter and how long to draft. Deeper drafters\nhelp up to a point (the released config uses three MoE layers), and accepted length keeps\nrising with proposal length $W$ where a fully-parallel DFlash block would have decayed —\nwhich is the whole argument for the semi-autoregressive design, and why $W=5$ is a\nsensible default rather than a hard ceiling.\n\n## What it does in production\n\nDSpark-5 replaced the previous production setup (a static MTP-1 single-token drafter)\non DeepSeek's own V4 serving engines. MTP-1 was the incumbent precisely *because*\nnaively deploying a static multi-token drafter degrades aggregate throughput under\nhigh concurrency — the exact problem the scheduler exists to solve.\n\n<Figure\n  src=\"/articles/deepseek-dspark/dspark-pareto.png\"\n  alt=\"Throughput versus per-user TPS Pareto frontiers for DeepSeek-V4-Flash and V4-Pro, comparing MTP (blue) against DSpark (green), with annotated operating points showing +51% and +661% throughput and +60% to +85% TPS.\"\n  caption=\"The serving Pareto frontier (paper, Figure 7): aggregate throughput vs per-request speed (tok/s/user) under live traffic. DSpark (green) sits above and to the right of the MTP-1 baseline (blue) on both V4-Flash and V4-Pro — it extends the feasible interactivity frontier.\"\n/>\n\nThe honest reading of those annotations matters. At matched, practical throughput,\nDSpark accelerates per-user generation by **60–85% on V4-Flash** and **57–78% on\nV4-Pro**. The eye-popping \"+661% throughput\" point is a *specific operating regime* —\na strict 120 tok/s/user SLA where the single-token baseline is already pinned at its\noperational boundary. The paper itself flags it as evidence of \"extending the feasible\ninteractivity frontier,\" not a representative multiplicative speedup. Don't quote +661%\nas a generic number; quote the 57–85% per-user range.\n\n<Callout type=\"note\">\nThe gains concentrate where GPUs are *under*-utilized — low batch, strict latency,\nRL-style long-tail decoding. When the system is already compute-saturated, smarter\nverification has less slack to recover, and the benefit shrinks. That's the tradeoff:\nDSpark buys interactivity, and interactivity is worth most exactly when you have spare\ncompute to spend on it.\n</Callout>\n\n## Where DSpark sits\n\n| | Autoregressive (EAGLE-3) | Parallel (DFlash) | DSpark |\n|---|---|---|---|\n| Draft cost vs block size | grows linearly | ~flat | ~flat |\n| Intra-block dependency | full | none | first-order (Markov head) |\n| Acceptance decay | low | rapid | low |\n| Verification length | fixed | fixed | scheduled per request |\n| Lossless | yes | yes | yes |\n\nEAGLE-3 drafts well but slowly; DFlash drafts fast but loosely; DSpark keeps DFlash's\nparallel speed, threads just enough sequential dependency back in to fix acceptance,\nand then adds the verification scheduler nobody else had. It's built directly on\nDFlash (the parallel backbone) and DeepSeek-V4 (the target), and ships open under MIT.\n\n## What I make of it\n\n- **The framing is right.** Speculative decoding's latency formula has three levers,\n  and \"verify smarter\" was the neglected one. A calibrated confidence head plus a\n  load-aware scheduler is a clean, principled way to pull it — and because acceptance\n  stays exact, it costs zero quality.\n- **The semi-autoregressive head is the quiet win.** A first-order Markov head is\n  almost free and recovers most of the acceptance a parallel block throws away. That's\n  a better engineering trade than going back to a slow autoregressive drafter.\n- **Read the numbers carefully.** The +16–31% accepted-length gains are clean\n  apples-to-apples and should reproduce via DeepSpec. The production speedups are real\n  but regime-dependent; the headline ratio is a boundary artifact, not a uniform\n  multiplier. DSpark shifts the Pareto frontier — it doesn't move every point on it by\n  6×.\n\n---\n\n*Built on DeepSeek's [DSpark: Confidence-Scheduled Speculative Decoding with\nSemi-Autoregressive Generation](https://github.com/deepseek-ai/DeepSpec) (paper in the\nDeepSpec repo), the [DFlash](https://arxiv.org/abs/2602.06036) parallel drafter it\nextends (ICML 2026), and [DeepSeek-V4](https://arxiv.org/abs/2606.19348), the target\nit accelerates. Weights: [V4-Flash-DSpark](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark)\nand [V4-Pro-DSpark](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark), MIT-licensed.*\n","readingTimeMins":13,"url":"https://ai.thesatyajit.com/articles/deepseek-dspark","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Multi-token prediction: training a model to see further than one step","description":"Predict the next n tokens instead of just one, and you get two things: a better-trained model and a built-in draft model for ~3x faster inference. A first-principles walk through MTP — Meta's parallel heads, DeepSeek-V3's sequential modules, the predict-verify-accept lineage from Google Brain, and Google's 2026 Gemma 4 and Gemini Nano deployments.","date":"2026-06-27","tags":["llm","multi-token-prediction","inference-optimization","speculative-decoding","explainer"],"draft":false,"featured":false,"interest":4,"helpful":4,"kind":"articles","slug":"multi-token-prediction","body":"Standard language models are trained on a deceptively narrow task: given everything so\nfar, predict the *single* next token. The objective is one cross-entropy term per\nposition,\n\n$$\nL_{\\text{next}} \\;=\\; -\\sum_t \\log P_\\theta\\big(x_{t+1}\\mid x_{\\le t}\\big).\n$$\n\n**Multi-token prediction (MTP)** changes one thing: from each position, predict the\nnext $n$ tokens at once. That small change buys two unrelated-looking wins — a model\nthat *trains* better, and a model that *decodes* faster — and the second one is why\nit's now in shipping products from DeepSeek to Google.\n\nA provenance note up front, because the \"Google multi-token prediction\" framing gets\nthe credit wrong. The modern MTP training objective was defined by **Meta (FAIR)** in\n[Gloeckle et al., 2024](https://arxiv.org/abs/2404.19737). The *predict-several-then-\nverify* decoding idea it reuses traces to **Google Brain's** 2018 [blockwise parallel\ndecoding](https://arxiv.org/abs/1811.03115). **DeepSeek-V3** productionized a sequential\nvariant at pretraining scale. And **Google's** genuine 2026 contribution is applied:\nMTP-based speculative decoding in Gemma 4 and on-device Gemini Nano. I'll attribute as\nI go.\n\n## The objective: predict n futures\n\nKeep one **shared transformer trunk** that turns the context into a latent $z_t$. Attach\n$n$ output heads. The loss sums cross-entropy over the next $n$ positions:\n\n$$\nL_{\\text{MTP}} \\;=\\; -\\sum_t \\sum_{i=1}^{n} \\log P_\\theta\\big(x_{t+i}\\mid z_t\\big).\n$$\n\nFrom position $t$, head $i$ predicts $x_{t+i}$. Pick $n$ and a flavor and read the loss\noff directly:\n\n<MTPHeads />\n\nThe obvious worry is memory: materializing $n$ full vocabulary-sized logit tensors at\nonce is brutal. The fix is mundane and important — compute each head's forward/backward\n*sequentially* and accumulate gradients at the trunk, so peak memory stays flat in $n$.\nYou pay a little time, not a lot of VRAM.\n\n## Two flavors: parallel heads vs sequential modules\n\nThe flavor toggle above is the real architectural fork.\n\n- **Meta / Medusa — parallel independent heads.** All heads hang off the same trunk and\n  predict in parallel; head 2 does *not* see head 1's token. Cheap, and it composes with\n  a clean inference trick: at deployment you can **discard the extra heads** and recover\n  an ordinary next-token model with zero overhead, or **keep them** to self-speculate.\n\n<Figure\n  src=\"/articles/multi-token-prediction/meta-mtp-arch.png\"\n  alt=\"Meta's multi-token prediction architecture: a shared trunk feeds four parallel output heads, each predicting one of the next four tokens; at inference the next-token head is used for generation and the others for speculative speedup.\"\n  caption=\"Meta's MTP architecture (Gloeckle et al., 2024, Figure 1): a shared trunk with n parallel output heads sharing one unembedding. At training, all heads predict; at inference, the extra heads draft for self-speculative decoding.\"\n/>\n\n- **DeepSeek-V3 — sequential modules that keep the causal chain.** Module $k$ takes the\n  previous depth's hidden state, concatenates the embedding of the (already-known) token,\n  RMS-norms both, projects, and runs its own transformer block. So depth-2's prediction\n  is conditioned on depth-1's token — the drafts are internally coherent, at more cost\n  than parallel heads.\n\n$$\nh'^{\\,k}_i \\;=\\; M_k\\big[\\operatorname{RMSNorm}(h^{\\,k-1}_i)\\,;\\ \\operatorname{RMSNorm}(\\operatorname{Emb}(t_{i+k}))\\big]\n$$\n\n<Figure\n  src=\"/articles/multi-token-prediction/deepseek-mtp.png\"\n  alt=\"DeepSeek-V3's sequential multi-token prediction: the main model plus MTP modules each predict a further token, with shared embedding and output head, passing hidden states forward to preserve the causal chain.\"\n  caption=\"DeepSeek-V3's MTP (Figure 3): sequential MTP modules that keep the complete causal chain — each module conditions on the previous depth's prediction, unlike Meta's independent heads.\"\n/>\n\nThe trade is exactly what you'd guess: parallel heads are cheaper and discardable;\nsequential modules draft more coherent blocks because each step sees the last.\n\n## Win one: it trains a better model\n\nThe training objective is the whole reason for the quality gain. Slide the window: from\neach position the model predicts the next $n$ tokens, so $n$ loss terms fire where an\nordinary model gets one. Flip $n$ between 1 and 4 to feel the supervision get denser:\n\n<MTPTraining />\n\nPredicting further is a denser, more demanding signal, and at scale it produces a model\nthat's better even when you *throw the extra heads away*. Meta's 13B model, trained with\nMTP, solves materially more coding problems than the matched next-token model on the\nsame data and compute:\n\n<BenchBars\n  title=\"13B MTP vs matched next-token model — relative gain (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"MBPP (more solved)\", value: 17, highlight: true },\n    { label: \"HumanEval (more solved)\", value: 12, highlight: true },\n  ]}\n/>\n\n<Figure\n  src=\"/articles/multi-token-prediction/meta-mtp-scaling.png\"\n  alt=\"Scaling plots of MBPP and HumanEval performance across six model sizes from 300M to 13B, showing multi-token prediction overtaking next-token prediction as model size grows.\"\n  caption=\"Scaling (Gloeckle et al., Figure 3): the MTP advantage on code grows with model size — small models barely benefit, but by multi-billion scale MTP clearly overtakes next-token training.\"\n/>\n\nTwo caveats keep this honest. The gains **scale with model size** — small models barely\nbenefit, and very large $n$ erodes quality ($n=4$ is the sweet spot for ~7B on code).\nAnd whether the *quality* gain transfers beyond pretraining is genuinely contested:\n[\"Multi-Token Prediction Needs Registers\"](https://arxiv.org/abs/2505.10518) (MuToR,\nNeurIPS 2025) exists precisely because the benefit hasn't consistently generalized to\nfine-tuning without help. MTP is not a free quality lunch in every regime.\n\n## Win two: it decodes ~3x faster\n\nThe extra heads are a *built-in draft model*. They cheaply propose the next $n-1$ tokens\nin one pass; the model then verifies all of them in a single batched forward and accepts\nthe longest correct prefix — **self-speculative decoding**. This is lossless: rejected\ndrafts fall back to the true next-token distribution, so output is unchanged.\n\nThat verify-and-accept scheme is the part with Google's fingerprints — Stern, Shazeer &\nUszkoreit's 2018 **blockwise parallel decoding** at Google Brain is the ancestor:\npredict several future positions with auxiliary heads, then accept the longest correct\nprefix. MTP simply folds those auxiliary heads into the training objective.\n\nHow much you gain depends entirely on the **acceptance rate** — how often a drafted\ntoken matches what the target would have produced. Because acceptance compounds along\nthe block, the marginal value of the $i$-th drafted token decays, and the whole scheme\nhits a ceiling no matter how far you draft. That's why a *more coherent* drafter is worth\nmore than a *longer* one — and why DeepSeek's sequential modules (higher acceptance) and\nDSpark's semi-autoregressive head exist at all. Drag the acceptance rate and block size:\n\n<MTPSpeedup />\n\nThe speedups, across the lineage:\n\n<BenchBars\n  title=\"Self-speculative decoding speedup (×, lossless unless noted)\"\n  unit=\"×\"\n  bars={[\n    { label: \"Meta MTP, n=4 (code)\", value: 3.0, highlight: true },\n    { label: \"Meta, 8-byte model\", value: 6.4 },\n    { label: \"DeepSeek-V3 (D=1)\", value: 1.8 },\n    { label: \"Google Brain 2018 (lossless)\", value: 4.0 },\n  ]}\n/>\n\nDeepSeek-V3 is the cleanest production data point: with MTP depth $D=1$ (predict two\ntokens total), the **second token is accepted 85–90%** of the time, and repurposing the\nmodule for speculative decoding gives **~1.8× tokens/sec**. The training loss weight was\nannealed — $\\lambda = 0.3$ for the first 10T tokens, then $0.1$ for the remaining 4.8T —\na detail worth noting because MTP at pretraining is a *secondary* objective, not the\nmain one.\n\n## Google's actual 2026 role: applied MTP\n\nWhere Google genuinely shows up is deployment, not the objective:\n\n- **Gemma 4 MTP drafters** (April 2026): MTP-style draft heads for lossless speculative\n  decoding, with a runtime heuristic that adapts how many tokens to draft. Google's\n  release claims **up to ~3x faster inference, no quality loss** — present that as a\n  vendor figure; the official docs only say \"significant speedups\".\n- **Gemini Nano frozen MTP** (June 2026): a *frozen-backbone* MTP head for on-device\n  speculative decoding on Pixel. It correctly predicts ~2 extra tokens per pass, gives a\n  **50%+ speedup on Pixel 9** over a standalone drafter, lifts token acceptance ~55% on\n  structured text, and — the on-device kicker — costs **−130MB per instance** by sharing\n  the KV cache zero-copy instead of running a separate draft model.\n\nThe on-device framing is the interesting one: a separate draft model is a non-starter\nwhen you're counting megabytes on a phone, so folding the drafter into the main model as\na frozen head is exactly the right move.\n\n## The open questions\n\nMTP isn't a closed book — two recent threads are worth knowing because they bound where\nthe simple story breaks.\n\n- **Does the quality gain survive fine-tuning?** Meta's gains are a *pretraining*\n  phenomenon, and they don't reliably transfer when you only have a fine-tuning budget.\n  [\"Multi-Token Prediction Needs Registers\"](https://arxiv.org/abs/2505.10518) (MuToR,\n  NeurIPS 2025) addresses this by interleaving learnable **register tokens** into the\n  sequence, each responsible for predicting a future token — adding almost no parameters\n  and no architectural surgery, so MTP's benefit shows up in the fine-tuning regime where\n  plain MTP heads underdeliver.\n- **Can you extract more drafts from a model that already exists?** Apple's [\"Your LLM\n  Knows the Future\"](https://arxiv.org/abs/2507.11851) argues a standard model already\n  encodes multi-token information, and unlocks it with **masked-input MTP** plus a gated\n  LoRA and a learnable sampler — reporting roughly **5× on code/math** and **2.5× on\n  general chat**, lossless. The framing is telling: MTP capability may be latent in\n  next-token models, waiting for the right decoding head.\n\nBoth reinforce the same lesson the speedup curve shows: the value is in *acceptance and\ncoherence*, and the active research is about getting more of both without paying a full\nretrain.\n\n## Who did what\n\n| Work | Org | Contribution |\n|---|---|---|\n| Blockwise parallel decoding (2018) | Google Brain | predict-several-then-verify/accept — the decoding ancestor |\n| Better & Faster LLMs via MTP (2024) | Meta / FAIR | the canonical MTP training objective (n parallel heads) |\n| Medusa (2024) | academic | multiple decoding heads + tree attention (not Google) |\n| DeepSeek-V3 MTP (2024) | DeepSeek-AI | sequential MTP modules at pretraining; ~1.8× TPS |\n| MuToR — \"MTP needs registers\" (2025) | academic | register tokens so MTP helps in fine-tuning |\n| Gemma 4 / Gemini Nano MTP (2026) | Google | applied MTP speculative decoding, incl. on-device |\n\n## What I make of it\n\n- **One change, two payoffs.** Predicting $n$ futures is a denser training signal *and*\n  a free draft model. That two-for-one is why MTP spread so fast from a 2024 paper to\n  2026 phones.\n- **The flavors matter.** Parallel heads are cheap and discardable; sequential modules\n  draft coherent blocks. Pick by whether you care more about training overhead or draft\n  acceptance.\n- **Keep the credit straight.** Meta defined the objective, Google Brain seeded the\n  verify/accept decoding, DeepSeek productionized it, and Google's 2026 work is applied\n  speculative decoding — strongest exactly where a separate draft model can't fit, like\n  on-device.\n- **Mind the caveats.** Quality gains scale with size and don't automatically survive\n  fine-tuning; the headline speedups are real but partly vendor-reported. The lossless\n  *speed* win is the part to trust unconditionally — it's guaranteed by the acceptance\n  rule, not a benchmark.\n\n---\n\n*Built on Meta's [Better & Faster Large Language Models via Multi-token\nPrediction](https://arxiv.org/abs/2404.19737), the [DeepSeek-V3 Technical\nReport](https://arxiv.org/abs/2412.19437) (§2.2), Google Brain's [Blockwise Parallel\nDecoding](https://arxiv.org/abs/1811.03115), [MuToR](https://arxiv.org/abs/2505.10518),\nand Google's 2026 [Gemini Nano frozen-MTP\nwork](https://research.google/blog/accelerating-gemini-nano-models-on-pixel-with-frozen-multi-token-prediction/).*\n","readingTimeMins":9,"url":"https://ai.thesatyajit.com/articles/multi-token-prediction","signal":{"interest":4,"helpful":4,"score":8,"level":4,"label":"High"}},{"title":"Nous Hermes and Mixture-of-Agents: when models confer before they answer","description":"Mixture-of-Agents stacks layers of LLMs that read each other's drafts and synthesize — and beats GPT-4 Omni on AlpacaEval using only open models. A first-principles walk through the MoA mechanism, why 'collaborativeness' works, and how Nous Research actually wired it onto the open-weight Hermes line via the Forge Reasoning API.","date":"2026-06-27","tags":["llm","multi-agent","mixture-of-agents","nous-research","open-weights","explainer"],"draft":false,"featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"nous-hermes-moa","body":"\"Hermes MoA\" isn't one thing, so let me separate the threads before building anything,\nbecause the difference between them is the difference between a real result and a\nmarketing claim.\n\n- **Mixture-of-Agents (MoA)** is a real, well-cited technique from Together AI / Duke /\n  Stanford — [arXiv:2406.04692](https://arxiv.org/abs/2406.04692). It's the foundation\n  everyone means by \"MoA\".\n- **Nous Hermes** is a real open-weight model family. The current\n  [Hermes 4](https://arxiv.org/abs/2508.18255) (14B / 70B / 405B) is a single\n  hybrid-reasoning model — it does *not* itself use MoA.\n- The genuine bridge between them is Nous's **Forge Reasoning API** (Nov 2024), which\n  really did run MoA — plus Monte Carlo Tree Search and Chain of Code — on top of\n  Hermes 70B.\n- There is also a body of **2026 \"Hermes Agent MoA\" claims** that I'll get to at the\n  end, and explicitly flag as unverified.\n\nSo the honest construction is: here's the MoA mechanism, here's the open-weight Hermes\nline and why it's a natural host for it, and here's how Nous actually shipped the two\ntogether. Let's build it.\n\n## The core idea: proposers and aggregators\n\nA single LLM gets one shot at your prompt. MoA's bet is that several models, allowed to\n*read each other's drafts and synthesize*, beat any one of them — even when the\nindividual drafts are mediocre.\n\nThe structure is a stack of **L layers**, each with **n agents**. Agents play two\nroles:\n\n- **Proposers** generate candidate responses. Diversity matters more here than any\n  single proposer's quality — you want different mistakes, not the same answer four\n  times.\n- **Aggregators** take the candidates and synthesize a single, better response.\n\nThe data flow is the load-bearing part: every agent in layer $i$ receives **all\noutputs from layer $i-1$**, concatenated into an *Aggregate-and-Synthesize* prompt that\ntells the model to critically evaluate the candidates and fuse them. The final layer's\naggregator emits the answer. Watch one round play out — four diverse proposers, each\nright about a different piece, fused into an answer that beats all of them:\n\n<MoARoundtable />\n\nStack that into layers and the synthesized answer sharpens further. Add depth and watch\nthe quality climb:\n\n<MoANetwork />\n\n<Figure\n  src=\"/articles/nous-hermes-moa/moa-fig2.png\"\n  alt=\"The Mixture-of-Agents architecture: four layers of agents, each layer's proposers feeding all of their outputs into every agent of the next layer, culminating in a final aggregated answer.\"\n  caption=\"The MoA architecture (paper, Figure 2): L layers × n agents. Each agent reads all outputs of the previous layer through an Aggregate-and-Synthesize prompt; the final aggregator returns the answer. The paper's default is 3 layers of 6 proposers, with Qwen1.5-110B as the final aggregator.\"\n/>\n\n## Why conferring helps: collaborativeness\n\nThe empirical observation that motivates the whole thing: an LLM produces a *better*\nanswer when shown other models' responses — **even when those responses are\nindividually weaker than what it would have written alone.** The paper calls this\ncollaborativeness, and it's the reason MoA isn't just \"best-of-n with extra steps\".\n\n<Figure\n  src=\"/articles/nous-hermes-moa/moa-fig1.png\"\n  alt=\"Bar chart showing AlpacaEval 2.0 LC win rates increasing for several models when they are provided other models' responses as context, versus answering alone.\"\n  caption=\"Collaborativeness (paper, Figure 1): AlpacaEval 2.0 win rate rises across models when each is shown peers' answers as auxiliary context — the effect MoA is built to exploit.\"\n/>\n\nThe mechanism is concrete. The aggregator isn't voting; it's reading. A proposer that\nnailed the units, another that caught an edge case, a third that structured the\nexplanation — the aggregate-and-synthesize prompt lets one model keep the part each\nproposer got right and drop the rest.\n\n<Diagram caption=\"The Aggregate-and-Synthesize prompt: the layer-i aggregator receives the original query plus every layer-(i−1) proposer's full response as auxiliary context, and is instructed to critically fuse them into one improved answer — not to pick a winner.\">\n  <svg viewBox=\"0 0 640 230\" role=\"img\" aria-label=\"Multiple proposer responses plus the original query feed an aggregate-and-synthesize prompt, which the aggregator turns into one fused answer.\" style={{ width: \"100%\", height: \"auto\" }}>\n    {/* query */}\n    <rect x=\"14\" y=\"100\" width=\"96\" height=\"34\" rx=\"8\" fill=\"oklch(0.8 0.1 250)\" opacity=\"0.4\" stroke=\"var(--border)\" />\n    <text x=\"62\" y=\"121\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">query</text>\n    {/* proposers */}\n    {[\"proposer 1\",\"proposer 2\",\"proposer 3\"].map((p,i) => (\n      <g key={p}>\n        <rect x=\"14\" y={14 + i*62} width=\"120\" height=\"34\" rx=\"8\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n        <text x=\"74\" y={35 + i*62} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--muted-foreground)\">{p}</text>\n        <line x1=\"134\" y1={31 + i*62} x2=\"250\" y2=\"115\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1\" strokeDasharray=\"3 3\" />\n      </g>\n    ))}\n    <line x1=\"110\" y1=\"117\" x2=\"250\" y2=\"117\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1\" strokeDasharray=\"3 3\" />\n    {/* aggregate prompt */}\n    <rect x=\"250\" y=\"78\" width=\"170\" height=\"78\" rx=\"10\" fill=\"oklch(0.72 0.13 150)\" opacity=\"0.25\" stroke=\"var(--border)\" />\n    <text x=\"335\" y=\"108\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">aggregate &amp;</text>\n    <text x=\"335\" y=\"123\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">synthesize</text>\n    <text x=\"335\" y=\"142\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"8\" fill=\"var(--muted-foreground)\">critically fuse, don't vote</text>\n    <line x1=\"420\" y1=\"117\" x2=\"468\" y2=\"117\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" markerEnd=\"url(#aa)\" />\n    <defs><marker id=\"aa\" markerWidth=\"7\" markerHeight=\"7\" refX=\"6\" refY=\"3.5\" orient=\"auto\"><path d=\"M0,0 L7,3.5 L0,7 Z\" fill=\"var(--muted-foreground)\" /></marker></defs>\n    {/* aggregator */}\n    <rect x=\"468\" y=\"92\" width=\"158\" height=\"50\" rx=\"10\" fill=\"oklch(0.72 0.14 150)\" opacity=\"0.85\" />\n    <text x=\"547\" y=\"113\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"oklch(0.2 0 0)\">aggregator</text>\n    <text x=\"547\" y=\"129\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"oklch(0.2 0 0)\">→ one answer</text>\n  </svg>\n</Diagram>\n\n## What MoA scores\n\nThe headline result is that a stack of **open-source** models, conferring, beats a\nsingle frontier model. On AlpacaEval 2.0 (length-controlled win rate):\n\n<BenchBars\n  title=\"AlpacaEval 2.0 — LC win rate (%)\"\n  unit=\"%\"\n  bars={[\n    { label: \"MoA w/ GPT-4o\", value: 65.7, highlight: true },\n    { label: \"MoA (open only)\", value: 65.1, highlight: true },\n    { label: \"GPT-4 Omni\", value: 57.5 },\n  ]}\n/>\n\nOpen-only MoA hits **65.1%** against GPT-4 Omni's **57.5%** — a +7.6-point margin with\nno closed model in the loop. On MT-Bench it scores **9.25** (9.40 with GPT-4o added)\nversus GPT-4 Omni's 9.19, and on FLASK it leads on robustness, correctness, factuality,\nand completeness against the strongest single proposer.\n\n<Figure\n  src=\"/articles/nous-hermes-moa/moa-fig6.png\"\n  alt=\"Performance-versus-cost Pareto frontier showing MoA configurations achieving higher win rate per dollar than single-model baselines.\"\n  caption=\"Cost-performance (paper, Figure 5a): MoA configurations sit on a better win-rate-per-dollar frontier — the gain isn't free (you pay for n proposers × L layers of calls), but it's competitive on cost, not just quality.\"\n/>\n\nThe cost is exactly what you'd expect: $n$ proposers across $L$ layers means many model\ncalls per answer, and latency stacks with depth. MoA buys quality with compute. Whether\nthat trade is worth it depends entirely on how much you value the marginal correctness.\n\n<MoACost />\n\n## Design choices that actually move the needle\n\nMoA has three knobs, and they don't all behave the way intuition says:\n\n- **Width (proposers) over depth (layers).** The bulk of the gain comes from having\n  several *diverse* proposers in the first layer; stacking more layers helps less and\n  costs latency linearly. The toy above saturates fast in $L$ for exactly this reason.\n- **Diversity beats raw strength.** Proposers that fail differently give the aggregator\n  more to work with than several copies of the strongest model. The paper's pool is\n  deliberately heterogeneous — Qwen, WizardLM, Llama-3, Mixtral, dbrx — not six clones.\n- **The aggregator is a real choice.** Not every strong model is a good *synthesizer*;\n  the aggregator has to read several candidate answers and fuse them faithfully rather\n  than just re-emit its own. The paper uses Qwen1.5-110B-Chat as the final aggregator,\n  and the role-suitability of a model as aggregator vs proposer is measured separately.\n\nYou can see the effect dimension-by-dimension on FLASK, which scores along twelve skill\naxes rather than one number:\n\n<Figure\n  src=\"/articles/nous-hermes-moa/moa-fig3.png\"\n  alt=\"FLASK evaluation across twelve skill dimensions, showing Mixture-of-Agents improving over the strongest single proposer on robustness, correctness, factuality, and completeness.\"\n  caption=\"FLASK (paper, Figure 3): MoA's gains aren't uniform — it pulls ahead most on robustness, correctness, factuality, and completeness, the dimensions where cross-checking multiple drafts helps most.\"\n/>\n\nThe shape of that result is the tell: MoA helps exactly where having several independent\nattempts to cross-check is valuable, and barely moves dimensions that a single competent\nmodel already nails.\n\n## Where Hermes comes in\n\nNous Research builds the **Hermes** line — open-weight models post-trained with a\ndeliberate *neutral alignment* philosophy: minimal gratuitous refusals, maximal user\nsteerability. Hermes 4 (70B and 405B on Llama-3.1 bases, 14B on a Qwen3 base) adds\n**hybrid reasoning** — a single checkpoint with a toggleable `<think>…</think>` block,\nso you get reasoning and instruct behavior from one model — plus strong function\ncalling and JSON-schema structured output. It was trained on ~60B tokens (~5M samples)\nbuilt with Nous's DataForge and the Atropos RL environment, rejection-sampled against\nroughly a thousand task-specific verifiers, on 192× B200 GPUs.\n\nOn capability it's competitive with the open frontier (Hermes 4 405B, reasoning mode):\n\n| Benchmark | Hermes 4 405B (reasoning) | non-reasoning |\n|---|---|---|\n| MATH-500 | 96.3 | 73.8 |\n| AIME'24 | 81.9 | 11.4 |\n| AIME'25 | 78.1 | 10.6 |\n| GPQA Diamond | 70.5 | 39.4 |\n| LiveCodeBench v6 | 61.3 | 28.1 |\n| MMLU | 87.2 | 73.6 |\n\nBut the number that captures the *philosophy* is RefusalBench — Nous's own measure of\nhow often a model refuses across 32 categories of typically-refused requests (higher =\nfewer refusals, except for a few inverted safety categories scored the other way):\n\n<BenchBars\n  title=\"RefusalBench — higher means fewer refusals (avg of 5 runs)\"\n  unit=\"\"\n  bars={[\n    { label: \"Hermes 4 (reasoning)\", value: 57.1, highlight: true },\n    { label: \"Grok 4\", value: 51.3 },\n    { label: \"Hermes 4 (non-reasoning)\", value: 43.2 },\n    { label: \"DeepSeek V3\", value: 28.1 },\n    { label: \"Gemini 2.5 Pro\", value: 24.2 },\n    { label: \"GPT-4o\", value: 17.7 },\n    { label: \"Opus 4.1\", value: 15.4 },\n    { label: \"GPT-5\", value: 11.3 },\n  ]}\n/>\n\nThat steerability is what makes Hermes a natural MoA citizen. Open weights mean you can\nrun a whole proposer pool yourself; neutral alignment means the aggregator won't refuse\nto synthesize half its inputs. Hermes is built to be *driven*, which is exactly what a\nmulti-agent harness does to it.\n\n## The real bridge: Forge\n\nThe genuine \"Nous ran MoA on Hermes\" artifact is the **Forge Reasoning API** (beta,\nNov 2024). Forge combined three inference-time techniques on top of Hermes 70B:\nMixture-of-Agents, Monte Carlo Tree Search, and Chain of Code. The MoA piece is exactly\nthe mechanism above — \"models respond, confer, and synthesize new answers\" — applied to\na Hermes-centric pool. If you want a concrete instance of MoA on the Hermes line that\nactually shipped, Forge is it.\n\nForge stacked three inference-time techniques that compose cleanly because they attack\ndifferent failure modes:\n\n- **Mixture-of-Agents** — breadth. Several models propose and an aggregator synthesizes,\n  the mechanism above.\n- **Monte Carlo Tree Search** — depth. Instead of one greedy chain, explore a tree of\n  reasoning continuations and back up value estimates, spending more search on promising\n  branches. This is the \"think longer on hard problems\" axis.\n- **Chain of Code** — grounding. Offload the steps that are better *executed* than\n  *reasoned about* (arithmetic, string manipulation, logic) into code that actually\n  runs, so the model isn't bluffing its way through a calculation.\n\nBreadth, depth, and grounding are orthogonal, which is why bolting all three onto a\nfixed Hermes backbone bought more than any one alone.\n\nA practical MoA instantiation Nous-style also collapses the textbook diagram into\nsomething cheap: a small **reference** model runs first *without* tool schemas (avoiding\nrefusals and saving tokens), its output is appended as private context, and the\n**aggregator** — the real Hermes agent — does the actual tool-calling loop with the\nreference draft in hand. One layer, two roles, most of the benefit. It's a reminder that\n\"MoA\" in production rarely looks like the 3×6 textbook diagram; it's whatever\nproposer/aggregator split pays for itself.\n\n<Callout type=\"warning\">\nThere is a wave of **June 2026 \"Hermes Agent MoA 2.0\"** content claiming MoA presets\nthat beat \"Claude Opus 4.8\" and \"GPT-5.5\" on an unpublished \"HermesBench\" (e.g. a\nquoted 0.8202 vs 0.7607/0.7412 for the individual models). I could not verify any of\nit: the cited models aren't confirmably released, the benchmark has no published\nleaderboard, and the supporting sources are a crypto-news post (which hedges with\n\"claiming\") and social posts. Treat the *mechanism* as faithful MoA, but treat the\n*numbers* as marketing-stage and unverified — not established fact.\n</Callout>\n\n## What I make of it\n\n- **The result is real and a little counterintuitive.** Open models that read each\n  other's drafts beat a single frontier model on AlpacaEval, and the lift comes from\n  collaborativeness — synthesis from diverse, even weaker, drafts. That's a genuine,\n  reproducible finding with public code.\n- **Hermes is the right host, not the inventor.** MoA is Together AI's; Hermes is\n  Nous's open-weight, neutral-alignment line; Forge is where Nous actually combined\n  them. Keep the attribution straight and the story is clean.\n- **The cost is the catch, as always.** $n \\times L$ model calls per answer and latency\n  that grows with depth. MoA is for when correctness is worth real compute — agentic\n  pipelines, hard reasoning — not for chat you need back in 200ms.\n- **Be skeptical of the 2026 leaderboard claims.** The mechanism is sound; the\n  benchmark numbers floating around are not yet something I'd cite.\n\n---\n\n*Built on Together AI's [Mixture-of-Agents Enhances Large Language Model\nCapabilities](https://arxiv.org/abs/2406.04692) (Wang et al., 2024;\n[code](https://github.com/togethercomputer/moa)), the [Hermes 4 Technical\nReport](https://arxiv.org/abs/2508.18255) (Nous Research, 2025), and Nous's [Forge\nReasoning API](https://nousresearch.com/introducing-the-forge-reasoning-api-beta-and-nous-chat-an-evolution-in-llm-inference).*\n","readingTimeMins":11,"url":"https://ai.thesatyajit.com/articles/nous-hermes-moa","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Unconventional AI's Un-0: generating images with coupled oscillators","description":"Un-0 replaces neural-network layers and the diffusion schedule with a population of coupled Kuramoto oscillators — letting the physics of a dynamical system be the computation. A walk through the math, the generation pipeline, the FID numbers, and the honest gap between a working simulation and the unbuilt analog chip it's a stand-in for.","date":"2026-06-27","tags":["generative-models","neuromorphic","physics","image-generation","explainer"],"draft":false,"featured":false,"interest":5,"helpful":2,"kind":"articles","slug":"unconventional-un-0","body":"Every image model you know is built from the same parts: neural-network layers, a lot\nof matrix multiplies, and — for the generative step — either a diffusion schedule or an\nadversary. **Un-0** throws all of that out. Its computational core is a population of\n**coupled oscillators**, and the generative step is just *letting them settle*.\n\nThis is the first release from **Unconventional AI**, the company Naveen Rao (ex-Databricks\nAI head, founder of Nervana and MosaicML) started with Michael Carbin and Sara Achour,\non a $475M seed. The thesis is one sentence: **physics as a computational primitive.**\nInstead of simulating a dynamical system on a von Neumann machine, run the dynamical\nsystem directly in analog silicon, and let the chip's physics *be* the computation —\nchasing brain-like (~20 W) efficiency.\n\nUn-0 is explicitly the \"hello world\" of that program: a proof, in software simulation,\nthat the math produces real images. The chip doesn't exist yet. Keep that line bright;\nI'll come back to it.\n\nHere is the thing itself: a field of oscillators, each pulled toward its neighbours.\nRaise the coupling and watch incoherent speckle organise into travelling waves. That\nself-organisation — not a matrix multiply — is the computation Un-0 runs.\n\n<KuramotoField />\n\n## The primitive: Kuramoto oscillators\n\nAn oscillator is just a phase $\\theta_i \\in [0, 2\\pi)$ turning at its own natural\nfrequency $\\omega_i$. Couple a population of them and each one also feels a pull toward\nits neighbours' phases. That's the **Kuramoto model**:\n\n$$\n\\dot{\\theta}_i \\;=\\; \\omega_i \\;+\\; \\frac{K}{N}\\sum_{j=1}^{N} \\sin(\\theta_j - \\theta_i)\n$$\n\n$K$ is the coupling strength. The behaviour has a sharp phase transition. Below a\ncritical $K$, everyone runs at their own frequency and the phases scatter — incoherent.\nAbove it, the population spontaneously **synchronizes** into one travelling cluster. The\nstandard measure is the order parameter\n\n$$\nr\\,e^{i\\psi} \\;=\\; \\frac{1}{N}\\sum_{j=1}^{N} e^{i\\theta_j},\n$$\n\nwhere $r \\to 0$ is total incoherence and $r \\to 1$ is full lock. You've seen this in the\nphysical world — pendulum metronomes started out of step on a shared, freely-moving base\npull each other into perfect synchrony:\n\n<Video\n  src=\"/articles/unconventional-un-0/metronomes\"\n  poster=\"/articles/unconventional-un-0/metronomes-poster.png\"\n  alt=\"Several pendulum metronomes started at different phases on a common moving platform gradually synchronizing into lockstep.\"\n  caption=\"Coupled metronomes on a shared base (Unconventional AI): the same Kuramoto physics in hardware — independent oscillators, weakly coupled through the platform, spontaneously phase-lock.\"\n/>\n\nEach oscillator can be drawn as its own dial. Drag $K$ through the transition and watch\nthe hands go from smeared to locked, and $r$ climb:\n\n<PhaseDials />\n\nThe point for Un-0: that transition, and the rich partially-synchronized regime around\nit, is a programmable dynamical system. If you can *shape* the coupling and the\nfrequencies, the settled phase pattern can encode something — like an image.\n\n## The pipeline: condition, evolve, read out\n\nUn-0's main class is a `ConditionalImplicitKuramotoGenerator`. There's no denoising\nschedule, no adversary, no iterative refinement loop in the diffusion sense — just an\nODE you integrate forward once.\n\n<Diagram caption=\"Un-0's generation pipeline: random initial phases, conditioned by a separate class-oscillator array through one-directional coupling, are evolved through the Kuramoto ODE for a fixed time T (explicit Euler). The settled phases are read out via sin/cos and a small conventional decoder (≤15% of parameters) renders pixels. No diffusion schedule.\">\n  <svg viewBox=\"0 0 660 170\" role=\"img\" aria-label=\"Random phases plus class-conditioning oscillators feed a coupled Kuramoto ODE evolved for time T; the phase readout passes through a small decoder to produce an image.\" style={{ width: \"100%\", height: \"auto\" }}>\n    {/* random init */}\n    <rect x=\"12\" y=\"58\" width=\"92\" height=\"52\" rx=\"8\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"58\" y=\"80\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">random</text>\n    <text x=\"58\" y=\"95\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">phases θ(0)</text>\n    {/* class conditioning */}\n    <rect x=\"12\" y=\"6\" width=\"92\" height=\"40\" rx=\"8\" fill=\"oklch(0.8 0.1 250)\" opacity=\"0.4\" stroke=\"var(--border)\" />\n    <text x=\"58\" y=\"23\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--foreground)\">class label</text>\n    <text x=\"58\" y=\"36\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"8\" fill=\"var(--muted-foreground)\">→ cond. oscillators</text>\n    <line x1=\"104\" y1=\"84\" x2=\"150\" y2=\"84\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    <line x1=\"104\" y1=\"26\" x2=\"175\" y2=\"64\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1\" strokeDasharray=\"3 3\" />\n    <text x=\"150\" y=\"46\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"8\" fill=\"var(--muted-foreground)\">1-way coupling</text>\n    {/* ODE evolve */}\n    <rect x=\"150\" y=\"52\" width=\"150\" height=\"64\" rx=\"8\" fill=\"oklch(0.72 0.13 150)\" opacity=\"0.28\" stroke=\"var(--border)\" />\n    <text x=\"225\" y=\"78\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">Kuramoto ODE</text>\n    <text x=\"225\" y=\"93\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">Euler, evolve to T</text>\n    <text x=\"225\" y=\"106\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"8\" fill=\"var(--muted-foreground)\">learn K, ω</text>\n    <line x1=\"300\" y1=\"84\" x2=\"346\" y2=\"84\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    {/* readout */}\n    <rect x=\"346\" y=\"58\" width=\"98\" height=\"52\" rx=\"8\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"395\" y=\"80\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">readout</text>\n    <text x=\"395\" y=\"95\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">sin θ, cos θ</text>\n    <line x1=\"444\" y1=\"84\" x2=\"490\" y2=\"84\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    {/* decoder */}\n    <rect x=\"490\" y=\"58\" width=\"92\" height=\"52\" rx=\"8\" fill=\"oklch(0.72 0.14 150)\" opacity=\"0.85\" />\n    <text x=\"536\" y=\"80\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"oklch(0.2 0 0)\">decoder</text>\n    <text x=\"536\" y=\"95\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"8\" fill=\"oklch(0.2 0 0)\">≤15% params</text>\n    <line x1=\"582\" y1=\"84\" x2=\"620\" y2=\"84\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    <rect x=\"620\" y=\"62\" width=\"30\" height=\"44\" rx=\"4\" fill=\"var(--muted)\" stroke=\"var(--border)\" />\n    <text x=\"635\" y=\"120\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">image</text>\n  </svg>\n</Diagram>\n\nStep by step:\n\n1. **Initialize** every oscillator's phase randomly.\n2. **Condition** on the class label through a *separate* oscillator array that couples\n   one-directionally into the main population — the label bends the dynamics without\n   being bent back.\n3. **Evolve** the coupled ODE forward for a fixed time $T$ with explicit Euler\n   integration. This is the entire \"generation\" — no schedule, no sampler loop.\n4. **Read out** the settled phases via $\\sin\\theta, \\cos\\theta$.\n5. **Decode** with a small conventional network — capped at ≤15% of total parameters —\n   to produce pixels.\n\nTraining learns the coupling matrix $K$, the natural frequencies $\\omega$, and the\ndecoder weights, via a \"drifting loss\" that uses a frozen DINOv2 feature extractor, with\nAdamW. So the *learning* is conventional gradient descent; what's unconventional is that\nthe thing being learned is the physics of a dynamical system, not a stack of attention\nlayers.\n\n## Training: differentiating through the dynamics\n\nThe subtle part is how you get gradients into an ODE. The forward pass *is* the Euler\nintegration of the Kuramoto system — a long chain of $\\sin$-coupled updates — and the\ndecoder reads the final state. Because every step is differentiable, you can\nbackpropagate through the unrolled trajectory and update $K$, $\\omega$, and the decoder\nend-to-end. The \"drifting loss\" supervises in a perceptual feature space (a frozen\nDINOv2 encoder) rather than raw pixels, which is what lets a tiny decoder — capped at\n≤15% of parameters — get away with so little work: the oscillator field is doing the\nheavy lifting, and the loss only has to match high-level features, not paint exact RGB.\n\nTwo things fall out of this design that are worth stating plainly:\n\n- **Capacity lives in the coupling.** Almost all the model's parameters are the coupling\n  matrix $K$ (it's $O(n^2)$ in the oscillator count $n$), which is exactly why FID\n  improves monotonically as you scale $n$ — you're literally adding interaction terms to\n  the dynamical system.\n- **The natural frequencies $\\omega$ are learned, not fixed.** The model gets to choose\n  each oscillator's intrinsic rhythm, so it can place itself wherever in the\n  synchronize/desynchronize landscape is most useful for a given class.\n\n## Oscillators vs diffusion\n\nIt's tempting to file Un-0 under \"another iterative generator,\" but the comparison is\ninstructive precisely because of how it *differs*:\n\n| | Diffusion model | Un-0 (oscillators) |\n|---|---|---|\n| Generative step | reverse a noising schedule, T denoising passes | integrate one coupled ODE to time T |\n| Core compute | matrix multiplies in NN layers | sin-coupled phase updates |\n| Conditioning | cross-attention / adaLN on the class | one-directional coupling from class oscillators |\n| Stochasticity | injected noise at each step | random initial phases only |\n| Why it might be efficient | — | the *physics* can run in analog silicon |\n\nA diffusion model spends its compute pushing tensors through learned layers many times.\nUn-0 spends its compute letting a physical system relax. On a GPU that's a wash at best\n— more on that below — but the bet is that the relaxation is free when the substrate is\nthe right kind of analog hardware.\n\n## Does it actually generate images?\n\nYes — and the honest version is \"yes, modestly\". Un-0 is class-conditional and low-res,\nand the company is upfront that it underperforms state-of-the-art generators like EDM.\nThe headline is FID **6.74** on ImageNet 64×64, which they frame as matching early\nconventional generators.\n\n<Figure\n  src=\"/articles/unconventional-un-0/mosaic.png\"\n  alt=\"A mosaic of small images generated by Un-0's coupled-oscillator model — recognizable class-conditional samples at low resolution.\"\n  caption=\"Samples from Un-0 (Unconventional AI). Class-conditional, low-resolution, generated by integrating a coupled-oscillator ODE and decoding the settled phases — no diffusion schedule involved.\"\n/>\n\nAnd here is the generation *happening* — a row of samples resolving out of the\noscillator field as the ODE integrates forward in time. There's no denoising loop; this\nis the population relaxing toward its conditioned attractor and the decoder reading it\nout frame by frame:\n\n<Video\n  src=\"/articles/unconventional-un-0/un0-samples\"\n  poster=\"/articles/unconventional-un-0/un0-samples-poster.png\"\n  alt=\"A row of Un-0 generated images sharpening from blur into recognizable class-conditional samples as the oscillator ODE integrates over time.\"\n  caption=\"Un-0 generation over integration time (Unconventional AI): each tile resolves from noise into a recognizable image as the coupled oscillators settle — the forward pass is the dynamics relaxing, not a sampler loop.\"\n/>\n\nFID scales the way you'd hope with oscillator count $n$ (more oscillators, lower FID):\n\n| Dataset | config | params | FID (↓) |\n|---|---|---|---|\n| CIFAR-10 32×32 | n1024 | 1.3M | ~11.0 |\n| CIFAR-10 32×32 | n2048 | 4.9M | ~9.3 |\n| CIFAR-10 32×32 | n4096 | 19.4M | ~8.8 |\n| ImageNet 64×64 | n6656 | 57M | ~8.4 |\n| ImageNet 64×64 | n10240 | 130M | ~8.0 |\n| ImageNet 64×64 | n16384 | 322M | **6.74** |\n\n<Figure\n  src=\"/articles/unconventional-un-0/imagenet64-pareto.png\"\n  alt=\"Parameter-count versus FID Pareto curve for Un-0 on ImageNet 64x64, FID dropping as oscillator count and parameters grow.\"\n  caption=\"Params-vs-FID frontier on ImageNet 64×64 (Unconventional AI): FID falls monotonically as the oscillator population grows, reaching 6.74 at n16384 / 322M params.\"\n/>\n\nThe same monotone scaling holds on CIFAR-10, where even a 1.3M-parameter field already\nreaches a usable FID:\n\n<Figure\n  src=\"/articles/unconventional-un-0/cifar10-pareto.png\"\n  alt=\"Parameter-count versus FID Pareto curve for Un-0 on CIFAR-10, FID dropping from about 11 to about 8.8 as the oscillator population grows.\"\n  caption=\"Params-vs-FID frontier on CIFAR-10 32×32 (Unconventional AI): from ~11.0 at 1.3M params (n1024) down to ~8.8 at 19.4M (n4096).\"\n/>\n\nNote the FID values wobble slightly between the blog and the repo README (e.g. 8.41 vs\n8.36) — these are self-reported, not third-party-reproduced, so treat them as\napproximate. The compute is non-trivial too: the largest ImageNet run is reported around\n640 B200-GPU-hours — *simulating* the oscillators on conventional GPUs is the expensive\npart, which is exactly the cost the proposed chip is meant to erase.\n\n## The hardware bet\n\nThis is where the whole thing either pays off or doesn't. Today's accelerators are von\nNeumann machines: weights live in memory, you stream them to compute units, multiply,\nand write back. That shuffle — not the arithmetic — is where most of the energy goes.\n\nUnconventional AI's proposal is to build the oscillators in physical silicon (CMOS ring\noscillators are the usual candidate), so that the coupled dynamics *happen* rather than\nbeing computed. There's no weight streaming because the coupling is the wiring; the\nsystem's settling to a synchronized state is the forward pass. The aspiration is\nbrain-like efficiency — order-of-tens-of-watts, against data-center GPUs — and the\n\"1000×\" figure is a projection of what that substrate could do relative to simulating\nthe same ODE on a GPU.\n\nIt's a real idea with real lineage — analog and neuromorphic computing has chased this\nfor decades — and the team (Naveen Rao, plus Michael Carbin from MIT and Sara Achour\nfrom Stanford on the hardware/compiler side) is credible. But it is, today, a\n*proposal*. The repo says chip schematics are \"coming soon\".\n\n## The part to keep straight\n\n<Callout type=\"warning\">\nUn-0 runs on a **software simulation of hardware that does not yet exist.** No oscillator\nchip has been built, and no chip schematics had been released at launch. The headline\n**\"1000× lower energy\" is a projection by the founders, not a measured result** — there\nis no analog silicon to measure. Press lines claiming Un-0 \"matches Stable Diffusion\"\noverstate what are class-conditional, 32×32/64×64 benchmarks behind SOTA. And there is\n**no peer-reviewed or arXiv paper** — the release is a company technical blog plus an\nMIT-licensed [GitHub repo](https://github.com/unconv-ai/Un-0). (Two real Kuramoto arXiv\npapers surface in searches — \"Artificial Kuramoto Oscillatory Neurons\" and \"Kuramoto\nOrientation Diffusion Models\" — but they are *unaffiliated* prior art, not Un-0.)\n</Callout>\n\nSo separate two claims cleanly. **Demonstrated today, in simulation:** a coupled-oscillator\nODE, conditioned and evolved once, decodes into recognizable class-conditional images at\nFID 6.74 (ImageNet-64). **Proposed, not yet built:** the analog oscillator chip whose\nphysics would run that ODE for ~1000× less energy. The first is a real, open, checkable\nresult. The second is a hardware vision — credible given the team and funding, but\nunbuilt and unverified.\n\n## What I make of it\n\n- **The idea is genuinely different, not a reskin.** Replacing layers + a diffusion\n  schedule with \"set up a dynamical system and let it settle\" is a real departure. The\n  generative step is an ODE integration, and the learned object is the physics itself.\n- **The demo is honest and modest.** FID 6.74 on ImageNet-64 is a proof-of-concept that\n  the math closes, deliberately framed as a \"hello world\", explicitly behind SOTA. That\n  honesty is worth more than a cherry-picked headline.\n- **The whole bet lives in the hardware that isn't here.** On a GPU, simulating\n  oscillators is *slower and costlier* than just running a normal generator — the entire\n  payoff is conditional on the analog chip materializing and delivering the projected\n  efficiency. Until silicon exists, \"1000×\" is a hypothesis, and the right way to read\n  Un-0 is as a credible research demonstration of physics-based generative computing —\n  not a shipping efficiency win.\n\n---\n\n*Built on Unconventional AI's [Un-0 technical\nwriteup](https://unconv.ai/blog/introducing-un-0-generating-images-with-coupled-oscillators/)\nand the MIT-licensed [Un-0 code](https://github.com/unconv-ai/Un-0). Benchmarks are\nself-reported; the analog-hardware efficiency claim is a founder projection, not a\nmeasured result.*\n","readingTimeMins":12,"url":"https://ai.thesatyajit.com/articles/unconventional-un-0","signal":{"interest":5,"helpful":2,"score":7,"level":3,"label":"Notable"}},{"title":"GLM 5.2: long-horizon coding at a million tokens","description":"Z.ai's GLM 5.2 is a 744B/40B-active open-weights MoE with a real 1M-token context, built for long-horizon agentic coding. How IndexShare makes that context cheap, what changed in training, and where it lands against the frontier — with the benchmarks.","date":"2026-06-23","tags":["llm","glm","long-context","agentic-coding","explainer"],"draft":false,"featured":false,"interest":3,"helpful":3,"kind":"articles","slug":"glm-5-2","body":"GLM 5.2, from Z.ai (Zhipu AI), is the flagship of the GLM-5 line: a 744-billion-\nparameter mixture-of-experts with 40B active per token, MIT-licensed open weights,\nand — the headline — a genuine **1-million-token context**. It is tuned for one thing\nin particular: long-horizon agentic coding, the sessions that run hundreds of rounds\nand thousands of tool calls without losing the thread.\n\nThere is no standalone GLM 5.2 paper. It builds on the GLM-5 technical report\n([arXiv 2602.15763](https://arxiv.org/abs/2602.15763)) and, for the context trick at\nits center, a method paper — IndexCache / IndexShare\n([arXiv 2603.12201](https://arxiv.org/abs/2603.12201)). This pulls from both, plus the\n[release blog](https://z.ai/blog/glm-5.2).\n\n## What changed from 5.1\n\nGLM-5 → 5.1 → 5.2 share the same 744B/40B backbone. What 5.2 adds:\n\n- a real **1M-token context**, up from 200K;\n- **IndexShare**, the architecture change that makes that context affordable;\n- a shift to **critic-based PPO** for very long RL rollouts;\n- faster speculative decoding (**+20% acceptance length**);\n- a **thinking-effort** dial (High / Max).\n\nThe first two are the load-bearing pair: the long context, and the trick that keeps\nit cheap.\n\n## The model\n\n744B total parameters, 40B active per token — a mixture-of-experts on an 80-layer,\n256-expert backbone. Attention is **DeepSeek Sparse Attention (DSA)**: Multi-head\nLatent Attention plus a lightweight *indexer* that, for each query, selects the\ntop-$k$ tokens worth attending to instead of the whole sequence. That sparsity is\nwhat makes a million-token context tractable at all.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-architecture-1m.png\"\n  alt=\"GLM 5.2 architecture for 1M context — DeepSeek Sparse Attention with the IndexShare layout.\"\n  caption=\"GLM 5.2's architecture for 1M context: sparse attention with a shared indexer (from the Z.ai release).\"\n/>\n\n## IndexShare: making 1M context cheap\n\nDSA has a catch. The indexer runs at *every layer*, and as the context grows toward\n1M tokens, that per-query top-$k$ search becomes the dominant cost. The IndexCache\npaper's observation is the whole insight: **adjacent DSA layers select almost the same\ntokens** — 70–100% of their top-$k$ overlap.\n\n<Figure\n  src=\"/articles/glm-5-2/indexcache-fig4-overlap-heatmap.png\"\n  alt=\"Heatmap of top-k token-selection overlap between every pair of layers, mostly 70-100%.\"\n  caption=\"Pairwise overlap of each layer's selected tokens. Neighbouring layers pick nearly identical sets — so recomputing the indexer for each is wasted work.\"\n/>\n\nSo compute the indexer once per group of layers and reuse its selection for the rest.\nGLM 5.2 shares one indexer across every 4 layers — skipping it in 3 of every 4:\n\n<IndexShare />\n\nIf the indexer's cost per layer scales with selecting top-$k$ over $L$ tokens, then\nsharing it across a group of $g$ layers amortizes that cost to $O(L/g)$ per layer.\nWith $g = 4$ and the rest of each layer unchanged, GLM 5.2 reports **2.9× lower\nper-token FLOPs at a 1M-token context**, with quality essentially intact.\n\n<Figure\n  src=\"/articles/glm-5-2/indexcache-fig2-architecture.png\"\n  alt=\"IndexCache inference loop: F-layers compute and cache indices, S-layers reuse them.\"\n  caption=\"The mechanism: an F-layer computes the indices and caches them; the following S-layers reuse the cache, skipping the indexer entirely.\"\n/>\n\nThe honest tradeoff: push reuse too far — share across 8 layers instead of 4 — and\nlong-context fidelity starts to degrade. One indexer per four layers is the sweet\nspot the paper settles on.\n\n## Faster decoding: MTP and KVShare\n\nGLM 5.2 also sharpens its multi-token-prediction layer (speculative decoding). With\nIndexShare, KVShare, and end-to-end training, the average **acceptance length rises\n~20% — from 4.56 to 5.47 tokens** per verification pass. More accepted tokens per\npass means faster generation, which matters most when you are streaming long agent\ntraces.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-mtp-indexshare-kvshare.png\"\n  alt=\"Two-step MTP inference with IndexShare and KVShare keeping train/infer KV consistent.\"\n  caption=\"Speculative decoding with IndexShare + KVShare — keeping the draft and verify passes consistent.\"\n/>\n\n## Training for the long horizon\n\nPretraining scaled to **28.5T tokens** (up from GLM-4.5's 23T). But the interesting\nchange in 5.2 is the agentic post-training. It moves from group-relative RL to a\n**critic-based PPO** that estimates token-level advantages from individual rollouts —\nwhich accommodates *trajectory compaction* without capping how long a trace can get.\nThat is exactly what you need when a single agent run is thousands of tool calls long\nand won't fit in one rollout.\n\nIt also adds an **anti-reward-hacking module**: a rule-based filter first catches\nlikely hacks (tuned for recall), then an LLM judge checks intent; on a detected hack\nthe system blocks the call and returns dummy information so the rollout continues\ninstead of being thrown away. All of it runs on Zhipu's open asynchronous RL\nframework, **slime**.\n\n## Benchmarks\n\nThe headline result: GLM 5.2 is the **strongest open-weights model on standard and\nlong-horizon coding**, closing much of the gap to Claude Opus 4.8 and GPT-5.5.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-coding-bench.png\"\n  alt=\"GLM 5.2 standard coding benchmark chart vs competitors.\"\n  caption=\"Standard coding benchmarks — GLM 5.2 as the strongest open model (Z.ai).\"\n/>\n\n<BenchBars\n  title=\"SWE-Bench Pro (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Claude Opus 4.8\", value: 69.2 },\n    { label: \"GLM 5.2\", value: 62.1, highlight: true },\n    { label: \"Qwen3.7-Max\", value: 60.6 },\n    { label: \"GPT-5.5\", value: 58.6 },\n    { label: \"GLM 5.1\", value: 58.4 },\n    { label: \"DeepSeek-V4-Pro\", value: 55.4 },\n  ]}\n/>\n\nWhere it stands out most is *long-horizon* coding — runs that have to stay coherent\nover many rounds — where it nearly catches Opus 4.8 and leaves the rest behind:\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-longhorizon-bench.png\"\n  alt=\"Long-horizon coding benchmarks: FrontierSWE, PostTrainBench, SWE-Marathon.\"\n  caption=\"Long-horizon benchmarks (FrontierSWE, PostTrainBench, SWE-Marathon) — the gap to the frontier is small.\"\n/>\n\n<BenchBars\n  title=\"FrontierSWE — long-horizon dominance (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Claude Opus 4.8\", value: 75.1 },\n    { label: \"GLM 5.2\", value: 74.4, highlight: true },\n    { label: \"GPT-5.5\", value: 72.6 },\n    { label: \"Gemini 3.1 Pro\", value: 39.6 },\n    { label: \"GLM 5.1\", value: 30.5 },\n  ]}\n/>\n\nReasoning is strong — a near-perfect AIME — though it trails the very top closed\nmodels on the hardest knowledge benchmarks (GPQA, HLE):\n\n<BenchBars\n  title=\"AIME 2026 (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"GLM 5.2\", value: 99.2, highlight: true },\n    { label: \"GPT-5.5\", value: 98.3 },\n    { label: \"Gemini 3.1 Pro\", value: 98.2 },\n    { label: \"Claude Opus 4.8\", value: 95.7 },\n    { label: \"GLM 5.1\", value: 95.3 },\n  ]}\n/>\n\n## Thinking effort, and what 1M costs to serve\n\nGLM 5.2 exposes two reasoning-effort levels — `high` for everyday speed and `max` for\nhard multi-step coding — and Z.ai positions its capability between Claude Opus 4.7 and\n4.8 at similar token spend.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-effort-tokenbudget.png\"\n  alt=\"Agentic coding performance vs token budget at High and Max effort levels.\"\n  caption=\"Effort vs token budget — Max trades more tokens for more capability on hard tasks.\"\n/>\n\nThe 1M context is not free to serve. The bottleneck moves from raw compute to\n**KV-cache capacity, long-context kernels, and CPU-side overhead**; the throughput\nadvantage grows with context length, but you need 8×H100-class hardware and ~1.5 TB\nfor the weights, and the API meters at 3× during peak hours.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-1m-throughput.png\"\n  alt=\"Serving throughput vs context length — GLM 5.2's advantage grows as context grows.\"\n  caption=\"The IndexShare payoff at serving time: the throughput edge widens as context approaches 1M tokens.\"\n/>\n\n## What I make of it\n\n- **The genuinely new bit is IndexShare** — a clean, well-motivated systems trick\n  (reuse what's nearly identical instead of recomputing it), with a paper that shows\n  *why* it's almost lossless. That's what turns \"1M context\" from a spec-sheet number\n  into something you can actually serve.\n- **It's the strongest open-weights model for long-horizon agentic coding**, and it's\n  MIT-licensed. That combination matters more than the benchmark deltas — you can run\n  and fine-tune it yourself.\n- **It still trails the best closed frontier models** on most hard coding and\n  reasoning axes (SWE-Bench Pro 62.1 vs Opus 4.8's 69.2), and it is heavy to\n  self-host. The bet was never \"beat Opus 4.8 everywhere\" — it's \"match the frontier on\n  long-horizon work, in the open, at a million tokens.\" On that, it largely delivers.\n\n---\n\n*Sources: the [GLM 5.2 release blog](https://z.ai/blog/glm-5.2), the GLM-5 technical\nreport ([arXiv 2602.15763](https://arxiv.org/abs/2602.15763)), and the IndexCache\nmethod paper behind IndexShare ([arXiv 2603.12201](https://arxiv.org/abs/2603.12201)).\nBenchmark figures are from Z.ai; numbers quoted as reported.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/glm-5-2","signal":{"interest":3,"helpful":3,"score":6,"level":2,"label":"Solid"}},{"title":"Sakana Fugu: a multi-agent system as a model","description":"Sakana AI turned LLM orchestration into a single model. A walk through the two ICLR 2026 papers behind Fugu — TRINITY, an evolved sub-20K-parameter coordinator, and the Conductor, a 7B reinforcement-learned orchestrator — and how routing a pool of frontier models beats any one of them.","date":"2026-06-23","tags":["llm","multi-agent","orchestration","reinforcement-learning","explainer"],"draft":false,"featured":false,"interest":4,"helpful":3,"kind":"articles","slug":"sakana-fugu","body":"No single LLM wins everywhere. One model leads on competition math, another on\nagentic coding, a third on multilingual work, and open models win on cost. The\nusual response is to pick one and absorb its weak spots. Sakana AI's bet is the\nother one: don't pick a model — *orchestrate a pool of them*, and make the\norchestration itself the model.\n\nThat product is **Sakana Fugu** (and a heavier tier, Fugu Ultra), shipped behind a\nsingle API. Underneath are two ICLR 2026 papers that attack the same problem from\nopposite ends: [TRINITY](https://arxiv.org/abs/2512.04695) *evolves* a tiny\ncoordinator over frozen models, and the\n[Conductor](https://arxiv.org/abs/2512.04388) *reinforcement-learns* a 7B model to\nwrite orchestration plans in natural language. This is a walk through both, and what\nthey add up to.\n\n## A multi-agent system as a model\n\nFugu's framing is the whole pitch: one OpenAI-compatible endpoint. You send a\nrequest to `model: fugu`; behind it a learned coordinator assembles a team from a\npool of frontier and open models, runs them over several turns, and returns one\nanswer. You never see the routing.\n\n<FuguPool />\n\n<Figure\n  src=\"/articles/sakana-fugu/fugu-architecture.png\"\n  alt=\"Sakana Fugu over a pool of closed and open models, with Fugu itself as one of the workers.\"\n  caption=\"Sakana's own framing of the idea: one Fugu endpoint coordinating a pool of closed and open models — and Fugu can even call itself as a worker (the recursive node on the right).\"\n/>\n\nThe pool is swappable — you can opt a model out for compliance and the coordinator\nroutes around it — and billing is a single top-tier rate rather than stacked\nper-model fees. There's even an export-controls angle: because Fugu can hit\nfrontier-level quality by coordinating open and semi-open models, you get the\ncapability without hard dependence on any one restricted vendor.\n\nBut the API is the boring part. The interesting part is that the coordinator is\n*learned*, not hand-written. There are two ways to learn it.\n\n## TRINITY: evolve a tiny coordinator\n\nTRINITY's constraint shapes everything: you cannot fine-tune GPT-5's weights, and\nmerging models with incompatible architectures doesn't work. So freeze every model\nin the pool, and learn only a tiny thing on top that decides who does what.\n\n<Figure\n  src=\"/articles/sakana-fugu/trinity-architecture.png\"\n  alt=\"TRINITY's coordination architecture: a coordinator selects an agent and a role each turn, looping Thinker, Worker, Verifier, with a worked example.\"\n  caption=\"TRINITY's coordination loop, from the paper: the coordinator picks an agent and a role each turn, with a worked Thinker → Worker → Verifier example on the right.\"\n/>\n\n### The coordinator is under 20,000 parameters\n\nA small model — Qwen3-0.6B — reads the current problem state and produces a hidden\nvector; a linear head turns that into a choice of *agent* and *role*. Given the\npenultimate-token hidden state $h(s)\\in\\mathbb{R}^{d}$ from the small model, a head\n$f_\\theta$ of roughly 10K parameters emits logits over $L$ agents plus 3 roles, and\nthe coordinator samples its action $a$ from\n\n$$\n\\pi_\\theta(a \\mid s) \\;\\propto\\; \\exp\\!\\big(f_\\theta(h(s))_a\\big),\n\\qquad a \\in \\{1,\\dots,L\\}\\cup\\{\\mathrm{T},\\mathrm{W},\\mathrm{V}\\}\n$$\n\nwhere $s$ is the running transcript, $\\mathrm{T},\\mathrm{W},\\mathrm{V}$ are the three\nroles below, and $\\theta$ is everything that gets trained. On top of the head, TRINITY\nadds *singular-value fine-tuning*: take an SVD of one or two of the small model's\nweight matrices and learn only the singular-value scales, keeping the orthogonal\nfactors fixed. That's a few thousand more numbers. Total trainable: **under 20K\nparameters.** The 0.6B backbone and all seven frontier and open models stay frozen.\n\n<Diagram caption=\"The entire trainable surface of TRINITY: a hidden state, a ~10K linear head, and a categorical choice over agents and roles. Everything below the head is frozen.\">\n  <svg viewBox=\"0 0 640 200\" role=\"img\" aria-label=\"The TRINITY coordinator: the small model maps the problem state to a hidden vector; a tiny linear head turns it into logits over agents and roles.\" style={{ width: \"100%\", height: \"auto\" }}>\n    <rect x=\"16\" y=\"74\" width=\"104\" height=\"44\" rx=\"8\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"68\" y=\"92\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--foreground)\">problem</text>\n    <text x=\"68\" y=\"108\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--foreground)\">state s</text>\n    <line x1=\"120\" y1=\"96\" x2=\"156\" y2=\"96\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    <rect x=\"156\" y=\"66\" width=\"120\" height=\"60\" rx=\"8\" fill=\"oklch(0.72 0.05 260)\" opacity=\"0.25\" stroke=\"var(--border)\" />\n    <text x=\"216\" y=\"90\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--foreground)\">Qwen3-0.6B</text>\n    <text x=\"216\" y=\"106\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">frozen · SLM</text>\n    <line x1=\"276\" y1=\"96\" x2=\"312\" y2=\"96\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    <text x=\"294\" y=\"88\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">h(s)</text>\n    <rect x=\"312\" y=\"72\" width=\"96\" height=\"48\" rx=\"8\" fill=\"oklch(0.72 0.15 150)\" opacity=\"0.85\" />\n    <text x=\"360\" y=\"92\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"oklch(0.2 0 0)\">head fθ</text>\n    <text x=\"360\" y=\"107\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"oklch(0.2 0 0)\">~10K params</text>\n    <line x1=\"408\" y1=\"96\" x2=\"444\" y2=\"96\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    <rect x=\"444\" y=\"40\" width=\"180\" height=\"50\" rx=\"8\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"534\" y=\"60\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">L agent logits</text>\n    <text x=\"534\" y=\"76\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">GPT-5 · Claude · Gemini · …</text>\n    <rect x=\"444\" y=\"102\" width=\"180\" height=\"50\" rx=\"8\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"534\" y=\"122\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">3 role logits</text>\n    <text x=\"534\" y=\"138\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">Thinker · Worker · Verifier</text>\n  </svg>\n</Diagram>\n\n<Figure\n  src=\"/articles/sakana-fugu/trinity-hidden-state-separability.png\"\n  alt=\"The small model's hidden states are linearly separable by task type (SVM) and form clear task clusters in a t-SNE plot.\"\n  caption=\"Why a ~10K linear head is enough: the small model's hidden states already separate by task type — a linear SVM classifies them almost perfectly (left), and t-SNE shows clean task clusters (right).\"\n/>\n\n### Three roles, looped until accepted\n\nEach turn, the coordinator gives the chosen agent one of three roles:\n\n- **Thinker** — plan, decompose, or critique; no direct work.\n- **Worker** — do the work: derive, compute, write code.\n- **Verifier** — check the current answer and return `ACCEPT` or `REVISE`.\n\nIt loops, accumulating a transcript, and halts the moment a Verifier accepts (or a\nfixed turn budget $K$ is exhausted):\n\n$$\n\\tau \\;=\\; \\min\\{\\, k \\le K \\;:\\; R_k = \\mathrm{V} \\ \\text{and}\\ u_k = \\mathrm{ACCEPT} \\,\\}\n$$\n\nwhere $R_k$ is the role at turn $k$ and $u_k$ is the verifier's verdict. Step through\none problem — watch a wrong answer get caught and revised before it's accepted:\n\n<TrinityLoop />\n\n### Trained by evolution, not gradients\n\nWhy not just RL the head? Because the reward is binary — the final answer is right or\nwrong — and the head is tiny, so the per-parameter gradient signal is buried in\nnoise. TRINITY instead optimizes the coordinator with a *derivative-free* evolution\nstrategy, maximizing expected terminal reward:\n\n$$\nJ(\\theta) \\;=\\; \\mathbb{E}_{\\tau \\sim \\pi_\\theta}\\big[\\, R(\\tau) \\,\\big],\n\\qquad R(\\tau) \\in \\{0, 1\\}\n$$\n\nThe optimizer is separable CMA-ES: it keeps a diagonal Gaussian over the ~10K\nparameters, samples a small population each generation —\n$\\lambda = \\lceil 4 + 3\\ln n \\rceil \\approx 32$ for $n \\approx 10{,}000$ — evaluates\neach candidate's fitness by actually running rollouts, and shifts the distribution\ntoward the winners. The paper shows the coordination objective is nearly\nblock-separable, which is exactly the regime where a diagonal evolution strategy\nbeats both random search and gradient RL under a tight evaluation budget. The honest\ncost: no gradients means you pay in *environment evaluations*, and each one is a full\nmulti-turn rollout against real model APIs.\n\n### It beats every model in its pool\n\nThis is the result that matters. Transferred zero-shot to four held-out tasks, the\nevolved coordinator outscored every individual model in its pool — including GPT-5,\nGemini-2.5-Pro, and Claude-4-Sonnet. On LiveCodeBench it set a record at the time of\nsubmission:\n\n<BenchBars\n  title=\"LiveCodeBench v6 — pass@1 (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"TRINITY\", value: 86.2, highlight: true },\n    { label: \"GPT-5\", value: 83.8 },\n    { label: \"Gemini-2.5-Pro\", value: 67.2 },\n    { label: \"Claude-4-Sonnet\", value: 46.5 },\n  ]}\n/>\n\nAnd the multi-turn loop earns its keep: accuracy climbs from 0.823 at two turns to\n0.863 at six. One cheap evolved head, a frozen pool, and the ensemble beats its best\nmember.\n\n<Figure\n  src=\"/articles/sakana-fugu/trinity-livecodebench.png\"\n  alt=\"TRINITY's LiveCodeBench result and its accuracy rising with the turn budget.\"\n  caption=\"TRINITY's own result: on LiveCodeBench it reaches 0.862 pass@1, above GPT-5 (0.838), Gemini-2.5-Pro (0.672), and Claude-4-Sonnet (0.465) — and accuracy keeps climbing with the turn budget (bottom).\"\n/>\n\n## Conductor: orchestration written in natural language\n\nThe Conductor attacks the same problem with a bigger hammer: a 7B model (Qwen2.5-7B)\ntrained with RL to *write the entire workflow itself*, in natural language.\n\n### Three lists are a workflow\n\nFor each problem the Conductor emits three synchronized lists:\n\n- `model_id` — which agent runs each step.\n- `subtasks` — a natural-language instruction for each step.\n- `access_list` — which earlier outputs each step is allowed to read.\n\nThose three lists *are* a directed graph. The `access_list` is the load-bearing\nidea: `[]` means the step sees only the original question, `[\"all\"]` means it sees\neverything produced so far, and `[0, 2]` means it sees steps 0 and 2. By choosing\naccess lists, the Conductor designs the communication topology — a chain, parallel\nbranches, a verify-and-merge — *per problem*, not from a fixed template. Flip between\nthe topologies it learns to produce:\n\n<ConductorWorkflow />\n\n### Trained with GRPO\n\nThe Conductor is trained end-to-end with GRPO. For each question it samples a group\nof $G = 64$ candidate workflows, scores each, and pushes the policy toward the\nabove-average ones using the group-normalized advantage\n\n$$\nA_i \\;=\\; \\frac{r_i - \\operatorname{mean}(r_1, \\dots, r_G)}{\\operatorname{std}(r_1, \\dots, r_G)}\n$$\n\nThe reward $r_i$ is blunt on purpose: $0$ if the three lists don't parse, $1$ if the\nfinal workflow output is correct, and $0.5$ otherwise — with no KL penalty\n($\\beta = 0$). The whole thing trains on just 960 problems for 200 iterations on two\nH100s. To make one Conductor work over *any* pool, they then fine-tune it with\nrandomly sampled $k$-model subsets per question, so it adapts to whatever agents you\nhand it.\n\n<Figure\n  src=\"/articles/sakana-fugu/conductor-training-emergence.png\"\n  alt=\"Conductor accuracy climbing over 200 GRPO iterations for out-of-distribution, in-distribution, and mixed agent pools.\"\n  caption=\"Coordination strategy emerging during training: accuracy climbs over 200 GRPO iterations as the Conductor learns to design better workflows — fastest when its few-shot examples are held out-of-distribution.\"\n/>\n\n### It can call itself\n\nThe Conductor may name *itself* as a worker. That spawns a fresh sub-workflow on its\nown draft — a recursive topology that turns inference depth into a tunable compute\naxis, what Sakana calls dynamic test-time scaling. Recursion buys a point or two on\nthe hardest benchmarks for under 2× the agent calls.\n\n### Results\n\nA 7B model orchestrating frontier workers beats the frontier workers. In a\ncontrolled run over the same pool:\n\n<BenchBars\n  title=\"LiveCodeBench — controlled, shared worker pool (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Conductor (7B)\", value: 64.3, highlight: true },\n    { label: \"GPT-5\", value: 57.5 },\n    { label: \"Gemini-2.5-Pro\", value: 40.1 },\n    { label: \"Claude-4\", value: 38.0 },\n    { label: \"MoA\", value: 38.6 },\n  ]}\n/>\n\nUnconstrained, the headline numbers were each a new high at publication and each\nabove the best single worker: **83.9% on LiveCodeBench, 87.5% on GPQA-Diamond, 93.3%\non AIME25** — reached with about 3 agent calls per question, versus 5–8 for prior\nmulti-agent methods.\n\n<Figure\n  src=\"/articles/sakana-fugu/conductor-leaderboard.png\"\n  alt=\"Conductor leading both GPQA-Diamond and LiveCodeBench against every individual worker model.\"\n  caption=\"The Conductor (highlighted) tops both GPQA-Diamond and LiveCodeBench against every individual worker in its pool — GPT-5, Gemini-2.5-Pro, DeepSeek-R1, and Claude Opus 4.\"\n/>\n\n<Figure\n  src=\"/articles/sakana-fugu/conductor-efficiency.png\"\n  alt=\"Scatter of average performance versus average number of agent calls: the Conductor is high-performance at about 3 calls, versus MoA at 8 calls.\"\n  caption=\"Performance versus cost: the Conductor sits top-left — higher accuracy than every multi-agent baseline at roughly 3 agent calls, where MoA needs 8.\"\n/>\n\n## Two routes to the same place\n\nTRINITY and the Conductor are the same idea — a learned layer that coordinates a\npool — built at opposite scales:\n\n| | TRINITY | Conductor |\n|---|---|---|\n| Learnable size | < 20K params (evolved head) | 7B params (RL-trained model) |\n| Training | derivative-free sep-CMA-ES | GRPO (reinforcement learning) |\n| Output per step | (agent, role) | a full natural-language workflow |\n| Coordination | fixed Thinker/Worker/Verifier loop | a topology it designs per problem |\n| Reads the task via | the small model's hidden state | reasoning in language |\n| Adapts to new pools | re-evolve (cheap) | randomized-pool fine-tune |\n\nTRINITY is the minimal, almost-free coordinator; the Conductor is the expressive one\nthat designs bespoke pipelines. Fugu uses both as its engine.\n\n## What ships: Fugu and Fugu Ultra\n\nTwo tiers. Base **Fugu** balances quality and latency over a lean pool. **Fugu\nUltra** coordinates a deeper pool over more turns for hard, high-stakes problems, and\ntakes longer for it. On Sakana's reported numbers, both match or beat the frontier:\n\n<BenchBars\n  title=\"SWE-Bench Pro (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Fugu Ultra\", value: 73.7, highlight: true },\n    { label: \"Claude Opus 4.8\", value: 69.2 },\n  ]}\n/>\n\nFugu Ultra also posts **50.0 on Humanity's Last Exam**, against baselines in the\n41–50 range. It's an OpenAI-compatible endpoint — change the base URL and key, no SDK\nmigration — and it bills at a single top-tier rate. (Not available in the EU yet,\npending GDPR; the exact routing decisions are kept proprietary.)\n\n<Figure\n  src=\"/articles/sakana-fugu/fugu-benchmarks.png\"\n  alt=\"Fugu and Fugu Ultra versus Fable 5, Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.8 across eight benchmarks.\"\n  caption=\"Fugu and Fugu Ultra (red) against Fable 5, Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.8 across eight benchmarks (Sakana). Fugu Ultra leads on SWE-Bench Pro (73.7 vs 69.2), GPQA-D, LiveCodeBench, and Humanity's Last Exam.\"\n/>\n\n## What I make of it\n\nThe honest read:\n\n- **The win is real.** An orchestration layer that beats every model it coordinates —\n  and generalizes zero-shot to unseen tasks — is a genuine result. \"Coordination\" is\n  now a trainable layer that sits *above* frontier models rather than inside one.\n- **The costs are real too.** Every model in the pool has to be available at\n  inference; you trade single-model simplicity for a fleet, and latency rises with\n  the extra turns. The biggest gains concentrate on long-tail reasoning and coding\n  benchmarks — on easy tasks the lift is small — and leaning on GPT-5/Claude/Gemini as\n  workers inherits their cost.\n- **The framing is the interesting part.** TRINITY argues the coordinator can be\n  almost free: 20K evolved parameters over frozen models. The Conductor argues\n  coordination is itself a reasoning skill worth a 7B model and a full RL run. Both\n  point the same way — as individual models plateau, the next axis is how you make\n  several of them work together, and that orchestration is learnable.\n\n---\n\n*Built on Sakana AI's [TRINITY: An Evolved LLM Coordinator](https://arxiv.org/abs/2512.04695)\nand [Learning to Orchestrate Agents in Natural Language with the Conductor](https://arxiv.org/abs/2512.04388),\nboth ICLR 2026. Product: [Sakana Fugu](https://sakana.ai/fugu/).*\n","readingTimeMins":12,"url":"https://ai.thesatyajit.com/articles/sakana-fugu","signal":{"interest":4,"helpful":3,"score":7,"level":3,"label":"Notable"}},{"title":"Mixture of Experts, from scratch","description":"Why MoE lets a model carry billions of parameters but only pay for a slice of them per token — built up from one MLP, a router, and a sparse forward pass, with the gating, dispatch, and load-balancing made visible.","date":"2026-06-10","tags":["deep-learning","transformers","mixture-of-experts","explainer"],"draft":false,"featured":false,"interest":3,"helpful":5,"kind":"articles","slug":"mixture-of-experts-from-scratch","body":"Scaling a transformer the dense way is a bad trade. Every parameter you add runs\non every token. Double the width of the feed-forward layers and you double both\nthe model's capacity *and* the FLOPs it burns per token — capacity and compute are\nwelded together. You pay for the whole network on every single token, whether that\ntoken needs it or not.\n\nMixture of Experts breaks the weld. The idea is **conditional computation**: keep a\nlarge pile of parameters around, but for any given token, only run a small slice of\nthem. A tiny router looks at each token and picks a couple of sub-networks — the\n*experts* — to handle it. The rest sit idle for that token. You get the capacity of\na big model at the compute of a small one.\n\nHere is the whole model we'll build, end to end. The only thing that makes it a MoE\nis one swapped line — tap the **sparse MoE** block to see it:\n\n<MoeArchitecture />\n\nEverything except that one block is a standard decoder-only transformer: token and\nposition embeddings, a stack of blocks, a final norm, an LM head. Attention is\nuntouched. MoE is a surgical replacement for the feed-forward layer inside each\nblock, and nothing else. So the whole thing reduces to three questions: what is an\nexpert, who decides which experts run, and how do you run only the chosen few.\n\n## An expert is just an MLP\n\nStart with the thing we're replacing. In a normal transformer block, after\nattention, every token goes through the same two-layer MLP — expand to `4 * n_embed`,\nnonlinearity, project back. That's the feed-forward network.\n\nAn expert is exactly that MLP. Nothing more.\n\n```python\nclass Expert(nn.Module):\n    def __init__(self, n_embed, dropout=0.1):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(n_embed, 4 * n_embed),\n            nn.ReLU(),\n            nn.Linear(4 * n_embed, n_embed),\n            nn.Dropout(dropout),\n        )\n\n    def forward(self, x):\n        return self.net(x)\n```\n\nThe move is to keep `num_experts` copies of this MLP instead of one. With 8 experts\nyou have 8× the feed-forward parameters. If every token went through all 8, you'd\nhave spent 8× the compute and gained nothing but a slow, fat FFN. The whole game is\nto run only `top_k` of them — say 2 — per token. So you carry 8 experts' worth of\nparameters and pay for 2.\n\nThe piece that makes that decision is the router.\n\n## Who decides? The router\n\nThe router's job: look at a token's vector $x$ and produce a weight for each expert,\nmostly zero, so that only a few experts actually contribute. Build it up in three\nsteps, because the naive versions teach you why the real one looks the way it does.\n\n**Attempt 1 — send every token to every expert, weighted.** A linear layer maps the\ntoken to one logit per expert, softmax over them, take a weighted sum of all expert\noutputs:\n\n$$\ng(x) = \\mathrm{softmax}(x W_g), \\qquad y = \\sum_{i=1}^{N} g(x)_i \\, E_i(x)\n$$\n\nHere $W_g$ is the router's weight matrix (`n_embed × num_experts`) and $E_i$ is the\n$i$-th expert. This is differentiable and trains fine — but it's *dense*. Every\nexpert runs on every token. We've built an expensive ensemble, not a sparse model.\n\n**Attempt 2 — hard pick the single best expert.** Take $\\arg\\max$ of the logits, run\nonly that expert. Now it's sparse and cheap. But $\\arg\\max$ has zero gradient: the\nrouter only ever learns about the one expert it already chose, and never gets a\nsignal to try the others. Routing freezes. Dead end.\n\n**Attempt 3 — top-$k$ softmax.** Keep the largest $k$ logits, set the rest to\n$-\\infty$, *then* softmax. The $-\\infty$ entries become exactly 0, so only $k$ experts\ncontribute — sparse like attempt 2 — but the softmax over the survivors is smooth, so\ngradients flow to all $k$ chosen experts. This is the real router:\n\n$$\ng(x) = \\mathrm{softmax}\\big(\\mathrm{KeepTopK}(x W_g,\\, k)\\big), \\qquad\n\\mathrm{KeepTopK}(v, k)_i = \\begin{cases} v_i & v_i \\text{ in top } k \\\\ -\\infty & \\text{otherwise} \\end{cases}\n$$\n\nWith $k = 2$ and $N = 8$, six of the eight gate weights are zero for every token, and\nthe two survivors sum to 1. Watch one token go through it — logits, keep the top two,\nsoftmax to gates, combine:\n\n<MoeRouter />\n\nThat stepper is the entire routing mechanism. The bars are the per-expert logits;\ntop-2 keeps two; softmax turns them into weights; the output is just those two\nexperts' outputs scaled by their gates and added.\n\n## Why the noise\n\nThere's one addition that the bare top-$k$ router needs in practice: noise. Before\npicking the top $k$, add a learned, per-expert amount of Gaussian noise to the\nlogits:\n\n$$\nH(x)_i = (x W_g)_i + \\varepsilon_i \\cdot \\mathrm{softplus}\\big((x W_{\\text{noise}})_i\\big), \\qquad \\varepsilon_i \\sim \\mathcal{N}(0, 1)\n$$\n\nThe noise scale is itself learned (a second linear layer $W_{\\text{noise}}$, passed\nthrough `softplus` to keep it positive). Why bother? Because early in training the\nrouter is random, and whichever experts happen to win first get all the gradient and\npull ahead — a rich-get-richer collapse. The noise jitters the top-$k$ selection so\nborderline experts occasionally win, get some tokens, and get a chance to become\nuseful. It's exploration, baked into the forward pass. Hit *resample noise* in the\nwidget above and you can watch which two experts win flip.\n\nIn code the router is four lines of real work:\n\n```python\nclass NoisyTopKRouter(nn.Module):\n    def __init__(self, n_embed, num_experts, top_k):\n        super().__init__()\n        self.top_k = top_k\n        self.route = nn.Linear(n_embed, num_experts)   # gate logits\n        self.noise = nn.Linear(n_embed, num_experts)   # per-expert noise scale\n\n    def forward(self, x):\n        logits = self.route(x)\n        noisy = logits + torch.randn_like(logits) * F.softplus(self.noise(x))\n\n        top_logits, idx = noisy.topk(self.top_k, dim=-1)     # the chosen experts\n        sparse = torch.full_like(noisy, float(\"-inf\"))\n        sparse.scatter_(-1, idx, top_logits)                 # keep top-k, rest -inf\n        return F.softmax(sparse, dim=-1), idx\n```\n\n`scatter_` is the one trick worth pausing on: it writes the kept logits back into a\ntensor of `-inf`, at the indices the `topk` chose. After the softmax those `-inf`\nslots are 0. The router returns the gate weights and the chosen indices — the\nindices tell the next stage which experts to actually run.\n\n## The sparse forward pass\n\nNow the part that earns the word *sparse*. We have gate weights and, for each token,\nthe indices of its top-$k$ experts. We want to run each expert on only the tokens\nrouted to it, scale by the gate, and add the result back.\n\nThe straightforward way: loop over experts, and for each one, mask out the tokens\nthat picked it.\n\n```python\nclass SparseMoE(nn.Module):\n    def __init__(self, n_embed, num_experts, top_k):\n        super().__init__()\n        self.router = NoisyTopKRouter(n_embed, num_experts, top_k)\n        self.experts = nn.ModuleList([Expert(n_embed) for _ in range(num_experts)])\n\n    def forward(self, x):\n        gates, idx = self.router(x)            # (B,T,N), (B,T,k)\n        out = torch.zeros_like(x)\n\n        flat_x = x.view(-1, x.size(-1))        # (B*T, C)\n        flat_gates = gates.view(-1, gates.size(-1))\n        flat_out = out.view(-1, x.size(-1))\n\n        for i, expert in enumerate(self.experts):\n            mask = (idx == i).any(dim=-1).view(-1)   # tokens routed to expert i\n            if mask.any():\n                y = expert(flat_x[mask])             # run on its tokens only\n                flat_out[mask] += flat_gates[mask, i:i+1] * y\n        return out\n```\n\nThe `mask = (idx == i).any(dim=-1)` line is the dispatch: it's true for exactly the\ntokens that have expert `i` somewhere in their top-$k$. We gather those tokens, run\nthe expert once on the batch of them, scale each by its gate weight, and scatter-add\nback into the output. A token routed to experts 2 and 5 gets contributions from both\nloop iterations, summed — which is exactly $\\sum_i g(x)_i E_i(x)$ with all but $k$\nterms zero.\n\nPicture the dispatch over a short sequence. Each token connects to just two of the\neight experts, so most of the grid stays dark — that darkness is the compute you're\n*not* spending:\n\n<MoeRouting />\n\nThe bars underneath are the per-expert load: how many tokens each expert handled.\nNotice it's already uneven — some experts attract more traffic than others. Hold that\nthought; it's the central problem with MoE.\n\n<Callout type=\"note\">\n  This masked loop is the *teaching* implementation. It's correct but it runs every\n  expert as a separate kernel and materialises a mask per expert. Production MoE\n  instead sorts/permutes tokens by expert and does one grouped matmul, and in the\n  distributed case each expert lives on a different GPU and tokens are shipped to\n  them (expert parallelism). Same math, very different plumbing.\n</Callout>\n\n## The one line that changes\n\nWith the experts and the router in hand, dropping MoE into a transformer block is\nanticlimactic — which is the point. A standard block is `attention → FFN`, each\nwrapped in a layer-norm and a residual. MoE swaps the FFN for the `SparseMoE` module\nand touches nothing else:\n\n```python\nclass Block(nn.Module):\n    def __init__(self, n_embed, n_head, num_experts, top_k, block_size):\n        super().__init__()\n        self.sa = MultiHeadAttention(n_head, n_embed, block_size)\n        self.smoe = SparseMoE(n_embed, num_experts, top_k)   # was: FeedForward(n_embed)\n        self.ln1 = nn.LayerNorm(n_embed)\n        self.ln2 = nn.LayerNorm(n_embed)\n\n    def forward(self, x):\n        x = x + self.sa(self.ln1(x))      # attention — unchanged\n        x = x + self.smoe(self.ln2(x))    # MoE replaces the feed-forward layer\n        return x\n```\n\nThat's the whole architectural delta. One `FeedForward` becomes one `SparseMoE`:\n\n<Diagram caption=\"Same slot in the block. Dense runs one MLP on every token; sparse runs a router plus the two chosen experts.\">\n  <svg viewBox=\"0 0 640 250\" role=\"img\" aria-label=\"A dense feed-forward layer applies one MLP to every token; the sparse MoE layer routes each token to two of eight experts.\" style={{ width: \"100%\", height: \"auto\" }}>\n    <defs>\n      <marker id=\"moe-arrow\" viewBox=\"0 0 10 10\" refX=\"8\" refY=\"5\" markerWidth=\"6\" markerHeight=\"6\" orient=\"auto-start-reverse\">\n        <path d=\"M0,0 L10,5 L0,10 z\" fill=\"var(--muted-foreground)\" />\n      </marker>\n    </defs>\n\n    {/* dense side */}\n    <text x=\"150\" y=\"24\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"13\" fill=\"var(--foreground)\">dense FFN</text>\n    <rect x=\"110\" y=\"44\" width=\"80\" height=\"22\" rx=\"5\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"150\" y=\"59\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">tokens</text>\n    <line x1=\"150\" y1=\"66\" x2=\"150\" y2=\"92\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.2\" markerEnd=\"url(#moe-arrow)\" />\n    <rect x=\"95\" y=\"94\" width=\"110\" height=\"50\" rx=\"8\" fill=\"oklch(0.72 0.13 250)\" opacity=\"0.9\" />\n    <text x=\"150\" y=\"116\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"oklch(0.2 0 0)\">one MLP</text>\n    <text x=\"150\" y=\"132\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"oklch(0.2 0 0)\">every token</text>\n    <line x1=\"150\" y1=\"144\" x2=\"150\" y2=\"170\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.2\" markerEnd=\"url(#moe-arrow)\" />\n    <text x=\"150\" y=\"190\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--muted-foreground)\">100% of params, every token</text>\n\n    {/* divider */}\n    <line x1=\"320\" y1=\"30\" x2=\"320\" y2=\"210\" stroke=\"var(--border)\" strokeDasharray=\"3 4\" />\n\n    {/* sparse side */}\n    <text x=\"490\" y=\"24\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"13\" fill=\"var(--foreground)\">sparse MoE</text>\n    <rect x=\"450\" y=\"44\" width=\"80\" height=\"22\" rx=\"5\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"490\" y=\"59\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">tokens</text>\n    <line x1=\"490\" y1=\"66\" x2=\"490\" y2=\"84\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.2\" markerEnd=\"url(#moe-arrow)\" />\n    <rect x=\"448\" y=\"86\" width=\"84\" height=\"20\" rx=\"5\" fill=\"var(--background)\" stroke=\"var(--foreground)\" strokeOpacity=\"0.4\" />\n    <text x=\"490\" y=\"100\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">router</text>\n\n    {/* 8 experts, 2 lit */}\n    {[0,1,2,3,4,5,6,7].map((i) => {\n      const x = 372 + i * 30\n      const lit = i === 2 || i === 5\n      return (\n        <g key={i}>\n          <line x1=\"490\" y1=\"106\" x2={x + 11} y2=\"150\" stroke={`oklch(0.72 0.13 ${(i*45)%360})`} strokeWidth={lit ? 2 : 1} opacity={lit ? 0.9 : 0.12} />\n          <rect x={x} y=\"150\" width=\"22\" height=\"34\" rx=\"4\" fill={`oklch(0.72 0.13 ${(i*45)%360})`} opacity={lit ? 1 : 0.16} />\n        </g>\n      )\n    })}\n    <text x=\"490\" y=\"204\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--muted-foreground)\">8 experts stored · 2 run</text>\n  </svg>\n</Diagram>\n\nStack eight of these blocks, add embeddings and an LM head, and you have the model\nfrom the top of the page. Train it exactly like a dense transformer — cross-entropy\non next-token prediction. The router learns its weights from the same gradient as\neverything else. No special routing supervision; it figures out a useful assignment\non its own.\n\n## Run it yourself\n\nHere is everything above assembled into one file — a char-level model that trains on\ntiny Shakespeare in about 200 lines, with no dependency past PyTorch. The `Expert`,\n`NoisyTopKRouter`, and `SparseMoE` are exactly the pieces we just built; the rest is\nthe smallest transformer that can hold them. Copy it, run `python tinymoe.py`, and\nwatch the loss come down.\n\n```python\n\"\"\"\ntinymoe — a tiny Mixture-of-Experts language model in one file.\nChar-level, trains on tiny Shakespeare. ~4.5M params, ~1.4M active per token.\nRuns on CPU; much faster on a GPU.\n\n    python tinymoe.py        # download data, train, then sample\n\nIt's a small decoder-only transformer where the feed-forward layer of every\nblock is replaced by a sparse mixture of experts with noisy top-k routing.\n\"\"\"\nimport os\nimport urllib.request\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import functional as F\n\n# --------------------------------------------------------------------- config\nbatch_size = 32          # sequences per step\nblock_size = 128         # context length (chars)\nn_embed = 128            # embedding / residual width\nn_head = 4               # attention heads\nn_layer = 4              # transformer blocks\nnum_experts = 8          # experts per MoE layer\ntop_k = 2                # experts actually run per token\ndropout = 0.1\nlearning_rate = 3e-4\nmax_iters = 5000\neval_interval = 500\neval_iters = 100\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\ntorch.manual_seed(1337)\n\n# ----------------------------------------------------------- data (shakespeare)\nif not os.path.exists(\"input.txt\"):\n    url = (\"https://raw.githubusercontent.com/karpathy/char-rnn/\"\n           \"master/data/tinyshakespeare/input.txt\")\n    urllib.request.urlretrieve(url, \"input.txt\")\ntext = open(\"input.txt\", encoding=\"utf-8\").read()\n\nchars = sorted(set(text))\nvocab_size = len(chars)\nstoi = {c: i for i, c in enumerate(chars)}\nitos = {i: c for i, c in enumerate(chars)}\nencode = lambda s: [stoi[c] for c in s]\ndecode = lambda t: \"\".join(itos[i] for i in t)\n\ndata = torch.tensor(encode(text), dtype=torch.long)\nn = int(0.9 * len(data))\ntrain_data, val_data = data[:n], data[n:]\n\n\ndef get_batch(split):\n    d = train_data if split == \"train\" else val_data\n    ix = torch.randint(len(d) - block_size, (batch_size,))\n    x = torch.stack([d[i:i + block_size] for i in ix])\n    y = torch.stack([d[i + 1:i + block_size + 1] for i in ix])\n    return x.to(device), y.to(device)\n\n\n# ------------------------------------------------------------------- attention\nclass Head(nn.Module):\n    def __init__(self, head_size):\n        super().__init__()\n        self.key = nn.Linear(n_embed, head_size, bias=False)\n        self.query = nn.Linear(n_embed, head_size, bias=False)\n        self.value = nn.Linear(n_embed, head_size, bias=False)\n        self.register_buffer(\"tril\", torch.tril(torch.ones(block_size, block_size)))\n        self.drop = nn.Dropout(dropout)\n\n    def forward(self, x):\n        B, T, C = x.shape\n        k, q = self.key(x), self.query(x)\n        wei = q @ k.transpose(-2, -1) * k.shape[-1] ** -0.5\n        wei = wei.masked_fill(self.tril[:T, :T] == 0, float(\"-inf\"))\n        wei = self.drop(F.softmax(wei, dim=-1))\n        return wei @ self.value(x)\n\n\nclass MultiHeadAttention(nn.Module):\n    def __init__(self, n_head, head_size):\n        super().__init__()\n        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_head)])\n        self.proj = nn.Linear(n_embed, n_embed)\n        self.drop = nn.Dropout(dropout)\n\n    def forward(self, x):\n        out = torch.cat([h(x) for h in self.heads], dim=-1)\n        return self.drop(self.proj(out))\n\n\n# --------------------------------------------------------- mixture of experts\nclass Expert(nn.Module):\n    \"\"\"One expert = one MLP. Same shape as a normal transformer FFN.\"\"\"\n\n    def __init__(self):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(n_embed, 4 * n_embed), nn.ReLU(),\n            nn.Linear(4 * n_embed, n_embed), nn.Dropout(dropout),\n        )\n\n    def forward(self, x):\n        return self.net(x)\n\n\nclass NoisyTopKRouter(nn.Module):\n    \"\"\"Score experts per token, add learned noise, keep top-k, softmax.\"\"\"\n\n    def __init__(self):\n        super().__init__()\n        self.route = nn.Linear(n_embed, num_experts)\n        self.noise = nn.Linear(n_embed, num_experts)\n\n    def forward(self, x):\n        logits = self.route(x)\n        noisy = logits + torch.randn_like(logits) * F.softplus(self.noise(x))\n        top_logits, idx = noisy.topk(top_k, dim=-1)\n        sparse = torch.full_like(noisy, float(\"-inf\")).scatter(-1, idx, top_logits)\n        return F.softmax(sparse, dim=-1), idx\n\n\nclass SparseMoE(nn.Module):\n    \"\"\"Run only the top-k experts per token; combine them by gate weight.\"\"\"\n\n    def __init__(self):\n        super().__init__()\n        self.router = NoisyTopKRouter()\n        self.experts = nn.ModuleList([Expert() for _ in range(num_experts)])\n\n    def forward(self, x):\n        gates, idx = self.router(x)                  # (B,T,E), (B,T,k)\n        out = torch.zeros_like(x)\n        flat_x = x.reshape(-1, x.size(-1))\n        flat_gates = gates.reshape(-1, gates.size(-1))\n        flat_out = out.reshape(-1, x.size(-1))\n        for i, expert in enumerate(self.experts):\n            mask = (idx == i).any(dim=-1).reshape(-1)  # tokens routed to expert i\n            if mask.any():\n                flat_out[mask] += flat_gates[mask, i:i + 1] * expert(flat_x[mask])\n        return out\n\n\n# ------------------------------------------------------------- block + model\nclass Block(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.sa = MultiHeadAttention(n_head, n_embed // n_head)\n        self.smoe = SparseMoE()                      # <- replaces the FFN\n        self.ln1 = nn.LayerNorm(n_embed)\n        self.ln2 = nn.LayerNorm(n_embed)\n\n    def forward(self, x):\n        x = x + self.sa(self.ln1(x))\n        x = x + self.smoe(self.ln2(x))\n        return x\n\n\nclass MoELanguageModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.tok_emb = nn.Embedding(vocab_size, n_embed)\n        self.pos_emb = nn.Embedding(block_size, n_embed)\n        self.blocks = nn.Sequential(*[Block() for _ in range(n_layer)])\n        self.ln_f = nn.LayerNorm(n_embed)\n        self.head = nn.Linear(n_embed, vocab_size)\n\n    def forward(self, idx, targets=None):\n        B, T = idx.shape\n        x = self.tok_emb(idx) + self.pos_emb(torch.arange(T, device=idx.device))\n        x = self.ln_f(self.blocks(x))\n        logits = self.head(x)\n        loss = None\n        if targets is not None:\n            loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))\n        return logits, loss\n\n    @torch.no_grad()\n    def generate(self, idx, max_new_tokens):\n        for _ in range(max_new_tokens):\n            logits, _ = self(idx[:, -block_size:])\n            probs = F.softmax(logits[:, -1, :], dim=-1)\n            idx = torch.cat([idx, torch.multinomial(probs, 1)], dim=1)\n        return idx\n\n\n# --------------------------------------------------------------------- train\n@torch.no_grad()\ndef estimate_loss(model):\n    out = {}\n    model.eval()\n    for split in (\"train\", \"val\"):\n        losses = torch.zeros(eval_iters)\n        for k in range(eval_iters):\n            x, y = get_batch(split)\n            _, losses[k] = model(x, y)\n        out[split] = losses.mean().item()\n    model.train()\n    return out\n\n\nmodel = MoELanguageModel().to(device)\ntotal = sum(p.numel() for p in model.parameters())\nprint(f\"{total / 1e6:.2f}M params on {device}\")\nopt = torch.optim.AdamW(model.parameters(), lr=learning_rate)\n\nfor it in range(max_iters):\n    if it % eval_interval == 0:\n        l = estimate_loss(model)\n        print(f\"step {it:5d} | train {l['train']:.3f} | val {l['val']:.3f}\")\n    x, y = get_batch(\"train\")\n    _, loss = model(x, y)\n    opt.zero_grad(set_to_none=True)\n    loss.backward()\n    opt.step()\n\n# -------------------------------------------------------------------- sample\nctx = torch.zeros((1, 1), dtype=torch.long, device=device)\nprint(decode(model.generate(ctx, 500)[0].tolist()))\n```\n\nAt the default size it prints `4.52M params` — but only **~1.4M of them run on any\ngiven token**, because 6 of every 8 experts sit out. That's the parameter-vs-compute\nsplit in miniature. Raise `num_experts` and the total climbs while the active count\nbarely moves; lower `top_k` to 1 and it gets sparser still. The same lever Mixtral\npulls, in a model you can train on a laptop.\n\nOne honesty note: this minimal version relies entirely on the routing noise to keep\nexperts balanced — there's no auxiliary loss. At toy scale it trains fine. Scale it\nup and a few experts quietly take over, which is the next problem.\n\n## The catch: load balancing\n\nMoE has one failure mode that dominates everything else, and you saw it forming in the\ndispatch map: **expert collapse**. Routing is a positive feedback loop. An expert that\nwins a few tokens early gets gradient, improves, and so becomes the router's favourite\nfor even more tokens. Meanwhile the experts that lost early get no tokens, no\ngradient, and never improve. Left alone, a handful of experts end up doing all the\nwork and the rest are dead weight — you're paying to store 8 experts and effectively\nrunning 2 or 3.\n\n<MoeLoadBalance />\n\nThe noise we added earlier is the first defense — it keeps the routing from hardening\ntoo fast. The second, used in every serious MoE, is an **auxiliary load-balancing\nloss**: a term added to the training objective that measures how lopsided the routing\nis across a batch and penalises imbalance, nudging the router toward spreading tokens\nevenly. It's a soft constraint — you're not forcing exactly equal load, just paying a\ncost for collapse. Tuning its weight is part of the unglamorous reality of training a\nMoE: too little and experts collapse, too much and you fight the router's ability to\nactually specialise.\n\nThis is the honest tradeoff. A dense FFN has no routing, no balance to maintain, no\nextra loss to tune. MoE buys you cheap capacity and hands you a load-balancing problem\nin return.\n\n## What the experts actually learn\n\nIt's tempting to picture expert 3 as \"the Python expert\" and expert 5 as \"the French\nexpert.\" That's mostly not what happens. When the Mixtral authors inspected their\nrouter, they found no clean topic or domain specialization — experts don't map to\nsubjects. What the router learns is lower-level and more syntactic: routing is\nstrongly correlated across consecutive tokens, and individual experts lean toward\nthings like indentation, punctuation, or particular token shapes. The specialization\nis real, but it's structural, not semantic, and not especially interpretable.\n\"Experts\" is a useful name, not a promise that each one becomes a tidy domain\nspecialist.\n\n## Beyond the basic router\n\nThe router we built is *token-choice*: each token picks its experts. Three variations\nare worth knowing, because they're all different answers to the same load-balancing\nproblem:\n\n- **Expert-choice routing** flips the selection — each expert picks its top tokens.\n  Load is balanced by construction (every expert takes a fixed budget), at the cost of\n  some tokens getting chosen by many experts and others by none.\n- **Shared experts** (as in DeepSeek-MoE) keep one or two experts always on for every\n  token, so the routed experts don't burn capacity re-learning common patterns and can\n  specialize at the margin.\n- **Capacity and token dropping** — in batched or distributed training each expert gets\n  a fixed number of slots per batch; tokens that overflow their chosen expert are\n  dropped and pass through on the residual alone. A blunt cap that keeps the per-expert\n  matmuls a fixed, rectangular shape.\n\nSame tradeoff surface — cheap capacity versus keeping every expert fed — approached\nfrom different sides.\n\n## What you actually buy\n\nWhy put up with the routing machinery? Because the parameter-vs-compute decoupling is\nreal and large. Mixtral 8×7B is the clean reference: 8 experts per layer, top-2\nrouting — the exact configuration we just built. It holds **47B parameters total**,\nbut because only 2 of 8 experts run per token, a forward pass touches **about 13B\nactive parameters**. It runs at the speed and memory-bandwidth cost of a ~13B dense\nmodel while matching or beating a 70B dense one across benchmarks.\n\nThat's the pitch in one line: **capacity you don't pay for on every token.** The\nparameters are the model's knowledge; the active fraction is what each token can\nafford to consult.\n\nThere's a cost on the other side of the ledger, and it's worth stating plainly. MoE\ntrades **compute for memory**. Only $k$ experts run, but *all* of them have to be\nresident — you still hold 47B parameters in memory even though each token uses 13B.\nAnd at batch scale the router scatters tokens across all experts, so the bandwidth and\nthe all-to-all communication of shipping tokens to the right expert (across GPUs)\nbecomes the real bottleneck, not the matmuls. MoE doesn't make models free. It moves\nthe cost from FLOPs, which you pay per token, to memory and bandwidth, which you pay\nonce. For inference-bound serving at scale, that's usually the trade you want.\n\n## The whole thing, in one breath\n\nStrip away the engineering and MoE is small: an expert is the FFN you already had;\nkeep several of them; a one-layer router scores them per token; keep the top two,\nsoftmax for weights, run only those two, add a little noise so routing explores and a\nbalancing loss so it doesn't collapse. One line in the transformer block changes. In\nreturn, the model's parameter count and its per-token compute stop being the same\nnumber — and that decoupling is the entire reason the largest models you can name are\nbuilt this way.\n","readingTimeMins":18,"url":"https://ai.thesatyajit.com/articles/mixture-of-experts-from-scratch","signal":{"interest":3,"helpful":5,"score":8,"level":4,"label":"High"}},{"title":"Coroutines in C, intuitively","description":"How to pause a function in the middle and resume it later — using nothing but a switch statement and __LINE__. An intuitive tour of Simon Tatham's classic trick, with a step-through animation.","date":"2026-06-09","tags":["c","coroutines","systems","explainer"],"draft":false,"featured":false,"interest":4,"helpful":5,"kind":"articles","slug":"coroutines-in-c","body":"Some functions want to be *callers*. Some want to be *callees*. The trouble starts\nwhen two pieces of code both want to be the caller.\n\nPicture a decompressor that walks a byte stream and emits one character at a time,\nand a parser that consumes characters one at a time. Each is most natural as a loop\nthat *drives* the other:\n\n<Diagram caption=\"Both want to be the loop. Only one can be — the other must invert into a state machine.\">\n  <svg\n    viewBox=\"0 0 600 220\"\n    role=\"img\"\n    aria-label=\"Two functions, a decompressor and a parser, each naturally a loop that wants to drive the other.\"\n    style={{ width: \"100%\", height: \"auto\", color: \"var(--foreground)\" }}\n  >\n    <defs>\n      <marker id=\"cf-arrow\" viewBox=\"0 0 10 10\" refX=\"8\" refY=\"5\" markerWidth=\"6\" markerHeight=\"6\" orient=\"auto-start-reverse\">\n        <path d=\"M0,0 L10,5 L0,10 z\" fill=\"currentColor\" />\n      </marker>\n    </defs>\n\n    {/* left: decompressor loop */}\n    <rect x=\"20\" y=\"50\" width=\"200\" height=\"120\" rx=\"10\" fill=\"none\" stroke=\"currentColor\" strokeOpacity=\"0.5\" />\n    <text x=\"120\" y=\"78\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"14\" fill=\"currentColor\">decompressor</text>\n    <text x=\"120\" y=\"98\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.6\">while (bytes) emit(c)</text>\n    {/* loop arrow */}\n    <path d=\"M 92 120 A 28 28 0 1 1 148 120\" fill=\"none\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" />\n    <text x=\"120\" y=\"128\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.6\">loop</text>\n    <text x=\"120\" y=\"190\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.55\">wants to push</text>\n\n    {/* right: parser loop */}\n    <rect x=\"380\" y=\"50\" width=\"200\" height=\"120\" rx=\"10\" fill=\"none\" stroke=\"currentColor\" strokeOpacity=\"0.5\" />\n    <text x=\"480\" y=\"78\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"14\" fill=\"currentColor\">parser</text>\n    <text x=\"480\" y=\"98\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.6\">while (chars) use(c)</text>\n    <path d=\"M 452 120 A 28 28 0 1 1 508 120\" fill=\"none\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" />\n    <text x=\"480\" y=\"128\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.6\">loop</text>\n    <text x=\"480\" y=\"190\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.55\">wants to pull</text>\n\n    {/* the clash in the middle */}\n    <line x1=\"232\" y1=\"104\" x2=\"368\" y2=\"104\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" opacity=\"0.8\" />\n    <line x1=\"368\" y1=\"124\" x2=\"232\" y2=\"124\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" opacity=\"0.8\" />\n    <text x=\"300\" y=\"150\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"22\" fill=\"currentColor\" fontWeight=\"bold\">?</text>\n    <text x=\"300\" y=\"172\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.55\">who calls whom</text>\n  </svg>\n</Diagram>\n\nWhichever one you make a *callee*, you have to turn inside-out: rip out its loop,\nhoist its locals into `static` state, and reconstruct \"where was I?\" by hand every\ntime it's called. The algorithm disappears into a state machine.\n\nA **coroutine** is the escape hatch: a function you can `return` from *in the middle*\nand later resume *exactly where it left off*, locals and loop position intact. C\ndoesn't have them. But — as Simon Tatham showed in his\n[classic note](https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html) — you\ncan fake them with a `switch` statement and one preprocessor macro.\n\n## The painful version first\n\nHere's that decompressor rewritten as a callee the honest way — a hand-rolled state\nmachine. It works, and it's miserable:\n\n```c\nint decompressor(void) {\n  static int state = 0, len, c;\n  switch (state) {\n    case 0:                 /* fresh start */\n      while (1) {\n        c = getchar();\n        if (c == EOF) return EOF;\n        if (c == 0xFF) {    /* run-length escape */\n          len = getchar();\n          c = getchar();\n          while (len--) {\n            state = 1; return c;   /* <-- emit, remember we're here */\n            case 1: ;              /* <-- ...come back to here */\n          }\n        } else {\n          state = 2; return c;\n          case 2: ;\n        }\n      }\n  }\n}\n```\n\nEvery `return` needs a unique number, a matching `case`, and an assignment to\n`state`. Add a branch and you renumber everything. The bookkeeping *is* the bug\nsurface.\n\n<Callout type=\"note\">\n  Notice the `case 1:` sitting **inside** the `while` loop, underneath a `switch`\n  that's outside it. That's legal C — `case` labels can live in any sub-block of a\n  `switch`. This is the same quirk that powers Duff's device, and it's the whole\n  trick.\n</Callout>\n\n## The insight: let `__LINE__` be the state\n\nThe numbers are pure noise. We never *read* them — we only need each `return` to\nhave a label unique to its position, and a way to jump back to it. The C preprocessor\nalready hands out a unique number per position: `__LINE__`.\n\nSo: on the way out, save `__LINE__`. On the way back in, `switch` on the saved value\nand let a `case __LINE__:` right after the `return` catch it. Two macros:\n\n```c\n#define crBegin     static int state = 0; switch (state) { case 0:\n#define crReturn(x) do { state = __LINE__; return x; \\\n                         case __LINE__: ; } while (0)\n#define crFinish    }\n```\n\nThat's the entire idea. `crBegin` opens a `switch` on the saved state. `crReturn`\nstamps the current line into `state`, returns, and drops a `case` label at that exact\nline so the next call resumes one statement later. `crFinish` closes the brace.\n\n## Watch it run\n\nA three-value generator — `next()` returns 0, 1, 2, then -1 — makes the control flow\nvisible. Step through it: watch `state` get stamped with a line number on the way out,\nand the `switch` teleport straight back into the middle of the `for` loop on the way\nback in.\n\n<CoroutineStepper />\n\nThe magic moment is the jump from `switch (state)` to `case __LINE__:` *inside* the\nloop. The function never \"starts over\" — it lands back exactly where it returned, with\n`i` right where it was.\n\n## How the macros expand\n\nIt reads like ordinary code, but here's what the preprocessor actually produces, one\nlayer at a time:\n\n<StepThrough titles={[\"you write\", \"expand crBegin\", \"expand crReturn\", \"what runs\"]}>\n\nYou write the coroutine in its natural, loop-shaped form:\n\n```c\nint next(void) {\n  static int i;\n  crBegin;\n  for (i = 0; i < 3; i++)\n    crReturn(i);\n  crFinish;\n}\n```\n\n`crBegin` becomes a `switch` on the saved state, entered at `case 0` on the first call:\n\n```c\nint next(void) {\n  static int i;\n  static int state = 0; switch (state) { case 0:\n  for (i = 0; i < 3; i++)\n    crReturn(i);\n  }\n}\n```\n\n`crReturn(i)` stamps the line number, returns, and leaves a `case` label one line on:\n\n```c\nfor (i = 0; i < 3; i++) {\n  state = __LINE__; return i;\n  case __LINE__: ;\n}\n```\n\nSo the next call jumps from `switch (state)` *directly* to that `case` — back inside\nthe `for` loop, with `i` preserved. No re-entry, no restart:\n\n```c\nswitch (state) {     /* state == that line number */\n  case 0: ...\n  case 17: ;         /* <-- lands here, mid-loop */\n}\n```\n\n</StepThrough>\n\n## Where it bites\n\nThis is a beautiful hack, and like every beautiful hack it has sharp edges. Tatham is\ncandid about them, and you should be too:\n\n<Callout type=\"warn\">\n  **Only `static` locals survive.** A normal `auto` variable is undefined after a\n  `crReturn` — its storage isn't preserved across the return. Loop counters and any\n  state you care about must be `static`. **One `crReturn` per line** (two share a\n  `__LINE__` and collide). And you **can't wrap the body in your own `switch`** — it\n  would capture the `case` labels meant for the coroutine.\n</Callout>\n\nThe `static` rule hides a worse problem: `static` means *one shared instance*. Two\ncallers can't run the same coroutine independently — they'd stomp each other's `state`\nand `i`. Fine for a single global decompressor; fatal for anything reentrant or\nthreaded.\n\n## Making it reentrant\n\nThe fix is to stop using `static` and instead thread all the state through a context\nstruct the caller owns. Every \"serious\" local becomes a field; the macros read and\nwrite `ctx->state` instead of a file-scoped one:\n\n```c\nstruct coro {\n  int state;\n  int i, len, c;   /* everything that must survive a yield */\n};\n\n#define crBegin(ctx)     switch ((ctx)->state) { case 0:\n#define crReturn(ctx, x) do { (ctx)->state = __LINE__; return x; \\\n                              case __LINE__: ; } while (0)\n#define crFinish         }\n\nint next(struct coro *ctx) {\n  crBegin(ctx);\n  for (ctx->i = 0; ctx->i < 3; ctx->i++)\n    crReturn(ctx, ctx->i);\n  crFinish;\n  return -1;\n}\n```\n\nNow each caller allocates its own `struct coro`, and you can run a hundred independent\ngenerators at once. The price is cosmetic — `ctx->i` everywhere you'd have written\n`i` — and Tatham's own verdict is the honest one: *\"virtually all your serious\nvariables become elements of the coroutine context structure.\"* You trade a little\nsyntax for reentrancy. Usually worth it.\n\n## Why this matters beyond the trick\n\nYou don't reach for these macros often — real codebases use explicit state machines,\nthreads, or a language with `async`/`yield` built in. But the idea underneath is worth\nkeeping: **a coroutine is just a state machine where the compiler tracks the state for\nyou.** `async/await` in Rust, generators in Python, goroutines parked on a channel —\nall of them are, at bottom, \"save where I am, return, resume later.\" Tatham's macro is\nthat idea stripped to its absolute minimum: one `switch`, one `__LINE__`, and the\nnerve to put a `case` label inside a loop.\n\n---\n\n*Built on Simon Tatham's [Coroutines in C](https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html) (2000) — still the clearest thing ever written on the subject.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/coroutines-in-c","signal":{"interest":4,"helpful":5,"score":9,"level":5,"label":"Essential"}},{"title":"How self-attention works in transformers","description":"A from-scratch explainer of scaled dot-product attention — queries, keys, values, the softmax, and why the √d scaling matters.","date":"2026-06-02","tags":["transformers","deep-learning","explainer"],"draft":false,"featured":false,"interest":3,"helpful":5,"kind":"articles","slug":"how-transformers-attention-works","body":"Self-attention is the single mechanism that lets a transformer decide, for every\ntoken in a sequence, which other tokens are worth listening to. Older architectures\nlike RNNs squeezed an entire sentence through a fixed-size hidden state and read it\nleft to right. Attention throws that bottleneck out: every token can look directly\nat every other token in one parallel step, and it learns *how much* to look.\n\nThe trick is to give each token three learned vectors. The **query** asks a question\n(\"what am I looking for?\"), the **key** advertises what a token offers (\"here is what\nI am about\"), and the **value** is the actual content that gets passed along once a\nmatch is found. You compute these by multiplying the input embeddings by three\nlearned weight matrices, $W_Q$, $W_K$, and $W_V$, giving matrices $Q$, $K$, and $V$.\n\nA token attends to another by comparing its query against that token's key with a\ndot product — a large dot product means the two vectors point in a similar direction,\nso the question and the offer line up. Do this for every query against every key and\nyou get a full grid of raw compatibility scores.\n\n$$\n\\text{Attention}(Q, K, V) = \\text{softmax}\\!\\left(\\frac{Q K^{\\top}}{\\sqrt{d_k}}\\right) V\n$$\n\nThat one line is the whole operation. The matrix below shows the resulting weights\nfor a tiny three-token sequence: each row is one query token, each column is a key it\nmight attend to, and the cell shading is how much weight that pair receives after the\nsoftmax. Hover a row to see where that token looks.\n\n<AttentionMatrix tokens={[\"the\", \"cat\", \"sat\"]} />\n\nIt helps to walk the formula from the inside out. Each step below takes the previous\nresult and transforms it; together they go from raw vectors to a context-aware output.\n\n<StepThrough titles={[\"scores\", \"weights\", \"mix\"]}>\n\n**Q·Kᵀ — raw scores.** Multiply the query matrix by the transpose of the key matrix.\nThe entry at row *i*, column *j* is the dot product of token *i*'s query with token\n*j*'s key — an unnormalised score for how relevant token *j* is to token *i*. The\nresult is a square matrix, one score for every ordered pair of tokens.\n\n**Scale, then softmax — attention weights.** Divide every score by $\\sqrt{d_k}$, the\nsquare root of the key dimension. Without this, large dimensions produce dot products\nwith a big variance, pushing the softmax into saturated regions where gradients\nvanish; the scaling keeps the distribution well-behaved. Then apply softmax across\neach row so the weights are non-negative and sum to one — a proper distribution over\n\"where this token attends.\"\n\n**Weighted sum — the output.** Multiply the weight matrix by the value matrix $V$.\nEach output row is a weighted average of all value vectors, blended according to that\ntoken's attention weights. A token that attended strongly to \"cat\" inherits most of\n\"cat\"'s value, so its new representation is now informed by the context around it.\n\n</StepThrough>\n\nStack several of these in parallel — each with its own $W_Q$, $W_K$, $W_V$ — and you\nget **multi-head attention**, where different heads specialise in different relations\n(syntax, coreference, positional patterns). Concatenate the heads, project once more,\nand that becomes one transformer sub-layer. Repeat across depth and the model builds\nincreasingly abstract, context-rich representations of the sequence.\n\n<Callout type=\"tip\">\n  The √dₖ scaling is easy to skip when implementing attention from scratch, but\n  dropping it is one of the most common reasons a hand-rolled transformer trains\n  slowly or not at all — the softmax saturates and gradients stop flowing.\n</Callout>\n\nThat is the entire idea: project tokens into queries, keys, and values; score every\npair with a scaled dot product; turn the scores into a distribution with softmax; and\nread out a weighted mix of values. Everything else in a transformer — feed-forward\nlayers, residual connections, layer norm, positional encodings — exists to support\nand stack this one operation.\n","readingTimeMins":3,"url":"https://ai.thesatyajit.com/articles/how-transformers-attention-works","signal":{"interest":3,"helpful":5,"score":8,"level":4,"label":"High"}}]