2026-06-28 · 11 min · llm · diffusion · language-models · architecture · explainer
Almost every language model you use is autoregressive: it factorizes text left-to-right, , with a causal attention mask, and generates one token per forward pass. It works so well that the alternatives barely get airtime.
iLLaDA is one of the alternatives, scaled up until it's hard to ignore. It's an 8B masked diffusion language model with fully bidirectional attention, trained from scratch on 12 trillion tokens by a team from Renmin University and ByteDance Seed — the direct successor to LLaDA. No causal mask, no left-to-right factorization. The question it's built to answer: can a bidirectional diffusion model, trained from scratch, actually keep up with a strong autoregressive model? The honest answer turns out to be yes for base models, not yet for instruct — and the path to that answer is worth understanding.
How a masked diffusion language model works
Forget Gaussian noise. The "diffusion" here is masking — a discrete, absorbing-state process over tokens.
The forward process: corrupt by masking
Pick a masking ratio . Each token is independently replaced by a
special [MASK] token with probability . At the sequence is clean; at
it's fully masked. Drag and watch the corruption — and the loss weighting — change:
The “noise” here is masking, not Gaussian — an absorbing-state diffusion over discrete tokens. The model sees the corrupted sequence and predicts the originals, but the loss only counts the 5 masked positions, scaled by 1/t = 2.50 so heavily- and lightly-masked samples both pull their weight. Averaged over all t, this is an upper bound on the negative log-likelihood — the same objective iLLaDA keeps through pre-training and fine-tuning.
The model sees the corrupted sequence and is trained to predict the original tokens at every masked position at once. The objective is a masked cross-entropy, computed only on masked positions and reweighted by :
The indicator restricts the loss to masked positions; the factor re-normalizes so heavily- and lightly-masked samples both contribute correctly. Averaged over , this is a Monte-Carlo upper bound on the negative log-likelihood — a principled training objective, not a heuristic. iLLaDA keeps this same objective through pre-training and supervised fine-tuning.
Bidirectional attention comes for free
An autoregressive model must hide the future — if token could attend to token , it would just read the answer it's supposed to predict. A masked diffusion model predicts masked positions anywhere in the sequence, not "the next" one, so there's nothing to hide. Every position attends to every other, left and right:
Bidirectional: every query attends to all 6 positions, future included. A masked diffusion LM predicts maskedpositions, not “the next” one, so there’s nothing to hide — it gets full left-and-right context at every layer, which is exactly what helps on infilling and global-structure tasks.
That full context is the structural argument for diffusion LMs: on infilling and tasks where later text disambiguates earlier text, seeing both sides at every layer should help.
Generation: unmask in parallel, over a few steps
Generation runs the process backward. Start from a block of all-[MASK] tokens. Each
denoising step, the model predicts every masked position, commits its most confident
predictions, and re-masks the low-confidence ones to try again next step. A whole
block resolves over a handful of steps, in confidence order — not reading order. Flip
between the two paradigms:
Diffusion decodes the whole block at once and commits its most confident predictions each step, so a 12-token answer lands in ~4 passes — and because there's no causal mask, every position conditions on the entire sequence, left and right. The cost is that quality depends on how many denoising steps you spend.
This is the crux. Autoregression spends one forward pass per output token, in series. Diffusion spends a fixed, smaller number of denoising passes over the whole block — the promise being fewer sequential steps, at the cost of needing enough steps for quality.
What iLLaDA changes over LLaDA
iLLaDA is, more than anything, a careful scale-up of LLaDA — proof that the recipe keeps paying off with more tokens and a better post-training pass.
A bigger, leaner backbone
The architecture is a standard dense Transformer (RMSNorm, SwiGLU, RoPE, no biases), but re-tuned for cheaper inference:
| iLLaDA | LLaDA | |
|---|---|---|
| Attention heads | 32 | 32 |
| Key/Value heads | 8 (GQA) | 32 (MHA) |
| FFN dim | 14,336 | 12,288 |
| Vocabulary | 155,136 | 126,464 |
| Max sequence length | 8192 | 4096 |
| Embedding / LM head | tied | untied |
| Total parameters | 7.62B | 8.02B |
The load-bearing change is grouped-query attention (8 KV heads instead of 32), adopted to shrink the cached key/value footprint at inference — plus a larger vocab, doubled context, and tied embeddings.
The headline spend: 12T tokens
- Pre-training: 12T tokens, up ~5.2× from LLaDA's 2.3T. AdamW, weight decay 0.1, LR warmed to , held, then cosine-decayed to .
- SFT: a 25B-token instruction corpus for 12 epochs. The new wrinkle: SFT now applies the same masking as pre-training across the entire sequence (prompt, response, EOS), rather than keeping the prompt fully visible — a more consistent objective end to end.
And the fine-tuning clearly hadn't saturated. The SFT-epoch ablation rises monotonically through all 12 epochs (they stopped on compute, not convergence):
Two inference-side tricks
- Variable-length generation. Instead of committing to a fixed output block and denoising all of it, iLLaDA appends a mask block, runs the sampler, commits confident tokens, and continues until termination — so it only denoises as many positions as the answer needs, rather than padding to a worst case. (The paper argues the efficiency, but — notably — does not report latency or step-count numbers for it.)
- Confidence-based multiple-choice scoring. Rather than a likelihood estimate, they score a candidate by revealing its tokens one at a time, each step unmasking the highest-confidence position, and summing the log-probs: The authors are upfront that this is "not a likelihood estimate" but a task-specific surrogate. Its ablation is modest: +1.3 PIQA, +0.6 ARC-C, +2.3 HellaSwag over likelihood scoring.
The results, honestly
Two stories live in these tables, and they point in different directions.
Base models: genuine parity with Qwen2.5
As a base model, iLLaDA improves broadly over LLaDA and lands even with Qwen2.5-7B on average — winning several benchmarks outright:
The per-benchmark picture, with the gains over LLaDA that the abstract leads on:
| Base | iLLaDA | LLaDA | Qwen2.5 | Δ vs LLaDA |
|---|---|---|---|---|
| MMLU | 74.8 | 65.9 | 71.9 | +8.9 |
| BBH | 71.3 | 49.7 | 63.9 | +21.6 |
| ARC-C | 60.8 | 45.9 | 51.5 | +14.9 |
| HellaSwag | 76.6 | 70.5 | 79.0 | +6.1 |
| GSM8K | 81.9 | 70.3 | 78.9 | +11.6 |
| MATH | 38.4 | 31.4 | 41.1 | +7.0 |
| HumanEval | 50.0 | 35.4 | 56.7 | +14.6 |
| MBPP | 57.8 | 40.0 | 63.6 | +17.8 |
iLLaDA-Base beats Qwen2.5-Base on MMLU, BBH, ARC-C, and GSM8K; Qwen still wins on HellaSwag, MATH, and code. But the average edges ahead — and that's the real result: a from-scratch bidirectional diffusion model matching a strong autoregressive base.
Instruct models: the gap that's left
This is the part the abstract's "competitive on several benchmarks" softens. After instruction tuning, iLLaDA trails Qwen2.5 by ~10 points on average, with double-digit gaps on the hard reasoning and coding tasks:
| Instruct | iLLaDA | LLaDA | Qwen2.5 | Δ vs LLaDA |
|---|---|---|---|---|
| MMLU | 71.6 | 65.5 | 76.6 | +6.1 |
| MMLU-Pro | 52.3 | 37.0 | 56.3 | +15.3 |
| GSM8K | 89.0 | 77.5 | 91.6 | +11.5 |
| MATH | 56.7 | 42.2 | 75.5 | +14.5 |
| HumanEval | 65.9 | 49.4 | 84.8 | +16.5 |
| MBPP | 58.0 | 41.0 | 79.2 | +17.0 |
The improvement over LLaDA is huge and real (+12.6 average). The gap to Qwen on MATH (56.7 vs 75.5), HumanEval (65.9 vs 84.8), and MBPP (58.0 vs 79.2) is also real, and the authors don't hide it — they point to the lack of RL alignment as part of the cause.
What I make of it
- The base-model result is the one that matters, and it's solid. A bidirectional masked diffusion LM, trained from scratch, reaching autoregressive base parity is a genuine data point: diffusion LMs scale like AR LMs. The paradigm is viable, not a curiosity.
- "Competitive" is doing some work in the abstract. On instruction-tuned reasoning and code, AR still wins by 10–20 points. Read the instruct tables before repeating the headline.
- The efficiency case is asserted, not measured. GQA and variable-length generation are motivated by cost, but the paper reports no sampling-step counts, no latency, no tokens/sec — and the number of denoising passes is exactly diffusion's central liability. "More efficient" is a design argument here, not a demonstrated result.
- Parity wasn't cheap. 12T tokens at 8B is a frontier-scale data spend, ~5× LLaDA and on par with what strong AR models of this size consume. iLLaDA shows diffusion can reach AR base parity — by paying full AR-scale training cost, and still trailing after post-training. There's also an honest failure mode noted: the sampler can fall into repetitive reasoning loops that need inference-time mitigation.
The fair summary: bidirectional masked diffusion is now a scalable paradigm at parity with autoregressive base models — no longer something you can wave off — but not yet a proven win on post-trained quality, and not yet a proven efficiency advantage. That's a meaningful place to have gotten to, stated without the gloss.
Built on Improved Large Language Diffusion Models (Nie et al., Renmin University & ByteDance Seed, 2026) and its predecessor LLaDA. All numbers are from the paper's Tables 1–3; the SFT-epoch curves are redrawn from its Figure 1.