iLLaDA: how far a masked-diffusion language model scales

2026-06-28 · 11 min · llm · diffusion · language-models · architecture · explainer

Almost every language model you use is autoregressive: it factorizes text left-to-right, $p(x) = \prod_i p(x_i \mid x_{<i})$ , with a causal attention mask, and generates one token per forward pass. It works so well that the alternatives barely get airtime.

iLLaDA is one of the alternatives, scaled up until it's hard to ignore. It's an 8B masked diffusion language model with fully bidirectional attention, trained from scratch on 12 trillion tokens by a team from Renmin University and ByteDance Seed — the direct successor to LLaDA. No causal mask, no left-to-right factorization. The question it's built to answer: can a bidirectional diffusion model, trained from scratch, actually keep up with a strong autoregressive model? The honest answer turns out to be yes for base models, not yet for instruct — and the path to that answer is worth understanding.

How a masked diffusion language model works

Forget Gaussian noise. The "diffusion" here is masking — a discrete, absorbing-state process over tokens.

The forward process: corrupt by masking

Pick a masking ratio $t \sim \mathcal{U}[0,1]$ . Each token is independently replaced by a special [MASK] token with probability $t$ . At $t=0$ the sequence is clean; at $t=1$ it's fully masked. Drag $t$ and watch the corruption — and the loss weighting — change:

forward masking process · loss on masked positions, weighted 1/t

masked[MASK]predictsthe[MASK]tokens[MASK]everymasked[MASK]inone[MASK]

masking ratio t0.40

t→0 · nearly cleant→1 · fully masked

masked tokens

5/13

scored positions

loss weight 1/t

2.50×

The “noise” here is masking, not Gaussian — an absorbing-state diffusion over discrete tokens. The model sees the corrupted sequence and predicts the originals, but the loss only counts the 5 masked positions, scaled by 1/t = 2.50 so heavily- and lightly-masked samples both pull their weight. Averaged over all t, this is an upper bound on the negative log-likelihood — the same objective iLLaDA keeps through pre-training and fine-tuning.

The model $p_\theta$ sees the corrupted sequence $x_t$ and is trained to predict the original tokens at every masked position at once. The objective is a masked cross-entropy, computed only on masked positions and reweighted by $1/t$ :

\mathcal{L}(\theta) \;=\; -\,\mathbb{E}_{t,\,x_0,\,x_t}\!\left[\frac{1}{t}\sum_{i=1}^{L} \mathbf{1}\!\left[x_t^{i} = \mathrm{M}\right]\,\log p_\theta\!\left(x_0^{i} \mid x_t\right)\right]

The indicator $\mathbf{1}[x_t^i = \mathrm{M}]$ restricts the loss to masked positions; the $1/t$ factor re-normalizes so heavily- and lightly-masked samples both contribute correctly. Averaged over $t$ , this is a Monte-Carlo upper bound on the negative log-likelihood — a principled training objective, not a heuristic. iLLaDA keeps this same objective through pre-training and supervised fine-tuning.

Bidirectional attention comes for free

An autoregressive model must hide the future — if token $i$ could attend to token $i{+}1$ , it would just read the answer it's supposed to predict. A masked diffusion model predicts masked positions anywhere in the sequence, not "the next" one, so there's nothing to hide. Every position attends to every other, left and right:

attention mask

Bidirectional: every query attends to all 6 positions, future included. A masked diffusion LM predicts maskedpositions, not “the next” one, so there’s nothing to hide — it gets full left-and-right context at every layer, which is exactly what helps on infilling and global-structure tasks.

That full context is the structural argument for diffusion LMs: on infilling and tasks where later text disambiguates earlier text, seeing both sides at every layer should help.

Generation: unmask in parallel, over a few steps

Generation runs the process backward. Start from a block of all-[MASK] tokens. Each denoising step, the model predicts every masked position, commits its most confident predictions, and re-masks the low-confidence ones to try again next step. A whole block resolves over a handful of steps, in confidence order — not reading order. Flip between the two paradigms:

generation order

adiffusionLMunmasksmanytokensatoncenotonebyone

denoising step 0/4 · 0/12 tokens unmasked

forward passes

tokens / pass

many

attention

bidirectional

Diffusion decodes the whole block at once and commits its most confident predictions each step, so a 12-token answer lands in ~4 passes — and because there's no causal mask, every position conditions on the entire sequence, left and right. The cost is that quality depends on how many denoising steps you spend.

This is the crux. Autoregression spends one forward pass per output token, in series. Diffusion spends a fixed, smaller number of denoising passes over the whole block — the promise being fewer sequential steps, at the cost of needing enough steps for quality.

The two directions of the same model. Training: mask a fraction t of the tokens and predict the originals (loss on masked positions only). Generation: start fully masked and iteratively unmask the confident predictions, re-masking the rest, until the block resolves.

What iLLaDA changes over LLaDA

iLLaDA is, more than anything, a careful scale-up of LLaDA — proof that the recipe keeps paying off with more tokens and a better post-training pass.

A bigger, leaner backbone

The architecture is a standard dense Transformer (RMSNorm, SwiGLU, RoPE, no biases), but re-tuned for cheaper inference:

	iLLaDA	LLaDA
Attention heads	32	32
Key/Value heads	8 (GQA)	32 (MHA)
FFN dim	14,336	12,288
Vocabulary	155,136	126,464
Max sequence length	8192	4096
Embedding / LM head	tied	untied
Total parameters	7.62B	8.02B

The load-bearing change is grouped-query attention (8 KV heads instead of 32), adopted to shrink the cached key/value footprint at inference — plus a larger vocab, doubled context, and tied embeddings.

The headline spend: 12T tokens

Pre-training: 12T tokens, up ~5.2× from LLaDA's 2.3T. AdamW, weight decay 0.1, LR warmed to $2\times10^{-4}$ , held, then cosine-decayed to $5\times10^{-6}$ .
SFT: a 25B-token instruction corpus for 12 epochs. The new wrinkle: SFT now applies the same masking as pre-training across the entire sequence (prompt, response, EOS), rather than keeping the prompt fully visible — a more consistent objective end to end.

And the fine-tuning clearly hadn't saturated. The SFT-epoch ablation rises monotonically through all 12 epochs (they stopped on compute, not convergence):

SFT-epoch ablation (from the paper, Figure 1), redrawn. Accuracy on GSM8K, MATH, and MMLU-Pro keeps climbing through 12 epochs of fine-tuning — the curve had not flattened where they stopped.

Two inference-side tricks

Variable-length generation. Instead of committing to a fixed output block and denoising all of it, iLLaDA appends a mask block, runs the sampler, commits confident tokens, and continues until termination — so it only denoises as many positions as the answer needs, rather than padding to a worst case. (The paper argues the efficiency, but — notably — does not report latency or step-count numbers for it.)
Confidence-based multiple-choice scoring. Rather than a likelihood estimate, they score a candidate by revealing its tokens one at a time, each step unmasking the highest-confidence position, and summing the log-probs: $S_{\text{conf}}(y \mid p) \;=\; \sum_k \log p_\theta\!\left(y^{i_k} \mid p,\, \tilde{y}_{k-1}\right), \quad i_k = \arg\max_{i \in \mathcal{M}_{k-1}} p_\theta\!\left(y^i \mid p,\, \tilde{y}_{k-1}\right)$ The authors are upfront that this is "not a likelihood estimate" but a task-specific surrogate. Its ablation is modest: +1.3 PIQA, +0.6 ARC-C, +2.3 HellaSwag over likelihood scoring.

The results, honestly

Two stories live in these tables, and they point in different directions.

Base models: genuine parity with Qwen2.5

As a base model, iLLaDA improves broadly over LLaDA and lands even with Qwen2.5-7B on average — winning several benchmarks outright:

Base models — average over 8 benchmarks (%)

iLLaDA 8B

63.9

Qwen2.5 7B

63.3

Dream 7B

61.4

LLaDA 8B

51.1

020406080

The per-benchmark picture, with the gains over LLaDA that the abstract leads on:

Base	iLLaDA	LLaDA	Qwen2.5	Δ vs LLaDA
MMLU	74.8	65.9	71.9	+8.9
BBH	71.3	49.7	63.9	+21.6
ARC-C	60.8	45.9	51.5	+14.9
HellaSwag	76.6	70.5	79.0	+6.1
GSM8K	81.9	70.3	78.9	+11.6
MATH	38.4	31.4	41.1	+7.0
HumanEval	50.0	35.4	56.7	+14.6
MBPP	57.8	40.0	63.6	+17.8

iLLaDA-Base beats Qwen2.5-Base on MMLU, BBH, ARC-C, and GSM8K; Qwen still wins on HellaSwag, MATH, and code. But the average edges ahead — and that's the real result: a from-scratch bidirectional diffusion model matching a strong autoregressive base.

Instruct models: the gap that's left

This is the part the abstract's "competitive on several benchmarks" softens. After instruction tuning, iLLaDA trails Qwen2.5 by ~10 points on average, with double-digit gaps on the hard reasoning and coding tasks:

Instruct models — average over 7 benchmarks (%)

Qwen2.5 7B

77.1

iLLaDA 8B

67.1

Dream 7B

60.2

LLaDA 8B

54.5

020406080

Instruct	iLLaDA	LLaDA	Qwen2.5	Δ vs LLaDA
MMLU	71.6	65.5	76.6	+6.1
MMLU-Pro	52.3	37.0	56.3	+15.3
GSM8K	89.0	77.5	91.6	+11.5
MATH	56.7	42.2	75.5	+14.5
HumanEval	65.9	49.4	84.8	+16.5
MBPP	58.0	41.0	79.2	+17.0

The improvement over LLaDA is huge and real (+12.6 average). The gap to Qwen on MATH (56.7 vs 75.5), HumanEval (65.9 vs 84.8), and MBPP (58.0 vs 79.2) is also real, and the authors don't hide it — they point to the lack of RL alignment as part of the cause.

What I make of it

The base-model result is the one that matters, and it's solid. A bidirectional masked diffusion LM, trained from scratch, reaching autoregressive base parity is a genuine data point: diffusion LMs scale like AR LMs. The paradigm is viable, not a curiosity.
"Competitive" is doing some work in the abstract. On instruction-tuned reasoning and code, AR still wins by 10–20 points. Read the instruct tables before repeating the headline.
The efficiency case is asserted, not measured. GQA and variable-length generation are motivated by cost, but the paper reports no sampling-step counts, no latency, no tokens/sec — and the number of denoising passes is exactly diffusion's central liability. "More efficient" is a design argument here, not a demonstrated result.
Parity wasn't cheap. 12T tokens at 8B is a frontier-scale data spend, ~5× LLaDA and on par with what strong AR models of this size consume. iLLaDA shows diffusion can reach AR base parity — by paying full AR-scale training cost, and still trailing after post-training. There's also an honest failure mode noted: the sampler can fall into repetitive reasoning loops that need inference-time mitigation.

The fair summary: bidirectional masked diffusion is now a scalable paradigm at parity with autoregressive base models — no longer something you can wave off — but not yet a proven win on post-trained quality, and not yet a proven efficiency advantage. That's a meaningful place to have gotten to, stated without the gloss.

Built on Improved Large Language Diffusion Models (Nie et al., Renmin University & ByteDance Seed, 2026) and its predecessor LLaDA. All numbers are from the paper's Tables 1–3; the SFT-epoch curves are redrawn from its Figure 1.