# iLLaDA: how far a masked-diffusion language model scales

> Satyajit Ghana — Head of Engineering @ Inkers Technology
> canonical: https://ai.thesatyajit.com/articles/illada-diffusion-language-model
> date: 2026-06-28
> tags: llm, diffusion, language-models, architecture, explainer
Almost every language model you use is **autoregressive**: it factorizes text
left-to-right, $p(x) = \prod_i p(x_i \mid x_{<i})$, with a causal attention mask, and
generates one token per forward pass. It works so well that the alternatives barely get
airtime.

**iLLaDA** is one of the alternatives, scaled up until it's hard to ignore. It's an 8B
**masked diffusion** language model with *fully bidirectional* attention, trained from
scratch on 12 trillion tokens by a team from Renmin University and ByteDance Seed — the
direct successor to [LLaDA](https://arxiv.org/abs/2502.09992). No causal mask, no
left-to-right factorization. The question it's built to answer: can a bidirectional
diffusion model, trained from scratch, actually keep up with a strong autoregressive
model? The honest answer turns out to be *yes for base models, not yet for instruct* —
and the path to that answer is worth understanding.

## How a masked diffusion language model works

Forget Gaussian noise. The "diffusion" here is **masking** — a discrete, absorbing-state
process over tokens.

### The forward process: corrupt by masking

Pick a masking ratio $t \sim \mathcal{U}[0,1]$. Each token is independently replaced by a
special `[MASK]` token with probability $t$. At $t=0$ the sequence is clean; at $t=1$
it's fully masked. Drag $t$ and watch the corruption — and the loss weighting — change:

<MaskingProcess />

The model $p_\theta$ sees the corrupted sequence $x_t$ and is trained to predict the
*original* tokens at every masked position at once. The objective is a masked
cross-entropy, computed only on masked positions and reweighted by $1/t$:

$$
\mathcal{L}(\theta) \;=\; -\,\mathbb{E}_{t,\,x_0,\,x_t}\!\left[\frac{1}{t}\sum_{i=1}^{L}
\mathbf{1}\!\left[x_t^{i} = \mathrm{M}\right]\,\log p_\theta\!\left(x_0^{i} \mid x_t\right)\right]
$$

The indicator $\mathbf{1}[x_t^i = \mathrm{M}]$ restricts the loss to masked positions;
the $1/t$ factor re-normalizes so heavily- and lightly-masked samples both contribute
correctly. Averaged over $t$, this is a Monte-Carlo **upper bound on the negative
log-likelihood** — a principled training objective, not a heuristic. iLLaDA keeps this
*same* objective through pre-training **and** supervised fine-tuning.

### Bidirectional attention comes for free

An autoregressive model *must* hide the future — if token $i$ could attend to token
$i{+}1$, it would just read the answer it's supposed to predict. A masked diffusion model
predicts *masked* positions anywhere in the sequence, not "the next" one, so there's
nothing to hide. Every position attends to every other, left and right:

<AttentionModes />

That full context is the structural argument for diffusion LMs: on infilling and tasks
where later text disambiguates earlier text, seeing both sides at every layer should
help.

### Generation: unmask in parallel, over a few steps

Generation runs the process backward. Start from a block of all-`[MASK]` tokens. Each
denoising step, the model predicts every masked position, **commits its most confident
predictions**, and **re-masks the low-confidence ones** to try again next step. A whole
block resolves over a handful of steps, in confidence order — not reading order. Flip
between the two paradigms:

<Unmasking />

This is the crux. Autoregression spends one forward pass per output token, in series.
Diffusion spends a fixed, smaller number of denoising passes over the whole block — the
promise being fewer sequential steps, at the cost of needing enough steps for quality.

<Diagram caption="The two directions of the same model. Training: mask a fraction t of the tokens and predict the originals (loss on masked positions only). Generation: start fully masked and iteratively unmask the confident predictions, re-masking the rest, until the block resolves.">
  <svg viewBox="0 0 640 220" role="img" aria-label="Training corrupts a clean sequence by masking and predicts the originals; generation starts fully masked and iteratively unmasks confident tokens." style={{ width: "100%", height: "auto" }}>
    {/* training row */}
    <text x="16" y="34" fontFamily="monospace" fontSize="11" fill="var(--muted-foreground)">training</text>
    <rect x="16" y="44" width="120" height="34" rx="6" fill="var(--background)" stroke="var(--border)" />
    <text x="76" y="65" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">clean x₀</text>
    <text x="150" y="65" textAnchor="middle" fontFamily="monospace" fontSize="11" fill="var(--muted-foreground)">→ mask (t) →</text>
    <rect x="216" y="44" width="120" height="34" rx="6" fill="oklch(0.72 0.13 60)" opacity="0.3" stroke="var(--border)" />
    <text x="276" y="65" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">corrupted xₜ</text>
    <text x="356" y="65" textAnchor="middle" fontFamily="monospace" fontSize="11" fill="var(--muted-foreground)">→ predict →</text>
    <rect x="430" y="44" width="130" height="34" rx="6" fill="oklch(0.72 0.13 150)" opacity="0.3" stroke="var(--border)" />
    <text x="495" y="65" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">x̂₀ on masked</text>
    {/* generation row */}
    <text x="16" y="134" fontFamily="monospace" fontSize="11" fill="var(--muted-foreground)">generation</text>
    <rect x="16" y="144" width="120" height="34" rx="6" fill="oklch(0.72 0.13 60)" opacity="0.45" stroke="var(--border)" />
    <text x="76" y="165" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">all [MASK]</text>
    <text x="172" y="165" textAnchor="middle" fontFamily="monospace" fontSize="11" fill="var(--muted-foreground)">→ unmask conf. →</text>
    <rect x="246" y="144" width="120" height="34" rx="6" fill="oklch(0.72 0.14 150)" opacity="0.55" stroke="var(--border)" />
    <text x="306" y="165" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--foreground)">partly filled</text>
    <text x="402" y="165" textAnchor="middle" fontFamily="monospace" fontSize="11" fill="var(--muted-foreground)">→ repeat →</text>
    <rect x="466" y="144" width="120" height="34" rx="6" fill="oklch(0.72 0.15 150)" opacity="0.85" stroke="var(--border)" />
    <text x="526" y="165" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="oklch(0.2 0 0)">complete</text>
    {/* loop arrow */}
    <path d="M 306 178 q 0 26 -120 26 q -120 0 -120 -26" fill="none" stroke="var(--muted-foreground)" strokeWidth="1" strokeDasharray="3 3" />
    <text x="186" y="214" textAnchor="middle" fontFamily="monospace" fontSize="9" fill="var(--muted-foreground)">re-mask low-confidence positions</text>
  </svg>
</Diagram>

## What iLLaDA changes over LLaDA

iLLaDA is, more than anything, a careful **scale-up** of LLaDA — proof that the recipe
keeps paying off with more tokens and a better post-training pass.

### A bigger, leaner backbone

The architecture is a standard dense Transformer (RMSNorm, SwiGLU, RoPE, no biases), but
re-tuned for cheaper inference:

| | iLLaDA | LLaDA |
|---|---|---|
| Attention heads | 32 | 32 |
| Key/Value heads | **8 (GQA)** | 32 (MHA) |
| FFN dim | 14,336 | 12,288 |
| Vocabulary | 155,136 | 126,464 |
| Max sequence length | **8192** | 4096 |
| Embedding / LM head | **tied** | untied |
| Total parameters | 7.62B | 8.02B |

The load-bearing change is **grouped-query attention** (8 KV heads instead of 32),
adopted to shrink the cached key/value footprint at inference — plus a larger vocab,
doubled context, and tied embeddings.

### The headline spend: 12T tokens

- **Pre-training: 12T tokens**, up ~5.2× from LLaDA's 2.3T. AdamW, weight decay 0.1, LR
  warmed to $2\times10^{-4}$, held, then cosine-decayed to $5\times10^{-6}$.
- **SFT: a 25B-token instruction corpus for 12 epochs.** The new wrinkle: SFT now applies
  the *same* masking as pre-training across the entire sequence (prompt, response, EOS),
  rather than keeping the prompt fully visible — a more consistent objective end to end.

And the fine-tuning clearly hadn't saturated. The SFT-epoch ablation rises monotonically
through all 12 epochs (they stopped on compute, not convergence):

<Diagram caption="SFT-epoch ablation (from the paper, Figure 1), redrawn. Accuracy on GSM8K, MATH, and MMLU-Pro keeps climbing through 12 epochs of fine-tuning — the curve had not flattened where they stopped.">
  <svg viewBox="0 0 560 240" role="img" aria-label="Three rising curves of accuracy versus SFT epoch for GSM8K, MATH, and MMLU-Pro, all increasing through 12 epochs." style={{ width: "100%", height: "auto" }}>
    {/* axes */}
    <line x1="48" y1="200" x2="520" y2="200" stroke="var(--border)" strokeWidth="1" />
    <line x1="48" y1="20" x2="48" y2="200" stroke="var(--border)" strokeWidth="1" />
    {/* y ticks 45..90 mapped 200..20 */}
    {[45,60,75,90].map((v) => (
      <g key={v}>
        <text x="40" y={200 - ((v-45)/45)*180 + 3} textAnchor="end" fontFamily="monospace" fontSize="9" fill="var(--muted-foreground)">{v}</text>
        <line x1="48" y1={200 - ((v-45)/45)*180} x2="520" y2={200 - ((v-45)/45)*180} stroke="var(--border)" strokeOpacity="0.3" strokeWidth="1" />
      </g>
    ))}
    {/* x ticks epochs 3,6,9,12 mapped 48..520 */}
    {[3,6,9,12].map((e) => (
      <text key={e} x={48 + ((e-3)/9)*460} y="216" textAnchor="middle" fontFamily="monospace" fontSize="9" fill="var(--muted-foreground)">{e}</text>
    ))}
    <text x="284" y="234" textAnchor="middle" fontFamily="monospace" fontSize="10" fill="var(--muted-foreground)">SFT epoch</text>
    {/* helper: x(e)=48+((e-3)/9)*460 ; y(v)=200-((v-45)/45)*180 */}
    {/* GSM8K 86.7,84.9,88.4,89.0 */}
    <polyline points="48,33.2 201,40.4 355,26.4 508,24.0" fill="none" stroke="oklch(0.72 0.15 150)" strokeWidth="2" />
    <text x="512" y="27" fontFamily="monospace" fontSize="9" fill="oklch(0.72 0.15 150)">GSM8K</text>
    {/* MATH 49.6,51.9,55.6,56.3 */}
    <polyline points="48,181.6 201,172.4 355,157.6 508,154.8" fill="none" stroke="oklch(0.72 0.15 250)" strokeWidth="2" />
    <text x="512" y="158" fontFamily="monospace" fontSize="9" fill="oklch(0.72 0.15 250)">MATH</text>
    {/* MMLU-Pro 48.4,51.5,51.8,52.2 */}
    <polyline points="48,186.4 201,174.0 355,172.8 508,171.2" fill="none" stroke="oklch(0.72 0.15 40)" strokeWidth="2" />
    <text x="512" y="174" fontFamily="monospace" fontSize="9" fill="oklch(0.72 0.15 40)">MMLU-Pro</text>
  </svg>
</Diagram>

### Two inference-side tricks

- **Variable-length generation.** Instead of committing to a fixed output block and
  denoising all of it, iLLaDA appends a mask block, runs the sampler, commits confident
  tokens, and continues until termination — so it only denoises as many positions as the
  answer needs, rather than padding to a worst case. (The paper argues the efficiency,
  but — notably — does not report latency or step-count numbers for it.)
- **Confidence-based multiple-choice scoring.** Rather than a likelihood estimate, they
  score a candidate by revealing its tokens one at a time, each step unmasking the
  highest-confidence position, and summing the log-probs:
  $$
  S_{\text{conf}}(y \mid p) \;=\; \sum_k \log p_\theta\!\left(y^{i_k} \mid p,\, \tilde{y}_{k-1}\right),
  \quad i_k = \arg\max_{i \in \mathcal{M}_{k-1}} p_\theta\!\left(y^i \mid p,\, \tilde{y}_{k-1}\right)
  $$
  The authors are upfront that this is "not a likelihood estimate" but a task-specific
  surrogate. Its ablation is modest: +1.3 PIQA, +0.6 ARC-C, +2.3 HellaSwag over
  likelihood scoring.

## The results, honestly

Two stories live in these tables, and they point in different directions.

### Base models: genuine parity with Qwen2.5

As a base model, iLLaDA improves broadly over LLaDA and lands **even with Qwen2.5-7B** on
average — winning several benchmarks outright:

<BenchBars
  title="Base models — average over 8 benchmarks (%)"
  unit=""
  bars={[
    { label: "iLLaDA 8B", value: 63.9, highlight: true },
    { label: "Qwen2.5 7B", value: 63.3 },
    { label: "Dream 7B", value: 61.4 },
    { label: "LLaDA 8B", value: 51.1 },
  ]}
/>

The per-benchmark picture, with the gains over LLaDA that the abstract leads on:

| Base | iLLaDA | LLaDA | Qwen2.5 | Δ vs LLaDA |
|---|---|---|---|---|
| MMLU | 74.8 | 65.9 | 71.9 | +8.9 |
| BBH | 71.3 | 49.7 | 63.9 | **+21.6** |
| ARC-C | 60.8 | 45.9 | 51.5 | **+14.9** |
| HellaSwag | 76.6 | 70.5 | 79.0 | +6.1 |
| GSM8K | 81.9 | 70.3 | 78.9 | +11.6 |
| MATH | 38.4 | 31.4 | 41.1 | +7.0 |
| HumanEval | 50.0 | 35.4 | 56.7 | +14.6 |
| MBPP | 57.8 | 40.0 | 63.6 | +17.8 |

iLLaDA-Base beats Qwen2.5-Base on MMLU, BBH, ARC-C, and GSM8K; Qwen still wins on
HellaSwag, MATH, and code. But the average edges ahead — and *that's the real result*:
a from-scratch bidirectional diffusion model matching a strong autoregressive base.

### Instruct models: the gap that's left

This is the part the abstract's "competitive on several benchmarks" softens. After
instruction tuning, iLLaDA **trails Qwen2.5 by ~10 points on average**, with double-digit
gaps on the hard reasoning and coding tasks:

<BenchBars
  title="Instruct models — average over 7 benchmarks (%)"
  unit=""
  bars={[
    { label: "Qwen2.5 7B", value: 77.1 },
    { label: "iLLaDA 8B", value: 67.1, highlight: true },
    { label: "Dream 7B", value: 60.2 },
    { label: "LLaDA 8B", value: 54.5 },
  ]}
/>

| Instruct | iLLaDA | LLaDA | Qwen2.5 | Δ vs LLaDA |
|---|---|---|---|---|
| MMLU | 71.6 | 65.5 | 76.6 | +6.1 |
| MMLU-Pro | 52.3 | 37.0 | 56.3 | +15.3 |
| GSM8K | 89.0 | 77.5 | 91.6 | +11.5 |
| MATH | 56.7 | 42.2 | 75.5 | **+14.5** |
| HumanEval | 65.9 | 49.4 | 84.8 | **+16.5** |
| MBPP | 58.0 | 41.0 | 79.2 | +17.0 |

The improvement *over LLaDA* is huge and real (+12.6 average). The gap *to Qwen* on
MATH (56.7 vs 75.5), HumanEval (65.9 vs 84.8), and MBPP (58.0 vs 79.2) is also real, and
the authors don't hide it — they point to the lack of RL alignment as part of the cause.

## What I make of it

- **The base-model result is the one that matters, and it's solid.** A bidirectional
  masked diffusion LM, trained from scratch, reaching autoregressive base parity is a
  genuine data point: diffusion LMs *scale* like AR LMs. The paradigm is viable, not a
  curiosity.
- **"Competitive" is doing some work in the abstract.** On instruction-tuned reasoning
  and code, AR still wins by 10–20 points. Read the instruct tables before repeating the
  headline.
- **The efficiency case is asserted, not measured.** GQA and variable-length generation
  are motivated by cost, but the paper reports no sampling-step counts, no latency, no
  tokens/sec — and the number of denoising passes is *exactly* diffusion's central
  liability. "More efficient" is a design argument here, not a demonstrated result.
- **Parity wasn't cheap.** 12T tokens at 8B is a frontier-scale data spend, ~5× LLaDA and
  on par with what strong AR models of this size consume. iLLaDA shows diffusion can reach
  AR base parity — by paying full AR-scale training cost, and still trailing after
  post-training. There's also an honest failure mode noted: the sampler can fall into
  repetitive reasoning loops that need inference-time mitigation.

The fair summary: bidirectional masked diffusion is now a **scalable paradigm at parity
with autoregressive base models** — no longer something you can wave off — but not yet a
proven win on post-trained quality, and not yet a proven efficiency advantage. That's a
meaningful place to have gotten to, stated without the gloss.

---

*Built on [Improved Large Language Diffusion Models](https://arxiv.org/abs/2606.25331)
(Nie et al., Renmin University & ByteDance Seed, 2026) and its predecessor
[LLaDA](https://arxiv.org/abs/2502.09992). All numbers are from the paper's Tables 1–3;
the SFT-epoch curves are redrawn from its Figure 1.*
