# iLLaDA: how far a masked-diffusion language model scales > Satyajit Ghana — Head of Engineering @ Inkers Technology > canonical: https://ai.thesatyajit.com/articles/illada-diffusion-language-model > date: 2026-06-28 > tags: llm, diffusion, language-models, architecture, explainer Almost every language model you use is **autoregressive**: it factorizes text left-to-right, $p(x) = \prod_i p(x_i \mid x_{ The model $p_\theta$ sees the corrupted sequence $x_t$ and is trained to predict the *original* tokens at every masked position at once. The objective is a masked cross-entropy, computed only on masked positions and reweighted by $1/t$: $$ \mathcal{L}(\theta) \;=\; -\,\mathbb{E}_{t,\,x_0,\,x_t}\!\left[\frac{1}{t}\sum_{i=1}^{L} \mathbf{1}\!\left[x_t^{i} = \mathrm{M}\right]\,\log p_\theta\!\left(x_0^{i} \mid x_t\right)\right] $$ The indicator $\mathbf{1}[x_t^i = \mathrm{M}]$ restricts the loss to masked positions; the $1/t$ factor re-normalizes so heavily- and lightly-masked samples both contribute correctly. Averaged over $t$, this is a Monte-Carlo **upper bound on the negative log-likelihood** — a principled training objective, not a heuristic. iLLaDA keeps this *same* objective through pre-training **and** supervised fine-tuning. ### Bidirectional attention comes for free An autoregressive model *must* hide the future — if token $i$ could attend to token $i{+}1$, it would just read the answer it's supposed to predict. A masked diffusion model predicts *masked* positions anywhere in the sequence, not "the next" one, so there's nothing to hide. Every position attends to every other, left and right: That full context is the structural argument for diffusion LMs: on infilling and tasks where later text disambiguates earlier text, seeing both sides at every layer should help. ### Generation: unmask in parallel, over a few steps Generation runs the process backward. Start from a block of all-`[MASK]` tokens. Each denoising step, the model predicts every masked position, **commits its most confident predictions**, and **re-masks the low-confidence ones** to try again next step. A whole block resolves over a handful of steps, in confidence order — not reading order. Flip between the two paradigms: This is the crux. Autoregression spends one forward pass per output token, in series. Diffusion spends a fixed, smaller number of denoising passes over the whole block — the promise being fewer sequential steps, at the cost of needing enough steps for quality. ## What iLLaDA changes over LLaDA iLLaDA is, more than anything, a careful **scale-up** of LLaDA — proof that the recipe keeps paying off with more tokens and a better post-training pass. ### A bigger, leaner backbone The architecture is a standard dense Transformer (RMSNorm, SwiGLU, RoPE, no biases), but re-tuned for cheaper inference: | | iLLaDA | LLaDA | |---|---|---| | Attention heads | 32 | 32 | | Key/Value heads | **8 (GQA)** | 32 (MHA) | | FFN dim | 14,336 | 12,288 | | Vocabulary | 155,136 | 126,464 | | Max sequence length | **8192** | 4096 | | Embedding / LM head | **tied** | untied | | Total parameters | 7.62B | 8.02B | The load-bearing change is **grouped-query attention** (8 KV heads instead of 32), adopted to shrink the cached key/value footprint at inference — plus a larger vocab, doubled context, and tied embeddings. ### The headline spend: 12T tokens - **Pre-training: 12T tokens**, up ~5.2× from LLaDA's 2.3T. AdamW, weight decay 0.1, LR warmed to $2\times10^{-4}$, held, then cosine-decayed to $5\times10^{-6}$. - **SFT: a 25B-token instruction corpus for 12 epochs.** The new wrinkle: SFT now applies the *same* masking as pre-training across the entire sequence (prompt, response, EOS), rather than keeping the prompt fully visible — a more consistent objective end to end. And the fine-tuning clearly hadn't saturated. The SFT-epoch ablation rises monotonically through all 12 epochs (they stopped on compute, not convergence): ### Two inference-side tricks - **Variable-length generation.** Instead of committing to a fixed output block and denoising all of it, iLLaDA appends a mask block, runs the sampler, commits confident tokens, and continues until termination — so it only denoises as many positions as the answer needs, rather than padding to a worst case. (The paper argues the efficiency, but — notably — does not report latency or step-count numbers for it.) - **Confidence-based multiple-choice scoring.** Rather than a likelihood estimate, they score a candidate by revealing its tokens one at a time, each step unmasking the highest-confidence position, and summing the log-probs: $$ S_{\text{conf}}(y \mid p) \;=\; \sum_k \log p_\theta\!\left(y^{i_k} \mid p,\, \tilde{y}_{k-1}\right), \quad i_k = \arg\max_{i \in \mathcal{M}_{k-1}} p_\theta\!\left(y^i \mid p,\, \tilde{y}_{k-1}\right) $$ The authors are upfront that this is "not a likelihood estimate" but a task-specific surrogate. Its ablation is modest: +1.3 PIQA, +0.6 ARC-C, +2.3 HellaSwag over likelihood scoring. ## The results, honestly Two stories live in these tables, and they point in different directions. ### Base models: genuine parity with Qwen2.5 As a base model, iLLaDA improves broadly over LLaDA and lands **even with Qwen2.5-7B** on average — winning several benchmarks outright: The per-benchmark picture, with the gains over LLaDA that the abstract leads on: | Base | iLLaDA | LLaDA | Qwen2.5 | Δ vs LLaDA | |---|---|---|---|---| | MMLU | 74.8 | 65.9 | 71.9 | +8.9 | | BBH | 71.3 | 49.7 | 63.9 | **+21.6** | | ARC-C | 60.8 | 45.9 | 51.5 | **+14.9** | | HellaSwag | 76.6 | 70.5 | 79.0 | +6.1 | | GSM8K | 81.9 | 70.3 | 78.9 | +11.6 | | MATH | 38.4 | 31.4 | 41.1 | +7.0 | | HumanEval | 50.0 | 35.4 | 56.7 | +14.6 | | MBPP | 57.8 | 40.0 | 63.6 | +17.8 | iLLaDA-Base beats Qwen2.5-Base on MMLU, BBH, ARC-C, and GSM8K; Qwen still wins on HellaSwag, MATH, and code. But the average edges ahead — and *that's the real result*: a from-scratch bidirectional diffusion model matching a strong autoregressive base. ### Instruct models: the gap that's left This is the part the abstract's "competitive on several benchmarks" softens. After instruction tuning, iLLaDA **trails Qwen2.5 by ~10 points on average**, with double-digit gaps on the hard reasoning and coding tasks: | Instruct | iLLaDA | LLaDA | Qwen2.5 | Δ vs LLaDA | |---|---|---|---|---| | MMLU | 71.6 | 65.5 | 76.6 | +6.1 | | MMLU-Pro | 52.3 | 37.0 | 56.3 | +15.3 | | GSM8K | 89.0 | 77.5 | 91.6 | +11.5 | | MATH | 56.7 | 42.2 | 75.5 | **+14.5** | | HumanEval | 65.9 | 49.4 | 84.8 | **+16.5** | | MBPP | 58.0 | 41.0 | 79.2 | +17.0 | The improvement *over LLaDA* is huge and real (+12.6 average). The gap *to Qwen* on MATH (56.7 vs 75.5), HumanEval (65.9 vs 84.8), and MBPP (58.0 vs 79.2) is also real, and the authors don't hide it — they point to the lack of RL alignment as part of the cause. ## What I make of it - **The base-model result is the one that matters, and it's solid.** A bidirectional masked diffusion LM, trained from scratch, reaching autoregressive base parity is a genuine data point: diffusion LMs *scale* like AR LMs. The paradigm is viable, not a curiosity. - **"Competitive" is doing some work in the abstract.** On instruction-tuned reasoning and code, AR still wins by 10–20 points. Read the instruct tables before repeating the headline. - **The efficiency case is asserted, not measured.** GQA and variable-length generation are motivated by cost, but the paper reports no sampling-step counts, no latency, no tokens/sec — and the number of denoising passes is *exactly* diffusion's central liability. "More efficient" is a design argument here, not a demonstrated result. - **Parity wasn't cheap.** 12T tokens at 8B is a frontier-scale data spend, ~5× LLaDA and on par with what strong AR models of this size consume. iLLaDA shows diffusion can reach AR base parity — by paying full AR-scale training cost, and still trailing after post-training. There's also an honest failure mode noted: the sampler can fall into repetitive reasoning loops that need inference-time mitigation. The fair summary: bidirectional masked diffusion is now a **scalable paradigm at parity with autoregressive base models** — no longer something you can wave off — but not yet a proven win on post-trained quality, and not yet a proven efficiency advantage. That's a meaningful place to have gotten to, stated without the gloss. --- *Built on [Improved Large Language Diffusion Models](https://arxiv.org/abs/2606.25331) (Nie et al., Renmin University & ByteDance Seed, 2026) and its predecessor [LLaDA](https://arxiv.org/abs/2502.09992). All numbers are from the paper's Tables 1–3; the SFT-epoch curves are redrawn from its Figure 1.*