# GLM 5.2: long-horizon coding at a million tokens

> Satyajit Ghana — Head of Engineering @ Inkers Technology
> canonical: https://ai.thesatyajit.com/articles/glm-5-2
> date: 2026-06-23
> tags: llm, glm, long-context, agentic-coding, explainer
GLM 5.2, from Z.ai (Zhipu AI), is the flagship of the GLM-5 line: a 744-billion-
parameter mixture-of-experts with 40B active per token, MIT-licensed open weights,
and — the headline — a genuine **1-million-token context**. It is tuned for one thing
in particular: long-horizon agentic coding, the sessions that run hundreds of rounds
and thousands of tool calls without losing the thread.

There is no standalone GLM 5.2 paper. It builds on the GLM-5 technical report
([arXiv 2602.15763](https://arxiv.org/abs/2602.15763)) and, for the context trick at
its center, a method paper — IndexCache / IndexShare
([arXiv 2603.12201](https://arxiv.org/abs/2603.12201)). This pulls from both, plus the
[release blog](https://z.ai/blog/glm-5.2).

## What changed from 5.1

GLM-5 → 5.1 → 5.2 share the same 744B/40B backbone. What 5.2 adds:

- a real **1M-token context**, up from 200K;
- **IndexShare**, the architecture change that makes that context affordable;
- a shift to **critic-based PPO** for very long RL rollouts;
- faster speculative decoding (**+20% acceptance length**);
- a **thinking-effort** dial (High / Max).

The first two are the load-bearing pair: the long context, and the trick that keeps
it cheap.

## The model

744B total parameters, 40B active per token — a mixture-of-experts on an 80-layer,
256-expert backbone. Attention is **DeepSeek Sparse Attention (DSA)**: Multi-head
Latent Attention plus a lightweight *indexer* that, for each query, selects the
top-$k$ tokens worth attending to instead of the whole sequence. That sparsity is
what makes a million-token context tractable at all.

<Figure
  src="/articles/glm-5-2/glm52-architecture-1m.png"
  alt="GLM 5.2 architecture for 1M context — DeepSeek Sparse Attention with the IndexShare layout."
  caption="GLM 5.2's architecture for 1M context: sparse attention with a shared indexer (from the Z.ai release)."
/>

## IndexShare: making 1M context cheap

DSA has a catch. The indexer runs at *every layer*, and as the context grows toward
1M tokens, that per-query top-$k$ search becomes the dominant cost. The IndexCache
paper's observation is the whole insight: **adjacent DSA layers select almost the same
tokens** — 70–100% of their top-$k$ overlap.

<Figure
  src="/articles/glm-5-2/indexcache-fig4-overlap-heatmap.png"
  alt="Heatmap of top-k token-selection overlap between every pair of layers, mostly 70-100%."
  caption="Pairwise overlap of each layer's selected tokens. Neighbouring layers pick nearly identical sets — so recomputing the indexer for each is wasted work."
/>

So compute the indexer once per group of layers and reuse its selection for the rest.
GLM 5.2 shares one indexer across every 4 layers — skipping it in 3 of every 4:

<IndexShare />

If the indexer's cost per layer scales with selecting top-$k$ over $L$ tokens, then
sharing it across a group of $g$ layers amortizes that cost to $O(L/g)$ per layer.
With $g = 4$ and the rest of each layer unchanged, GLM 5.2 reports **2.9× lower
per-token FLOPs at a 1M-token context**, with quality essentially intact.

<Figure
  src="/articles/glm-5-2/indexcache-fig2-architecture.png"
  alt="IndexCache inference loop: F-layers compute and cache indices, S-layers reuse them."
  caption="The mechanism: an F-layer computes the indices and caches them; the following S-layers reuse the cache, skipping the indexer entirely."
/>

The honest tradeoff: push reuse too far — share across 8 layers instead of 4 — and
long-context fidelity starts to degrade. One indexer per four layers is the sweet
spot the paper settles on.

## Faster decoding: MTP and KVShare

GLM 5.2 also sharpens its multi-token-prediction layer (speculative decoding). With
IndexShare, KVShare, and end-to-end training, the average **acceptance length rises
~20% — from 4.56 to 5.47 tokens** per verification pass. More accepted tokens per
pass means faster generation, which matters most when you are streaming long agent
traces.

<Figure
  src="/articles/glm-5-2/glm52-mtp-indexshare-kvshare.png"
  alt="Two-step MTP inference with IndexShare and KVShare keeping train/infer KV consistent."
  caption="Speculative decoding with IndexShare + KVShare — keeping the draft and verify passes consistent."
/>

## Training for the long horizon

Pretraining scaled to **28.5T tokens** (up from GLM-4.5's 23T). But the interesting
change in 5.2 is the agentic post-training. It moves from group-relative RL to a
**critic-based PPO** that estimates token-level advantages from individual rollouts —
which accommodates *trajectory compaction* without capping how long a trace can get.
That is exactly what you need when a single agent run is thousands of tool calls long
and won't fit in one rollout.

It also adds an **anti-reward-hacking module**: a rule-based filter first catches
likely hacks (tuned for recall), then an LLM judge checks intent; on a detected hack
the system blocks the call and returns dummy information so the rollout continues
instead of being thrown away. All of it runs on Zhipu's open asynchronous RL
framework, **slime**.

## Benchmarks

The headline result: GLM 5.2 is the **strongest open-weights model on standard and
long-horizon coding**, closing much of the gap to Claude Opus 4.8 and GPT-5.5.

<Figure
  src="/articles/glm-5-2/glm52-coding-bench.png"
  alt="GLM 5.2 standard coding benchmark chart vs competitors."
  caption="Standard coding benchmarks — GLM 5.2 as the strongest open model (Z.ai)."
/>

<BenchBars
  title="SWE-Bench Pro (%)"
  unit=""
  bars={[
    { label: "Claude Opus 4.8", value: 69.2 },
    { label: "GLM 5.2", value: 62.1, highlight: true },
    { label: "Qwen3.7-Max", value: 60.6 },
    { label: "GPT-5.5", value: 58.6 },
    { label: "GLM 5.1", value: 58.4 },
    { label: "DeepSeek-V4-Pro", value: 55.4 },
  ]}
/>

Where it stands out most is *long-horizon* coding — runs that have to stay coherent
over many rounds — where it nearly catches Opus 4.8 and leaves the rest behind:

<Figure
  src="/articles/glm-5-2/glm52-longhorizon-bench.png"
  alt="Long-horizon coding benchmarks: FrontierSWE, PostTrainBench, SWE-Marathon."
  caption="Long-horizon benchmarks (FrontierSWE, PostTrainBench, SWE-Marathon) — the gap to the frontier is small."
/>

<BenchBars
  title="FrontierSWE — long-horizon dominance (%)"
  unit=""
  bars={[
    { label: "Claude Opus 4.8", value: 75.1 },
    { label: "GLM 5.2", value: 74.4, highlight: true },
    { label: "GPT-5.5", value: 72.6 },
    { label: "Gemini 3.1 Pro", value: 39.6 },
    { label: "GLM 5.1", value: 30.5 },
  ]}
/>

Reasoning is strong — a near-perfect AIME — though it trails the very top closed
models on the hardest knowledge benchmarks (GPQA, HLE):

<BenchBars
  title="AIME 2026 (%)"
  unit=""
  bars={[
    { label: "GLM 5.2", value: 99.2, highlight: true },
    { label: "GPT-5.5", value: 98.3 },
    { label: "Gemini 3.1 Pro", value: 98.2 },
    { label: "Claude Opus 4.8", value: 95.7 },
    { label: "GLM 5.1", value: 95.3 },
  ]}
/>

## Thinking effort, and what 1M costs to serve

GLM 5.2 exposes two reasoning-effort levels — `high` for everyday speed and `max` for
hard multi-step coding — and Z.ai positions its capability between Claude Opus 4.7 and
4.8 at similar token spend.

<Figure
  src="/articles/glm-5-2/glm52-effort-tokenbudget.png"
  alt="Agentic coding performance vs token budget at High and Max effort levels."
  caption="Effort vs token budget — Max trades more tokens for more capability on hard tasks."
/>

The 1M context is not free to serve. The bottleneck moves from raw compute to
**KV-cache capacity, long-context kernels, and CPU-side overhead**; the throughput
advantage grows with context length, but you need 8×H100-class hardware and ~1.5 TB
for the weights, and the API meters at 3× during peak hours.

<Figure
  src="/articles/glm-5-2/glm52-1m-throughput.png"
  alt="Serving throughput vs context length — GLM 5.2's advantage grows as context grows."
  caption="The IndexShare payoff at serving time: the throughput edge widens as context approaches 1M tokens."
/>

## What I make of it

- **The genuinely new bit is IndexShare** — a clean, well-motivated systems trick
  (reuse what's nearly identical instead of recomputing it), with a paper that shows
  *why* it's almost lossless. That's what turns "1M context" from a spec-sheet number
  into something you can actually serve.
- **It's the strongest open-weights model for long-horizon agentic coding**, and it's
  MIT-licensed. That combination matters more than the benchmark deltas — you can run
  and fine-tune it yourself.
- **It still trails the best closed frontier models** on most hard coding and
  reasoning axes (SWE-Bench Pro 62.1 vs Opus 4.8's 69.2), and it is heavy to
  self-host. The bet was never "beat Opus 4.8 everywhere" — it's "match the frontier on
  long-horizon work, in the open, at a million tokens." On that, it largely delivers.

---

*Sources: the [GLM 5.2 release blog](https://z.ai/blog/glm-5.2), the GLM-5 technical
report ([arXiv 2602.15763](https://arxiv.org/abs/2602.15763)), and the IndexCache
method paper behind IndexShare ([arXiv 2603.12201](https://arxiv.org/abs/2603.12201)).
Benchmark figures are from Z.ai; numbers quoted as reported.*
