~/satyajit

How LLM inference works: prefill, decode, and where the time goes

mdjsonmcp

2026-06-28 · 7 min · llm · inference-optimization · systems · kv-cache · explainer

When a model feels slow in production, the first question I ask is which phase is slow. Because a single generate() call isn't one workload — it's two, with opposite bottlenecks running on the same GPU:

Almost every inference optimization you'll read about targets one of these two phases. So before reaching for a fix, you have to know which one is hurting. Here's the whole pipeline, and where the time actually goes.

one generate() call · prefill → decode
prefill
compute-bound · TTFT
decode
memory-bound · ITL
prompt
ExplainhowLLMinferenceworks← all at once, one matmul
KV cache · 5 tokens
output
ItrunstwophasesononeGPU.
compute (arithmetic)
95%
memory bandwidth
40%
bottleneck
compute
metric
TTFT
GPU util
~95%

Prefill runs the whole prompt through every layer in parallel — a big matrix-matrix multiply that saturates the GPU's math units. Its latency is Time To First Token (TTFT), and it leaves behind the KV cache.

From text to vectors

Before either phase, the text becomes numbers. A tokenizer — usually byte-pair encoding (BPE) — splits the string into integer IDs from a vocabulary of roughly 50,000 entries. Each ID indexes a row of the embedding table, a learned [vocab_size, hidden_dim] matrix, so for hidden_dim = 4096 every token becomes a 4096-dimensional vector.

Position is injected here. Modern models use rotary position embeddings (RoPE), which encode position by rotating each query/key vector by an angle proportional to its index, rather than adding a separate positional vector. It's cheap and it's what lets the same weights generalize across lengths.

Inside a layer

The embedded sequence flows through a stack of transformer layers — 32 for a 7B, 80+ for the big ones. Each layer is two operations:

  1. Self-attention projects every token into a query Q, key K, and value V. Each token's query is scored against every token's key; scale, softmax, and the scores become weights that mix the values. This is the only place information moves between positions.
  2. Feed-forward network (FFN) — a two-layer MLP applied to each token independently. Attention routes information across positions; the FFN transforms it in place.

After the last layer, the final position's hidden state is projected back to vocabulary size, softmaxed, and sampled — that's one output token. How that projection-and-sample gets driven is exactly what differs between the two phases.

Prefill: compute-bound

Prefill processes the entire prompt at once. Q, K, V are computed for every prompt token in parallel, and attention is a big matrix-matrix multiply. That's dense arithmetic, and it saturates the GPU's math units — utilization runs near 100%. The metric that captures this phase is Time To First Token (TTFT): how long before the first output appears.

Prefill also populates the KV cache — the K and V tensors for every layer get written to GPU memory so they never have to be recomputed. That cache is what makes the next phase cheap, and also what makes it expensive.

Decode: memory-bound

Once the first token exists, generation switches to one token per step. For each new token the model computes Q, K, V for that token only; the keys and values for everything before it are already cached. So the attention is one query vector against a cached key matrix — a matrix-vector multiply, almost no arithmetic.

And yet decode is the slow part per token, because the GPU still has to stream every weight matrix and the entire KV cache out of memory to do that tiny computation. The bottleneck flips from arithmetic to memory bandwidth. The metric here is Inter-Token Latency (ITL) — the gap between consecutive tokens, which is what makes a stream feel fast or sluggish. GPU utilization during decode can sit at 30% on a fully loaded server, because the math units are starved waiting on memory.

prefilldecode
Workwhole prompt, parallelone token at a time
Attention shapematrix × matrixmatrix × vector
Bottleneckcompute (arithmetic)memory bandwidth
MetricTTFTITL
GPU util~95%~30%
Optimize bymore FLOPs, better kernelssmaller cache, faster memory, batching

The KV cache runs the economics

The cache is the single most important object in LLM serving. Without it, generating a 1000-token response would re-attend over the whole growing sequence every step — quadratic work. With it, each step appends one token's K/V and does constant new work — linear. Toggle it and watch the per-step cost, then drag the context length to see what the cache costs in memory:

KV cache · work per decode step
decode step → (work per step ∝ height)
per step
1× (flat)
total work
14 · linear
cache speedup
~7.5×

With the cache, each step appends one token's K/V and does O(1) new attention work — generation is linear in length. That's the ~5× (and more, for long outputs) speedup over recomputing.

context length (13B, ~1 MB/token)4k tokens
KV cache size
3.9 GB
≈ requests / 80GB GPU
20

The cache grows linearly with context and per layer, so long contexts get expensive fast — and every gigabyte spent on cache is a gigabyte not spent on batch size. Cache directly trades against concurrency, which is why the field quantizes it (INT8/INT4), windows it, shares it (GQA), and pages it (PagedAttention).

The trade is brutal and unavoidable: the cache grows linearly with sequence length, per layer. For a 13B model it's roughly 1 MB per token, so a 4K context is ~4 GB of VRAM spent on cache alone — before a single weight. And that memory competes directly with batch size: every gigabyte on cache is a gigabyte not serving another request. Long contexts are expensive not because of compute, but because they evict concurrency.

The standard mitigations all attack the cache from different angles:

Redesigning attention around the cache

The deeper move is to make the cache structurally smaller from the start, by changing attention itself. DeepSeek's V4 series does this with a hybrid of two compressed mechanisms: Compressed Sparse Attention (compress KV ~4× with softmax-gated pooling, then attend sparsely) and Heavily Compressed Attention (consolidate KV across 128 tokens into one entry, attend densely over those). At a 1M-token context, V4-Pro needs about 27% of the single-token inference FLOPs and 10% of the KV cache of its predecessor — in absolute terms, ~9.62 GiB of cache per sequence in bf16 versus an estimated ~83.9 GiB for the older design, and fp4/fp8 halves it again. (I went deeper on V4's drafter in the DSpark write-up.) The cache has become the constraint the architecture is being designed around.

Quantization

Training needs FP32/BF16 for gradient stability. Inference doesn't. Dropping bit-width saves memory linearly, and quality barely moves. Pick a size and precision:

weights memory · params × bytes-per-param
model
precision
7B · INT43.5 GB
within 1–2 pts (GPTQ / AWQ)
fits in VRAM (weights only)
laptop GPU
6 GB · fits
RTX 4090
24 GB · fits
A100
40 GB · fits
H100
80 GB · fits

The savings are linear in bit-width, and inference quality barely moves: INT8 typically costs nothing and roughly halves latency, while INT4 lands within 1–2 points of full precision using per-channel scaling (GPTQ, AWQ). That’s why quantization is usually the highest-leverage single change for a deployment — and why a 7B model at INT4 (3.5 GB) fits where its FP16 form (14GB) wouldn’t.

INT4 is the reason a 7B model runs on a 4–6 GB laptop GPU at all. Methods like GPTQ and AWQ use per-channel scaling to keep the lossy compression within 1–2 points of full precision on standard benchmarks. And going FP16 → INT8 often roughly halves latency with negligible quality loss — which makes quantization the highest-leverage single change for most deployments.

The serving layer

On top of the prefill/decode loop sits the infrastructure that makes a GPU economical:

Frameworks like vLLM, TensorRT-LLM, and TGI combine all of this. The throughput they get comes mostly from the fact that decode is memory-bound, so there's spare arithmetic lying around for batching to soak up.

The full path

  1. Tokenize — text → integer IDs via BPE.
  2. Embed — IDs → vectors; RoPE rotates in position.
  3. Prefill — all prompt tokens through every layer in parallel; compute-bound; KV cache populated; first token emitted (TTFT).
  4. Decode loop — one token per step: project Q, attend over cached K/V, run FFN, sample, append to cache; memory-bound (ITL).
  5. Detokenize — IDs → text, streamed out.

How to actually use this

The whole point of splitting it this way is diagnosis. When something is slow:

That last instinct is the one I'd internalize: during decode the arithmetic units are mostly idle, so when a decode-bound server is slow, throwing a bigger compute budget at it does nothing. The bottleneck is the memory bus. Optimize the thing that's actually full.

share