~/satyajit

How self-attention works in transformers

mdjsonmcp

2026-06-02 · 3 min · transformers · deep-learning · explainer

Self-attention is the single mechanism that lets a transformer decide, for every token in a sequence, which other tokens are worth listening to. Older architectures like RNNs squeezed an entire sentence through a fixed-size hidden state and read it left to right. Attention throws that bottleneck out: every token can look directly at every other token in one parallel step, and it learns how much to look.

The trick is to give each token three learned vectors. The query asks a question ("what am I looking for?"), the key advertises what a token offers ("here is what I am about"), and the value is the actual content that gets passed along once a match is found. You compute these by multiplying the input embeddings by three learned weight matrices, WQW_Q, WKW_K, and WVW_V, giving matrices QQ, KK, and VV.

A token attends to another by comparing its query against that token's key with a dot product — a large dot product means the two vectors point in a similar direction, so the question and the offer line up. Do this for every query against every key and you get a full grid of raw compatibility scores.

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V

That one line is the whole operation. The matrix below shows the resulting weights for a tiny three-token sequence: each row is one query token, each column is a key it might attend to, and the cell shading is how much weight that pair receives after the softmax. Hover a row to see where that token looks.

thecatsat
the
cat
sat
attention weights — hover a query token (row) to see where it attends

It helps to walk the formula from the inside out. Each step below takes the previous result and transforms it; together they go from raw vectors to a context-aware output.

Q·Kᵀ — raw scores. Multiply the query matrix by the transpose of the key matrix. The entry at row i, column j is the dot product of token i's query with token j's key — an unnormalised score for how relevant token j is to token i. The result is a square matrix, one score for every ordered pair of tokens.

step 1/3

Stack several of these in parallel — each with its own WQW_Q, WKW_K, WVW_V — and you get multi-head attention, where different heads specialise in different relations (syntax, coreference, positional patterns). Concatenate the heads, project once more, and that becomes one transformer sub-layer. Repeat across depth and the model builds increasingly abstract, context-rich representations of the sequence.

That is the entire idea: project tokens into queries, keys, and values; score every pair with a scaled dot product; turn the scores into a distribution with softmax; and read out a weighted mix of values. Everything else in a transformer — feed-forward layers, residual connections, layer norm, positional encodings — exists to support and stack this one operation.

share