# How self-attention works in transformers

> Satyajit Ghana — Head of Engineering @ Inkers Technology
> canonical: https://ai.thesatyajit.com/articles/how-transformers-attention-works
> date: 2026-06-02
> tags: transformers, deep-learning, explainer

Self-attention is the single mechanism that lets a transformer decide, for every
token in a sequence, which other tokens are worth listening to. Older architectures
like RNNs squeezed an entire sentence through a fixed-size hidden state and read it
left to right. Attention throws that bottleneck out: every token can look directly
at every other token in one parallel step, and it learns *how much* to look.

The trick is to give each token three learned vectors. The **query** asks a question
("what am I looking for?"), the **key** advertises what a token offers ("here is what
I am about"), and the **value** is the actual content that gets passed along once a
match is found. You compute these by multiplying the input embeddings by three
learned weight matrices, $W_Q$, $W_K$, and $W_V$, giving matrices $Q$, $K$, and $V$.

A token attends to another by comparing its query against that token's key with a
dot product — a large dot product means the two vectors point in a similar direction,
so the question and the offer line up. Do this for every query against every key and
you get a full grid of raw compatibility scores.

$$
\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V
$$

That one line is the whole operation. The matrix below shows the resulting weights
for a tiny three-token sequence: each row is one query token, each column is a key it
might attend to, and the cell shading is how much weight that pair receives after the
softmax. Hover a row to see where that token looks.

<AttentionMatrix tokens={["the", "cat", "sat"]} />

It helps to walk the formula from the inside out. Each step below takes the previous
result and transforms it; together they go from raw vectors to a context-aware output.

<StepThrough titles={["scores", "weights", "mix"]}>

**Q·Kᵀ — raw scores.** Multiply the query matrix by the transpose of the key matrix.
The entry at row *i*, column *j* is the dot product of token *i*'s query with token
*j*'s key — an unnormalised score for how relevant token *j* is to token *i*. The
result is a square matrix, one score for every ordered pair of tokens.

**Scale, then softmax — attention weights.** Divide every score by $\sqrt{d_k}$, the
square root of the key dimension. Without this, large dimensions produce dot products
with a big variance, pushing the softmax into saturated regions where gradients
vanish; the scaling keeps the distribution well-behaved. Then apply softmax across
each row so the weights are non-negative and sum to one — a proper distribution over
"where this token attends."

**Weighted sum — the output.** Multiply the weight matrix by the value matrix $V$.
Each output row is a weighted average of all value vectors, blended according to that
token's attention weights. A token that attended strongly to "cat" inherits most of
"cat"'s value, so its new representation is now informed by the context around it.

</StepThrough>

Stack several of these in parallel — each with its own $W_Q$, $W_K$, $W_V$ — and you
get **multi-head attention**, where different heads specialise in different relations
(syntax, coreference, positional patterns). Concatenate the heads, project once more,
and that becomes one transformer sub-layer. Repeat across depth and the model builds
increasingly abstract, context-rich representations of the sequence.

<Callout type="tip">
  The √dₖ scaling is easy to skip when implementing attention from scratch, but
  dropping it is one of the most common reasons a hand-rolled transformer trains
  slowly or not at all — the softmax saturates and gradients stop flowing.
</Callout>

That is the entire idea: project tokens into queries, keys, and values; score every
pair with a scaled dot product; turn the scores into a distribution with softmax; and
read out a weighted mix of values. Everything else in a transformer — feed-forward
layers, residual connections, layer norm, positional encodings — exists to support
and stack this one operation.
