[{"title":"Coroutines in C, intuitively","description":"How to pause a function in the middle and resume it later — using nothing but a switch statement and __LINE__. An intuitive tour of Simon Tatham's classic trick, with a step-through animation.","date":"2026-06-09","tags":["c","coroutines","systems","explainer"],"draft":false,"kind":"articles","slug":"coroutines-in-c","body":"\nSome functions want to be *callers*. Some want to be *callees*. The trouble starts\nwhen two pieces of code both want to be the caller.\n\nPicture a decompressor that walks a byte stream and emits one character at a time,\nand a parser that consumes characters one at a time. Each is most natural as a loop\nthat *drives* the other:\n\n<Diagram caption=\"Both want to be the loop. Only one can be — the other must invert into a state machine.\">\n  <svg\n    viewBox=\"0 0 600 220\"\n    role=\"img\"\n    aria-label=\"Two functions, a decompressor and a parser, each naturally a loop that wants to drive the other.\"\n    style={{ width: \"100%\", height: \"auto\", color: \"var(--foreground)\" }}\n  >\n    <defs>\n      <marker id=\"cf-arrow\" viewBox=\"0 0 10 10\" refX=\"8\" refY=\"5\" markerWidth=\"6\" markerHeight=\"6\" orient=\"auto-start-reverse\">\n        <path d=\"M0,0 L10,5 L0,10 z\" fill=\"currentColor\" />\n      </marker>\n    </defs>\n\n    {/* left: decompressor loop */}\n    <rect x=\"20\" y=\"50\" width=\"200\" height=\"120\" rx=\"10\" fill=\"none\" stroke=\"currentColor\" strokeOpacity=\"0.5\" />\n    <text x=\"120\" y=\"78\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"14\" fill=\"currentColor\">decompressor</text>\n    <text x=\"120\" y=\"98\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.6\">while (bytes) emit(c)</text>\n    {/* loop arrow */}\n    <path d=\"M 92 120 A 28 28 0 1 1 148 120\" fill=\"none\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" />\n    <text x=\"120\" y=\"128\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.6\">loop</text>\n    <text x=\"120\" y=\"190\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.55\">wants to push</text>\n\n    {/* right: parser loop */}\n    <rect x=\"380\" y=\"50\" width=\"200\" height=\"120\" rx=\"10\" fill=\"none\" stroke=\"currentColor\" strokeOpacity=\"0.5\" />\n    <text x=\"480\" y=\"78\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"14\" fill=\"currentColor\">parser</text>\n    <text x=\"480\" y=\"98\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.6\">while (chars) use(c)</text>\n    <path d=\"M 452 120 A 28 28 0 1 1 508 120\" fill=\"none\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" />\n    <text x=\"480\" y=\"128\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.6\">loop</text>\n    <text x=\"480\" y=\"190\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.55\">wants to pull</text>\n\n    {/* the clash in the middle */}\n    <line x1=\"232\" y1=\"104\" x2=\"368\" y2=\"104\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" opacity=\"0.8\" />\n    <line x1=\"368\" y1=\"124\" x2=\"232\" y2=\"124\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" opacity=\"0.8\" />\n    <text x=\"300\" y=\"150\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"22\" fill=\"currentColor\" fontWeight=\"bold\">?</text>\n    <text x=\"300\" y=\"172\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.55\">who calls whom</text>\n  </svg>\n</Diagram>\n\nWhichever one you make a *callee*, you have to turn inside-out: rip out its loop,\nhoist its locals into `static` state, and reconstruct \"where was I?\" by hand every\ntime it's called. The algorithm disappears into a state machine.\n\nA **coroutine** is the escape hatch: a function you can `return` from *in the middle*\nand later resume *exactly where it left off*, locals and loop position intact. C\ndoesn't have them. But — as Simon Tatham showed in his\n[classic note](https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html) — you\ncan fake them with a `switch` statement and one preprocessor macro.\n\n## The painful version first\n\nHere's that decompressor rewritten as a callee the honest way — a hand-rolled state\nmachine. It works, and it's miserable:\n\n```c\nint decompressor(void) {\n  static int state = 0, len, c;\n  switch (state) {\n    case 0:                 /* fresh start */\n      while (1) {\n        c = getchar();\n        if (c == EOF) return EOF;\n        if (c == 0xFF) {    /* run-length escape */\n          len = getchar();\n          c = getchar();\n          while (len--) {\n            state = 1; return c;   /* <-- emit, remember we're here */\n            case 1: ;              /* <-- ...come back to here */\n          }\n        } else {\n          state = 2; return c;\n          case 2: ;\n        }\n      }\n  }\n}\n```\n\nEvery `return` needs a unique number, a matching `case`, and an assignment to\n`state`. Add a branch and you renumber everything. The bookkeeping *is* the bug\nsurface.\n\n<Callout type=\"note\">\n  Notice the `case 1:` sitting **inside** the `while` loop, underneath a `switch`\n  that's outside it. That's legal C — `case` labels can live in any sub-block of a\n  `switch`. This is the same quirk that powers Duff's device, and it's the whole\n  trick.\n</Callout>\n\n## The insight: let `__LINE__` be the state\n\nThe numbers are pure noise. We never *read* them — we only need each `return` to\nhave a label unique to its position, and a way to jump back to it. The C preprocessor\nalready hands out a unique number per position: `__LINE__`.\n\nSo: on the way out, save `__LINE__`. On the way back in, `switch` on the saved value\nand let a `case __LINE__:` right after the `return` catch it. Two macros:\n\n```c\n#define crBegin     static int state = 0; switch (state) { case 0:\n#define crReturn(x) do { state = __LINE__; return x; \\\n                         case __LINE__: ; } while (0)\n#define crFinish    }\n```\n\nThat's the entire idea. `crBegin` opens a `switch` on the saved state. `crReturn`\nstamps the current line into `state`, returns, and drops a `case` label at that exact\nline so the next call resumes one statement later. `crFinish` closes the brace.\n\n## Watch it run\n\nA three-value generator — `next()` returns 0, 1, 2, then -1 — makes the control flow\nvisible. Step through it: watch `state` get stamped with a line number on the way out,\nand the `switch` teleport straight back into the middle of the `for` loop on the way\nback in.\n\n<CoroutineStepper />\n\nThe magic moment is the jump from `switch (state)` to `case __LINE__:` *inside* the\nloop. The function never \"starts over\" — it lands back exactly where it returned, with\n`i` right where it was.\n\n## How the macros expand\n\nIt reads like ordinary code, but here's what the preprocessor actually produces, one\nlayer at a time:\n\n<StepThrough titles={[\"you write\", \"expand crBegin\", \"expand crReturn\", \"what runs\"]}>\n\nYou write the coroutine in its natural, loop-shaped form:\n\n```c\nint next(void) {\n  static int i;\n  crBegin;\n  for (i = 0; i < 3; i++)\n    crReturn(i);\n  crFinish;\n}\n```\n\n`crBegin` becomes a `switch` on the saved state, entered at `case 0` on the first call:\n\n```c\nint next(void) {\n  static int i;\n  static int state = 0; switch (state) { case 0:\n  for (i = 0; i < 3; i++)\n    crReturn(i);\n  }\n}\n```\n\n`crReturn(i)` stamps the line number, returns, and leaves a `case` label one line on:\n\n```c\nfor (i = 0; i < 3; i++) {\n  state = __LINE__; return i;\n  case __LINE__: ;\n}\n```\n\nSo the next call jumps from `switch (state)` *directly* to that `case` — back inside\nthe `for` loop, with `i` preserved. No re-entry, no restart:\n\n```c\nswitch (state) {     /* state == that line number */\n  case 0: ...\n  case 17: ;         /* <-- lands here, mid-loop */\n}\n```\n\n</StepThrough>\n\n## Where it bites\n\nThis is a beautiful hack, and like every beautiful hack it has sharp edges. Tatham is\ncandid about them, and you should be too:\n\n<Callout type=\"warn\">\n  **Only `static` locals survive.** A normal `auto` variable is undefined after a\n  `crReturn` — its storage isn't preserved across the return. Loop counters and any\n  state you care about must be `static`. **One `crReturn` per line** (two share a\n  `__LINE__` and collide). And you **can't wrap the body in your own `switch`** — it\n  would capture the `case` labels meant for the coroutine.\n</Callout>\n\nThe `static` rule hides a worse problem: `static` means *one shared instance*. Two\ncallers can't run the same coroutine independently — they'd stomp each other's `state`\nand `i`. Fine for a single global decompressor; fatal for anything reentrant or\nthreaded.\n\n## Making it reentrant\n\nThe fix is to stop using `static` and instead thread all the state through a context\nstruct the caller owns. Every \"serious\" local becomes a field; the macros read and\nwrite `ctx->state` instead of a file-scoped one:\n\n```c\nstruct coro {\n  int state;\n  int i, len, c;   /* everything that must survive a yield */\n};\n\n#define crBegin(ctx)     switch ((ctx)->state) { case 0:\n#define crReturn(ctx, x) do { (ctx)->state = __LINE__; return x; \\\n                              case __LINE__: ; } while (0)\n#define crFinish         }\n\nint next(struct coro *ctx) {\n  crBegin(ctx);\n  for (ctx->i = 0; ctx->i < 3; ctx->i++)\n    crReturn(ctx, ctx->i);\n  crFinish;\n  return -1;\n}\n```\n\nNow each caller allocates its own `struct coro`, and you can run a hundred independent\ngenerators at once. The price is cosmetic — `ctx->i` everywhere you'd have written\n`i` — and Tatham's own verdict is the honest one: *\"virtually all your serious\nvariables become elements of the coroutine context structure.\"* You trade a little\nsyntax for reentrancy. Usually worth it.\n\n## Why this matters beyond the trick\n\nYou don't reach for these macros often — real codebases use explicit state machines,\nthreads, or a language with `async`/`yield` built in. But the idea underneath is worth\nkeeping: **a coroutine is just a state machine where the compiler tracks the state for\nyou.** `async/await` in Rust, generators in Python, goroutines parked on a channel —\nall of them are, at bottom, \"save where I am, return, resume later.\" Tatham's macro is\nthat idea stripped to its absolute minimum: one `switch`, one `__LINE__`, and the\nnerve to put a `case` label inside a loop.\n\n---\n\n*Built on Simon Tatham's [Coroutines in C](https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html) (2000) — still the clearest thing ever written on the subject.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/coroutines-in-c"},{"title":"How self-attention works in transformers","description":"A from-scratch explainer of scaled dot-product attention — queries, keys, values, the softmax, and why the √d scaling matters.","date":"2026-06-02","tags":["transformers","deep-learning","explainer"],"draft":false,"kind":"articles","slug":"how-transformers-attention-works","body":"\nSelf-attention is the single mechanism that lets a transformer decide, for every\ntoken in a sequence, which other tokens are worth listening to. Older architectures\nlike RNNs squeezed an entire sentence through a fixed-size hidden state and read it\nleft to right. Attention throws that bottleneck out: every token can look directly\nat every other token in one parallel step, and it learns *how much* to look.\n\nThe trick is to give each token three learned vectors. The **query** asks a question\n(\"what am I looking for?\"), the **key** advertises what a token offers (\"here is what\nI am about\"), and the **value** is the actual content that gets passed along once a\nmatch is found. You compute these by multiplying the input embeddings by three\nlearned weight matrices, $W_Q$, $W_K$, and $W_V$, giving matrices $Q$, $K$, and $V$.\n\nA token attends to another by comparing its query against that token's key with a\ndot product — a large dot product means the two vectors point in a similar direction,\nso the question and the offer line up. Do this for every query against every key and\nyou get a full grid of raw compatibility scores.\n\n$$\n\\text{Attention}(Q, K, V) = \\text{softmax}\\!\\left(\\frac{Q K^{\\top}}{\\sqrt{d_k}}\\right) V\n$$\n\nThat one line is the whole operation. The matrix below shows the resulting weights\nfor a tiny three-token sequence: each row is one query token, each column is a key it\nmight attend to, and the cell shading is how much weight that pair receives after the\nsoftmax. Hover a row to see where that token looks.\n\n<AttentionMatrix tokens={[\"the\", \"cat\", \"sat\"]} />\n\nIt helps to walk the formula from the inside out. Each step below takes the previous\nresult and transforms it; together they go from raw vectors to a context-aware output.\n\n<StepThrough titles={[\"scores\", \"weights\", \"mix\"]}>\n\n**Q·Kᵀ — raw scores.** Multiply the query matrix by the transpose of the key matrix.\nThe entry at row *i*, column *j* is the dot product of token *i*'s query with token\n*j*'s key — an unnormalised score for how relevant token *j* is to token *i*. The\nresult is a square matrix, one score for every ordered pair of tokens.\n\n**Scale, then softmax — attention weights.** Divide every score by $\\sqrt{d_k}$, the\nsquare root of the key dimension. Without this, large dimensions produce dot products\nwith a big variance, pushing the softmax into saturated regions where gradients\nvanish; the scaling keeps the distribution well-behaved. Then apply softmax across\neach row so the weights are non-negative and sum to one — a proper distribution over\n\"where this token attends.\"\n\n**Weighted sum — the output.** Multiply the weight matrix by the value matrix $V$.\nEach output row is a weighted average of all value vectors, blended according to that\ntoken's attention weights. A token that attended strongly to \"cat\" inherits most of\n\"cat\"'s value, so its new representation is now informed by the context around it.\n\n</StepThrough>\n\nStack several of these in parallel — each with its own $W_Q$, $W_K$, $W_V$ — and you\nget **multi-head attention**, where different heads specialise in different relations\n(syntax, coreference, positional patterns). Concatenate the heads, project once more,\nand that becomes one transformer sub-layer. Repeat across depth and the model builds\nincreasingly abstract, context-rich representations of the sequence.\n\n<Callout type=\"tip\">\n  The √dₖ scaling is easy to skip when implementing attention from scratch, but\n  dropping it is one of the most common reasons a hand-rolled transformer trains\n  slowly or not at all — the softmax saturates and gradients stop flowing.\n</Callout>\n\nThat is the entire idea: project tokens into queries, keys, and values; score every\npair with a scaled dot product; turn the scores into a distribution with softmax; and\nread out a weighted mix of values. Everything else in a transformer — feed-forward\nlayers, residual connections, layer norm, positional encodings — exists to support\nand stack this one operation.\n","readingTimeMins":3,"url":"https://ai.thesatyajit.com/articles/how-transformers-attention-works"}]