{"blog":[{"title":"My site's chatbot was stuffing 273k tokens into every message. I gave it tools instead.","description":"The assistant on this site used to paste the entire content corpus — every article, every log — into the system prompt on every request: ~273k tokens, past the model's own context window. I rebuilt it as a small tool-using agent that retrieves on demand: three tools (BM25 search, fetch-a-page, and a think scratchpad), a ~5k-token catalog, and the bodies pulled only when needed. Notes on the Kimi / pi / Anthropic ideas I borrowed.","date":"2026-07-20","tags":["agents","llm","tools","retrieval","systems"],"draft":false,"kind":"blog","slug":"site-agent-dynamic-tools","body":"The assistant embedded on this site — the `⌘K` console, `/api/ask`, and the `ask_satyajit`\nMCP tool — worked. It was also doing something faintly absurd. Every single request built its\nsystem prompt like this:\n\n```ts\n// the old lib/chat.ts, abridged\nconst sections = []\nfor (const page of [\"about\", \"resume\", \"health\", \"now\", \"uses\", \"reading\"])\n  sections.push(await dataPageMarkdown(page))\nfor (const item of getAllContent())           // every article, blog, log, digest…\n  sections.push(contentMarkdown(item.kind, item.slug))\n\nreturn [PERSONA, RULES, \"=== CONTENT CORPUS ===\", sections.join(\"\\n---\\n\")].join(\"\\n\")\n```\n\nIt pasted **the entire site** into the prompt and let the provider's context cache eat the cost.\nThat's a defensible move when the corpus is a page or two. Mine isn't anymore:\n\n```text\nold full-corpus system prompt:  1,090,854 chars  ≈  273,000 tokens\n```\n\nThe code even carried a comment — *\"if the corpus ever approaches ~100K tokens, switch to\nretrieval\"* — that reality had quietly sailed past nearly 3× over. Two hundred seventy-three\nthousand tokens is past the input window of the model serving it. The chat was running on\nwhatever survived truncation. Time to do the thing the comment said.\n\n## The fix: retrieve, don't dump\n\nI rebuilt the assistant as a small **tool-using agent**. The system prompt now carries only a\ncompact **catalog** — every page's `kind/slug`, title, and one-line description, about **5k\ntokens** — and the agent pulls the bodies it actually needs through tools. Three of them:\n\n```ts\n// lib/chat.ts — the whole tool set\nsearch_content({ query, limit })  // BM25 over the site (ranked, returns section + snippet)\nget_content({ kind, slug })       // fetch one page's full markdown, on demand\nthink({ thought })                // a reasoning scratchpad; records the thought, returns nothing\n```\n\nA question now flows: read the catalog → `search_content` (or jump straight to `get_content` if\nthe catalog already names the page) → read one or two pages → optionally `think` to reconcile\nthem → answer, with citations. The base prompt is fixed and small; the variable cost is the\none-to-three pages it fetched, not the other sixty-five it didn't.\n\n```text\nbefore:  ~273k tokens, every request, whether relevant or not\nafter:   ~5k catalog  +  ~2–10k per page actually fetched\n```\n\nThat's the same **dynamic loading** idea from\n[Kimi K3's tool-calling guide](https://platform.kimi.ai/docs/guide/kimi-k3-tool-calling-best-practice) —\ntheir point is that a big upfront payload \"eats up context and makes the model more likely to\npick the wrong thing.\" Kimi loads *tool schemas* on demand; my corpus is the payload, so I load\n*content* on demand. Same principle, one layer down.\n\n## Three tools, on purpose\n\nThe tool count is a decision, not an accident. Mario Zechner's\n[pi coding agent](https://mariozechner.at/posts/2025-11-30-pi-coding-agent/) makes the case that\n\"four tools are all you need\" — `read`, `write`, `edit`, `bash` — and that MCP servers which\n\"dump their entire tool descriptions into your context on every session\" are the anti-pattern.\nA read-only site agent needs fewer still: search, fetch, think. Each description is a couple of\nsentences. The combined tool surface is under a thousand tokens, so keeping it declared upfront\n(rather than lazily loading schemas, which only pays off with dozens of tools) is the right call\nhere — the honest version of \"dynamic tools\" for a small surface is *don't have a big one*.\n\nThe one genuinely new tool is `think`, from Anthropic's\n[think-tool post](https://www.anthropic.com/engineering/claude-think-tool). It does nothing —\nliterally logs the thought and returns:\n\n```ts\nthink: tool({\n  description: \"Think out loud: plan which pages to fetch, or check a draft answer against the sources you read. Records the thought and returns nothing new.\",\n  inputSchema: z.object({ thought: z.string() }),\n  execute: async ({ thought }) => ({ ok: true, thought }),\n}),\n```\n\nThat looks pointless until you watch a multi-step tool run. Between `search_content` and the\nfinal answer, the model has raw tool output sitting in context and no designated place to reason\nover it before committing. `think` is that place — a scratchpad that keeps the \"what did I just\nread, and does it actually answer the question\" step from being skipped. Anthropic reports it\nbuying a large margin on multi-step tool tasks; the appeal for a retrieval agent is exactly that\nit makes the model *check the page it fetched* instead of answering from the snippet.\n\n## The loop\n\nThe agent loop itself is one call, courtesy of the Vercel AI SDK — `stopWhen` bounds how many\ntool round-trips it may take before it has to answer:\n\n```ts\nconst result = streamText({\n  model: chatModel(\"main\"),\n  system,                       // persona + rules + the 5k-token catalog\n  messages,\n  tools: agentTools(),          // search_content, get_content, think\n  stopWhen: stepCountIs(6),     // search → read → think → answer is ~4; 6 is headroom\n})\n```\n\nThe same tool set backs all three surfaces — the streaming `⌘K` console (`/api/chat`), the\none-shot `/api/ask`, and the `ask_satyajit` MCP tool — so there's one harness, not three. And\n`search_content` is the [Contextual BM25](/blog/site-search-contextual-bm25) engine I'd already\nbuilt for the site's `/search`, now doing double duty as the agent's retriever. The pieces\ncompose: the search work made the harness possible.\n\n<Callout type=\"warn\">\nHonest scope. This is retrieval over a small, single-author corpus, not a coding agent — the\nlessons transfer but the stakes are lower. `think`'s benefit is task-dependent (Anthropic saw\nbig gains on policy-heavy multi-step tasks, near-zero on simple ones); on a two-hop lookup it\nmostly earns its keep by stopping the model from answering off the snippet. And the retriever is\nlexical BM25 — no embeddings yet, so a pure paraphrase with no shared terms can still miss the\nright page. The catalog is the safety net there: the model can see every title even when search\ncomes up short.\n</Callout>\n\n## The take\n\nThe old design wasn't wrong when it was written — it was wrong at 273k tokens. The rebuild is the\n[agent-harness](/articles/agent-harness) framing applied to my own site: the model barely changed,\nbut the loop, the tool set, and the context policy around it changed completely. Give the agent a\nmap and three tools and let it fetch, instead of force-feeding it the whole library and hoping the\nanswer survives the truncation. Retrieve, don't dump.\n","readingTimeMins":5,"url":"https://ai.thesatyajit.com/blog/site-agent-dynamic-tools"},{"title":"Giving my own search the Contextual BM25 treatment","description":"This site's search was a substring indexOf — first match wins, no ranking, and any multi-word query that isn't a verbatim phrase returns nothing. I replaced it with the two things I'd just written up: BM25 for ranking, and Contextual Retrieval for chunks — except the context is deterministic frontmatter instead of a Claude call, so it runs at request time with no model. Here's the rebuild, with the real before/after.","date":"2026-07-20","tags":["search","retrieval","bm25","rag","information-retrieval"],"draft":false,"kind":"blog","slug":"site-search-contextual-bm25","body":"I recently wrote two pieces back to back: one on [BM25](/articles/bm25), the ranking\nfunction that refuses to die, and one on [Contextual Retrieval](/blog/contextual-retrieval),\nAnthropic's fix for chunks that forget where they came from. Then I opened `lib/search.ts` —\nthe thing powering this site's own `/search` and the `/api/search` endpoint agents hit — and\nfound this:\n\n```ts\n// the old lib/search.ts, abridged\nfor (const item of getAllContent()) {\n  const fields = [item.title, item.description, item.tags, item.body]\n  for (const text of fields) {\n    if (text.toLowerCase().includes(q)) {   // q = the whole lowercased query\n      results.push({ /* … a ±120-char window around the hit */ })\n      break\n    }\n  }\n}\n```\n\nA substring `indexOf`. It has three problems, and the third is the one that actually bites:\n\n1. **No ranking.** The first document that contains the substring wins, in date order. A tight,\n   on-topic match and an incidental mention are indistinguishable.\n2. **No notion of rarity or length.** Matching \"the\" counts the same as matching \"kalman\".\n3. **Multi-word queries fall off a cliff.** `q` is the *entire* query string, so\n   `includes(q)` needs that **exact phrase** somewhere in the text. Nobody searches in verbatim\n   phrases. `\"bm25 length normalization\"` returns **zero results** — those three words are all\n   in the BM25 article, just never contiguous.\n\nSo I ate my own cooking. Here's the rebuild.\n\n## BM25, the same formula I just wrote up\n\nThe [BM25 walkthrough](/articles/bm25) has the whole derivation; the code is a direct\ntranscription of it. Lucene's non-negative IDF, term-frequency saturation at `k1 = 1.2`, length\nnormalization at `b = 0.75`:\n\n```ts\n// lib/search.ts — the ranking core, straight from the article's formula\nconst K1 = 1.2\nconst B = 0.75\n\nfunction idf(term: string, ix: Index): number {\n  const n = ix.df.get(term) ?? 0\n  return Math.log(1 + (ix.n - n + 0.5) / (n + 0.5))\n}\n\n// per query term present in a chunk:\nconst f = chunk.tf.get(term) ?? 0\nconst norm = K1 * (1 - B + (B * chunk.len) / ix.avgdl)\nscore += idf(term, ix) * (f * (K1 + 1)) / (f + norm)\n```\n\nTwo things fall out for free. The query is **tokenized** into terms and each is scored\nindependently, so multi-word queries just work — no phrase has to exist. And rare terms\ndominate: `kalman` carries far more weight than `filter`, because `idf` collapses for common\nwords. No stopword list, same as the article promised.\n\nI keep a **postings map** (`term → chunk indices`) so a query only touches chunks that actually\nshare a term, instead of scanning the whole corpus. The corpus is one person's writing, so this\nis overkill — but it's the same inverted-index shape a real engine uses, and it's three lines.\n\n## Contextual chunks, minus the Claude call\n\nBM25 alone still has the chunk-amnesia problem from the [Contextual Retrieval\npost](/blog/contextual-retrieval): if I split an article into passages and index each one alone,\na passage that reads \"it drops 41% at parity\" has no idea it's *about the Harness Effect, in the\nsection on the controlled swap*. A query for either misses it.\n\nAnthropic's fix is to have Claude write a one-line context for every chunk before indexing.\nMine is cheaper and dumber: the context a chunk lost is sitting right there in the document's\nfrontmatter and headings. So before indexing, every chunk inherits its document's **title,\ndescription, tags, and nearest heading** — deterministically, at request time, no model, no\nbuild step:\n\n```ts\n// each chunk is packed with weighted terms: its own body, its heading, and the\n// document context it would otherwise have lost when the body was split.\nconst contextTokens = tokenize([title, description, tags.join(\" \")].join(\" \"))\nfor (const t of bodyTokens) add(t, 1)\nfor (const t of headingTokens) add(t, 1)          // section-local: full weight\nfor (const t of contextTokens) add(t, CONTEXT_WEIGHT) // 0.5 — present, not dominant\n```\n\nThe honest caveat: this is a **poor man's** Contextual Retrieval. Claude's per-chunk context can\nsay things the frontmatter can't (\"the previous quarter's revenue was \\$314M\"); my version can\nonly replay the structural context the document already carries. And prepending the *same* title\nto every chunk of a document inflates those terms' document-frequency, which is exactly why the\ncontext terms get `CONTEXT_WEIGHT = 0.5` instead of full weight — present enough to make the\nchunk findable by its subject, quiet enough not to drown the passage's own words. It's the 80%\nof the win for 0% of the inference cost, which is the right trade for a static personal site.\n\nEach result also reports **which section** it matched — the nearest heading rides along as the\nresult's `field`, so `/api/search` tells an agent not just *which* document but *where* in it.\n\n## Before / after\n\nSame queries, old substring search vs. the new Contextual BM25, on the live corpus:\n\n```text\nquery                          substring indexOf         contextual BM25 (top hit)\n--------------------------------------------------------------------------------------\n\"bm25 length normalization\"    0 results (no verbatim     articles/bm25\n                                phrase)                    [Start with TF-IDF, and its two flaws]\n\n\"revenue grew quarter\"         0 results                  blog/contextual-retrieval\n                                                           [The problem: chunks forget …]\n\n\"kalman filter\"                first dated doc that        articles/kalman-filter  (8.4)\n                                contains the phrase,        + fast-lio2 (8.6, cites it)\n                                unranked                    ranked by relevance, not date\n\n\"reciprocal rank fusion\"       0 results                  blog/contextual-retrieval\n                                                           [A dependency-free repro]\n```\n\nThe multi-word queries are the story. Under substring, anything that isn't a verbatim phrase —\nwhich is almost everything a person types — returned nothing. Under BM25 every term contributes,\nso the query finds the document even when its words are scattered across a paragraph, and the\ncontextual prefix pulls in matches on the *subject* of a passage, not just its literal words.\n\n## It's live\n\nTry it: [`/search?q=length+normalization`](/search?q=length+normalization). Agents get the\nranked, scored JSON with the section label:\n\n```bash\ncurl -s \"https://ai.thesatyajit.com/api/search?q=contextual%20retrieval&limit=3\" | jq '.results[] | {slug, field, score}'\n```\n\n<Callout type=\"warn\">\nWhat this still isn't: it's the **lexical half** only. The [contextual retrieval\npost](/blog/contextual-retrieval) makes the case that the real win is *fusing* BM25 with a dense\nretriever, and that reranking the shortlist is what takes the failure rate the last mile. There\nare no embeddings here yet — a query has to share actual terms with a chunk, so a pure paraphrase\nwith no lexical overlap can still miss. For a corpus this size, lexical BM25 over contextualized\nchunks is the honest 90% solution; the semantic half is the next commit, not this one.\n</Callout>\n\n## The takeaway\n\nThe whole change is `lib/search.ts` — a few hundred lines, no dependencies, no index server, no\nmodel at request time. It's the two ideas I'd just written about, applied to the smallest\npossible target: my own search bar. BM25 for ranking, structural context for the chunk-amnesia\nproblem, and an honest note about the half I haven't built yet. Writing about a technique is a\ngood way to understand it; running it in your own site is a better one.\n","readingTimeMins":6,"url":"https://ai.thesatyajit.com/blog/site-search-contextual-bm25"},{"title":"Contextual Retrieval, with a runnable repro and a browser playground","description":"Anthropic's Contextual Retrieval fixes the oldest RAG bug — chunks that lose their context — by having Claude write a one-line context for each chunk before you index it. I built a dependency-free repro that reproduces the ranking lift, plus an in-browser playground so you can watch the right chunk climb.","date":"2026-07-17","tags":["rag","retrieval","claude","embeddings","playground"],"draft":false,"kind":"blog","slug":"contextual-retrieval","body":"[Anthropic's Contextual Retrieval post](https://www.anthropic.com/engineering/contextual-retrieval)\nhas been on my to-read list for a while. It targets the oldest bug in RAG, so I finally sat down,\nbuilt a dependency-free reproduction, and wired up a browser playground. This is the write-up.\n\n## The problem: chunks forget where they came from\n\nStandard RAG splits documents into chunks and indexes each chunk on its own. That destroys context.\nAnthropic's example is perfect:\n\n> `The company's revenue grew by 3% over the previous quarter.`\n\nWhich company? Which quarter? The chunk can't say. A query like *\"how did ACME's Q2 revenue change?\"*\nhas **no terms to match** against that chunk — the words \"ACME\" and \"Q2 2023\" live in the *document*,\nnot the *chunk*. Both lexical (BM25) and semantic (embedding) retrieval miss it.\n\n## The fix: let Claude situate each chunk\n\nContextual Retrieval prepends a short, chunk-specific context to each chunk **before** you embed it and\n**before** you build the BM25 index — Anthropic calls these **Contextual Embeddings** and **Contextual\nBM25**. The context is generated by Claude, given the whole document. The same chunk becomes:\n\n> `This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter.`\n\nNow the owner and the period are *in the chunk*, so retrieval can find it.\n\n## Watch it happen\n\nHere is the whole pipeline in your browser — the document, its chunks, the chunks with context prepended,\nthen retrieval. The scoring is a real BM25 and a TF-IDF cosine (a stand-in for an embedding model); the\ncontextual prefixes stand in for Claude's output. Step to **4 · retrieve**, pick a query, and watch the\nright chunk climb from the standard column to the contextual one:\n\n<ContextualRetrievalPlayground />\n\nToggle between BM25, embeddings, and their fusion. The pattern holds across all three: the answer chunk\nis buried on standard chunks and near the top once each chunk carries its context.\n\n## The one prompt that does the work\n\nThis is the prompt from the post, verbatim. It runs once per chunk, with the full document supplied as\ncontext:\n\n```text\n<document>\n{{WHOLE_DOCUMENT}}\n</document>\nHere is the chunk we want to situate within the whole document\n<chunk>\n{{CHUNK_CONTENT}}\n</chunk>\nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.\n```\n\nThe obvious worry is cost: you re-send the whole document once per chunk. **Prompt caching** kills that.\nCache the document once and every chunk in it reads from the cache, which is why Anthropic quotes a\none-time **$1.02 per million document tokens**. The document is the stable prefix, so it takes the\n`cache_control` breakpoint; the chunk and instruction vary and come after it:\n\n```python\nfrom anthropic import Anthropic\n\nclient = Anthropic()\n\nCONTEXT_PROMPT = \"\"\"Here is the chunk we want to situate within the whole document\n<chunk>\n{chunk}\n</chunk>\nPlease give a short succinct context to situate this chunk within the overall document \\\nfor the purposes of improving search retrieval of the chunk. Answer only with the \\\nsuccinct context and nothing else.\"\"\"\n\ndef situate(doc: str, chunk: str) -> str:\n    resp = client.messages.create(\n        # One Claude call per chunk. For a large index you'd typically drop to a\n        # cheaper model like claude-haiku-4-5 — which is what Anthropic's ~$1.02 /\n        # 1M-doc-token estimate assumes — trading a little context quality for cost.\n        model=\"claude-opus-4-8\",\n        max_tokens=200,\n        messages=[{\n            \"role\": \"user\",\n            \"content\": [\n                # The whole document is the stable prefix: cache it ONCE, then every\n                # chunk of this document reads from the cache instead of re-paying for it.\n                {\"type\": \"text\",\n                 \"text\": f\"<document>\\n{doc}\\n</document>\",\n                 \"cache_control\": {\"type\": \"ephemeral\"}},\n                {\"type\": \"text\", \"text\": CONTEXT_PROMPT.format(chunk=chunk)},\n            ],\n        }],\n    )\n    return \"\".join(b.text for b in resp.content if b.type == \"text\").strip()\n```\n\n<Callout type=\"tip\">\nVerify the cache is actually working: `resp.usage.cache_read_input_tokens` should be non-zero on every\nchunk after the first for a given document. If it's zero, something upstream is mutating the document\nbytes (a timestamp, non-deterministic JSON) and invalidating the prefix.\n</Callout>\n\n## A dependency-free repro\n\nTo convince myself the lift was real and not marketing, I wrote a ~180-line pure-stdlib script: naive\nchunking, BM25 from scratch, a TF-IDF cosine as an embedding stand-in, and reciprocal rank fusion. The\nretrieval core is small — here is BM25 and the fusion step:\n\n```python\nclass BM25:\n    def score(self, query, i):\n        dl, tf, s = len(self.docs[i]), self.tf[i], 0.0\n        for t in query:\n            if t not in tf:\n                continue\n            num = tf[t] * (self.k1 + 1)\n            den = tf[t] + self.k1 * (1 - self.b + self.b * dl / self.avgdl)\n            s += self.idf(t) * num / den\n        return s\n\ndef rrf(rankings, k=60):  # reciprocal rank fusion of BM25 + embedding rankings\n    scores = Counter()\n    for ranking in rankings:\n        for rank, item in enumerate(ranking):\n            scores[item] += 1 / (k + rank + 1)\n    return [i for i, _ in scores.most_common()]\n```\n\nThen I index the chunks two ways — plain, and with a one-line context prepended — and measure where the\ncorrect chunk lands for a couple of \"which entity, which period\" queries. The actual output:\n\n```text\nquery                                                method   plain rank  ctx rank\n------------------------------------------------------------------------------------\nHow did ACME Corp revenue change in Q2 2023?         bm25              4         2\nHow did ACME Corp revenue change in Q2 2023?         emb               4         1\nHow did ACME Corp revenue change in Q2 2023?         hybrid            4         2\n\nWhat happened to Beta Industries revenue in Q3 2023? bm25              5         2\nWhat happened to Beta Industries revenue in Q3 2023? emb               5         1\nWhat happened to Beta Industries revenue in Q3 2023? hybrid            5         2\n\nrecall@1 (fraction of queries where the right chunk ranks #1):\n  bm25    plain 0/2   contextual 0/2\n  emb     plain 0/2   contextual 2/2\n```\n\nContextualizing the chunks moves the answer from rank **4–5** to rank **1–2** across BM25, embeddings, and\ntheir fusion — and embeddings recall@1 goes from **0/2 to 2/2**. On this toy corpus fusion lands the answer\nat #2 rather than #1 (a small-N artifact of RRF), which is a good reminder that the fusion win is an\n*aggregate* effect — which is exactly what Anthropic's real evaluation measures.\n\n## What the real evaluation found\n\nOn Anthropic's benchmark (top-20 retrieval failure rate, i.e. `1 - recall@20`), stacking the techniques\ncompounds:\n\n- **Contextual Embeddings** alone: `5.7% → 3.7%` — a **35%** cut in the failure rate.\n- **+ Contextual BM25**: `5.7% → 2.9%` — **49%**.\n- **+ reranking**: `5.7% → 1.9%` — **67%**.\n\nThe lexical half matters more than you'd guess — BM25 nails exact identifiers (error codes, ticker symbols,\nfunction names) that embeddings smear together, so contextualizing *both* indexes and fusing them beats\neither alone.\n\n## Things worth copying from the post\n\n- **Retrieve top-20, not top-5/10.** Anthropic found 20 the most performant cut for the final context.\n- **Rerank the shortlist.** Retrieve ~150 candidates, then rerank down to 20 for the answer prompt — that's\n  the step that takes the failure rate from 2.9% to 1.9%.\n- **Embedding model matters.** Gemini and Voyage embeddings were the standouts in their tests.\n- **Chunking still matters.** Size, boundary, and overlap all move the numbers — Contextual Retrieval sits\n  on top of good chunking, it doesn't replace it.\n- **A domain-tuned context prompt beats the generic one.** The template above is a floor, not a ceiling.\n\n<Callout type=\"warn\">\nBe honest about what this costs. Contextual Retrieval adds a Claude call **per chunk** at index time —\ncheap per token with caching, but real latency and spend when you're indexing millions of chunks, and it\nhas to re-run when documents change. My repro is a minimal reproduction of the *core mechanism* (context\nlifts rank); the `35 / 49 / 67%` figures are Anthropic's, on their corpus. And my playground scores with a\nTF-IDF cosine, not a real embedding model — it shows the *shape* of the effect, not production numbers.\n</Callout>\n\n## The takeaway\n\nThe move is almost embarrassingly simple: spend a cheap, cached Claude call per chunk to write down the\ncontext a human would need to make sense of it, then index that. It attacks the failure at its source\ninstead of papering over it downstream, it helps lexical and semantic retrieval at the same time, and — as\nthe little repro above shows — you can watch the right chunk climb the rankings the moment the context goes\nin.\n\n---\n\n*Source: [Introducing Contextual Retrieval](https://www.anthropic.com/engineering/contextual-retrieval)\n(Anthropic). The prompt and the `35 / 49 / 67%` numbers are theirs; the repro and the browser playground are\nmine, and the playground's scoring runs entirely client-side.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/blog/contextual-retrieval"},{"title":"A 14B model that matches a 671B one — by knowing its domain","description":"Paper notes on Qwen-BIM: fine-tuning a small open model on a reasoning-supervised, domain-specific dataset beats a 50×-larger general model on BIM design tasks. The interesting part isn't the result — it's why.","date":"2026-06-09","tags":["llm","fine-tuning","domain-models","bim","paper-notes"],"draft":false,"kind":"blog","slug":"qwen-bim-domain-beats-scale","body":"Here's the headline from [Qwen-BIM](https://arxiv.org/abs/2602.20812) (Lin et al.,\nTsinghua, Feb 2026): a fine-tuned **14B** model scores **0.83 on G-Eval** for\nBIM-based design tasks — essentially tied with **DeepSeek-R1 at 671B** (0.84), and\nahead of its own 72B sibling. A model ~48× smaller, matching the frontier on a\nspecific domain.\n\nThat result is not surprising on its own — \"fine-tune a small model on your domain\"\nis folklore by now. What makes the paper worth reading is the *anatomy*: where exactly\ngeneral LLMs fall over on engineering work, and which one design choice did most of the\nlifting. I work on industrial AI at [Inkers](https://inkers.ai), so domain models over\n3D/BIM data are close to home. These are my notes.\n\n## The actual problem: a BIM model isn't text\n\nA Building Information Model is a structured graph of components — walls, slabs, beams,\neach with geometry, materials, and relationships. An LLM can't read it. So step one of\n*any* LLM-on-BIM pipeline is an unglamorous one the field mostly skips past: **turn the\nmodel into text.**\n\nThe authors do exactly that, carefully. Revit models of five building types (malls,\noffices, dormitories, teaching buildings, museums) are sliced into spatial blocks of\n~10–15 components each, defects are injected, and each block is serialized to plain text\nplus 22 templated questions with **hard-coded reference answers**. That last detail\nmatters: the ground truth is computed by rules, not by another model, so the benchmark\nisn't measuring one LLM against another LLM's opinion.\n\n<Callout type=\"note\">\n  The questions ladder up in difficulty on purpose: from \"list the wall IDs\"\n  (extraction) through \"compute each slab's area\" (calculation) to \"is this wall\n  thickness suspicious given residential norms?\" (domain reasoning). It's a clean way to\n  see *which rung* a model falls off.\n</Callout>\n\nThe whole pipeline, end to end, is just: project the structured model into text, turn it\ninto supervised Q&A, add reasoning traces, and LoRA-fine-tune a small open model on it.\n\n<Diagram caption=\"The Qwen-BIM pipeline: a faithful text projection becomes rule-checked Q&A, then reasoning-supervised QRA, then a LoRA fine-tune of a 14B model.\">\n  <svg viewBox=\"0 0 620 130\" role=\"img\" aria-label=\"Pipeline from BIM model to Qwen-BIM\" style={{ width: \"100%\", height: \"auto\", color: \"var(--foreground)\" }}>\n    <defs>\n      <marker id=\"qb-arrow\" viewBox=\"0 0 10 10\" refX=\"8\" refY=\"5\" markerWidth=\"6\" markerHeight=\"6\" orient=\"auto-start-reverse\">\n        <path d=\"M0,0 L10,5 L0,10 z\" fill=\"currentColor\" />\n      </marker>\n    </defs>\n    {[\n      { x: 8, t1: \"BIM model\", t2: \"Revit graph\" },\n      { x: 132, t1: \"textualize\", t2: \"+ 22 questions\" },\n      { x: 256, t1: \"BIM-QA\", t2: \"2,129 pairs\" },\n      { x: 380, t1: \"BIM-QRA\", t2: \"1,364 + reasoning\" },\n      { x: 504, t1: \"Qwen-BIM\", t2: \"LoRA · 14B\" },\n    ].map((n, i) => (\n      <g key={i}>\n        <rect x={n.x} y={42} width={108} height={46} rx={8} fill=\"none\" stroke=\"currentColor\" strokeOpacity={i === 4 ? 0.9 : 0.45} />\n        <text x={n.x + 54} y={64} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"13\" fill=\"currentColor\">{n.t1}</text>\n        <text x={n.x + 54} y={79} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"currentColor\" opacity=\"0.6\">{n.t2}</text>\n        {i < 4 ? <line x1={n.x + 108} y1={65} x2={n.x + 124} y2={65} stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#qb-arrow)\" /> : null}\n      </g>\n    ))}\n    <text x={310} y={20} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.55\">structured 3D → text → supervised data → small fine-tuned model</text>\n  </svg>\n</Diagram>\n\n## Where general LLMs actually fail\n\nThey evaluated 11 general models (ChatGLM, Qwen, DeepSeek — including the 671B\nDeepSeek-V3/R1). The failure modes are specific and, honestly, familiar from any\nengineering-LLM project:\n\n- **Arithmetic.** Asked for a slab's planar area, Qwen-max picks the *right formula* and\n  still returns the wrong number. The bottleneck isn't understanding — it's calculation.\n- **Natural-language literalism.** Models misread parentheses in the answer template, or\n  a naming rule (\"wall IDs start with Q\"), and confidently apply the wrong transform.\n- **Missing domain knowledge.** Asked to infer a building's floor height, the 14B base\n  model reasons that floor height ≈ slab thickness (120 mm) — coherent chain of thought,\n  wrong mental model, because it was never taught what \"floor height\" means in practice.\n\nThe pattern: general models clear extraction and counting, then degrade sharply on\ncalculation, multi-step reasoning, and anything needing design common sense. On the\ndomain-specific design-review tasks, G-Eval is mostly **below 0.8** — not reliable\nenough to trust.\n\n## The one choice that mattered: reasoning supervision\n\nThis is the part I'd underline. They built two datasets from the same BIM text:\n\n- **BIM-QA** — 2,129 plain question→answer pairs.\n- **BIM-QRA** — 1,364 question→**reasoning**→answer triples, where the intermediate\n  steps are supervised, not just the final answer.\n\nThen they LoRA-fine-tuned Qwen2.5-14B on different mixes. The result is the kind of\nfinding that should change how you build these datasets:\n\n| Fine-tuning data | Size | G-Eval |\n|---|---|---|\n| 100% QA | 2,129 | 0.69 |\n| 80% QA + 20% QRA | 2,661 | 0.77 |\n| 60% QA + 40% QRA | 2,500 | 0.77 |\n| **100% QRA** | **1,364** | **0.83** |\n\nThe **smallest** dataset — pure reasoning triples — won, by a wide margin. More\nreasoning supervision monotonically improved G-Eval, and quality beat quantity outright.\nTeaching the model *how to get there*, on a third of the data, beat teaching it *what the\nanswer is* on the full set.\n\n## Bigger is not better (and the paper shows it twice)\n\nTwo clean data points against scale-maximalism:\n\n1. On the general benchmark, **QwQ-32B out-scored DeepSeek-R1 (671B)** on G-Eval. The\n   giant model's verbose reasoning actually *hurt* — it padded answers with redundant\n   text, tanking format and text-similarity scores without improving correctness.\n2. After fine-tuning, **Qwen-BIM (14B) matched DeepSeek-R1 (671B)** and beat the 72B and\n   32B Qwen models on the domain G-Eval.\n\n<Diagram caption=\"G-Eval on BIM design tasks. Fine-tuning lifts a 14B model from 0.69 to 0.83 — level with a 671B reasoning model ~48× its size.\">\n  <svg viewBox=\"0 0 560 240\" role=\"img\" aria-label=\"G-Eval comparison: base 14B vs Qwen-BIM 14B vs DeepSeek-R1 671B\" style={{ width: \"100%\", height: \"auto\", color: \"var(--foreground)\" }}>\n    {/* y gridlines at 0.5..0.9 — chart area y 20..200 maps score 0.9..0.5 */}\n    {[0.5, 0.6, 0.7, 0.8, 0.9].map((v) => {\n      const y = 200 - ((v - 0.5) / 0.4) * 180\n      return (\n        <g key={v}>\n          <line x1=\"56\" y1={y} x2=\"540\" y2={y} stroke=\"currentColor\" strokeOpacity=\"0.12\" />\n          <text x=\"48\" y={y + 4} textAnchor=\"end\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.5\">{v.toFixed(1)}</text>\n        </g>\n      )\n    })}\n    {[\n      { label: \"base 14B\", sub: \"Qwen2.5\", score: 0.69, hl: false },\n      { label: \"Qwen-BIM\", sub: \"14B · fine-tuned\", score: 0.83, hl: true },\n      { label: \"DeepSeek-R1\", sub: \"671B\", score: 0.84, hl: false },\n    ].map((b, i) => {\n      const x = 96 + i * 150\n      const y = 200 - ((b.score - 0.5) / 0.4) * 180\n      return (\n        <g key={i}>\n          <rect x={x} y={y} width=\"92\" height={200 - y} rx=\"4\" fill=\"currentColor\" fillOpacity={b.hl ? 0.85 : 0.3} />\n          <text x={x + 46} y={y - 8} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"13\" fontWeight=\"bold\" fill=\"currentColor\">{b.score.toFixed(2)}</text>\n          <text x={x + 46} y=\"218\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\">{b.label}</text>\n          <text x={x + 46} y=\"232\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"currentColor\" opacity=\"0.6\">{b.sub}</text>\n        </g>\n      )\n    })}\n  </svg>\n</Diagram>\n\nThe improvement from fine-tuning is also *targeted* exactly where you'd want it:\n\n| G-Eval | Base 14B | Qwen-BIM | Δ |\n|---|---|---|---|\n| General tasks | 0.810 | 0.874 | +0.06 |\n| Domain-specific tasks | 0.588 | 0.801 | **+0.21** |\n| Overall | 0.689 | 0.834 | +0.15 |\n\nGeneral ability barely moved (it was already fine); the entire gain is concentrated in\nthe domain tasks that were broken. That's the signature of fine-tuning doing the right\nthing — adding domain competence without trading away the base model's generality.\n\n## What I'd flag\n\nIt's a careful paper, but keep the scope honest:\n\n- **2D only.** Early tests showed the models couldn't do 3D geometry (collision\n  detection), so those questions were cut. The hard part of real BIM reasoning is 3D.\n- **One narrow task family**, five building types, rule-generated Q&A. G-Eval is an\n  LLM-as-judge metric — better-correlated with humans than BLEU/ROUGE here, but still a\n  proxy. \"Data available on request\" rather than released.\n- The \"matches 671B\" comparison is on *this* benchmark. It's a domain-competence claim,\n  not a general-capability one.\n\n## Why it's the right playbook anyway\n\nStrip away the BIM specifics and this is a template for industrial domain models, the\nkind I think about constantly: you rarely need a frontier model. You need (1) a faithful\n**text projection of your structured/3D data**, (2) a benchmark with **rule-computed\nground truth** so you're measuring competence, not vibes, and (3) **reasoning-supervised**\nfine-tuning data — quality and chain-of-thought over raw volume. Get those three right\nand a 14B model on two A6000s reaches the same place a 671B model does, at a fraction of\nthe inference cost. For anyone shipping AI into a real engineering vertical, that economics\nis the whole game.\n\n---\n\n*Paper: [Developing large language model for BIM-based design with domain-specific\nbenchmark and dataset](https://arxiv.org/abs/2602.20812) — Lin, Cai, Ni, Zhou, Pan\n(2026), arXiv:2602.20812.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/blog/qwen-bim-domain-beats-scale"},{"title":"This site is managed by Claude","description":"Why my homepage is an AI-native, agent-operated artifact — and how the content layer works.","date":"2026-06-03","tags":["meta","ai","nextjs"],"draft":false,"kind":"blog","slug":"hello-world","body":"GitHub READMEs are dead. After Claude and the wave of coding agents, your homepage\nisn't a static profile — it's a living, agent-readable artifact you can hand to an LLM.\n\nThis site is **dual-native**: every page is both a clean human document and a\nmachine-readable surface. Try fetching [`/blog/hello-world.md`](/blog/hello-world.md)\nor [`/llms.txt`](/llms.txt) — an agent gets structured text, you get the rendered page.\n\n<Callout type=\"tip\">\n  The whole site is maintained by a crew of Claude agents. New posts, logs, and\n  data updates are authored by skills that validate themselves before shipping.\n</Callout>\n\n## What's under the hood\n\nThe content layer is a single source of truth: MDX files validated with Zod, surfaced\nidentically to humans (this page), to agents (the `.md` variant), and to tools\n(the MCP server). More on that soon.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/blog/hello-world"}],"logs":[{"title":"Scaffolding the AI site","date":"2026-06-03","tags":["build-log"],"kind":"logs","slug":"2026-06-03-scaffolding","body":"Kicked off `ai.thesatyajit.com`. Wired the content layer (MDX + gray-matter + Zod 4),\nswapped fonts to Hanken Grotesk + IBM Plex Mono, and set up `@next/mdx` with the\nTurbopack string-plugin config. Next: the editorial layout shell and core pages.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/logs/2026-06-03-scaffolding"}]}