{"blog":[{"title":"A 14B model that matches a 671B one — by knowing its domain","description":"Paper notes on Qwen-BIM: fine-tuning a small open model on a reasoning-supervised, domain-specific dataset beats a 50×-larger general model on BIM design tasks. The interesting part isn't the result — it's why.","date":"2026-06-09","tags":["llm","fine-tuning","domain-models","bim","paper-notes"],"draft":false,"kind":"blog","slug":"qwen-bim-domain-beats-scale","body":"\nHere's the headline from [Qwen-BIM](https://arxiv.org/abs/2602.20812) (Lin et al.,\nTsinghua, Feb 2026): a fine-tuned **14B** model scores **0.83 on G-Eval** for\nBIM-based design tasks — essentially tied with **DeepSeek-R1 at 671B** (0.84), and\nahead of its own 72B sibling. A model ~48× smaller, matching the frontier on a\nspecific domain.\n\nThat result is not surprising on its own — \"fine-tune a small model on your domain\"\nis folklore by now. What makes the paper worth reading is the *anatomy*: where exactly\ngeneral LLMs fall over on engineering work, and which one design choice did most of the\nlifting. I work on industrial AI at [Inkers](https://inkers.ai), so domain models over\n3D/BIM data are close to home. These are my notes.\n\n## The actual problem: a BIM model isn't text\n\nA Building Information Model is a structured graph of components — walls, slabs, beams,\neach with geometry, materials, and relationships. An LLM can't read it. So step one of\n*any* LLM-on-BIM pipeline is an unglamorous one the field mostly skips past: **turn the\nmodel into text.**\n\nThe authors do exactly that, carefully. Revit models of five building types (malls,\noffices, dormitories, teaching buildings, museums) are sliced into spatial blocks of\n~10–15 components each, defects are injected, and each block is serialized to plain text\nplus 22 templated questions with **hard-coded reference answers**. That last detail\nmatters: the ground truth is computed by rules, not by another model, so the benchmark\nisn't measuring one LLM against another LLM's opinion.\n\n<Callout type=\"note\">\n  The questions ladder up in difficulty on purpose: from \"list the wall IDs\"\n  (extraction) through \"compute each slab's area\" (calculation) to \"is this wall\n  thickness suspicious given residential norms?\" (domain reasoning). It's a clean way to\n  see *which rung* a model falls off.\n</Callout>\n\nThe whole pipeline, end to end, is just: project the structured model into text, turn it\ninto supervised Q&A, add reasoning traces, and LoRA-fine-tune a small open model on it.\n\n<Diagram caption=\"The Qwen-BIM pipeline: a faithful text projection becomes rule-checked Q&A, then reasoning-supervised QRA, then a LoRA fine-tune of a 14B model.\">\n  <svg viewBox=\"0 0 620 130\" role=\"img\" aria-label=\"Pipeline from BIM model to Qwen-BIM\" style={{ width: \"100%\", height: \"auto\", color: \"var(--foreground)\" }}>\n    <defs>\n      <marker id=\"qb-arrow\" viewBox=\"0 0 10 10\" refX=\"8\" refY=\"5\" markerWidth=\"6\" markerHeight=\"6\" orient=\"auto-start-reverse\">\n        <path d=\"M0,0 L10,5 L0,10 z\" fill=\"currentColor\" />\n      </marker>\n    </defs>\n    {[\n      { x: 8, t1: \"BIM model\", t2: \"Revit graph\" },\n      { x: 132, t1: \"textualize\", t2: \"+ 22 questions\" },\n      { x: 256, t1: \"BIM-QA\", t2: \"2,129 pairs\" },\n      { x: 380, t1: \"BIM-QRA\", t2: \"1,364 + reasoning\" },\n      { x: 504, t1: \"Qwen-BIM\", t2: \"LoRA · 14B\" },\n    ].map((n, i) => (\n      <g key={i}>\n        <rect x={n.x} y={42} width={108} height={46} rx={8} fill=\"none\" stroke=\"currentColor\" strokeOpacity={i === 4 ? 0.9 : 0.45} />\n        <text x={n.x + 54} y={64} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"13\" fill=\"currentColor\">{n.t1}</text>\n        <text x={n.x + 54} y={79} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"currentColor\" opacity=\"0.6\">{n.t2}</text>\n        {i < 4 ? <line x1={n.x + 108} y1={65} x2={n.x + 124} y2={65} stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#qb-arrow)\" /> : null}\n      </g>\n    ))}\n    <text x={310} y={20} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.55\">structured 3D → text → supervised data → small fine-tuned model</text>\n  </svg>\n</Diagram>\n\n## Where general LLMs actually fail\n\nThey evaluated 11 general models (ChatGLM, Qwen, DeepSeek — including the 671B\nDeepSeek-V3/R1). The failure modes are specific and, honestly, familiar from any\nengineering-LLM project:\n\n- **Arithmetic.** Asked for a slab's planar area, Qwen-max picks the *right formula* and\n  still returns the wrong number. The bottleneck isn't understanding — it's calculation.\n- **Natural-language literalism.** Models misread parentheses in the answer template, or\n  a naming rule (\"wall IDs start with Q\"), and confidently apply the wrong transform.\n- **Missing domain knowledge.** Asked to infer a building's floor height, the 14B base\n  model reasons that floor height ≈ slab thickness (120 mm) — coherent chain of thought,\n  wrong mental model, because it was never taught what \"floor height\" means in practice.\n\nThe pattern: general models clear extraction and counting, then degrade sharply on\ncalculation, multi-step reasoning, and anything needing design common sense. On the\ndomain-specific design-review tasks, G-Eval is mostly **below 0.8** — not reliable\nenough to trust.\n\n## The one choice that mattered: reasoning supervision\n\nThis is the part I'd underline. They built two datasets from the same BIM text:\n\n- **BIM-QA** — 2,129 plain question→answer pairs.\n- **BIM-QRA** — 1,364 question→**reasoning**→answer triples, where the intermediate\n  steps are supervised, not just the final answer.\n\nThen they LoRA-fine-tuned Qwen2.5-14B on different mixes. The result is the kind of\nfinding that should change how you build these datasets:\n\n| Fine-tuning data | Size | G-Eval |\n|---|---|---|\n| 100% QA | 2,129 | 0.69 |\n| 80% QA + 20% QRA | 2,661 | 0.77 |\n| 60% QA + 40% QRA | 2,500 | 0.77 |\n| **100% QRA** | **1,364** | **0.83** |\n\nThe **smallest** dataset — pure reasoning triples — won, by a wide margin. More\nreasoning supervision monotonically improved G-Eval, and quality beat quantity outright.\nTeaching the model *how to get there*, on a third of the data, beat teaching it *what the\nanswer is* on the full set.\n\n## Bigger is not better (and the paper shows it twice)\n\nTwo clean data points against scale-maximalism:\n\n1. On the general benchmark, **QwQ-32B out-scored DeepSeek-R1 (671B)** on G-Eval. The\n   giant model's verbose reasoning actually *hurt* — it padded answers with redundant\n   text, tanking format and text-similarity scores without improving correctness.\n2. After fine-tuning, **Qwen-BIM (14B) matched DeepSeek-R1 (671B)** and beat the 72B and\n   32B Qwen models on the domain G-Eval.\n\n<Diagram caption=\"G-Eval on BIM design tasks. Fine-tuning lifts a 14B model from 0.69 to 0.83 — level with a 671B reasoning model ~48× its size.\">\n  <svg viewBox=\"0 0 560 240\" role=\"img\" aria-label=\"G-Eval comparison: base 14B vs Qwen-BIM 14B vs DeepSeek-R1 671B\" style={{ width: \"100%\", height: \"auto\", color: \"var(--foreground)\" }}>\n    {/* y gridlines at 0.5..0.9 — chart area y 20..200 maps score 0.9..0.5 */}\n    {[0.5, 0.6, 0.7, 0.8, 0.9].map((v) => {\n      const y = 200 - ((v - 0.5) / 0.4) * 180\n      return (\n        <g key={v}>\n          <line x1=\"56\" y1={y} x2=\"540\" y2={y} stroke=\"currentColor\" strokeOpacity=\"0.12\" />\n          <text x=\"48\" y={y + 4} textAnchor=\"end\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.5\">{v.toFixed(1)}</text>\n        </g>\n      )\n    })}\n    {[\n      { label: \"base 14B\", sub: \"Qwen2.5\", score: 0.69, hl: false },\n      { label: \"Qwen-BIM\", sub: \"14B · fine-tuned\", score: 0.83, hl: true },\n      { label: \"DeepSeek-R1\", sub: \"671B\", score: 0.84, hl: false },\n    ].map((b, i) => {\n      const x = 96 + i * 150\n      const y = 200 - ((b.score - 0.5) / 0.4) * 180\n      return (\n        <g key={i}>\n          <rect x={x} y={y} width=\"92\" height={200 - y} rx=\"4\" fill=\"currentColor\" fillOpacity={b.hl ? 0.85 : 0.3} />\n          <text x={x + 46} y={y - 8} textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"13\" fontWeight=\"bold\" fill=\"currentColor\">{b.score.toFixed(2)}</text>\n          <text x={x + 46} y=\"218\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\">{b.label}</text>\n          <text x={x + 46} y=\"232\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"currentColor\" opacity=\"0.6\">{b.sub}</text>\n        </g>\n      )\n    })}\n  </svg>\n</Diagram>\n\nThe improvement from fine-tuning is also *targeted* exactly where you'd want it:\n\n| G-Eval | Base 14B | Qwen-BIM | Δ |\n|---|---|---|---|\n| General tasks | 0.810 | 0.874 | +0.06 |\n| Domain-specific tasks | 0.588 | 0.801 | **+0.21** |\n| Overall | 0.689 | 0.834 | +0.15 |\n\nGeneral ability barely moved (it was already fine); the entire gain is concentrated in\nthe domain tasks that were broken. That's the signature of fine-tuning doing the right\nthing — adding domain competence without trading away the base model's generality.\n\n## What I'd flag\n\nIt's a careful paper, but keep the scope honest:\n\n- **2D only.** Early tests showed the models couldn't do 3D geometry (collision\n  detection), so those questions were cut. The hard part of real BIM reasoning is 3D.\n- **One narrow task family**, five building types, rule-generated Q&A. G-Eval is an\n  LLM-as-judge metric — better-correlated with humans than BLEU/ROUGE here, but still a\n  proxy. \"Data available on request\" rather than released.\n- The \"matches 671B\" comparison is on *this* benchmark. It's a domain-competence claim,\n  not a general-capability one.\n\n## Why it's the right playbook anyway\n\nStrip away the BIM specifics and this is a template for industrial domain models, the\nkind I think about constantly: you rarely need a frontier model. You need (1) a faithful\n**text projection of your structured/3D data**, (2) a benchmark with **rule-computed\nground truth** so you're measuring competence, not vibes, and (3) **reasoning-supervised**\nfine-tuning data — quality and chain-of-thought over raw volume. Get those three right\nand a 14B model on two A6000s reaches the same place a 671B model does, at a fraction of\nthe inference cost. For anyone shipping AI into a real engineering vertical, that economics\nis the whole game.\n\n---\n\n*Paper: [Developing large language model for BIM-based design with domain-specific\nbenchmark and dataset](https://arxiv.org/abs/2602.20812) — Lin, Cai, Ni, Zhou, Pan\n(2026), arXiv:2602.20812.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/blog/qwen-bim-domain-beats-scale"},{"title":"This site is managed by Claude","description":"Why my homepage is an AI-native, agent-operated artifact — and how the content layer works.","date":"2026-06-03","tags":["meta","ai","nextjs"],"draft":false,"kind":"blog","slug":"hello-world","body":"\nGitHub READMEs are dead. After Claude and the wave of coding agents, your homepage\nisn't a static profile — it's a living, agent-readable artifact you can hand to an LLM.\n\nThis site is **dual-native**: every page is both a clean human document and a\nmachine-readable surface. Try fetching [`/blog/hello-world.md`](/blog/hello-world.md)\nor [`/llms.txt`](/llms.txt) — an agent gets structured text, you get the rendered page.\n\n<Callout type=\"tip\">\n  The whole site is maintained by a crew of Claude agents. New posts, logs, and\n  data updates are authored by skills that validate themselves before shipping.\n</Callout>\n\n## What's under the hood\n\nThe content layer is a single source of truth: MDX files validated with Zod, surfaced\nidentically to humans (this page), to agents (the `.md` variant), and to tools\n(the MCP server). More on that soon.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/blog/hello-world"}],"logs":[{"title":"Scaffolding the AI site","date":"2026-06-03","tags":["build-log"],"kind":"logs","slug":"2026-06-03-scaffolding","body":"\nKicked off `ai.thesatyajit.com`. Wired the content layer (MDX + gray-matter + Zod 4),\nswapped fonts to Hanken Grotesk + IBM Plex Mono, and set up `@next/mdx` with the\nTurbopack string-plugin config. Next: the editorial layout shell and core pages.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/logs/2026-06-03-scaffolding"}]}