# A 14B model that matches a 671B one — by knowing its domain > Satyajit Ghana — Head of Engineering @ Inkers Technology > canonical: https://ai.thesatyajit.com/blog/qwen-bim-domain-beats-scale > date: 2026-06-09 > tags: llm, fine-tuning, domain-models, bim, paper-notes Here's the headline from [Qwen-BIM](https://arxiv.org/abs/2602.20812) (Lin et al., Tsinghua, Feb 2026): a fine-tuned **14B** model scores **0.83 on G-Eval** for BIM-based design tasks — essentially tied with **DeepSeek-R1 at 671B** (0.84), and ahead of its own 72B sibling. A model ~48× smaller, matching the frontier on a specific domain. That result is not surprising on its own — "fine-tune a small model on your domain" is folklore by now. What makes the paper worth reading is the *anatomy*: where exactly general LLMs fall over on engineering work, and which one design choice did most of the lifting. I work on industrial AI at [Inkers](https://inkers.ai), so domain models over 3D/BIM data are close to home. These are my notes. ## The actual problem: a BIM model isn't text A Building Information Model is a structured graph of components — walls, slabs, beams, each with geometry, materials, and relationships. An LLM can't read it. So step one of *any* LLM-on-BIM pipeline is an unglamorous one the field mostly skips past: **turn the model into text.** The authors do exactly that, carefully. Revit models of five building types (malls, offices, dormitories, teaching buildings, museums) are sliced into spatial blocks of ~10–15 components each, defects are injected, and each block is serialized to plain text plus 22 templated questions with **hard-coded reference answers**. That last detail matters: the ground truth is computed by rules, not by another model, so the benchmark isn't measuring one LLM against another LLM's opinion. The questions ladder up in difficulty on purpose: from "list the wall IDs" (extraction) through "compute each slab's area" (calculation) to "is this wall thickness suspicious given residential norms?" (domain reasoning). It's a clean way to see *which rung* a model falls off. The whole pipeline, end to end, is just: project the structured model into text, turn it into supervised Q&A, add reasoning traces, and LoRA-fine-tune a small open model on it. ## Where general LLMs actually fail They evaluated 11 general models (ChatGLM, Qwen, DeepSeek — including the 671B DeepSeek-V3/R1). The failure modes are specific and, honestly, familiar from any engineering-LLM project: - **Arithmetic.** Asked for a slab's planar area, Qwen-max picks the *right formula* and still returns the wrong number. The bottleneck isn't understanding — it's calculation. - **Natural-language literalism.** Models misread parentheses in the answer template, or a naming rule ("wall IDs start with Q"), and confidently apply the wrong transform. - **Missing domain knowledge.** Asked to infer a building's floor height, the 14B base model reasons that floor height ≈ slab thickness (120 mm) — coherent chain of thought, wrong mental model, because it was never taught what "floor height" means in practice. The pattern: general models clear extraction and counting, then degrade sharply on calculation, multi-step reasoning, and anything needing design common sense. On the domain-specific design-review tasks, G-Eval is mostly **below 0.8** — not reliable enough to trust. ## The one choice that mattered: reasoning supervision This is the part I'd underline. They built two datasets from the same BIM text: - **BIM-QA** — 2,129 plain question→answer pairs. - **BIM-QRA** — 1,364 question→**reasoning**→answer triples, where the intermediate steps are supervised, not just the final answer. Then they LoRA-fine-tuned Qwen2.5-14B on different mixes. The result is the kind of finding that should change how you build these datasets: | Fine-tuning data | Size | G-Eval | |---|---|---| | 100% QA | 2,129 | 0.69 | | 80% QA + 20% QRA | 2,661 | 0.77 | | 60% QA + 40% QRA | 2,500 | 0.77 | | **100% QRA** | **1,364** | **0.83** | The **smallest** dataset — pure reasoning triples — won, by a wide margin. More reasoning supervision monotonically improved G-Eval, and quality beat quantity outright. Teaching the model *how to get there*, on a third of the data, beat teaching it *what the answer is* on the full set. ## Bigger is not better (and the paper shows it twice) Two clean data points against scale-maximalism: 1. On the general benchmark, **QwQ-32B out-scored DeepSeek-R1 (671B)** on G-Eval. The giant model's verbose reasoning actually *hurt* — it padded answers with redundant text, tanking format and text-similarity scores without improving correctness. 2. After fine-tuning, **Qwen-BIM (14B) matched DeepSeek-R1 (671B)** and beat the 72B and 32B Qwen models on the domain G-Eval. The improvement from fine-tuning is also *targeted* exactly where you'd want it: | G-Eval | Base 14B | Qwen-BIM | Δ | |---|---|---|---| | General tasks | 0.810 | 0.874 | +0.06 | | Domain-specific tasks | 0.588 | 0.801 | **+0.21** | | Overall | 0.689 | 0.834 | +0.15 | General ability barely moved (it was already fine); the entire gain is concentrated in the domain tasks that were broken. That's the signature of fine-tuning doing the right thing — adding domain competence without trading away the base model's generality. ## What I'd flag It's a careful paper, but keep the scope honest: - **2D only.** Early tests showed the models couldn't do 3D geometry (collision detection), so those questions were cut. The hard part of real BIM reasoning is 3D. - **One narrow task family**, five building types, rule-generated Q&A. G-Eval is an LLM-as-judge metric — better-correlated with humans than BLEU/ROUGE here, but still a proxy. "Data available on request" rather than released. - The "matches 671B" comparison is on *this* benchmark. It's a domain-competence claim, not a general-capability one. ## Why it's the right playbook anyway Strip away the BIM specifics and this is a template for industrial domain models, the kind I think about constantly: you rarely need a frontier model. You need (1) a faithful **text projection of your structured/3D data**, (2) a benchmark with **rule-computed ground truth** so you're measuring competence, not vibes, and (3) **reasoning-supervised** fine-tuning data — quality and chain-of-thought over raw volume. Get those three right and a 14B model on two A6000s reaches the same place a 671B model does, at a fraction of the inference cost. For anyone shipping AI into a real engineering vertical, that economics is the whole game. --- *Paper: [Developing large language model for BIM-based design with domain-specific benchmark and dataset](https://arxiv.org/abs/2602.20812) — Lin, Cai, Ni, Zhou, Pan (2026), arXiv:2602.20812.*