2026-06-09 · 7 min · llm · fine-tuning · domain-models · bim · paper-notes
Here's the headline from Qwen-BIM (Lin et al., Tsinghua, Feb 2026): a fine-tuned 14B model scores 0.83 on G-Eval for BIM-based design tasks — essentially tied with DeepSeek-R1 at 671B (0.84), and ahead of its own 72B sibling. A model ~48× smaller, matching the frontier on a specific domain.
That result is not surprising on its own — "fine-tune a small model on your domain" is folklore by now. What makes the paper worth reading is the anatomy: where exactly general LLMs fall over on engineering work, and which one design choice did most of the lifting. I work on industrial AI at Inkers, so domain models over 3D/BIM data are close to home. These are my notes.
The actual problem: a BIM model isn't text
A Building Information Model is a structured graph of components — walls, slabs, beams, each with geometry, materials, and relationships. An LLM can't read it. So step one of any LLM-on-BIM pipeline is an unglamorous one the field mostly skips past: turn the model into text.
The authors do exactly that, carefully. Revit models of five building types (malls, offices, dormitories, teaching buildings, museums) are sliced into spatial blocks of ~10–15 components each, defects are injected, and each block is serialized to plain text plus 22 templated questions with hard-coded reference answers. That last detail matters: the ground truth is computed by rules, not by another model, so the benchmark isn't measuring one LLM against another LLM's opinion.
The whole pipeline, end to end, is just: project the structured model into text, turn it into supervised Q&A, add reasoning traces, and LoRA-fine-tune a small open model on it.
Where general LLMs actually fail
They evaluated 11 general models (ChatGLM, Qwen, DeepSeek — including the 671B DeepSeek-V3/R1). The failure modes are specific and, honestly, familiar from any engineering-LLM project:
- Arithmetic. Asked for a slab's planar area, Qwen-max picks the right formula and still returns the wrong number. The bottleneck isn't understanding — it's calculation.
- Natural-language literalism. Models misread parentheses in the answer template, or a naming rule ("wall IDs start with Q"), and confidently apply the wrong transform.
- Missing domain knowledge. Asked to infer a building's floor height, the 14B base model reasons that floor height ≈ slab thickness (120 mm) — coherent chain of thought, wrong mental model, because it was never taught what "floor height" means in practice.
The pattern: general models clear extraction and counting, then degrade sharply on calculation, multi-step reasoning, and anything needing design common sense. On the domain-specific design-review tasks, G-Eval is mostly below 0.8 — not reliable enough to trust.
The one choice that mattered: reasoning supervision
This is the part I'd underline. They built two datasets from the same BIM text:
- BIM-QA — 2,129 plain question→answer pairs.
- BIM-QRA — 1,364 question→reasoning→answer triples, where the intermediate steps are supervised, not just the final answer.
Then they LoRA-fine-tuned Qwen2.5-14B on different mixes. The result is the kind of finding that should change how you build these datasets:
| Fine-tuning data | Size | G-Eval |
|---|---|---|
| 100% QA | 2,129 | 0.69 |
| 80% QA + 20% QRA | 2,661 | 0.77 |
| 60% QA + 40% QRA | 2,500 | 0.77 |
| 100% QRA | 1,364 | 0.83 |
The smallest dataset — pure reasoning triples — won, by a wide margin. More reasoning supervision monotonically improved G-Eval, and quality beat quantity outright. Teaching the model how to get there, on a third of the data, beat teaching it what the answer is on the full set.
Bigger is not better (and the paper shows it twice)
Two clean data points against scale-maximalism:
- On the general benchmark, QwQ-32B out-scored DeepSeek-R1 (671B) on G-Eval. The giant model's verbose reasoning actually hurt — it padded answers with redundant text, tanking format and text-similarity scores without improving correctness.
- After fine-tuning, Qwen-BIM (14B) matched DeepSeek-R1 (671B) and beat the 72B and 32B Qwen models on the domain G-Eval.
The improvement from fine-tuning is also targeted exactly where you'd want it:
| G-Eval | Base 14B | Qwen-BIM | Δ |
|---|---|---|---|
| General tasks | 0.810 | 0.874 | +0.06 |
| Domain-specific tasks | 0.588 | 0.801 | +0.21 |
| Overall | 0.689 | 0.834 | +0.15 |
General ability barely moved (it was already fine); the entire gain is concentrated in the domain tasks that were broken. That's the signature of fine-tuning doing the right thing — adding domain competence without trading away the base model's generality.
What I'd flag
It's a careful paper, but keep the scope honest:
- 2D only. Early tests showed the models couldn't do 3D geometry (collision detection), so those questions were cut. The hard part of real BIM reasoning is 3D.
- One narrow task family, five building types, rule-generated Q&A. G-Eval is an LLM-as-judge metric — better-correlated with humans than BLEU/ROUGE here, but still a proxy. "Data available on request" rather than released.
- The "matches 671B" comparison is on this benchmark. It's a domain-competence claim, not a general-capability one.
Why it's the right playbook anyway
Strip away the BIM specifics and this is a template for industrial domain models, the kind I think about constantly: you rarely need a frontier model. You need (1) a faithful text projection of your structured/3D data, (2) a benchmark with rule-computed ground truth so you're measuring competence, not vibes, and (3) reasoning-supervised fine-tuning data — quality and chain-of-thought over raw volume. Get those three right and a 14B model on two A6000s reaches the same place a 671B model does, at a fraction of the inference cost. For anyone shipping AI into a real engineering vertical, that economics is the whole game.
Paper: Developing large language model for BIM-based design with domain-specific benchmark and dataset — Lin, Cai, Ni, Zhou, Pan (2026), arXiv:2602.20812.