~/satyajit

A 14B model that matches a 671B one — by knowing its domain

mdjsonmcp

2026-06-09 · 7 min · llm · fine-tuning · domain-models · bim · paper-notes

Here's the headline from Qwen-BIM (Lin et al., Tsinghua, Feb 2026): a fine-tuned 14B model scores 0.83 on G-Eval for BIM-based design tasks — essentially tied with DeepSeek-R1 at 671B (0.84), and ahead of its own 72B sibling. A model ~48× smaller, matching the frontier on a specific domain.

That result is not surprising on its own — "fine-tune a small model on your domain" is folklore by now. What makes the paper worth reading is the anatomy: where exactly general LLMs fall over on engineering work, and which one design choice did most of the lifting. I work on industrial AI at Inkers, so domain models over 3D/BIM data are close to home. These are my notes.

The actual problem: a BIM model isn't text

A Building Information Model is a structured graph of components — walls, slabs, beams, each with geometry, materials, and relationships. An LLM can't read it. So step one of any LLM-on-BIM pipeline is an unglamorous one the field mostly skips past: turn the model into text.

The authors do exactly that, carefully. Revit models of five building types (malls, offices, dormitories, teaching buildings, museums) are sliced into spatial blocks of ~10–15 components each, defects are injected, and each block is serialized to plain text plus 22 templated questions with hard-coded reference answers. That last detail matters: the ground truth is computed by rules, not by another model, so the benchmark isn't measuring one LLM against another LLM's opinion.

The whole pipeline, end to end, is just: project the structured model into text, turn it into supervised Q&A, add reasoning traces, and LoRA-fine-tune a small open model on it.

BIM modelRevit graphtextualize+ 22 questionsBIM-QA2,129 pairsBIM-QRA1,364 + reasoningQwen-BIMLoRA · 14Bstructured 3D → text → supervised data → small fine-tuned model
The Qwen-BIM pipeline: a faithful text projection becomes rule-checked Q&A, then reasoning-supervised QRA, then a LoRA fine-tune of a 14B model.

Where general LLMs actually fail

They evaluated 11 general models (ChatGLM, Qwen, DeepSeek — including the 671B DeepSeek-V3/R1). The failure modes are specific and, honestly, familiar from any engineering-LLM project:

The pattern: general models clear extraction and counting, then degrade sharply on calculation, multi-step reasoning, and anything needing design common sense. On the domain-specific design-review tasks, G-Eval is mostly below 0.8 — not reliable enough to trust.

The one choice that mattered: reasoning supervision

This is the part I'd underline. They built two datasets from the same BIM text:

Then they LoRA-fine-tuned Qwen2.5-14B on different mixes. The result is the kind of finding that should change how you build these datasets:

Fine-tuning dataSizeG-Eval
100% QA2,1290.69
80% QA + 20% QRA2,6610.77
60% QA + 40% QRA2,5000.77
100% QRA1,3640.83

The smallest dataset — pure reasoning triples — won, by a wide margin. More reasoning supervision monotonically improved G-Eval, and quality beat quantity outright. Teaching the model how to get there, on a third of the data, beat teaching it what the answer is on the full set.

Bigger is not better (and the paper shows it twice)

Two clean data points against scale-maximalism:

  1. On the general benchmark, QwQ-32B out-scored DeepSeek-R1 (671B) on G-Eval. The giant model's verbose reasoning actually hurt — it padded answers with redundant text, tanking format and text-similarity scores without improving correctness.
  2. After fine-tuning, Qwen-BIM (14B) matched DeepSeek-R1 (671B) and beat the 72B and 32B Qwen models on the domain G-Eval.
0.50.60.70.80.90.69base 14BQwen2.50.83Qwen-BIM14B · fine-tuned0.84DeepSeek-R1671B
G-Eval on BIM design tasks. Fine-tuning lifts a 14B model from 0.69 to 0.83 — level with a 671B reasoning model ~48× its size.

The improvement from fine-tuning is also targeted exactly where you'd want it:

G-EvalBase 14BQwen-BIMΔ
General tasks0.8100.874+0.06
Domain-specific tasks0.5880.801+0.21
Overall0.6890.834+0.15

General ability barely moved (it was already fine); the entire gain is concentrated in the domain tasks that were broken. That's the signature of fine-tuning doing the right thing — adding domain competence without trading away the base model's generality.

What I'd flag

It's a careful paper, but keep the scope honest:

Why it's the right playbook anyway

Strip away the BIM specifics and this is a template for industrial domain models, the kind I think about constantly: you rarely need a frontier model. You need (1) a faithful text projection of your structured/3D data, (2) a benchmark with rule-computed ground truth so you're measuring competence, not vibes, and (3) reasoning-supervised fine-tuning data — quality and chain-of-thought over raw volume. Get those three right and a 14B model on two A6000s reaches the same place a 671B model does, at a fraction of the inference cost. For anyone shipping AI into a real engineering vertical, that economics is the whole game.


Paper: Developing large language model for BIM-based design with domain-specific benchmark and dataset — Lin, Cai, Ni, Zhou, Pan (2026), arXiv:2602.20812.

share