# A 14B model that matches a 671B one — by knowing its domain

> Satyajit Ghana — Head of Engineering @ Inkers Technology
> canonical: https://ai.thesatyajit.com/blog/qwen-bim-domain-beats-scale
> date: 2026-06-09
> tags: llm, fine-tuning, domain-models, bim, paper-notes

Here's the headline from [Qwen-BIM](https://arxiv.org/abs/2602.20812) (Lin et al.,
Tsinghua, Feb 2026): a fine-tuned **14B** model scores **0.83 on G-Eval** for
BIM-based design tasks — essentially tied with **DeepSeek-R1 at 671B** (0.84), and
ahead of its own 72B sibling. A model ~48× smaller, matching the frontier on a
specific domain.

That result is not surprising on its own — "fine-tune a small model on your domain"
is folklore by now. What makes the paper worth reading is the *anatomy*: where exactly
general LLMs fall over on engineering work, and which one design choice did most of the
lifting. I work on industrial AI at [Inkers](https://inkers.ai), so domain models over
3D/BIM data are close to home. These are my notes.

## The actual problem: a BIM model isn't text

A Building Information Model is a structured graph of components — walls, slabs, beams,
each with geometry, materials, and relationships. An LLM can't read it. So step one of
*any* LLM-on-BIM pipeline is an unglamorous one the field mostly skips past: **turn the
model into text.**

The authors do exactly that, carefully. Revit models of five building types (malls,
offices, dormitories, teaching buildings, museums) are sliced into spatial blocks of
~10–15 components each, defects are injected, and each block is serialized to plain text
plus 22 templated questions with **hard-coded reference answers**. That last detail
matters: the ground truth is computed by rules, not by another model, so the benchmark
isn't measuring one LLM against another LLM's opinion.

<Callout type="note">
  The questions ladder up in difficulty on purpose: from "list the wall IDs"
  (extraction) through "compute each slab's area" (calculation) to "is this wall
  thickness suspicious given residential norms?" (domain reasoning). It's a clean way to
  see *which rung* a model falls off.
</Callout>

The whole pipeline, end to end, is just: project the structured model into text, turn it
into supervised Q&A, add reasoning traces, and LoRA-fine-tune a small open model on it.

<Diagram caption="The Qwen-BIM pipeline: a faithful text projection becomes rule-checked Q&A, then reasoning-supervised QRA, then a LoRA fine-tune of a 14B model.">
  <svg viewBox="0 0 620 130" role="img" aria-label="Pipeline from BIM model to Qwen-BIM" style={{ width: "100%", height: "auto", color: "var(--foreground)" }}>
    <defs>
      <marker id="qb-arrow" viewBox="0 0 10 10" refX="8" refY="5" markerWidth="6" markerHeight="6" orient="auto-start-reverse">
        <path d="M0,0 L10,5 L0,10 z" fill="currentColor" />
      </marker>
    </defs>
    {[
      { x: 8, t1: "BIM model", t2: "Revit graph" },
      { x: 132, t1: "textualize", t2: "+ 22 questions" },
      { x: 256, t1: "BIM-QA", t2: "2,129 pairs" },
      { x: 380, t1: "BIM-QRA", t2: "1,364 + reasoning" },
      { x: 504, t1: "Qwen-BIM", t2: "LoRA · 14B" },
    ].map((n, i) => (
      <g key={i}>
        <rect x={n.x} y={42} width={108} height={46} rx={8} fill="none" stroke="currentColor" strokeOpacity={i === 4 ? 0.9 : 0.45} />
        <text x={n.x + 54} y={64} textAnchor="middle" fontFamily="monospace" fontSize="13" fill="currentColor">{n.t1}</text>
        <text x={n.x + 54} y={79} textAnchor="middle" fontFamily="monospace" fontSize="9" fill="currentColor" opacity="0.6">{n.t2}</text>
        {i < 4 ? <line x1={n.x + 108} y1={65} x2={n.x + 124} y2={65} stroke="currentColor" strokeWidth="1.5" markerEnd="url(#qb-arrow)" /> : null}
      </g>
    ))}
    <text x={310} y={20} textAnchor="middle" fontFamily="monospace" fontSize="10" fill="currentColor" opacity="0.55">structured 3D → text → supervised data → small fine-tuned model</text>
  </svg>
</Diagram>

## Where general LLMs actually fail

They evaluated 11 general models (ChatGLM, Qwen, DeepSeek — including the 671B
DeepSeek-V3/R1). The failure modes are specific and, honestly, familiar from any
engineering-LLM project:

- **Arithmetic.** Asked for a slab's planar area, Qwen-max picks the *right formula* and
  still returns the wrong number. The bottleneck isn't understanding — it's calculation.
- **Natural-language literalism.** Models misread parentheses in the answer template, or
  a naming rule ("wall IDs start with Q"), and confidently apply the wrong transform.
- **Missing domain knowledge.** Asked to infer a building's floor height, the 14B base
  model reasons that floor height ≈ slab thickness (120 mm) — coherent chain of thought,
  wrong mental model, because it was never taught what "floor height" means in practice.

The pattern: general models clear extraction and counting, then degrade sharply on
calculation, multi-step reasoning, and anything needing design common sense. On the
domain-specific design-review tasks, G-Eval is mostly **below 0.8** — not reliable
enough to trust.

## The one choice that mattered: reasoning supervision

This is the part I'd underline. They built two datasets from the same BIM text:

- **BIM-QA** — 2,129 plain question→answer pairs.
- **BIM-QRA** — 1,364 question→**reasoning**→answer triples, where the intermediate
  steps are supervised, not just the final answer.

Then they LoRA-fine-tuned Qwen2.5-14B on different mixes. The result is the kind of
finding that should change how you build these datasets:

| Fine-tuning data | Size | G-Eval |
|---|---|---|
| 100% QA | 2,129 | 0.69 |
| 80% QA + 20% QRA | 2,661 | 0.77 |
| 60% QA + 40% QRA | 2,500 | 0.77 |
| **100% QRA** | **1,364** | **0.83** |

The **smallest** dataset — pure reasoning triples — won, by a wide margin. More
reasoning supervision monotonically improved G-Eval, and quality beat quantity outright.
Teaching the model *how to get there*, on a third of the data, beat teaching it *what the
answer is* on the full set.

## Bigger is not better (and the paper shows it twice)

Two clean data points against scale-maximalism:

1. On the general benchmark, **QwQ-32B out-scored DeepSeek-R1 (671B)** on G-Eval. The
   giant model's verbose reasoning actually *hurt* — it padded answers with redundant
   text, tanking format and text-similarity scores without improving correctness.
2. After fine-tuning, **Qwen-BIM (14B) matched DeepSeek-R1 (671B)** and beat the 72B and
   32B Qwen models on the domain G-Eval.

<Diagram caption="G-Eval on BIM design tasks. Fine-tuning lifts a 14B model from 0.69 to 0.83 — level with a 671B reasoning model ~48× its size.">
  <svg viewBox="0 0 560 240" role="img" aria-label="G-Eval comparison: base 14B vs Qwen-BIM 14B vs DeepSeek-R1 671B" style={{ width: "100%", height: "auto", color: "var(--foreground)" }}>
    {/* y gridlines at 0.5..0.9 — chart area y 20..200 maps score 0.9..0.5 */}
    {[0.5, 0.6, 0.7, 0.8, 0.9].map((v) => {
      const y = 200 - ((v - 0.5) / 0.4) * 180
      return (
        <g key={v}>
          <line x1="56" y1={y} x2="540" y2={y} stroke="currentColor" strokeOpacity="0.12" />
          <text x="48" y={y + 4} textAnchor="end" fontFamily="monospace" fontSize="10" fill="currentColor" opacity="0.5">{v.toFixed(1)}</text>
        </g>
      )
    })}
    {[
      { label: "base 14B", sub: "Qwen2.5", score: 0.69, hl: false },
      { label: "Qwen-BIM", sub: "14B · fine-tuned", score: 0.83, hl: true },
      { label: "DeepSeek-R1", sub: "671B", score: 0.84, hl: false },
    ].map((b, i) => {
      const x = 96 + i * 150
      const y = 200 - ((b.score - 0.5) / 0.4) * 180
      return (
        <g key={i}>
          <rect x={x} y={y} width="92" height={200 - y} rx="4" fill="currentColor" fillOpacity={b.hl ? 0.85 : 0.3} />
          <text x={x + 46} y={y - 8} textAnchor="middle" fontFamily="monospace" fontSize="13" fontWeight="bold" fill="currentColor">{b.score.toFixed(2)}</text>
          <text x={x + 46} y="218" textAnchor="middle" fontFamily="monospace" fontSize="11" fill="currentColor">{b.label}</text>
          <text x={x + 46} y="232" textAnchor="middle" fontFamily="monospace" fontSize="9" fill="currentColor" opacity="0.6">{b.sub}</text>
        </g>
      )
    })}
  </svg>
</Diagram>

The improvement from fine-tuning is also *targeted* exactly where you'd want it:

| G-Eval | Base 14B | Qwen-BIM | Δ |
|---|---|---|---|
| General tasks | 0.810 | 0.874 | +0.06 |
| Domain-specific tasks | 0.588 | 0.801 | **+0.21** |
| Overall | 0.689 | 0.834 | +0.15 |

General ability barely moved (it was already fine); the entire gain is concentrated in
the domain tasks that were broken. That's the signature of fine-tuning doing the right
thing — adding domain competence without trading away the base model's generality.

## What I'd flag

It's a careful paper, but keep the scope honest:

- **2D only.** Early tests showed the models couldn't do 3D geometry (collision
  detection), so those questions were cut. The hard part of real BIM reasoning is 3D.
- **One narrow task family**, five building types, rule-generated Q&A. G-Eval is an
  LLM-as-judge metric — better-correlated with humans than BLEU/ROUGE here, but still a
  proxy. "Data available on request" rather than released.
- The "matches 671B" comparison is on *this* benchmark. It's a domain-competence claim,
  not a general-capability one.

## Why it's the right playbook anyway

Strip away the BIM specifics and this is a template for industrial domain models, the
kind I think about constantly: you rarely need a frontier model. You need (1) a faithful
**text projection of your structured/3D data**, (2) a benchmark with **rule-computed
ground truth** so you're measuring competence, not vibes, and (3) **reasoning-supervised**
fine-tuning data — quality and chain-of-thought over raw volume. Get those three right
and a 14B model on two A6000s reaches the same place a 671B model does, at a fraction of
the inference cost. For anyone shipping AI into a real engineering vertical, that economics
is the whole game.

---

*Paper: [Developing large language model for BIM-based design with domain-specific
benchmark and dataset](https://arxiv.org/abs/2602.20812) — Lin, Cai, Ni, Zhou, Pan
(2026), arXiv:2602.20812.*
