Javascript must be enabled to continue!

Benchmarking Knowledge and Capability of Large Language Models in Building Science Domain

Large language models (LLMs) are increasingly adopted across scientific and engineering fields. However, applying general-purpose LLMs to specialized engineering domains imposes stringent requirements for structured knowledge, rigorous reasoning, and technical precision. Thus, the suitability of current general-purpose LLMs for practical applications in engineering domains remains questionable. To understand the mastery level of LLMs in the building science domain as one broad but specific engineering domain, in this paper, we perform a comprehensive benchmark analysis (with benchmark dataset of 1,487 questions) to evaluate abilities of 15 state-of-the-art (SOTA) LLMs across 12 core subject topics in the building science domain. To enable scalable and robust evaluation, we propose and validate an AI-Judger for assessment across five dimensions of abilities, i.e., knowledge and concept, logic and consistency, clarity of expression, and reflection and exploratory. Overall, SOTA general-purposes LLMs achieve only ~50% accuracy on average in answering different types of questions. The capabilities of LLMs decrease progressively from linguistic expression and factual knowledge to logical reasoning, then reflection and exploratory thinking. For different tasks, LLMs exhibit notably low accuracy on calculation (~13%), short-answer (~23%), and cloze tasks (~30%), contrast to stronger performance on single-choice (74%) and multiple-choice questions (63%). Finally, pronounced variance of LLM performance exists across topics, with relatively low accuracy on physics fundamental and HVAC&R-related questions (median of 20%-40%) compared to ~80% for building standards and codes. These identified gaps highlight the limitations of general-purpose LLMs in engineering contexts, clearly pointing to the necessity of developing domain-specific LLMs tailored for engineering applications.

Innovation Press Co., Limited

Gang Jiang Keyan Chen Jiatao Liu Jianli Chen Shandian Zhe Zhe Wang

Energy Use

2025

Title: Benchmarking Knowledge and Capability of Large Language Models in Building Science Domain

Description:

Large language models (LLMs) are increasingly adopted across scientific and engineering fields.

However, applying general-purpose LLMs to specialized engineering domains imposes stringent requirements for structured knowledge, rigorous reasoning, and technical precision.

Thus, the suitability of current general-purpose LLMs for practical applications in engineering domains remains questionable.

To understand the mastery level of LLMs in the building science domain as one broad but specific engineering domain, in this paper, we perform a comprehensive benchmark analysis (with benchmark dataset of 1,487 questions) to evaluate abilities of 15 state-of-the-art (SOTA) LLMs across 12 core subject topics in the building science domain.

To enable scalable and robust evaluation, we propose and validate an AI-Judger for assessment across five dimensions of abilities, i.

, knowledge and concept, logic and consistency, clarity of expression, and reflection and exploratory.

Overall, SOTA general-purposes LLMs achieve only ~50% accuracy on average in answering different types of questions.

The capabilities of LLMs decrease progressively from linguistic expression and factual knowledge to logical reasoning, then reflection and exploratory thinking.

For different tasks, LLMs exhibit notably low accuracy on calculation (~13%), short-answer (~23%), and cloze tasks (~30%), contrast to stronger performance on single-choice (74%) and multiple-choice questions (63%).

Finally, pronounced variance of LLM performance exists across topics, with relatively low accuracy on physics fundamental and HVAC&R-related questions (median of 20%-40%) compared to ~80% for building standards and codes.

These identified gaps highlight the limitations of general-purpose LLMs in engineering contexts, clearly pointing to the necessity of developing domain-specific LLMs tailored for engineering applications.

.

Back

<span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program

Abstract Funding Acknowledgements Type of funding sources: None. INTRODUCTION Patients with heart failure (HF)...

An optimisational model of benchmarking

PurposeThe purpose of this paper is to develop a quantitative methodology for benchmarking process which is simple, effective and efficient as a rejoinder to benchmarking detractor...

A review on benchmarking of supply chain performance measures

PurposeThe purpose of this paper is to redress the imbalances in the past literature of supply chain benchmarking and enhance data envelopment analysis (DEA) modeling approach in s...

The need for adaptive processes of benchmarking in small business‐to‐business services

PurposeThis paper aims to explore current management attitudes towards benchmarking and its implementation within small business‐to‐business service firms in order to enhance a dee...

Organisational ensuring the international benchmarking of the enterprise

This paper delves into the contemporary significance of organizational facilitation for international benchmarking within enterprises. It explores strategies and methodologies, she...

Der skal ikke lades sten på sten tilbage

The Building by the Barbar TempleClose by the large temple at Barbar 1) lies a little tell, which was investigated in the spring of 1956. The tell was shown to cover a building of ...

De gevel – een intermediair element tussen buiten en binnen

This study is based on the fact that all people have a basic need for protection from other people (and animals) as well as from the elements (the exterior climate). People need a ...

Email:
Password:

Email:

Benchmarking Knowledge and Capability of Large Language Models in Building Science Domain

Related Results