Javascript must be enabled to continue!

Evaluating reasoning large language models with human-like thinking in ophthalmic question answering

Objectives To evaluate the performance of reasoning large language models (LLMs) with human-like thinking in ophthalmic question answering. Methods We evaluated two state-of-the-art open-source reasoning LLMs (DeepSeek-R1 and QwQ-32B) and one conventional non-reasoning LLM (LLaMA-3.3-70B-Instruct) models on ophthalmology questions, assessing not only answer accuracy (ACC) but also the quality of their reasoning processes. First, we curated MedQA-Eye, a dataset of 967 ophthalmology questions across 10 subspecialties, 3 scenarios, 5 medical entities and 3 languages. Second, we proposed a novel framework considering human thinking patterns essential to medical practice to evaluate the thinking performance of reasoning LLMs on MedQA-Eye. Results DeepSeek-R1 demonstrated superior overall ACC (90.59%, 95% CI 88.59% to 92.27%) to LLaMA-3.3-70B-Instruct (87.90%, 95% CI 85.69% to 89.81%, p=0.015) and QwQ-32B (84.28%, 95% CI 81.85% to 86.44%, p<0.001) with performance varying across subspecialties. Analysis of reasoning LLMs revealed incorrect logical inference as the primary point of failure, accounting for 93.41%–94.74% of incorrectly answered questions. We further quantified semantic uncertainty in reasoning LLM thinking as a predictor of answer reliability. DeepSeek-R1 exhibited lower semantic uncertainty (1.04±3.63) compared with QwQ-32B (4.31±40.70), p<0.001. Conclusion Reasoning LLMs demonstrated superior performance in ophthalmology question answering, with DeepSeek-R1 achieving the highest ACC. Our findings demonstrate that reasoning LLM can better simulate human-like thinking processes compared with conventional non-reasoning LLM, suggesting its potential for more trustworthy LLM systems in ophthalmology.

BMJ

Zhouqian Wang Chenjia Xu Lei Wang Wei Qiang Yanzhen Li Daoyuan Li Fabao Xu Yanyan Zhang Jiewei Jiang Zhongwen Li

BMJ Open Ophthalmology

2026

Title: Evaluating reasoning large language models with human-like thinking in ophthalmic question answering

Description:

Objectives To evaluate the performance of reasoning large language models (LLMs) with human-like thinking in ophthalmic question answering.

Methods We evaluated two state-of-the-art open-source reasoning LLMs (DeepSeek-R1 and QwQ-32B) and one conventional non-reasoning LLM (LLaMA-3.

3-70B-Instruct) models on ophthalmology questions, assessing not only answer accuracy (ACC) but also the quality of their reasoning processes.

First, we curated MedQA-Eye, a dataset of 967 ophthalmology questions across 10 subspecialties, 3 scenarios, 5 medical entities and 3 languages.

Second, we proposed a novel framework considering human thinking patterns essential to medical practice to evaluate the thinking performance of reasoning LLMs on MedQA-Eye.

Results DeepSeek-R1 demonstrated superior overall ACC (90.

59%, 95% CI 88.

59% to 92.

27%) to LLaMA-3.

3-70B-Instruct (87.

90%, 95% CI 85.

69% to 89.

81%, p=0.

015) and QwQ-32B (84.

28%, 95% CI 81.

85% to 86.

44%, p<0.

001) with performance varying across subspecialties.

Analysis of reasoning LLMs revealed incorrect logical inference as the primary point of failure, accounting for 93.

41%–94.

74% of incorrectly answered questions.

We further quantified semantic uncertainty in reasoning LLM thinking as a predictor of answer reliability.

DeepSeek-R1 exhibited lower semantic uncertainty (1.

04±3.

63) compared with QwQ-32B (4.

31±40.

70), p<0.

001.

Conclusion Reasoning LLMs demonstrated superior performance in ophthalmology question answering, with DeepSeek-R1 achieving the highest ACC.

Our findings demonstrate that reasoning LLM can better simulate human-like thinking processes compared with conventional non-reasoning LLM, suggesting its potential for more trustworthy LLM systems in ophthalmology.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga

The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...

Logical Challenges in Artificial General Intelligence

The present thesis pertains to the research area of logic for artificial intelligence (AI), and is motivated by the critical role of automated reasoning in AI, particularly by the ...

Analitičko rasuđivanje i uvjerenja koja proizvode polarizaciju

Introduction: Reason has always been considered an essential feature of our species, leading us to tremendous progress in the evolutionarily very short time of our existence. On th...

Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program

Abstract Funding Acknowledgements Type of funding sources: None. INTRODUCTION Patients with heart failure (HF)...

Reflections Of Zoltan P. Dienes On Mathematics Education

The name of Zoltan P. Dienes (1916- ) stands with those ofJean Piaget, Jerome Bruner, Edward Begle, and Robert Davis as legendary figures whose work left a lasting impression on th...

Characteristics and processes of registered nurses’ clinical reasoning and factors relating to the use of clinical reasoning in practice: a scoping review

Objective: The objective of this review was to examine the characteristics and processes of clinical reasoning used by registered nurses in clinical practice, and to id...

How Large Language Models Can Affect Clinical Reasoning: A Randomized Clinical Trial

Abstract Importance LLMs have encoded a vast array of medical knowledge and are being integrated into clinical settings as deci...

Email:
Password:

Email:

Evaluating reasoning large language models with human-like thinking in ophthalmic question answering

Related Results