Javascript must be enabled to continue!
Evaluating reasoning large language models with human-like thinking in ophthalmic question answering
View through CrossRef
Objectives
To evaluate the performance of reasoning large language models (LLMs) with human-like thinking in ophthalmic question answering.
Methods
We evaluated two state-of-the-art open-source reasoning LLMs (DeepSeek-R1 and QwQ-32B) and one conventional non-reasoning LLM (LLaMA-3.3-70B-Instruct) models on ophthalmology questions, assessing not only answer accuracy (ACC) but also the quality of their reasoning processes. First, we curated MedQA-Eye, a dataset of 967 ophthalmology questions across 10 subspecialties, 3 scenarios, 5 medical entities and 3 languages. Second, we proposed a novel framework considering human thinking patterns essential to medical practice to evaluate the thinking performance of reasoning LLMs on MedQA-Eye.
Results
DeepSeek-R1 demonstrated superior overall ACC (90.59%, 95% CI 88.59% to 92.27%) to LLaMA-3.3-70B-Instruct (87.90%, 95% CI 85.69% to 89.81%, p=0.015) and QwQ-32B (84.28%, 95% CI 81.85% to 86.44%, p<0.001) with performance varying across subspecialties. Analysis of reasoning LLMs revealed incorrect logical inference as the primary point of failure, accounting for 93.41%–94.74% of incorrectly answered questions. We further quantified semantic uncertainty in reasoning LLM thinking as a predictor of answer reliability. DeepSeek-R1 exhibited lower semantic uncertainty (1.04±3.63) compared with QwQ-32B (4.31±40.70), p<0.001.
Conclusion
Reasoning LLMs demonstrated superior performance in ophthalmology question answering, with DeepSeek-R1 achieving the highest ACC. Our findings demonstrate that reasoning LLM can better simulate human-like thinking processes compared with conventional non-reasoning LLM, suggesting its potential for more trustworthy LLM systems in ophthalmology.
Title: Evaluating reasoning large language models with human-like thinking in ophthalmic question answering
Description:
Objectives
To evaluate the performance of reasoning large language models (LLMs) with human-like thinking in ophthalmic question answering.
Methods
We evaluated two state-of-the-art open-source reasoning LLMs (DeepSeek-R1 and QwQ-32B) and one conventional non-reasoning LLM (LLaMA-3.
3-70B-Instruct) models on ophthalmology questions, assessing not only answer accuracy (ACC) but also the quality of their reasoning processes.
First, we curated MedQA-Eye, a dataset of 967 ophthalmology questions across 10 subspecialties, 3 scenarios, 5 medical entities and 3 languages.
Second, we proposed a novel framework considering human thinking patterns essential to medical practice to evaluate the thinking performance of reasoning LLMs on MedQA-Eye.
Results
DeepSeek-R1 demonstrated superior overall ACC (90.
59%, 95% CI 88.
59% to 92.
27%) to LLaMA-3.
3-70B-Instruct (87.
90%, 95% CI 85.
69% to 89.
81%, p=0.
015) and QwQ-32B (84.
28%, 95% CI 81.
85% to 86.
44%, p<0.
001) with performance varying across subspecialties.
Analysis of reasoning LLMs revealed incorrect logical inference as the primary point of failure, accounting for 93.
41%–94.
74% of incorrectly answered questions.
We further quantified semantic uncertainty in reasoning LLM thinking as a predictor of answer reliability.
DeepSeek-R1 exhibited lower semantic uncertainty (1.
04±3.
63) compared with QwQ-32B (4.
31±40.
70), p<0.
001.
Conclusion
Reasoning LLMs demonstrated superior performance in ophthalmology question answering, with DeepSeek-R1 achieving the highest ACC.
Our findings demonstrate that reasoning LLM can better simulate human-like thinking processes compared with conventional non-reasoning LLM, suggesting its potential for more trustworthy LLM systems in ophthalmology.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
Logical Challenges in Artificial General Intelligence
Logical Challenges in Artificial General Intelligence
The present thesis pertains to the research area of logic for artificial intelligence (AI), and is motivated by the critical role of automated reasoning in AI, particularly by the ...
Analitičko rasuđivanje i uvjerenja koja proizvode polarizaciju
Analitičko rasuđivanje i uvjerenja koja proizvode polarizaciju
Introduction: Reason has always been considered an essential feature of our species, leading us to tremendous progress in the evolutionarily very short time of our existence. On th...
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Abstract
Funding Acknowledgements
Type of funding sources: None.
INTRODUCTION Patients with heart failure (HF)...
Reflections Of Zoltan P. Dienes On Mathematics Education
Reflections Of Zoltan P. Dienes On Mathematics Education
The name of Zoltan P. Dienes (1916- ) stands with those ofJean Piaget, Jerome Bruner, Edward Begle, and Robert Davis as legendary figures whose work left a lasting impression on th...
Characteristics and processes of registered nurses’ clinical reasoning and factors relating to the use of clinical reasoning in practice: a scoping review
Characteristics and processes of registered nurses’ clinical reasoning and factors relating to the use of clinical reasoning in practice: a scoping review
Objective:
The objective of this review was to examine the characteristics and processes of clinical reasoning used by registered nurses in clinical practice, and to id...
How Large Language Models Can Affect Clinical Reasoning: A Randomized Clinical Trial
How Large Language Models Can Affect Clinical Reasoning: A Randomized Clinical Trial
Abstract
Importance
LLMs have encoded a vast array of medical knowledge and are being integrated into clinical settings as deci...

