Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency

View through CrossRef
Abstract Background We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions. Methods ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations. Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident). Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.0–1.0). Results Claude 3.5 Sonnet achieved the highest score per question (0.84 ± 0.26, mean ± standard deviation) compared to ChatGPT-4o (0.76 ± 0.31), ChatGPT-4o-mini (0.64 ± 0.35), Copilot (0.62 ± 0.37), and Gemini (0.54 ± 0.39) (p < 0.001). A self-reported confidence in answering the questions was 9.0 ± 0.9 for Claude 3.5 Sonnet followed by ChatGPT-4o (8.7 ± 1.1), compared to ChatGPT-4o-mini (8.2 ± 1.3), Copilot (8.2 ± 2.2), and Gemini (8.2 ± 1.6, p < 0.001). Claude 3.5 Sonnet demonstrated superior consistency, changing responses in 5.4% of cases between the two iterations, compared to ChatGPT-4o (6.5%), ChatGPT-4o-mini (8.8%), Copilot (13.8%), and Gemini (18.5%). All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination. Conclusion Claude 3.5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well. The variation in performance among the evaluated models was substantial. Relevance statement Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings. Key Points Claude 3.5 Sonnet outperformed other chatbots in accuracy and response consistency. ChatGPT-4o ranked second, showing strong but slightly less reliable performance. All chatbots surpassed EDiR candidates in text-based EDiR questions. Graphical Abstract
Title: Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency
Description:
Abstract Background We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions.
Methods ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.
5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations.
Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident).
Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.
0–1.
0).
Results Claude 3.
5 Sonnet achieved the highest score per question (0.
84 ± 0.
26, mean ± standard deviation) compared to ChatGPT-4o (0.
76 ± 0.
31), ChatGPT-4o-mini (0.
64 ± 0.
35), Copilot (0.
62 ± 0.
37), and Gemini (0.
54 ± 0.
39) (p < 0.
001).
A self-reported confidence in answering the questions was 9.
0 ± 0.
9 for Claude 3.
5 Sonnet followed by ChatGPT-4o (8.
7 ± 1.
1), compared to ChatGPT-4o-mini (8.
2 ± 1.
3), Copilot (8.
2 ± 2.
2), and Gemini (8.
2 ± 1.
6, p < 0.
001).
Claude 3.
5 Sonnet demonstrated superior consistency, changing responses in 5.
4% of cases between the two iterations, compared to ChatGPT-4o (6.
5%), ChatGPT-4o-mini (8.
8%), Copilot (13.
8%), and Gemini (18.
5%).
All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination.
Conclusion Claude 3.
5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well.
The variation in performance among the evaluated models was substantial.
Relevance statement Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings.
Key Points Claude 3.
5 Sonnet outperformed other chatbots in accuracy and response consistency.
ChatGPT-4o ranked second, showing strong but slightly less reliable performance.
All chatbots surpassed EDiR candidates in text-based EDiR questions.
Graphical Abstract.

Related Results

Revolutionizing public health: The importance of chatbots
Revolutionizing public health: The importance of chatbots
Introduction: Public health is a crucial aspect of maintaining the well-being and health of the community. The ever-growing demands of the modern w...
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
App review of anxiety and depression chatbots and their self-care features (Preprint)
App review of anxiety and depression chatbots and their self-care features (Preprint)
BACKGROUND Anxiety and depression rates are at an all-time high along with other mental health disorders. Smartphone-based mental health chatbots or convers...
EDIR: exome database of interspersed repeats
EDIR: exome database of interspersed repeats
Abstract Motivation Intragenic exonic deletions are known to contribute to genetic diseases and are often flanked by regions of ...
AI and Incidental Findings
AI and Incidental Findings
Photo by Accuray on Unsplash INTRODUCTION Delayed and missed follow-up on incidental findings threatens patient health and is a major financial risk for healthcare systems. The hea...
Artificial Intelligence in Project Management: A Study of The Role of Ai-Powered Chatbots in Project Stakeholder Engagement
Artificial Intelligence in Project Management: A Study of The Role of Ai-Powered Chatbots in Project Stakeholder Engagement
Artificial Intelligence (AI) is increasingly becoming a cornerstone in the evolution of project management. Its capabilities extend beyond simple automation, fostering improved dec...

Back to Top