Javascript must be enabled to continue!
Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency
View through CrossRef
Abstract
Background
We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions.
Methods
ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations. Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident). Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.0–1.0).
Results
Claude 3.5 Sonnet achieved the highest score per question (0.84 ± 0.26, mean ± standard deviation) compared to ChatGPT-4o (0.76 ± 0.31), ChatGPT-4o-mini (0.64 ± 0.35), Copilot (0.62 ± 0.37), and Gemini (0.54 ± 0.39) (p < 0.001). A self-reported confidence in answering the questions was 9.0 ± 0.9 for Claude 3.5 Sonnet followed by ChatGPT-4o (8.7 ± 1.1), compared to ChatGPT-4o-mini (8.2 ± 1.3), Copilot (8.2 ± 2.2), and Gemini (8.2 ± 1.6, p < 0.001). Claude 3.5 Sonnet demonstrated superior consistency, changing responses in 5.4% of cases between the two iterations, compared to ChatGPT-4o (6.5%), ChatGPT-4o-mini (8.8%), Copilot (13.8%), and Gemini (18.5%). All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination.
Conclusion
Claude 3.5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well. The variation in performance among the evaluated models was substantial.
Relevance statement
Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings.
Key Points
Claude 3.5 Sonnet outperformed other chatbots in accuracy and response consistency.
ChatGPT-4o ranked second, showing strong but slightly less reliable performance.
All chatbots surpassed EDiR candidates in text-based EDiR questions.
Graphical Abstract
Springer Science and Business Media LLC
Title: Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency
Description:
Abstract
Background
We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions.
Methods
ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.
5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations.
Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident).
Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.
0–1.
0).
Results
Claude 3.
5 Sonnet achieved the highest score per question (0.
84 ± 0.
26, mean ± standard deviation) compared to ChatGPT-4o (0.
76 ± 0.
31), ChatGPT-4o-mini (0.
64 ± 0.
35), Copilot (0.
62 ± 0.
37), and Gemini (0.
54 ± 0.
39) (p < 0.
001).
A self-reported confidence in answering the questions was 9.
0 ± 0.
9 for Claude 3.
5 Sonnet followed by ChatGPT-4o (8.
7 ± 1.
1), compared to ChatGPT-4o-mini (8.
2 ± 1.
3), Copilot (8.
2 ± 2.
2), and Gemini (8.
2 ± 1.
6, p < 0.
001).
Claude 3.
5 Sonnet demonstrated superior consistency, changing responses in 5.
4% of cases between the two iterations, compared to ChatGPT-4o (6.
5%), ChatGPT-4o-mini (8.
8%), Copilot (13.
8%), and Gemini (18.
5%).
All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination.
Conclusion
Claude 3.
5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well.
The variation in performance among the evaluated models was substantial.
Relevance statement
Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings.
Key Points
Claude 3.
5 Sonnet outperformed other chatbots in accuracy and response consistency.
ChatGPT-4o ranked second, showing strong but slightly less reliable performance.
All chatbots surpassed EDiR candidates in text-based EDiR questions.
Graphical Abstract.
Related Results
EDIR: exome database of interspersed repeats
EDIR: exome database of interspersed repeats
Abstract
Motivation
Intragenic exonic deletions are known to contribute to genetic diseases and are often flanked by regions of ...
Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Abstract
The Physical Activity Guidelines for Americans (Guidelines) advises older adults to be as active as possible. Yet, despite the well documented benefits of physical a...
Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study
Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study
Artificial intelligence (AI) and the introduction of Large Language Model (LLM) chatbots have become a common source of patient inquiry in healthcare. The quality and readability o...
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Abstract
Introduction
Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...
E-Press and Oppress
E-Press and Oppress
From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...
Implementasi Chatbot Pelajaran Sekolah Dasar Dengan Pandorabots
Implementasi Chatbot Pelajaran Sekolah Dasar Dengan Pandorabots
Chatbot is a virtual conversation that can receive input in the form of voice or writing. A chatbot can be a generative or retrieval chatbot. The creation of the two chatbots provi...
AI Chatbots and Psychotherapy: A Match Made in Heaven?
AI Chatbots and Psychotherapy: A Match Made in Heaven?
Dear Editor,
Artificial Intelligence (AI) is revolutionizing psychotherapy by combating its inaccessibility.1 AI chatbots and conversational agents are among the most promising int...
Review of chatbots in urogynecology
Review of chatbots in urogynecology
Purpose of review
Chatbots based on large language models have been rapidly incorporated into many aspects of medicine in a short time despite an incomplete understandi...

