Javascript must be enabled to continue!

Comparison of ChatGPT 3.5 Turbo and Human Performance in taking the European Board of Ophthalmology Diploma (EBOD) Exam

Abstract Background/Objectives: This paper aims to assess ChatGPT’s performance in answering European Board of Ophthalmology Diploma (EBOD) examination papers and to compare these results to pass benchmarks and candidate results. Methods This cross-sectional study used a sample of previous past exam papers from 2012, 2013, 2020–2023 EBOD examinations. This study analysed ChatGPT’s responses to 392 Multiple Choice Questions (MCQ), each containing 5 true/false statements (1432 statements in total) and 48 Single Best Answer (SBA) questions. Results ChatGPT’s performance for MCQ questions scored on average 64.39%. ChatGPT’s strongest metric performance for MCQ was precision (68.76%). ChatGPT performed best at answering Pathology questions (Grubbs test p < .05). Optics and refraction had the lowest-scoring MCQ performance across all metrics. ChatGPT’s SBA performance averaged 28.43%, with the highest score and strongest performance in precision (29.36%). Pathology SBA questions were consistently the lowest-scoring topic across most metrics. ChatGPT chose option 1 more than other options (p = 0.19). When answering SBAs, human candidates scored higher than ChatGPT in all metric areas measured. Conclusion ChatGPT performed stronger for true/false questions, scoring a pass mark in most instances. Performance was poorer for SBA questions, especially as ChatGPT was more likely to choose the first answer out of four. Our results suggest that ChatGPT’s ability in information retrieval is better than knowledge integration.

Springer Science and Business Media LLC

Anna Maino Jakub Klikowski Brendan Strong Wahid Ghaffari Michał Woźniak Tristan BOURCIER Andrzej Grzybowski

2024

Title: Comparison of ChatGPT 3.5 Turbo and Human Performance in taking the European Board of Ophthalmology Diploma (EBOD) Exam

Description:

Methods This cross-sectional study used a sample of previous past exam papers from 2012, 2013, 2020–2023 EBOD examinations.

This study analysed ChatGPT’s responses to 392 Multiple Choice Questions (MCQ), each containing 5 true/false statements (1432 statements in total) and 48 Single Best Answer (SBA) questions.

Results ChatGPT’s performance for MCQ questions scored on average 64.

39%.

ChatGPT’s strongest metric performance for MCQ was precision (68.

76%).

ChatGPT performed best at answering Pathology questions (Grubbs test p < .

05).

Optics and refraction had the lowest-scoring MCQ performance across all metrics.

ChatGPT’s SBA performance averaged 28.

43%, with the highest score and strongest performance in precision (29.

36%).

Pathology SBA questions were consistently the lowest-scoring topic across most metrics.

ChatGPT chose option 1 more than other options (p = 0.

19).

When answering SBAs, human candidates scored higher than ChatGPT in all metric areas measured.

Conclusion ChatGPT performed stronger for true/false questions, scoring a pass mark in most instances.

Performance was poorer for SBA questions, especially as ChatGPT was more likely to choose the first answer out of four.

Our results suggest that ChatGPT’s ability in information retrieval is better than knowledge integration.

Back

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study

Abstract Introduction Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...

CHATGPT ASSISTANCE ON BIOCHEMISTRY LEARNING OUTCOMES OF PRE-SERVICE TEACHERS

This research investigates the effect of ChatGPT on the learning outcomes of pre-service biology teachers. Sampling was done by purposive sampling in class A (treated with ChatGPT)...

Appearance of ChatGPT and English Study

The purpose of this study is to examine the definition and characteristics of ChatGPT in order to present the direction of self-directed learning to learners, and to explore the po...

User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study (Preprint)

BACKGROUND With the rapid advancement of artificial intelligence (AI) technologies, AI-powered chatbots, such as Chat Generative Pretrained Transformer (Cha...

P-525 ChatGPT 4.0: accurate, clear, relevant, and readable responses to frequently asked fertility patient questions

Abstract Study question What is the accuracy, clarity, relevance and readability of ChatGPT’s responses to frequently asked fert...

ChatGPT: "To be or not to be" ... in academic research. The human mind's analytical rigor and capacity to discriminate between AI bots' truths and hallucinations

Background. ChatGPT can generate increasingly realistic language, but the correctness and integrity of implementing these models in scientific papers remain unknown. Recently publ...

Can ChatGPT-3.5 Pass a Medical Exam? A Systematic Review of ChatGPT's Performance in Academic Testing

OBJECTIVE We, therefore, aim to conduct a systematic review to assess the academic potential of ChatGPT-3.5, along with its strengths and limitations when giving medical exams. MET...

Email:
Password:

Email:

Comparison of ChatGPT 3.5 Turbo and Human Performance in taking the European Board of Ophthalmology Diploma (EBOD) Exam

Related Results