Javascript must be enabled to continue!
Assessment of Artificial Intelligence Chatbot Performance on the Canadian Otolaryngology and Head and Neck Surgery In-Training Exam: Insights from a Comparative Analysis (Preprint)
View through CrossRef
BACKGROUND
The introduction of large language models (LLM) has rapidly transformed the field of healthcare. Its performance, often compared to that of physicians, has been greatly scrutinized. ChatGPT-4, a finely tuned supervised model, offers improved reasoning capabilities and visual input analysis.
OBJECTIVE
The purpose of this study is to evaluate the performance of ChatGPT-4 in the field of otolaryngology and head and neck surgery (OTOHNS) residency training.
METHODS
A total of 351 questions from the OTOHNS National In-Training Exam (NITE) for 2022 and 2023 were submitted to ChatGPT-4 from April 22nd, 2024, to May 12th, 2024, using a new account. New sessions were used for every question, except for follow-up questions. Answers were independently graded by two reviewers using the official grading rubric and the average score was used. Cohen’s kappa coefficient was used for inter-rater reliability. Anonymized mean exam results from residents who have previously taken this exam were obtained from the lead faculty of the NITE. The sample size was calculated based on the total number of enrolled residents, as indicated on each university’s program website. Z-tests were used to compare ChatGPT-4’s performance to that of residents per sub-specialty and training level. The questions were categorized by type (image or text), task (diagnosis, additional exams, treatment or guidelines), sub-specialty, taxonomic level and prompt length. A one-way ANOVA, independent t-test and two-tailed Pearson correlation was used to examine variations between question categories. IBM SPSS 29 was used.
RESULTS
ChatGPT-4 scored 66.19% and 64.84% on the 2022 and 2023 exams, respectively. Inter-rater reliability between the two raters was 89.8% (standard error 0.018, P < .001). ChatGPT-4 outperformed the residents on both exams, amongst all training levels and within all sub-specialties except for the general/pediatrics section of the 2023 exam (Z-test -2.37). There were decreasing performance gaps with increasing residency training as per the following Z-scores: PGY-2 16.08, PGY-3 9.31, PGY-4 3.49 in 2022 and PGY-2 15.57, PGY-3 8.60, PGY-4 3.21 in 2023. For the 2022 exam, ChatGPT-4 would rank in the 99th percentile amongst PGY-2, 95th percentile amongst PGY-3 and 73rd percentile amongst PGY-4 classmates. For the 2023 exam, it would rank in the 99th percentile amongst PGY-2, 94th percentile amongst PGY-3 and 71st percentile amongst PGY-4 classmates. ChatGPT-4 performed best on text-based questions (74.3%, P<.001), level one taxonomic questions (75.1%, P<.001) and guideline-based questions (70%, P=.048). It had no significant difference in performance based on sub-specialty (P=.364) or prompt length (P=.385).
CONCLUSIONS
ChatGPT-4 not only achieved passing grades on two versions of the Canadian OTOHNS NITE, but it also outperformed residents in an outstanding manner, underscoring a critical need to redesign residency assessment methods.
CLINICALTRIAL
N/A
Title: Assessment of Artificial Intelligence Chatbot Performance on the Canadian Otolaryngology and Head and Neck Surgery In-Training Exam: Insights from a Comparative Analysis (Preprint)
Description:
BACKGROUND
The introduction of large language models (LLM) has rapidly transformed the field of healthcare.
Its performance, often compared to that of physicians, has been greatly scrutinized.
ChatGPT-4, a finely tuned supervised model, offers improved reasoning capabilities and visual input analysis.
OBJECTIVE
The purpose of this study is to evaluate the performance of ChatGPT-4 in the field of otolaryngology and head and neck surgery (OTOHNS) residency training.
METHODS
A total of 351 questions from the OTOHNS National In-Training Exam (NITE) for 2022 and 2023 were submitted to ChatGPT-4 from April 22nd, 2024, to May 12th, 2024, using a new account.
New sessions were used for every question, except for follow-up questions.
Answers were independently graded by two reviewers using the official grading rubric and the average score was used.
Cohen’s kappa coefficient was used for inter-rater reliability.
Anonymized mean exam results from residents who have previously taken this exam were obtained from the lead faculty of the NITE.
The sample size was calculated based on the total number of enrolled residents, as indicated on each university’s program website.
Z-tests were used to compare ChatGPT-4’s performance to that of residents per sub-specialty and training level.
The questions were categorized by type (image or text), task (diagnosis, additional exams, treatment or guidelines), sub-specialty, taxonomic level and prompt length.
A one-way ANOVA, independent t-test and two-tailed Pearson correlation was used to examine variations between question categories.
IBM SPSS 29 was used.
RESULTS
ChatGPT-4 scored 66.
19% and 64.
84% on the 2022 and 2023 exams, respectively.
Inter-rater reliability between the two raters was 89.
8% (standard error 0.
018, P < .
001).
ChatGPT-4 outperformed the residents on both exams, amongst all training levels and within all sub-specialties except for the general/pediatrics section of the 2023 exam (Z-test -2.
37).
There were decreasing performance gaps with increasing residency training as per the following Z-scores: PGY-2 16.
08, PGY-3 9.
31, PGY-4 3.
49 in 2022 and PGY-2 15.
57, PGY-3 8.
60, PGY-4 3.
21 in 2023.
For the 2022 exam, ChatGPT-4 would rank in the 99th percentile amongst PGY-2, 95th percentile amongst PGY-3 and 73rd percentile amongst PGY-4 classmates.
For the 2023 exam, it would rank in the 99th percentile amongst PGY-2, 94th percentile amongst PGY-3 and 71st percentile amongst PGY-4 classmates.
ChatGPT-4 performed best on text-based questions (74.
3%, P<.
001), level one taxonomic questions (75.
1%, P<.
001) and guideline-based questions (70%, P=.
048).
It had no significant difference in performance based on sub-specialty (P=.
364) or prompt length (P=.
385).
CONCLUSIONS
ChatGPT-4 not only achieved passing grades on two versions of the Canadian OTOHNS NITE, but it also outperformed residents in an outstanding manner, underscoring a critical need to redesign residency assessment methods.
CLINICALTRIAL
N/A.
Related Results
Implementasi Chatbot Pelajaran Sekolah Dasar Dengan Pandorabots
Implementasi Chatbot Pelajaran Sekolah Dasar Dengan Pandorabots
Chatbot is a virtual conversation that can receive input in the form of voice or writing. A chatbot can be a generative or retrieval chatbot. The creation of the two chatbots provi...
Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
DigiBete, a Novel Chatbot to Support Transition to Adult Care of Young People/Young Adults With Type 1 Diabetes Mellitus: Outcomes From a Prospective, Multimethod, Nonrandomized Feasibility and Acceptability Study
DigiBete, a Novel Chatbot to Support Transition to Adult Care of Young People/Young Adults With Type 1 Diabetes Mellitus: Outcomes From a Prospective, Multimethod, Nonrandomized Feasibility and Acceptability Study
Abstract
Background
Transition to adult health care for young people and young adults (YP/YA) with type 1 diabetes mellitus (T1DM) starts around ...
Analysis of Reliability and Efficiency of Information Extraction Using AI-Based Chatbot: The More-for-Less Paradox
Analysis of Reliability and Efficiency of Information Extraction Using AI-Based Chatbot: The More-for-Less Paradox
This paper addresses the problem of information extraction using an AI-powered chatbot. The problem concerns searching and extracting relevant information from large databases in r...
Complex Collision Tumors: A Systematic Review
Complex Collision Tumors: A Systematic Review
Abstract
Introduction: A collision tumor consists of two distinct neoplastic components located within the same organ, separated by stromal tissue, without histological intermixing...
Applications of ChatGPT in Otolaryngology–Head Neck Surgery: A State of the Art Review
Applications of ChatGPT in Otolaryngology–Head Neck Surgery: A State of the Art Review
AbstractObjectiveTo review the current literature on the application, accuracy, and performance of Chatbot Generative Pre‐Trained Transformer (ChatGPT) in Otolaryngology–Head and N...
Exploring the Ability of ChatGPT to Act as a Research Aid in Otolaryngology
Exploring the Ability of ChatGPT to Act as a Research Aid in Otolaryngology
AbstractRecently artificial intelligence (AI) platforms have developed at a rapid pace. To date no studies have explored AI platform ChatGPT’s ability to serve as an aid in researc...
Physiological and biochemical study of the effect of Anxiety exam in the blood of students.
Physiological and biochemical study of the effect of Anxiety exam in the blood of students.
This study was conducted on the students of the Faculty of Education / University of Samarra, Department of biology for the period of (10/ 11/ 2010 to 28/02/2011 ) study were divi...

