Javascript must be enabled to continue!
Assessment of Artificial Intelligence Chatbot Performance on the Canadian Otolaryngology and Head and Neck Surgery In-Training Exam: Insights from a Comparative Analysis (Preprint)
View through CrossRef
BACKGROUND
The introduction of large language models (LLM) has rapidly transformed the field of healthcare. Its performance, often compared to that of physicians, has been greatly scrutinized. ChatGPT-4, a finely tuned supervised model, offers improved reasoning capabilities and visual input analysis.
OBJECTIVE
The purpose of this study is to evaluate the performance of ChatGPT-4 in the field of otolaryngology and head and neck surgery (OTOHNS) residency training.
METHODS
A total of 351 questions from the OTOHNS National In-Training Exam (NITE) for 2022 and 2023 were submitted to ChatGPT-4 from April 22nd, 2024, to May 12th, 2024, using a new account. New sessions were used for every question, except for follow-up questions. Answers were independently graded by two reviewers using the official grading rubric and the average score was used. Cohen’s kappa coefficient was used for inter-rater reliability. Anonymized mean exam results from residents who have previously taken this exam were obtained from the lead faculty of the NITE. The sample size was calculated based on the total number of enrolled residents, as indicated on each university’s program website. Z-tests were used to compare ChatGPT-4’s performance to that of residents per sub-specialty and training level. The questions were categorized by type (image or text), task (diagnosis, additional exams, treatment or guidelines), sub-specialty, taxonomic level and prompt length. A one-way ANOVA, independent t-test and two-tailed Pearson correlation was used to examine variations between question categories. IBM SPSS 29 was used.
RESULTS
ChatGPT-4 scored 66.19% and 64.84% on the 2022 and 2023 exams, respectively. Inter-rater reliability between the two raters was 89.8% (standard error 0.018, P < .001). ChatGPT-4 outperformed the residents on both exams, amongst all training levels and within all sub-specialties except for the general/pediatrics section of the 2023 exam (Z-test -2.37). There were decreasing performance gaps with increasing residency training as per the following Z-scores: PGY-2 16.08, PGY-3 9.31, PGY-4 3.49 in 2022 and PGY-2 15.57, PGY-3 8.60, PGY-4 3.21 in 2023. For the 2022 exam, ChatGPT-4 would rank in the 99th percentile amongst PGY-2, 95th percentile amongst PGY-3 and 73rd percentile amongst PGY-4 classmates. For the 2023 exam, it would rank in the 99th percentile amongst PGY-2, 94th percentile amongst PGY-3 and 71st percentile amongst PGY-4 classmates. ChatGPT-4 performed best on text-based questions (74.3%, P<.001), level one taxonomic questions (75.1%, P<.001) and guideline-based questions (70%, P=.048). It had no significant difference in performance based on sub-specialty (P=.364) or prompt length (P=.385).
CONCLUSIONS
ChatGPT-4 not only achieved passing grades on two versions of the Canadian OTOHNS NITE, but it also outperformed residents in an outstanding manner, underscoring a critical need to redesign residency assessment methods.
CLINICALTRIAL
N/A
Title: Assessment of Artificial Intelligence Chatbot Performance on the Canadian Otolaryngology and Head and Neck Surgery In-Training Exam: Insights from a Comparative Analysis (Preprint)
Description:
BACKGROUND
The introduction of large language models (LLM) has rapidly transformed the field of healthcare.
Its performance, often compared to that of physicians, has been greatly scrutinized.
ChatGPT-4, a finely tuned supervised model, offers improved reasoning capabilities and visual input analysis.
OBJECTIVE
The purpose of this study is to evaluate the performance of ChatGPT-4 in the field of otolaryngology and head and neck surgery (OTOHNS) residency training.
METHODS
A total of 351 questions from the OTOHNS National In-Training Exam (NITE) for 2022 and 2023 were submitted to ChatGPT-4 from April 22nd, 2024, to May 12th, 2024, using a new account.
New sessions were used for every question, except for follow-up questions.
Answers were independently graded by two reviewers using the official grading rubric and the average score was used.
Cohen’s kappa coefficient was used for inter-rater reliability.
Anonymized mean exam results from residents who have previously taken this exam were obtained from the lead faculty of the NITE.
The sample size was calculated based on the total number of enrolled residents, as indicated on each university’s program website.
Z-tests were used to compare ChatGPT-4’s performance to that of residents per sub-specialty and training level.
The questions were categorized by type (image or text), task (diagnosis, additional exams, treatment or guidelines), sub-specialty, taxonomic level and prompt length.
A one-way ANOVA, independent t-test and two-tailed Pearson correlation was used to examine variations between question categories.
IBM SPSS 29 was used.
RESULTS
ChatGPT-4 scored 66.
19% and 64.
84% on the 2022 and 2023 exams, respectively.
Inter-rater reliability between the two raters was 89.
8% (standard error 0.
018, P < .
001).
ChatGPT-4 outperformed the residents on both exams, amongst all training levels and within all sub-specialties except for the general/pediatrics section of the 2023 exam (Z-test -2.
37).
There were decreasing performance gaps with increasing residency training as per the following Z-scores: PGY-2 16.
08, PGY-3 9.
31, PGY-4 3.
49 in 2022 and PGY-2 15.
57, PGY-3 8.
60, PGY-4 3.
21 in 2023.
For the 2022 exam, ChatGPT-4 would rank in the 99th percentile amongst PGY-2, 95th percentile amongst PGY-3 and 73rd percentile amongst PGY-4 classmates.
For the 2023 exam, it would rank in the 99th percentile amongst PGY-2, 94th percentile amongst PGY-3 and 71st percentile amongst PGY-4 classmates.
ChatGPT-4 performed best on text-based questions (74.
3%, P<.
001), level one taxonomic questions (75.
1%, P<.
001) and guideline-based questions (70%, P=.
048).
It had no significant difference in performance based on sub-specialty (P=.
364) or prompt length (P=.
385).
CONCLUSIONS
ChatGPT-4 not only achieved passing grades on two versions of the Canadian OTOHNS NITE, but it also outperformed residents in an outstanding manner, underscoring a critical need to redesign residency assessment methods.
CLINICALTRIAL
N/A.
Related Results
Implementasi Chatbot Pelajaran Sekolah Dasar Dengan Pandorabots
Implementasi Chatbot Pelajaran Sekolah Dasar Dengan Pandorabots
Chatbot is a virtual conversation that can receive input in the form of voice or writing. A chatbot can be a generative or retrieval chatbot. The creation of the two chatbots provi...
Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
Analysis of Reliability and Efficiency of Information Extraction Using AI-Based Chatbot: The More-for-Less Paradox
Analysis of Reliability and Efficiency of Information Extraction Using AI-Based Chatbot: The More-for-Less Paradox
This paper addresses the problem of information extraction using an AI-powered chatbot. The problem concerns searching and extracting relevant information from large databases in r...
Detect Exam Cheating Pattern by Data Mining
Detect Exam Cheating Pattern by Data Mining
This aim of this project is to apply a series of pattern detection Data Mining algorithms to accurately identify cheating by one or more students during classroom test exams. JMP s...
Analysis of the Reliability and Efficiency of Information Extraction Using AI-Based Chatbot: The More for Less Paradox
Analysis of the Reliability and Efficiency of Information Extraction Using AI-Based Chatbot: The More for Less Paradox
This paper addresses the following problem of information extraction using an AI-powered chatbot. An AI chatbot processes a natural language conversation with a human user using s...
Reporting guidelines and journal quality in otolaryngology
Reporting guidelines and journal quality in otolaryngology
ObjectivesJournals increasingly use reporting guidelines to standardise research papers, partly to improve quality. Although defining journal quality is difficult, various calculat...
Interdependence of neck pain prevalence with neck disability and sleep quality among Nigerian seamstresses
Interdependence of neck pain prevalence with neck disability and sleep quality among Nigerian seamstresses
<p style="text-align: justify;">Seamstresses, due to the nature of their profession, often face ergonomic challenges stemming from prolonged sitting...
Enhancing Chatbot Intelligence Through Narrative Memory Structures
Enhancing Chatbot Intelligence Through Narrative Memory Structures
Through the use of narrative memory structures—a cutting-edge technique intended to provide chatbots context awareness and conversational coherence—we sought to improve the intelli...

