Javascript must be enabled to continue!
Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study
View through CrossRef
Abstract
Background
With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.
Objective
The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).
Methods
The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt’s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model’s accuracy and consistency.
Results
GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%‐3.7%) and GPT-3.5 (1.3%‐4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response.
Conclusions
GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model’s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.
Title: Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study
Description:
Abstract
Background
With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.
Objective
The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).
Methods
The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties.
Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023.
Three key factors were considered: the version of GPT-3.
5 and 4.
0, the prompt’s designation of system roles tailored to medical subspecialties, and repetition for coherence.
A passing accuracy threshold was established as 60%.
The χ2 tests and κ values were employed to evaluate the model’s accuracy and consistency.
Results
GPT-4.
0 achieved a passing accuracy of 72.
7%, which was significantly higher than that of GPT-3.
5 (54%; P<.
001).
The variability rate of repeated responses from GPT-4.
0 was lower than that of GPT-3.
5 (9% vs 19.
5%; P<.
001).
However, both models showed relatively good response coherence, with κ values of 0.
778 and 0.
610, respectively.
System roles numerically increased accuracy for both GPT-4.
0 (0.
3%‐3.
7%) and GPT-3.
5 (1.
3%‐4.
5%), and reduced variability by 1.
7% and 1.
8%, respectively (P>.
05).
In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.
05).
GPT-4.
0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.
5 did so in 7 of 15 on the first response.
Conclusions
GPT-4.
0 passed the CNMLE and outperformed GPT-3.
5 in key areas such as accuracy, consistency, and medical subspecialty expertise.
Adding a system role insignificantly enhanced the model’s reliability and answer coherence.
GPT-4.
0 showed promising potential in medical education and clinical practice, meriting further study.
Related Results
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Abstract
Introduction
Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...
Unlocking Educational Potential: Exploring Students’ Satisfaction and Sustainable Engagement with ChatGPT Using the ECM Model
Unlocking Educational Potential: Exploring Students’ Satisfaction and Sustainable Engagement with ChatGPT Using the ECM Model
Aim/Purpose: The main goal of this study is to investigate the factors affecting students’ satisfaction and continuous usage of ChatGPT in an educational context, using the Expecta...
Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
ChatGPT's Capabilities for Use in Anatomy Education and Anatomy Research
ChatGPT's Capabilities for Use in Anatomy Education and Anatomy Research
Dear Editors,
Recently, the discussion of an artificial intelligence (AI) - fueled platform in several articles in your journal has attracted the attention of many researchers [1, ...
Exploring the implementation of public involvement in local alcohol availability policy: the case of alcohol licensing decision‐making in England
Exploring the implementation of public involvement in local alcohol availability policy: the case of alcohol licensing decision‐making in England
AbstractBackground and AimsIn 2003, the UK government passed the Licensing Act for England and Wales. The Act provides a framework for regulating alcohol sale, including four licen...
ChatGPT and Medical Education: A Double-Edged Sword
ChatGPT and Medical Education: A Double-Edged Sword
ChatGPT has gained attention worldwide. In the medical education field, ChatGPT, or any similar large language model, provides a convenient way for students to access information a...
Appearance of ChatGPT and English Study
Appearance of ChatGPT and English Study
The purpose of this study is to examine the definition and characteristics of ChatGPT in order to present the direction of self-directed learning to learners, and to explore the po...

