Javascript must be enabled to continue!

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

Abstract Background With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt’s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model’s accuracy and consistency. Results GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%‐3.7%) and GPT-3.5 (1.3%‐4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model’s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.

JMIR Publications Inc.

Shuai Ming Qingge Guo Wenjun Cheng Bo Lei

JMIR Medical Education

2024

Title: Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

Description:

Objective The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).

Methods The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties.

Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023.

Three key factors were considered: the version of GPT-3.

5 and 4.

0, the prompt’s designation of system roles tailored to medical subspecialties, and repetition for coherence.

A passing accuracy threshold was established as 60%.

The χ2 tests and κ values were employed to evaluate the model’s accuracy and consistency.

Results GPT-4.

0 achieved a passing accuracy of 72.

7%, which was significantly higher than that of GPT-3.

5 (54%; P<.

001).

The variability rate of repeated responses from GPT-4.

0 was lower than that of GPT-3.

5 (9% vs 19.

5%; P<.

001).

However, both models showed relatively good response coherence, with κ values of 0.

778 and 0.

610, respectively.

System roles numerically increased accuracy for both GPT-4.

0 (0.

3%‐3.

7%) and GPT-3.

5 (1.

3%‐4.

5%), and reduced variability by 1.

7% and 1.

8%, respectively (P>.

05).

In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.

05).

GPT-4.

0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.

5 did so in 7 of 15 on the first response.

Conclusions GPT-4.

0 passed the CNMLE and outperformed GPT-3.

5 in key areas such as accuracy, consistency, and medical subspecialty expertise.

Adding a system role insignificantly enhanced the model’s reliability and answer coherence.

GPT-4.

0 showed promising potential in medical education and clinical practice, meriting further study.

Back

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study

Abstract Introduction Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...

Unlocking Educational Potential: Exploring Students’ Satisfaction and Sustainable Engagement with ChatGPT Using the ECM Model

Aim/Purpose: The main goal of this study is to investigate the factors affecting students’ satisfaction and continuous usage of ChatGPT in an educational context, using the Expecta...

Primerjalna književnost na prelomu tisočletja

In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...

ChatGPT's Capabilities for Use in Anatomy Education and Anatomy Research

Dear Editors, Recently, the discussion of an artificial intelligence (AI) - fueled platform in several articles in your journal has attracted the attention of many researchers [1, ...

Exploring the implementation of public involvement in local alcohol availability policy: the case of alcohol licensing decision‐making in England

AbstractBackground and AimsIn 2003, the UK government passed the Licensing Act for England and Wales. The Act provides a framework for regulating alcohol sale, including four licen...

ChatGPT and Medical Education: A Double-Edged Sword

ChatGPT has gained attention worldwide. In the medical education field, ChatGPT, or any similar large language model, provides a convenient way for students to access information a...

Appearance of ChatGPT and English Study

The purpose of this study is to examine the definition and characteristics of ChatGPT in order to present the direction of self-directed learning to learners, and to explore the po...

Email:
Password:

Email:

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

Related Results