Javascript must be enabled to continue!

Model Evolution and System Roles Influence the Performance of ChatGPT on Chinese Medical Licensing Exams: A Comparative Study (Preprint)

BACKGROUND With the increasing application of Large Language Models (LLMs) like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. OBJECTIVE To assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). METHODS The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical sub-specialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompts designation of system roles tailored to medical sub-specialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and Kappa values were employed to evaluate the model's accuracy and consistency. RESULTS GPT-4.0 achieved passing accuracy of (71.0% - 74.7%), significantly higher than that of GPT-3.5 (50.3% - 54.8%, P < 0.001). Both models showed relatively high coherence between initial and 2nd response, with Kappa values of 0.778 and 0.610. System roles boosted accuracy for both GPT-4.0 (0.3% - 3.7%) and GPT-3.5 (1.3% - 4.5%), and increased the Kappa by 0.023 and 0.035 respectively. In multi-specialty analysis, GPT-4.0 passed the threshold in 14 of 15 sub-specialties, while GPT-3.5 did so in 7 of 15 on the first response. CONCLUSIONS GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical sub-specialty expertise. Adding a system role enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.

JMIR Publications Inc.

Shuai Ming Qingge Guo Wenjun Cheng Bo Lei

2023

Title: Model Evolution and System Roles Influence the Performance of ChatGPT on Chinese Medical Licensing Exams: A Comparative Study (Preprint)

Description:

OBJECTIVE To assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).

METHODS The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical sub-specialties.

Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023.

Three key factors were considered: the version of GPT-3.

5 and 4.

0, the prompts designation of system roles tailored to medical sub-specialties, and repetition for coherence.

A passing accuracy threshold was established as 60%.

The χ2 tests and Kappa values were employed to evaluate the model's accuracy and consistency.

RESULTS GPT-4.

0 achieved passing accuracy of (71.

0% - 74.

7%), significantly higher than that of GPT-3.

5 (50.

3% - 54.

8%, P < 0.

001).

Both models showed relatively high coherence between initial and 2nd response, with Kappa values of 0.

778 and 0.

610.

System roles boosted accuracy for both GPT-4.

0 (0.

3% - 3.

7%) and GPT-3.

5 (1.

3% - 4.

5%), and increased the Kappa by 0.

023 and 0.

035 respectively.

In multi-specialty analysis, GPT-4.

0 passed the threshold in 14 of 15 sub-specialties, while GPT-3.

5 did so in 7 of 15 on the first response.

CONCLUSIONS GPT-4.

0 passed the CNMLE and outperformed GPT-3.

5 in key areas such as accuracy, consistency, and medical sub-specialty expertise.

Adding a system role enhanced the model's reliability and answer coherence.

GPT-4.

0 showed promising potential in medical education and clinical practice, meriting further study.

Back

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study

Abstract Introduction Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...

Unlocking Educational Potential: Exploring Students’ Satisfaction and Sustainable Engagement with ChatGPT Using the ECM Model

Aim/Purpose: The main goal of this study is to investigate the factors affecting students’ satisfaction and continuous usage of ChatGPT in an educational context, using the Expecta...

Primerjalna književnost na prelomu tisočletja

In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...

ChatGPT's Capabilities for Use in Anatomy Education and Anatomy Research

Dear Editors, Recently, the discussion of an artificial intelligence (AI) - fueled platform in several articles in your journal has attracted the attention of many researchers [1, ...

Exploring the implementation of public involvement in local alcohol availability policy: the case of alcohol licensing decision‐making in England

AbstractBackground and AimsIn 2003, the UK government passed the Licensing Act for England and Wales. The Act provides a framework for regulating alcohol sale, including four licen...

ChatGPT and Medical Education: A Double-Edged Sword

ChatGPT has gained attention worldwide. In the medical education field, ChatGPT, or any similar large language model, provides a convenient way for students to access information a...

ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case–Based Questions (Preprint)

BACKGROUND Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information ...

Email:
Password:

Email:

Model Evolution and System Roles Influence the Performance of ChatGPT on Chinese Medical Licensing Exams: A Comparative Study (Preprint)

Related Results