Javascript must be enabled to continue!

AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination

Abstract Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams. Objective This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam. Methods A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs—one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom’s taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated. Results Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 ± 0.23 vs. 0.26 ± 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12–0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours). Conclusion ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.

Springer Science and Business Media LLC

Alex KK Law Jerome So Chun Tat Lui Yu Fai Choi Koon Ho Cheung Kevin Kei-ching Hung Colin Alexander Graham

BMC Medical Education

2025

Title: AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination

Description:

Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams.

Objective This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam.

Methods A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024.

Participants attempted two sets of 100 MCQs—one AI-generated and one human-generated.

Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom’s taxonomy (remember, understand, apply and analyse), and item writing flaws.

Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability.

Candidate performance and time efficiency were also evaluated.

Results Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.

78 ± 0.

22 vs.

69 ± 0.

23, p < 0.

01) but showed similar discrimination indices to human MCQs (mean = 0.

22 ± 0.

23 vs.

26 ± 0.

26).

Agreement was moderate (ICC = 0.

62, p = 0.

01, 95% CI: 0.

12–0.

84).

Expert reviews identified more factual inaccuracies (6% vs.

4%), irrelevance (6% vs.

0%), and inappropriate difficulty levels (14% vs.

1%) in AI MCQs.

AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.

27, p = 0.

003).

AI significantly reduced time spent on question generation (24.

5 vs.

96 person-hours).

Conclusion ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments.

Human review remains essential to ensure quality.

Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.

Back

Abstract Introduction Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...

Implementasi Pembelajaran IPS Sebagai Penguatan Pendidikan Karakter di Sekolah Dasar

This study aims to analyze the implementation of social studies learning as strengthening character education in elementary schools. The research method used is a qualitative descr...

Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

Autonomy on Trial

Photo by CHUTTERSNAP on Unsplash Abstract This paper critically examines how US bioethics and health law conceptualize patient autonomy, contrasting the rights-based, individualist...

Microwave Ablation with or Without Chemotherapy in Management of Non-Small Cell Lung Cancer: A Systematic Review

Abstract Introduction Microwave ablation (MWA) has emerged as a minimally invasive treatment for patients with inoperable non-small cell lung cancer (NSCLC). However, whether it i...

TO STUDY OF CHARACTERISTICS OF OBJECTIVE MULTIPLE-CHOICE TEST FOR MEDICAL MODULES

Background: The objective multiple-choice test is an assessment method that has been applied widely in universities. The meticulous judgment of complication and differentiation lev...

Blunt Chest Trauma and Chylothorax: A Systematic Review

Abstract Introduction: Although traumatic chylothorax is predominantly associated with penetrating injuries, instances following blunt trauma, as a rare and challenging condition, ...

Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report

Abstract The Physical Activity Guidelines for Americans (Guidelines) advises older adults to be as active as possible. Yet, despite the well documented benefits of physical a...

Email:
Password:

Email:

AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination

Related Results