Javascript must be enabled to continue!
AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
View through CrossRef
Abstract
Background
The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams.
Objective
This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam.
Methods
A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs—one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom’s taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated.
Results
Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 ± 0.23 vs. 0.26 ± 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12–0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours).
Conclusion
ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.
Springer Science and Business Media LLC
Title: AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
Description:
Abstract
Background
The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts.
Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams.
Objective
This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam.
Methods
A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024.
Participants attempted two sets of 100 MCQs—one AI-generated and one human-generated.
Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom’s taxonomy (remember, understand, apply and analyse), and item writing flaws.
Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability.
Candidate performance and time efficiency were also evaluated.
Results
Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.
78 ± 0.
22 vs.
0.
69 ± 0.
23, p < 0.
01) but showed similar discrimination indices to human MCQs (mean = 0.
22 ± 0.
23 vs.
0.
26 ± 0.
26).
Agreement was moderate (ICC = 0.
62, p = 0.
01, 95% CI: 0.
12–0.
84).
Expert reviews identified more factual inaccuracies (6% vs.
4%), irrelevance (6% vs.
0%), and inappropriate difficulty levels (14% vs.
1%) in AI MCQs.
AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.
27, p = 0.
003).
AI significantly reduced time spent on question generation (24.
5 vs.
96 person-hours).
Conclusion
ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments.
Human review remains essential to ensure quality.
Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.
Related Results
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Abstract
Introduction
Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...
Microwave Ablation with or Without Chemotherapy in Management of Non-Small Cell Lung Cancer: A Systematic Review
Microwave Ablation with or Without Chemotherapy in Management of Non-Small Cell Lung Cancer: A Systematic Review
Abstract
Introduction
Microwave ablation (MWA) has emerged as a minimally invasive treatment for patients with inoperable non-small cell lung cancer (NSCLC). However, whether it i...
Autonomy on Trial
Autonomy on Trial
Photo by CHUTTERSNAP on Unsplash
Abstract
This paper critically examines how US bioethics and health law conceptualize patient autonomy, contrasting the rights-based, individualist...
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
Implementasi Pembelajaran IPS Sebagai Penguatan Pendidikan Karakter di Sekolah Dasar
Implementasi Pembelajaran IPS Sebagai Penguatan Pendidikan Karakter di Sekolah Dasar
This study aims to analyze the implementation of social studies learning as strengthening character education in elementary schools. The research method used is a qualitative descr...
TO STUDY OF CHARACTERISTICS OF OBJECTIVE MULTIPLE-CHOICE TEST FOR MEDICAL MODULES
TO STUDY OF CHARACTERISTICS OF OBJECTIVE MULTIPLE-CHOICE TEST FOR MEDICAL MODULES
Background: The objective multiple-choice test is an assessment method that has been applied widely in universities. The meticulous judgment of complication and differentiation lev...
Responsabilização Educacional no Brasil
Responsabilização Educacional no Brasil
Em alguns países desenvolvidos, sistemas de avaliação educacional em larga escala facilitaram a adoção de políticas que responsabilizam as escolas por seus resultados. Mediante o e...
Blunt Chest Trauma and Chylothorax: A Systematic Review
Blunt Chest Trauma and Chylothorax: A Systematic Review
Abstract
Introduction: Although traumatic chylothorax is predominantly associated with penetrating injuries, instances following blunt trauma, as a rare and challenging condition, ...

