Javascript must be enabled to continue!

Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education

Abstract Problem Clinical vignette–based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop. Large language models, a type of artificial intelligence (AI), can potentially expedite this task. This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content. Approach The authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality. Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each). The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024. Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain. Outcomes Thirty-three participants (16 residents, 17 attendings) completed the survey (80.5% response rate). Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.1% (30.0%–50.0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions. The median (IQR) correct answer selection rate was 62.3% (50.0%–75.0%) for human-generated MCQs and 64.4% (50.0%–83.3%) for AI-generated MCQs ( P = .74). The difficulty (0.69 vs 0.66, P = .83) and discriminatory (0.42 vs 0.38, P = .90) indexes showed no significant differences, supporting the feasibility of large language model–generated MCQs in medical education. Next Steps Future studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy. The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.

Oxford University Press (OUP)

Frank I. Jackson Nathan A. Keller Insaf Kouba Wassil Kouba Luis A. Bracero Matthew J. Blitz

Academic Medicine

2025

Title: Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education

Description:

Abstract Problem Clinical vignette–based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop.

Large language models, a type of artificial intelligence (AI), can potentially expedite this task.

This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content.

Approach The authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality.

Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each).

The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024.

Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain.

Outcomes Thirty-three participants (16 residents, 17 attendings) completed the survey (80.

5% response rate).

Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.

1% (30.

0%–50.

0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions.

The median (IQR) correct answer selection rate was 62.

3% (50.

0%–75.

0%) for human-generated MCQs and 64.

4% (50.

0%–83.

3%) for AI-generated MCQs ( P = .

74).

The difficulty (0.

69 vs 0.

66, P = .

83) and discriminatory (0.

42 vs 0.

38, P = .

90) indexes showed no significant differences, supporting the feasibility of large language model–generated MCQs in medical education.

Next Steps Future studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy.

The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga

The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...

1 Osler and the fellowship of postgraduate medicine

Abstract Sir William Osler’s legacy lives on through the Fellowship of Postgraduate Medicine (FPM). Osler was in 1911 founding President both of the Postgraduate Med...

THE CONTRIBUTION OF THE FELLOWSHIP OF POSTGRADUATE MEDICINE TO MEDICINE 1919–2025

Abstract Sir William Osler Bt, considered by many the greatest figure in the medical world at the time, arrived in Britain in 1905 when he was appointed Regius Pr...

Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program

Abstract Funding Acknowledgements Type of funding sources: None. INTRODUCTION Patients with heart failure (HF)...

Autonomy on Trial

Photo by CHUTTERSNAP on Unsplash Abstract This paper critically examines how US bioethics and health law conceptualize patient autonomy, contrasting the rights-based, individualist...

Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study

Abstract Introduction Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...

Email:
Password:

Email:

Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education

Related Results