Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education

View through CrossRef
Abstract Problem Clinical vignette–based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop. Large language models, a type of artificial intelligence (AI), can potentially expedite this task. This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content. Approach The authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality. Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each). The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024. Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain. Outcomes Thirty-three participants (16 residents, 17 attendings) completed the survey (80.5% response rate). Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.1% (30.0%–50.0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions. The median (IQR) correct answer selection rate was 62.3% (50.0%–75.0%) for human-generated MCQs and 64.4% (50.0%–83.3%) for AI-generated MCQs ( P = .74). The difficulty (0.69 vs 0.66, P = .83) and discriminatory (0.42 vs 0.38, P = .90) indexes showed no significant differences, supporting the feasibility of large language model–generated MCQs in medical education. Next Steps Future studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy. The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.
Title: Large Language Model Clinical Vignettes and Multiple-Choice Questions for Postgraduate Medical Education
Description:
Abstract Problem Clinical vignette–based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop.
Large language models, a type of artificial intelligence (AI), can potentially expedite this task.
This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content.
Approach The authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality.
Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each).
The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024.
Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain.
Outcomes Thirty-three participants (16 residents, 17 attendings) completed the survey (80.
5% response rate).
Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.
1% (30.
0%–50.
0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions.
The median (IQR) correct answer selection rate was 62.
3% (50.
0%–75.
0%) for human-generated MCQs and 64.
4% (50.
0%–83.
3%) for AI-generated MCQs ( P = .
74).
The difficulty (0.
69 vs 0.
66, P = .
83) and discriminatory (0.
42 vs 0.
38, P = .
90) indexes showed no significant differences, supporting the feasibility of large language model–generated MCQs in medical education.
Next Steps Future studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy.
The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
1 Osler and the fellowship of postgraduate medicine
1 Osler and the fellowship of postgraduate medicine
Abstract Sir William Osler’s legacy lives on through the Fellowship of Postgraduate Medicine (FPM). Osler was in 1911 founding President both of the Postgraduate Med...
THE CONTRIBUTION OF THE FELLOWSHIP OF POSTGRADUATE MEDICINE TO MEDICINE 1919–2025
THE CONTRIBUTION OF THE FELLOWSHIP OF POSTGRADUATE MEDICINE TO MEDICINE 1919–2025
Abstract Sir William Osler Bt, considered by many the greatest figure in the medical world at the time, arrived in Britain in 1905 when he was appointed Regius Pr...
Autonomy on Trial
Autonomy on Trial
Photo by CHUTTERSNAP on Unsplash Abstract This paper critically examines how US bioethics and health law conceptualize patient autonomy, contrasting the rights-based, individualist...
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study
Abstract Introduction Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...

Back to Top