Javascript must be enabled to continue!
Exceptional Performance of DeepSeek on Pediatric Board Examination Preparation Questions (Preprint)
View through CrossRef
BACKGROUND
The integration of artificial intelligence in medical education raises questions about large language models' (LLMs) capabilities in specialized medical knowledge domains. Limited research exists evaluating AI performance on standardized pediatric assessments.
OBJECTIVE
To evaluate and compare the performance of three leading LLMs on pediatric board examination preparation questions and contextualize their performance against human physician benchmarks.
METHODS
We conducted a comparative analysis of DeepSeek 7B v2.5, ChatGPT-4, and ChatGPT-4.5 using 266 multiple-choice questions from the 2023 PREP® Self-Assessment (American Academy of Pediatrics). Each model was presented with identical questions covering the full spectrum of pediatric knowledge domains. Performance was measured by calculating the percentage of correct responses and compared to published first-time pass rates for the American Board of Pediatrics (ABP) examination.
RESULTS
DeepSeek exhibited the highest accuracy at 98.12% (261/266 correct responses), exceeding typical human performance metrics. ChatGPT-4.5 achieved 96.6% accuracy (257/266), performing at the upper threshold of human performance. ChatGPT-4 demonstrated 82.7% accuracy (220/266), comparable to the lower range of human pass rates. Error pattern analysis revealed that AI models most commonly struggled with questions requiring integration of complex clinical presentations with rare disease knowledge.
CONCLUSIONS
Recent advancements in large language models have produced AI systems capable of performing at or above the level of board-certified pediatricians on standardized examination questions. These findings suggest potential applications in medical education, board examination preparation, and possibly clinical decision support. Further research should evaluate these AI systems on more complex clinical reasoning tasks and in simulated clinical scenarios.
Title: Exceptional Performance of DeepSeek on Pediatric Board Examination Preparation Questions (Preprint)
Description:
BACKGROUND
The integration of artificial intelligence in medical education raises questions about large language models' (LLMs) capabilities in specialized medical knowledge domains.
Limited research exists evaluating AI performance on standardized pediatric assessments.
OBJECTIVE
To evaluate and compare the performance of three leading LLMs on pediatric board examination preparation questions and contextualize their performance against human physician benchmarks.
METHODS
We conducted a comparative analysis of DeepSeek 7B v2.
5, ChatGPT-4, and ChatGPT-4.
5 using 266 multiple-choice questions from the 2023 PREP® Self-Assessment (American Academy of Pediatrics).
Each model was presented with identical questions covering the full spectrum of pediatric knowledge domains.
Performance was measured by calculating the percentage of correct responses and compared to published first-time pass rates for the American Board of Pediatrics (ABP) examination.
RESULTS
DeepSeek exhibited the highest accuracy at 98.
12% (261/266 correct responses), exceeding typical human performance metrics.
ChatGPT-4.
5 achieved 96.
6% accuracy (257/266), performing at the upper threshold of human performance.
ChatGPT-4 demonstrated 82.
7% accuracy (220/266), comparable to the lower range of human pass rates.
Error pattern analysis revealed that AI models most commonly struggled with questions requiring integration of complex clinical presentations with rare disease knowledge.
CONCLUSIONS
Recent advancements in large language models have produced AI systems capable of performing at or above the level of board-certified pediatricians on standardized examination questions.
These findings suggest potential applications in medical education, board examination preparation, and possibly clinical decision support.
Further research should evaluate these AI systems on more complex clinical reasoning tasks and in simulated clinical scenarios.
Related Results
A Survey of DeepSeek Models
A Survey of DeepSeek Models
Advances in artificial intelligence (AI) rely on systems capable of human-like reasoning, a limitation for conventional Large Language Models (LLMs), which struggle with multi-step...
The Pediatric Anesthesiology Workforce: Projecting Supply and Trends 2015–2035
The Pediatric Anesthesiology Workforce: Projecting Supply and Trends 2015–2035
BACKGROUND:
A workforce analysis was conducted to predict whether the projected future supply of pediatric anesthesiologists is balanced with the requirements o...
The Geographic Distribution of Pediatric Anesthesiologists Relative to the US Pediatric Population
The Geographic Distribution of Pediatric Anesthesiologists Relative to the US Pediatric Population
BACKGROUND:
The geographic relationship between pediatric anesthesiologists and the pediatric population has potentially important clinical and policy implications. In ...
Evaluation of ChatGPT vs. DeepSeek from a Privacy Perspective
Evaluation of ChatGPT vs. DeepSeek from a Privacy Perspective
The integration of artificial intelligence in healthcare has revolutionized research, diagnostics, and patient care. In particular, the emergence of ChatGPT and the recent rise of ...
Research on the Value, Risks, and Responses of DeepSeek Empowering Vocational Education
Research on the Value, Risks, and Responses of DeepSeek Empowering Vocational Education
With the rapid development of artificial intelligence technology, the application of DeepSeek big model in higher vocational education is becoming increasingly widespread, promotin...
Performance of DeepSeek-R1 in Ophthalmology: An Evaluation of Clinical Decision-Making and Cost-Effectiveness
Performance of DeepSeek-R1 in Ophthalmology: An Evaluation of Clinical Decision-Making and Cost-Effectiveness
ABSTRACT
Purpose
To compare the performance and cost-effectiveness of DeepSeek-R1 with OpenAI o1 in diagnosing and managing oph...
How does DeepSeek-R1 perform on USMLE?
How does DeepSeek-R1 perform on USMLE?
AbstractDeepSeek, a Chinese artificial intelligence company, released its first free chatbot app based on its DeepSeek-R1 model. DeepSeek provides its models, algorithms, and train...
DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans
DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans
ImportanceLarge language models (LLMs) are increasingly being explored in clinical decision-making, but few studies have evaluated their performance on complex ophthalmology cases ...

