Javascript must be enabled to continue!
DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans
View through CrossRef
ImportanceLarge language models (LLMs) are increasingly being explored in clinical decision-making, but few studies have evaluated their performance on complex ophthalmology cases from clinical practice settings. Understanding whether open-weight, reasoning-enhanced LLMs can outperform proprietary models has implications for clinical utility and accessibility.ObjectiveTo evaluate the diagnostic accuracy, management decision-making, and cost of DeepSeek-R1 vs OpenAI o1 across diverse ophthalmic subspecialties.Design, Setting, and ParticipantsThis was a cross-sectional evaluation conducted using standardized prompts and model configurations. Clinical cases were sourced from JAMA Ophthalmology’s Clinical Challenge articles, containing complex cases from clinical practice settings. Each case included an open-ended diagnostic question and a multiple-choice next-step decision. All cases were included without exclusions, and no human participants were involved. Data were analyzed from March 13 to March 30, 2025.ExposuresDeepSeek-R1and OpenAI o1 were evaluated using the Plan-and-Solve Plus (PS+) prompt engineering method.Main Outcomes and MeasuresPrimary outcomes were diagnostic accuracy and next-step decision-making accuracy, defined as the proportion of correct responses. Token cost analyses were performed to estimate expenses. Intermodel agreement was evaluated using Cohen κ, and McNemar test was used to compare performance.ResultsA total of 422 clinical cases were included, spanning 10 subspecialties. DeepSeek-R1 achieved a higher diagnostic accuracy of 70.4% (297 of 422 cases) compared with 63.0% (266 of 422 cases) for OpenAI o1, a 7.3% difference (95% CI, 1.0%-13.7%; P = .02). For next-step decisions, DeepSeek-R1 was correct in 82.7% of cases (349 of 422 cases) vs OpenAI o1’s accuracy of 75.8% (320 of 422 cases), a 6.9% difference (95% CI, 1.4%-12.3%; P = .01). Intermodel agreement was moderate (κ = 0.422; 95% CI, 0.375-0.469; P < .001). DeepSeek-R1 offered lower costs per query than OpenAI o1, with savings exceeding 66-fold (up to 98.5%) during off-peak pricing.Conclusions and RelevanceDeepSeek-R1 outperformed OpenAI o1 in diagnosis and management across subspecialties while lowering operating costs, supporting the potential of open-weight, reinforcement learning–augmented LLMs as scalable and cost-saving tools for clinical decision support. Further investigations should evaluate safety guardrails and assess performance of self-hosted adaptations of DeepSeek-R1 with domain-specific ophthalmic expertise to optimize clinical utility.
American Medical Association (AMA)
Title: DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans
Description:
ImportanceLarge language models (LLMs) are increasingly being explored in clinical decision-making, but few studies have evaluated their performance on complex ophthalmology cases from clinical practice settings.
Understanding whether open-weight, reasoning-enhanced LLMs can outperform proprietary models has implications for clinical utility and accessibility.
ObjectiveTo evaluate the diagnostic accuracy, management decision-making, and cost of DeepSeek-R1 vs OpenAI o1 across diverse ophthalmic subspecialties.
Design, Setting, and ParticipantsThis was a cross-sectional evaluation conducted using standardized prompts and model configurations.
Clinical cases were sourced from JAMA Ophthalmology’s Clinical Challenge articles, containing complex cases from clinical practice settings.
Each case included an open-ended diagnostic question and a multiple-choice next-step decision.
All cases were included without exclusions, and no human participants were involved.
Data were analyzed from March 13 to March 30, 2025.
ExposuresDeepSeek-R1and OpenAI o1 were evaluated using the Plan-and-Solve Plus (PS+) prompt engineering method.
Main Outcomes and MeasuresPrimary outcomes were diagnostic accuracy and next-step decision-making accuracy, defined as the proportion of correct responses.
Token cost analyses were performed to estimate expenses.
Intermodel agreement was evaluated using Cohen κ, and McNemar test was used to compare performance.
ResultsA total of 422 clinical cases were included, spanning 10 subspecialties.
DeepSeek-R1 achieved a higher diagnostic accuracy of 70.
4% (297 of 422 cases) compared with 63.
0% (266 of 422 cases) for OpenAI o1, a 7.
3% difference (95% CI, 1.
0%-13.
7%; P = .
02).
For next-step decisions, DeepSeek-R1 was correct in 82.
7% of cases (349 of 422 cases) vs OpenAI o1’s accuracy of 75.
8% (320 of 422 cases), a 6.
9% difference (95% CI, 1.
4%-12.
3%; P = .
01).
Intermodel agreement was moderate (κ = 0.
422; 95% CI, 0.
375-0.
469; P < .
001).
DeepSeek-R1 offered lower costs per query than OpenAI o1, with savings exceeding 66-fold (up to 98.
5%) during off-peak pricing.
Conclusions and RelevanceDeepSeek-R1 outperformed OpenAI o1 in diagnosis and management across subspecialties while lowering operating costs, supporting the potential of open-weight, reinforcement learning–augmented LLMs as scalable and cost-saving tools for clinical decision support.
Further investigations should evaluate safety guardrails and assess performance of self-hosted adaptations of DeepSeek-R1 with domain-specific ophthalmic expertise to optimize clinical utility.
Related Results
Performance of DeepSeek-R1 in Ophthalmology: An Evaluation of Clinical Decision-Making and Cost-Effectiveness
Performance of DeepSeek-R1 in Ophthalmology: An Evaluation of Clinical Decision-Making and Cost-Effectiveness
ABSTRACT
Purpose
To compare the performance and cost-effectiveness of DeepSeek-R1 with OpenAI o1 in diagnosing and managing oph...
A Survey of DeepSeek Models
A Survey of DeepSeek Models
Advances in artificial intelligence (AI) rely on systems capable of human-like reasoning, a limitation for conventional Large Language Models (LLMs), which struggle with multi-step...
Research on the Value, Risks, and Responses of DeepSeek Empowering Vocational Education
Research on the Value, Risks, and Responses of DeepSeek Empowering Vocational Education
With the rapid development of artificial intelligence technology, the application of DeepSeek big model in higher vocational education is becoming increasingly widespread, promotin...
Evaluation of ChatGPT vs. DeepSeek from a Privacy Perspective
Evaluation of ChatGPT vs. DeepSeek from a Privacy Perspective
The integration of artificial intelligence in healthcare has revolutionized research, diagnostics, and patient care. In particular, the emergence of ChatGPT and the recent rise of ...
How does DeepSeek-R1 perform on USMLE?
How does DeepSeek-R1 perform on USMLE?
AbstractDeepSeek, a Chinese artificial intelligence company, released its first free chatbot app based on its DeepSeek-R1 model. DeepSeek provides its models, algorithms, and train...
Performance of ChatGPT in Ophthalmic Registration and Clinical Diagnosis: Cross-Sectional Study (Preprint)
Performance of ChatGPT in Ophthalmic Registration and Clinical Diagnosis: Cross-Sectional Study (Preprint)
BACKGROUND
Artificial intelligence (AI) chatbots such as ChatGPT are expected to impact vision health care significantly. Their potential to optimize the co...
A Timely Quick Literature Review on the Deepseek in Chinese Publication
A Timely Quick Literature Review on the Deepseek in Chinese Publication
The swift rise of DeepSeek—the Chinese generative artificial intelligence (AI) model that champions open‐source innovation—has ignited scholarly interests across frontiers. This ti...
A Timely Quick Literature Review on the Deepseek in Chinese Publication
A Timely Quick Literature Review on the Deepseek in Chinese Publication
The swift rise of DeepSeek—the Chinese generative artificial intelligence (AI) model that champions open‐source innovation—has ignited scholarly interests across frontiers. This ti...

