Javascript must be enabled to continue!
Multimodal Performance of GPT-4 in Complex Ophthalmology Cases
View through CrossRef
Objectives: The integration of multimodal capabilities into GPT-4 represents a transformative leap for artificial intelligence in ophthalmology, yet its utility in scenarios requiring advanced reasoning remains underexplored. This study evaluates GPT-4’s multimodal performance on open-ended diagnostic and next-step reasoning tasks in complex ophthalmology cases, comparing it against human expertise. Methods: GPT-4 was assessed across three study arms: (1) text-based case details with figure descriptions, (2) cases with text and accompanying ophthalmic figures, and (3) cases with figures only (no figure descriptions). We compared GPT-4’s diagnostic and next-step accuracy across arms and benchmarked its performance against three board-certified ophthalmologists. Results: GPT-4 achieved 38.4% (95% CI [33.9%, 43.1%]) diagnostic accuracy and 57.8% (95% CI [52.8%, 62.2%]) next-step accuracy when prompted with figures without descriptions. Diagnostic accuracy declined significantly compared to text-only prompts (p = 0.007), though the next-step performance was similar (p = 0.140). Adding figure descriptions restored diagnostic accuracy (49.3%) to near parity with text-only prompts (p = 0.684). Using figures without descriptions, GPT-4’s diagnostic accuracy was comparable to two ophthalmologists (p = 0.30, p = 0.41) but fell short of the highest-performing ophthalmologist (p = 0.0004). For next-step accuracy, GPT-4 was similar to one ophthalmologist (p = 0.22) but underperformed relative to the other two (p = 0.0015, p = 0.0017). Conclusions: GPT-4’s diagnostic performance diminishes when relying solely on ophthalmic images without textual context, highlighting limitations in its current multimodal capabilities. Despite this, GPT-4 demonstrated comparable performance to at least one ophthalmologist on both diagnostic and next-step reasoning tasks, emphasizing its potential as an assistive tool. Future research should refine multimodal prompts and explore iterative or sequential prompting strategies to optimize AI-driven interpretation of complex ophthalmic datasets.
Title: Multimodal Performance of GPT-4 in Complex Ophthalmology Cases
Description:
Objectives: The integration of multimodal capabilities into GPT-4 represents a transformative leap for artificial intelligence in ophthalmology, yet its utility in scenarios requiring advanced reasoning remains underexplored.
This study evaluates GPT-4’s multimodal performance on open-ended diagnostic and next-step reasoning tasks in complex ophthalmology cases, comparing it against human expertise.
Methods: GPT-4 was assessed across three study arms: (1) text-based case details with figure descriptions, (2) cases with text and accompanying ophthalmic figures, and (3) cases with figures only (no figure descriptions).
We compared GPT-4’s diagnostic and next-step accuracy across arms and benchmarked its performance against three board-certified ophthalmologists.
Results: GPT-4 achieved 38.
4% (95% CI [33.
9%, 43.
1%]) diagnostic accuracy and 57.
8% (95% CI [52.
8%, 62.
2%]) next-step accuracy when prompted with figures without descriptions.
Diagnostic accuracy declined significantly compared to text-only prompts (p = 0.
007), though the next-step performance was similar (p = 0.
140).
Adding figure descriptions restored diagnostic accuracy (49.
3%) to near parity with text-only prompts (p = 0.
684).
Using figures without descriptions, GPT-4’s diagnostic accuracy was comparable to two ophthalmologists (p = 0.
30, p = 0.
41) but fell short of the highest-performing ophthalmologist (p = 0.
0004).
For next-step accuracy, GPT-4 was similar to one ophthalmologist (p = 0.
22) but underperformed relative to the other two (p = 0.
0015, p = 0.
0017).
Conclusions: GPT-4’s diagnostic performance diminishes when relying solely on ophthalmic images without textual context, highlighting limitations in its current multimodal capabilities.
Despite this, GPT-4 demonstrated comparable performance to at least one ophthalmologist on both diagnostic and next-step reasoning tasks, emphasizing its potential as an assistive tool.
Future research should refine multimodal prompts and explore iterative or sequential prompting strategies to optimize AI-driven interpretation of complex ophthalmic datasets.
Related Results
Performance of Novel GPT-4 in Otolaryngology Knowledge Assessment
Performance of Novel GPT-4 in Otolaryngology Knowledge Assessment
Abstract
Purpose
GPT-4, recently released by OpenAI, improves upon GPT-3.5 with increased reliability and expanded capabilities, including user-spec...
Analisis Penggunaan GPT dalam Pembelajaran Klinik Optik I di ARO Gapopin
Analisis Penggunaan GPT dalam Pembelajaran Klinik Optik I di ARO Gapopin
Perkembangan teknologi kecerdasan buatan (Artificial Intelligence/AI), khususnya model bahasa besar seperti Generative Pre-trained Transformer (GPT), telah membawa transformasi bes...
Factors Influencing Choice of Medical Specialty among Ophthalmology and Non-Ophthalmology Residency Applicants
Factors Influencing Choice of Medical Specialty among Ophthalmology and Non-Ophthalmology Residency Applicants
AbstractObjective The study aimed to investigate factors influencing choice of specialty among ophthalmology and non-ophthalmology residency applicants.Patients and Methods Anonymo...
GPT-agents based on medical guidelines can improve the responsiveness and explainability of outcomes for traumatic brain injury rehabilitation
GPT-agents based on medical guidelines can improve the responsiveness and explainability of outcomes for traumatic brain injury rehabilitation
AbstractThis study explored the application of generative pre-trained transformer (GPT) agents based on medical guidelines using large language model (LLM) technology for traumatic...
Performance of ChatGPT in Ophthalmic Registration and Clinical Diagnosis: Cross-Sectional Study (Preprint)
Performance of ChatGPT in Ophthalmic Registration and Clinical Diagnosis: Cross-Sectional Study (Preprint)
BACKGROUND
Artificial intelligence (AI) chatbots such as ChatGPT are expected to impact vision health care significantly. Their potential to optimize the co...
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study
Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study
Abstract
Background
With the increasing application of large language models like ChatGPT in various industries, its potential in the medical dom...
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
BACKGROUND
Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...

