Javascript must be enabled to continue!

Multimodal Performance of GPT-4 in Complex Ophthalmology Cases

Objectives: The integration of multimodal capabilities into GPT-4 represents a transformative leap for artificial intelligence in ophthalmology, yet its utility in scenarios requiring advanced reasoning remains underexplored. This study evaluates GPT-4’s multimodal performance on open-ended diagnostic and next-step reasoning tasks in complex ophthalmology cases, comparing it against human expertise. Methods: GPT-4 was assessed across three study arms: (1) text-based case details with figure descriptions, (2) cases with text and accompanying ophthalmic figures, and (3) cases with figures only (no figure descriptions). We compared GPT-4’s diagnostic and next-step accuracy across arms and benchmarked its performance against three board-certified ophthalmologists. Results: GPT-4 achieved 38.4% (95% CI [33.9%, 43.1%]) diagnostic accuracy and 57.8% (95% CI [52.8%, 62.2%]) next-step accuracy when prompted with figures without descriptions. Diagnostic accuracy declined significantly compared to text-only prompts (p = 0.007), though the next-step performance was similar (p = 0.140). Adding figure descriptions restored diagnostic accuracy (49.3%) to near parity with text-only prompts (p = 0.684). Using figures without descriptions, GPT-4’s diagnostic accuracy was comparable to two ophthalmologists (p = 0.30, p = 0.41) but fell short of the highest-performing ophthalmologist (p = 0.0004). For next-step accuracy, GPT-4 was similar to one ophthalmologist (p = 0.22) but underperformed relative to the other two (p = 0.0015, p = 0.0017). Conclusions: GPT-4’s diagnostic performance diminishes when relying solely on ophthalmic images without textual context, highlighting limitations in its current multimodal capabilities. Despite this, GPT-4 demonstrated comparable performance to at least one ophthalmologist on both diagnostic and next-step reasoning tasks, emphasizing its potential as an assistive tool. Future research should refine multimodal prompts and explore iterative or sequential prompting strategies to optimize AI-driven interpretation of complex ophthalmic datasets.

MDPI AG

David Mikhail Daniel Milad Fares Antaki Jason Milad Andrew Farah Thomas Khairy Jonathan El-Khoury Kenan Bachour Andrei-Alexandru Szigiato Taylor Nayman Guillaume A. Mullie Renaud Duval

Journal of Personalized Medicine

2025

Title: Multimodal Performance of GPT-4 in Complex Ophthalmology Cases

Description:

This study evaluates GPT-4’s multimodal performance on open-ended diagnostic and next-step reasoning tasks in complex ophthalmology cases, comparing it against human expertise.

Methods: GPT-4 was assessed across three study arms: (1) text-based case details with figure descriptions, (2) cases with text and accompanying ophthalmic figures, and (3) cases with figures only (no figure descriptions).

We compared GPT-4’s diagnostic and next-step accuracy across arms and benchmarked its performance against three board-certified ophthalmologists.

Results: GPT-4 achieved 38.

4% (95% CI [33.

9%, 43.

1%]) diagnostic accuracy and 57.

8% (95% CI [52.

8%, 62.

2%]) next-step accuracy when prompted with figures without descriptions.

Diagnostic accuracy declined significantly compared to text-only prompts (p = 0.

007), though the next-step performance was similar (p = 0.

140).

Adding figure descriptions restored diagnostic accuracy (49.

3%) to near parity with text-only prompts (p = 0.

684).

Using figures without descriptions, GPT-4’s diagnostic accuracy was comparable to two ophthalmologists (p = 0.

30, p = 0.

41) but fell short of the highest-performing ophthalmologist (p = 0.

0004).

For next-step accuracy, GPT-4 was similar to one ophthalmologist (p = 0.

22) but underperformed relative to the other two (p = 0.

0015, p = 0.

0017).

Conclusions: GPT-4’s diagnostic performance diminishes when relying solely on ophthalmic images without textual context, highlighting limitations in its current multimodal capabilities.

Despite this, GPT-4 demonstrated comparable performance to at least one ophthalmologist on both diagnostic and next-step reasoning tasks, emphasizing its potential as an assistive tool.

Future research should refine multimodal prompts and explore iterative or sequential prompting strategies to optimize AI-driven interpretation of complex ophthalmic datasets.

Back

Abstract Purpose GPT-4, recently released by OpenAI, improves upon GPT-3.5 with increased reliability and expanded capabilities, including user-spec...

Analisis Penggunaan GPT dalam Pembelajaran Klinik Optik I di ARO Gapopin

Perkembangan teknologi kecerdasan buatan (Artificial Intelligence/AI), khususnya model bahasa besar seperti Generative Pre-trained Transformer (GPT), telah membawa transformasi bes...

Factors Influencing Choice of Medical Specialty among Ophthalmology and Non-Ophthalmology Residency Applicants

AbstractObjective The study aimed to investigate factors influencing choice of specialty among ophthalmology and non-ophthalmology residency applicants.Patients and Methods Anonymo...

GPT-agents based on medical guidelines can improve the responsiveness and explainability of outcomes for traumatic brain injury rehabilitation

AbstractThis study explored the application of generative pre-trained transformer (GPT) agents based on medical guidelines using large language model (LLM) technology for traumatic...

Performance of ChatGPT in Ophthalmic Registration and Clinical Diagnosis: Cross-Sectional Study (Preprint)

BACKGROUND Artificial intelligence (AI) chatbots such as ChatGPT are expected to impact vision health care significantly. Their potential to optimize the co...

Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

Abstract Background With the increasing application of large language models like ChatGPT in various industries, its potential in the medical dom...

Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)

BACKGROUND Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...

Email:
Password:

Email:

Multimodal Performance of GPT-4 in Complex Ophthalmology Cases

Related Results