Javascript must be enabled to continue!

ChatGPT-4 Prompt: A Tool to Enhance Novice Radiologists' Diagnostic Capabilities in Cystic Renal Masses to Expert-Level Accuracy

Abstract Background The impact of prompt engineering in LLMs on text-based questions has shown variability, whereas its influence on image-based diagnostic tasks remains largely unexplored. Purpose This study aims to evaluate the diagnostic performance of various prompts in GPT-4 for the assessment of renal cystic masses (CRMs) using contrast-enhanced ultrasound (CEUS)Bosniak classification. And then test the ability of ChatGPT-4 prompts to assist radiologists with different experience. Materials and Methods This retrospective study included 103 images of CRMs from patients who underwent CEUS and CT. GPT-4 (OpenAI) and six radiologists (three experts and three novices) were independently tasked with assigning the Bosniak classification (BC) based solely on the original CEUS images. Subsequently,radiologists reassessed these images after knowing the BCs generated by GPT-4's prompt and determined whether to modify their initial assessments. The diagnostic performance of radiologists and GPT-4 prompts was assessed and quantified using the area under the receiver operating characteristic curve (AUC). Result The AUC achieved by GPT-4 prompts ranged from 0.549 to 0.778, while radiologists' AUCs ranged from 0.820 to 0.901. Among all prompting strategies, ROT prompting achieved the highest AUC, demonstrating performance comparable to that of novices (0.778 vs. 0.820, P = 0.39). Although the AUC was lower than that of experts (0.778 vs. 0.901, P = 0.01), ROT prompting improved the AUCs of novices: from 0.714 to 0.834 for novice 1, from 0.685 to 0.782 for novice 2, and from 0.704 to 0.783 for novice 3, with all three novices approaching expert-level performance. Conclusion GPT-4 with different prompts showed variable performance in interpreting images. ROT prompting as the best-performing style achieved diagnostic accuracy comparable to novices, and it could aidnovices in improving their diagnostic performance to expert level.

Springer Science and Business Media LLC

Dong-dong Jin Nan Zhang Yan Wang Ke Lin Bin Qiao Yue Yang Jin-hua Lin Ding-xiang Xie Xiao-yan Xie Xiao-hua Xie Bo-wen Zhuang

2025

Title: ChatGPT-4 Prompt: A Tool to Enhance Novice Radiologists' Diagnostic Capabilities in Cystic Renal Masses to Expert-Level Accuracy

Description:

Abstract Background The impact of prompt engineering in LLMs on text-based questions has shown variability, whereas its influence on image-based diagnostic tasks remains largely unexplored.

Purpose This study aims to evaluate the diagnostic performance of various prompts in GPT-4 for the assessment of renal cystic masses (CRMs) using contrast-enhanced ultrasound (CEUS)Bosniak classification.

And then test the ability of ChatGPT-4 prompts to assist radiologists with different experience.

Materials and Methods This retrospective study included 103 images of CRMs from patients who underwent CEUS and CT.

GPT-4 (OpenAI) and six radiologists (three experts and three novices) were independently tasked with assigning the Bosniak classification (BC) based solely on the original CEUS images.

Subsequently,radiologists reassessed these images after knowing the BCs generated by GPT-4's prompt and determined whether to modify their initial assessments.

The diagnostic performance of radiologists and GPT-4 prompts was assessed and quantified using the area under the receiver operating characteristic curve (AUC).

Result The AUC achieved by GPT-4 prompts ranged from 0.

549 to 0.

778, while radiologists' AUCs ranged from 0.

820 to 0.

901.

Among all prompting strategies, ROT prompting achieved the highest AUC, demonstrating performance comparable to that of novices (0.

778 vs.

820, P = 0.

39).

Although the AUC was lower than that of experts (0.

778 vs.

901, P = 0.

01), ROT prompting improved the AUCs of novices: from 0.

714 to 0.

834 for novice 1, from 0.

685 to 0.

782 for novice 2, and from 0.

704 to 0.

783 for novice 3, with all three novices approaching expert-level performance.

Conclusion GPT-4 with different prompts showed variable performance in interpreting images.

ROT prompting as the best-performing style achieved diagnostic accuracy comparable to novices, and it could aidnovices in improving their diagnostic performance to expert level.

Back

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

Assessment of Chat-GPT, Gemini, and Perplexity in Principle of Research Publication: A Comparative Study

Abstract Introduction Many researchers utilize artificial intelligence (AI) to aid their research endeavors. This study seeks to assess and contrast the performance of three sophis...

Complex Collision Tumors: A Systematic Review

Abstract Introduction: A collision tumor consists of two distinct neoplastic components located within the same organ, separated by stromal tissue, without histological intermixing...

AI and Incidental Findings

Photo by Accuray on Unsplash INTRODUCTION Delayed and missed follow-up on incidental findings threatens patient health and is a major financial risk for healthcare systems. The hea...

ChatGPT's Capabilities for Use in Anatomy Education and Anatomy Research

Dear Editors, Recently, the discussion of an artificial intelligence (AI) - fueled platform in several articles in your journal has attracted the attention of many researchers [1, ...

Unlocking Educational Potential: Exploring Students’ Satisfaction and Sustainable Engagement with ChatGPT Using the ECM Model

Aim/Purpose: The main goal of this study is to investigate the factors affecting students’ satisfaction and continuous usage of ChatGPT in an educational context, using the Expecta...

Hydatid Disease of The Brain Parenchyma: A Systematic Review

Abstarct Introduction Isolated brain hydatid disease (BHD) is an extremely rare form of echinococcosis. A prompt and timely diagnosis is a crucial step in disease management. This ...

P-525 ChatGPT 4.0: accurate, clear, relevant, and readable responses to frequently asked fertility patient questions

Abstract Study question What is the accuracy, clarity, relevance and readability of ChatGPT’s responses to frequently asked fert...

Email:
Password:

Email:

ChatGPT-4 Prompt: A Tool to Enhance Novice Radiologists' Diagnostic Capabilities in Cystic Renal Masses to Expert-Level Accuracy

Related Results