Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Evaluating GPT-4 Responses on Scars or Keloids for Patient Education: Large Language Model Evaluation Study

View through CrossRef
Abstract Background Scars and keloids impose significant physical and psychological burdens on patients, often leading to functional limitations, cosmetic concerns, and mental health issues such as anxiety or depression. Patients increasingly turn to online platforms for information; however, existing web-based resources on scars and keloids are frequently unreliable, fragmented, or difficult to understand. Large language models such as GPT-4 show promise for delivering medical information, but their accuracy, readability, and potential to generate hallucinated content require validation for patient education applications. Objective This study aimed to systematically evaluate GPT-4’s performance in providing patient education on scars and keloids, focusing on its accuracy, reliability, readability, and reference quality. Methods This study involved collecting 354 questions from Reddit communities (r/Keloids, r/SCAR, and r/PlasticSurgery), covering topics including treatment options, pre- and postoperative care, and psychological impacts. Each question was input into GPT-4 in independent sessions to mimic real-world patient interactions. Responses were evaluated using multiple tools: the Patient Education Materials Assessment Tool-Artificial Intelligence for understandability and actionability, DISCERN-AI for treatment information quality, the Global Quality Scale for overall information quality, and standard readability metrics (Flesch Reading Ease score, and Gunning Fog Index). Three plastic surgeons used the Natural Language Assessment Tool for Artificial Intelligence to rate the accuracy, safety, and clinical appropriateness, while the Reference Evaluation for Artificial Intelligence tool validated references for reference hallucination, relevance, and source quality. We conducted the same analysis to assess the quality of GPT-4–generated content in response to questions from 3 medical websites. Results GPT-4 demonstrated high accuracy and reliability. The Patient Education Materials Assessment Tool-Artificial Intelligence showed 75.5% understandability, DISCERN-AI rated responses as “good” (26.3/35), and the Global Quality Scale score was 4.28 of 5. Surgeons’ evaluations averaged 3.94 to 4.43 out of 5 across dimensions (accuracy 3.9, SD 0.7; safety 4.3, SD 0.8; clinical appropriateness 4.4, SD 0.5; actionability 4.1, SD 0.8; and effectiveness 4.1, SD 0.8). Readability analyses indicated moderate complexity (Flesch Reading Ease Score: 50.13; Gunning Fog Index: 12.68), corresponding to a 12th-grade reading level. Reference Evaluation for Artificial Intelligence identified 11.8% (383/3250) hallucinated references, while 88.2% (2867/3250) of references were real, with 95.1% (2724/2867) from authoritative sources (eg, government guidelines and the literature). The overall results about questions from medical websites were consistent with the answers to Reddit questions. Conclusions GPT-4 has serious potential as a patient education tool for scars and keloids, offering reliable and accurate information. However, improvements in readability (to align with sixth to eighth grade standards) and reduction of reference hallucinations are essential to enhance accessibility and trustworthiness. Future large language model optimizations should prioritize simplifying medical language and strengthening reference validation mechanisms to maximize clinical utility.
Title: Evaluating GPT-4 Responses on Scars or Keloids for Patient Education: Large Language Model Evaluation Study
Description:
Abstract Background Scars and keloids impose significant physical and psychological burdens on patients, often leading to functional limitations, cosmetic concerns, and mental health issues such as anxiety or depression.
Patients increasingly turn to online platforms for information; however, existing web-based resources on scars and keloids are frequently unreliable, fragmented, or difficult to understand.
Large language models such as GPT-4 show promise for delivering medical information, but their accuracy, readability, and potential to generate hallucinated content require validation for patient education applications.
Objective This study aimed to systematically evaluate GPT-4’s performance in providing patient education on scars and keloids, focusing on its accuracy, reliability, readability, and reference quality.
Methods This study involved collecting 354 questions from Reddit communities (r/Keloids, r/SCAR, and r/PlasticSurgery), covering topics including treatment options, pre- and postoperative care, and psychological impacts.
Each question was input into GPT-4 in independent sessions to mimic real-world patient interactions.
Responses were evaluated using multiple tools: the Patient Education Materials Assessment Tool-Artificial Intelligence for understandability and actionability, DISCERN-AI for treatment information quality, the Global Quality Scale for overall information quality, and standard readability metrics (Flesch Reading Ease score, and Gunning Fog Index).
Three plastic surgeons used the Natural Language Assessment Tool for Artificial Intelligence to rate the accuracy, safety, and clinical appropriateness, while the Reference Evaluation for Artificial Intelligence tool validated references for reference hallucination, relevance, and source quality.
We conducted the same analysis to assess the quality of GPT-4–generated content in response to questions from 3 medical websites.
Results GPT-4 demonstrated high accuracy and reliability.
The Patient Education Materials Assessment Tool-Artificial Intelligence showed 75.
5% understandability, DISCERN-AI rated responses as “good” (26.
3/35), and the Global Quality Scale score was 4.
28 of 5.
Surgeons’ evaluations averaged 3.
94 to 4.
43 out of 5 across dimensions (accuracy 3.
9, SD 0.
7; safety 4.
3, SD 0.
8; clinical appropriateness 4.
4, SD 0.
5; actionability 4.
1, SD 0.
8; and effectiveness 4.
1, SD 0.
8).
Readability analyses indicated moderate complexity (Flesch Reading Ease Score: 50.
13; Gunning Fog Index: 12.
68), corresponding to a 12th-grade reading level.
Reference Evaluation for Artificial Intelligence identified 11.
8% (383/3250) hallucinated references, while 88.
2% (2867/3250) of references were real, with 95.
1% (2724/2867) from authoritative sources (eg, government guidelines and the literature).
The overall results about questions from medical websites were consistent with the answers to Reddit questions.
Conclusions GPT-4 has serious potential as a patient education tool for scars and keloids, offering reliable and accurate information.
However, improvements in readability (to align with sixth to eighth grade standards) and reduction of reference hallucinations are essential to enhance accessibility and trustworthiness.
Future large language model optimizations should prioritize simplifying medical language and strengthening reference validation mechanisms to maximize clinical utility.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
ANDROGEN-DEPENDENT DERMOPATHY IN WOMEN WITH KELOID SCARS
ANDROGEN-DEPENDENT DERMOPATHY IN WOMEN WITH KELOID SCARS
Objective: To explore the character of androgen-dependent dermopathy (ADD) in women with keloid scars. Methods: 100 girls and women aged 15-28 years were examined, of whom 47 were...
Autonomy on Trial
Autonomy on Trial
Photo by CHUTTERSNAP on Unsplash Abstract This paper critically examines how US bioethics and health law conceptualize patient autonomy, contrasting the rights-based, individualist...
Analisis Penggunaan GPT dalam Pembelajaran Klinik Optik I di ARO Gapopin
Analisis Penggunaan GPT dalam Pembelajaran Klinik Optik I di ARO Gapopin
Perkembangan teknologi kecerdasan buatan (Artificial Intelligence/AI), khususnya model bahasa besar seperti Generative Pre-trained Transformer (GPT), telah membawa transformasi bes...
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
Comparative Methods for Building Chatbots: Open Source, Hybrid, and Fully Integrated Large Language Models
Comparative Methods for Building Chatbots: Open Source, Hybrid, and Fully Integrated Large Language Models
In the complex and dynamic realm of biodiversity informatics, the accessibility and comprehension of standards and vocabularies are pivotal for, but not limited to, effective data ...
TREATMENT OF KELOIDS AND HYPERTROPHIC SCARS BY COMBINED CRYOTHERAPY AND INTRALESIONAL TRIAMCINOLONE
TREATMENT OF KELOIDS AND HYPERTROPHIC SCARS BY COMBINED CRYOTHERAPY AND INTRALESIONAL TRIAMCINOLONE
Objectives: To evaluate the outcomes of combining cryotherapy and intralesional triamcinolone in the treatment of keloids and hypertrophic scars. Methods: 60 patients (31 males and...

Back to Top