Javascript must be enabled to continue!

Evaluating GPT-4 Responses on Scars or Keloids for Patient Education: Large Language Model Evaluation Study

Abstract Background Scars and keloids impose significant physical and psychological burdens on patients, often leading to functional limitations, cosmetic concerns, and mental health issues such as anxiety or depression. Patients increasingly turn to online platforms for information; however, existing web-based resources on scars and keloids are frequently unreliable, fragmented, or difficult to understand. Large language models such as GPT-4 show promise for delivering medical information, but their accuracy, readability, and potential to generate hallucinated content require validation for patient education applications. Objective This study aimed to systematically evaluate GPT-4’s performance in providing patient education on scars and keloids, focusing on its accuracy, reliability, readability, and reference quality. Methods This study involved collecting 354 questions from Reddit communities (r/Keloids, r/SCAR, and r/PlasticSurgery), covering topics including treatment options, pre- and postoperative care, and psychological impacts. Each question was input into GPT-4 in independent sessions to mimic real-world patient interactions. Responses were evaluated using multiple tools: the Patient Education Materials Assessment Tool-Artificial Intelligence for understandability and actionability, DISCERN-AI for treatment information quality, the Global Quality Scale for overall information quality, and standard readability metrics (Flesch Reading Ease score, and Gunning Fog Index). Three plastic surgeons used the Natural Language Assessment Tool for Artificial Intelligence to rate the accuracy, safety, and clinical appropriateness, while the Reference Evaluation for Artificial Intelligence tool validated references for reference hallucination, relevance, and source quality. We conducted the same analysis to assess the quality of GPT-4–generated content in response to questions from 3 medical websites. Results GPT-4 demonstrated high accuracy and reliability. The Patient Education Materials Assessment Tool-Artificial Intelligence showed 75.5% understandability, DISCERN-AI rated responses as “good” (26.3/35), and the Global Quality Scale score was 4.28 of 5. Surgeons’ evaluations averaged 3.94 to 4.43 out of 5 across dimensions (accuracy 3.9, SD 0.7; safety 4.3, SD 0.8; clinical appropriateness 4.4, SD 0.5; actionability 4.1, SD 0.8; and effectiveness 4.1, SD 0.8). Readability analyses indicated moderate complexity (Flesch Reading Ease Score: 50.13; Gunning Fog Index: 12.68), corresponding to a 12th-grade reading level. Reference Evaluation for Artificial Intelligence identified 11.8% (383/3250) hallucinated references, while 88.2% (2867/3250) of references were real, with 95.1% (2724/2867) from authoritative sources (eg, government guidelines and the literature). The overall results about questions from medical websites were consistent with the answers to Reddit questions. Conclusions GPT-4 has serious potential as a patient education tool for scars and keloids, offering reliable and accurate information. However, improvements in readability (to align with sixth to eighth grade standards) and reduction of reference hallucinations are essential to enhance accessibility and trustworthiness. Future large language model optimizations should prioritize simplifying medical language and strengthening reference validation mechanisms to maximize clinical utility.

JMIR Publications Inc.

Mingjun Rao Tang Xiujun Wang Haoyu

JMIR Medical Informatics

2025

Title: Evaluating GPT-4 Responses on Scars or Keloids for Patient Education: Large Language Model Evaluation Study

Description:

Patients increasingly turn to online platforms for information; however, existing web-based resources on scars and keloids are frequently unreliable, fragmented, or difficult to understand.

Large language models such as GPT-4 show promise for delivering medical information, but their accuracy, readability, and potential to generate hallucinated content require validation for patient education applications.

Objective This study aimed to systematically evaluate GPT-4’s performance in providing patient education on scars and keloids, focusing on its accuracy, reliability, readability, and reference quality.

Methods This study involved collecting 354 questions from Reddit communities (r/Keloids, r/SCAR, and r/PlasticSurgery), covering topics including treatment options, pre- and postoperative care, and psychological impacts.

Each question was input into GPT-4 in independent sessions to mimic real-world patient interactions.

Responses were evaluated using multiple tools: the Patient Education Materials Assessment Tool-Artificial Intelligence for understandability and actionability, DISCERN-AI for treatment information quality, the Global Quality Scale for overall information quality, and standard readability metrics (Flesch Reading Ease score, and Gunning Fog Index).

Three plastic surgeons used the Natural Language Assessment Tool for Artificial Intelligence to rate the accuracy, safety, and clinical appropriateness, while the Reference Evaluation for Artificial Intelligence tool validated references for reference hallucination, relevance, and source quality.

We conducted the same analysis to assess the quality of GPT-4–generated content in response to questions from 3 medical websites.

Results GPT-4 demonstrated high accuracy and reliability.

The Patient Education Materials Assessment Tool-Artificial Intelligence showed 75.

5% understandability, DISCERN-AI rated responses as “good” (26.

3/35), and the Global Quality Scale score was 4.

28 of 5.

Surgeons’ evaluations averaged 3.

94 to 4.

43 out of 5 across dimensions (accuracy 3.

9, SD 0.

7; safety 4.

3, SD 0.

8; clinical appropriateness 4.

4, SD 0.

5; actionability 4.

1, SD 0.

8; and effectiveness 4.

1, SD 0.

8).

Readability analyses indicated moderate complexity (Flesch Reading Ease Score: 50.

13; Gunning Fog Index: 12.

68), corresponding to a 12th-grade reading level.

Reference Evaluation for Artificial Intelligence identified 11.

8% (383/3250) hallucinated references, while 88.

2% (2867/3250) of references were real, with 95.

1% (2724/2867) from authoritative sources (eg, government guidelines and the literature).

The overall results about questions from medical websites were consistent with the answers to Reddit questions.

Conclusions GPT-4 has serious potential as a patient education tool for scars and keloids, offering reliable and accurate information.

However, improvements in readability (to align with sixth to eighth grade standards) and reduction of reference hallucinations are essential to enhance accessibility and trustworthiness.

Future large language model optimizations should prioritize simplifying medical language and strengthening reference validation mechanisms to maximize clinical utility.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

ANDROGEN-DEPENDENT DERMOPATHY IN WOMEN WITH KELOID SCARS

Objective: To explore the character of androgen-dependent dermopathy (ADD) in women with keloid scars. Methods: 100 girls and women aged 15-28 years were examined, of whom 47 were...

Performance of Novel GPT-4 in Otolaryngology Knowledge Assessment

Abstract Purpose GPT-4, recently released by OpenAI, improves upon GPT-3.5 with increased reliability and expanded capabilities, including user-spec...

Autonomy on Trial

Photo by CHUTTERSNAP on Unsplash Abstract This paper critically examines how US bioethics and health law conceptualize patient autonomy, contrasting the rights-based, individualist...

Analisis Penggunaan GPT dalam Pembelajaran Klinik Optik I di ARO Gapopin

Perkembangan teknologi kecerdasan buatan (Artificial Intelligence/AI), khususnya model bahasa besar seperti Generative Pre-trained Transformer (GPT), telah membawa transformasi bes...

GPT-agents based on medical guidelines can improve the responsiveness and explainability of outcomes for traumatic brain injury rehabilitation

AbstractThis study explored the application of generative pre-trained transformer (GPT) agents based on medical guidelines using large language model (LLM) technology for traumatic...

Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga

The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...

Treatment of keloids and hypertrophic scars using bleomycin

SummaryBackground Numerous treatments have been attempted with unsatisfactory results using either single or combination modalities for treatment of keloids and hypertrophic scars...

Email:
Password:

Email:

Evaluating GPT-4 Responses on Scars or Keloids for Patient Education: Large Language Model Evaluation Study

Related Results