Javascript must be enabled to continue!
Large Language Model–Generated Patient Instructions for Prescriptions in Primary Health Care: Preclinical Algorithm Validation (Preprint)
View through CrossRef
BACKGROUND
The application of generative artificial intelligence to simplify medication use instructions has the potential to enhance people’s health by improving treatment adherence.
OBJECTIVE
We evaluated the performance of large language models (LLMs) in generating medication usage instructions to complement prescriptions in primary health care.
METHODS
This randomized, blinded experimental preclinical study used prescription-inducing scenarios, assigned to 62 health care professionals, to validate instructions generated by LLMs during electronic prescriptions. The instructions were generated by ChatGPT-4.0 (OpenAI), Llama3.1-8B (Meta), and Llama3.1-8B-RAG (Meta) using retrieval-augmented generation based on patient information leaflets. Performance metrics assessed adequacy, completeness, clarity, language simplification, usefulness, and errors in the generated instructions, with scores to analyze overall and individual metrics.
RESULTS
The 3 models yielded high overall scores for producing qualified instructions (ChatGPT-4.0: median 88.4, IQR 22.8; Llama3.1-8B: median 66.5, IQR 50.9; Llama3.1-8B-RAG: median 79.9, IQR 34.4; Kruskal-Wallis test <i>P</i>=.003). Llama3.1-8B-RAG received evaluations with similar overall scores to ChatGPT-4.0 (post hoc test, <i>P</i>=.05) and similar to Llama3.1-8B (post hoc test, <i>P</i>=.44). ChatGPT-4.0 outperformed Llama3.1-8B (Bonferroni test, <i>P</i><.001). Regarding specific domains, Llama3.1-8B-RAG received scores equivalent to those of ChatGPT-4.0 for adequacy (mean 6.24, SD 2.3 vs mean 6.82, SD 2.1; post hoc test, <i>P</i>=.54); completeness (mean 5.94, SD 2.2 vs 6.55, SD 1.9; post hoc test <i>P</i>=.38), clarity (mean 5.77, SD 2.4 vs mean 6.68, SD 1.9; post hoc test <i>P</i>=.09), and usefulness (mean 5.42, SD 2.4 vs mean 5.96, SD 2.2; post hoc test <i>P</i>=.63). ChatGPT-4.0 received higher scores in the language simplification criterion than Llama3.1-8B-RAG (mean 7.05, SD 1.5 vs mean 5.44, SD 2.6; post hoc test <i>P</i><.001). Interrater variability in assigning scores ranged from 4.2% (n=3) to 85.8% (n=6) among primary health care professionals. Instructions leading to incorrect use of the medication had similar frequency among the models(ChatGPT-4.0: n=15, 22.7%; Llama3.1-8B: n=19, 22.8%; Llama3.1-8B-RAG: n=19, 22.8%; chi-square test <i>P</i>=.71). The frequencies of hallucination were similar (ChatGPT-4.0: n=7, 10.6%; Llama3.1-8B: n=9, 13.6%; Llama3.1-8B-RAG: n=6, 9.1%; chi-square test <i>P</i>=.67).
CONCLUSIONS
The open-source LLM enhanced with external information presented similar performance to the closed-source model, except for ChatGPT4.0, which was superior in language simplification of messages. LLM generation demonstrated potential for instructing patients on medication use. Nonetheless, the introduction of this innovation into the electronic prescribing workflow demands prescriber validation for human oversight of the technology and requires a strategy for LLM performance governance.
INTERNATIONAL REGISTERED REPORT
RR2-https://doi.org/10.12688/verixiv.1359.1
JMIR Publications Inc.
Zilma Silveira Nogueira Reis
Elisa Tuler Albergaria
Adriana Silvina Pagano
Eura Martins Lage
Flávia Ribeiro de Oliveira
Cristiane dos Santos Dias
Juliana Almeida Oliveira
Gláucia Miranda Varella Pereira
Isaias Jose Ramos de Oliveira
Érico Franco Mineiro
Igor Carvalho Lima Oliveira
Davi dos Reis de Jesus
Antônio Pereira de Souza Júnior
Igor de Carvalho Gomes
Rodrigo André Cuevas Gaete
Ricardo Cruz-Correia
Leonardo Rocha
Title: Large Language Model–Generated Patient Instructions for Prescriptions in Primary Health Care: Preclinical Algorithm Validation (Preprint)
Description:
BACKGROUND
The application of generative artificial intelligence to simplify medication use instructions has the potential to enhance people’s health by improving treatment adherence.
OBJECTIVE
We evaluated the performance of large language models (LLMs) in generating medication usage instructions to complement prescriptions in primary health care.
METHODS
This randomized, blinded experimental preclinical study used prescription-inducing scenarios, assigned to 62 health care professionals, to validate instructions generated by LLMs during electronic prescriptions.
The instructions were generated by ChatGPT-4.
0 (OpenAI), Llama3.
1-8B (Meta), and Llama3.
1-8B-RAG (Meta) using retrieval-augmented generation based on patient information leaflets.
Performance metrics assessed adequacy, completeness, clarity, language simplification, usefulness, and errors in the generated instructions, with scores to analyze overall and individual metrics.
RESULTS
The 3 models yielded high overall scores for producing qualified instructions (ChatGPT-4.
0: median 88.
4, IQR 22.
8; Llama3.
1-8B: median 66.
5, IQR 50.
9; Llama3.
1-8B-RAG: median 79.
9, IQR 34.
4; Kruskal-Wallis test <i>P</i>=.
003).
Llama3.
1-8B-RAG received evaluations with similar overall scores to ChatGPT-4.
0 (post hoc test, <i>P</i>=.
05) and similar to Llama3.
1-8B (post hoc test, <i>P</i>=.
44).
ChatGPT-4.
0 outperformed Llama3.
1-8B (Bonferroni test, <i>P</i><.
001).
Regarding specific domains, Llama3.
1-8B-RAG received scores equivalent to those of ChatGPT-4.
0 for adequacy (mean 6.
24, SD 2.
3 vs mean 6.
82, SD 2.
1; post hoc test, <i>P</i>=.
54); completeness (mean 5.
94, SD 2.
2 vs 6.
55, SD 1.
9; post hoc test <i>P</i>=.
38), clarity (mean 5.
77, SD 2.
4 vs mean 6.
68, SD 1.
9; post hoc test <i>P</i>=.
09), and usefulness (mean 5.
42, SD 2.
4 vs mean 5.
96, SD 2.
2; post hoc test <i>P</i>=.
63).
ChatGPT-4.
0 received higher scores in the language simplification criterion than Llama3.
1-8B-RAG (mean 7.
05, SD 1.
5 vs mean 5.
44, SD 2.
6; post hoc test <i>P</i><.
001).
Interrater variability in assigning scores ranged from 4.
2% (n=3) to 85.
8% (n=6) among primary health care professionals.
Instructions leading to incorrect use of the medication had similar frequency among the models(ChatGPT-4.
0: n=15, 22.
7%; Llama3.
1-8B: n=19, 22.
8%; Llama3.
1-8B-RAG: n=19, 22.
8%; chi-square test <i>P</i>=.
71).
The frequencies of hallucination were similar (ChatGPT-4.
0: n=7, 10.
6%; Llama3.
1-8B: n=9, 13.
6%; Llama3.
1-8B-RAG: n=6, 9.
1%; chi-square test <i>P</i>=.
67).
CONCLUSIONS
The open-source LLM enhanced with external information presented similar performance to the closed-source model, except for ChatGPT4.
0, which was superior in language simplification of messages.
LLM generation demonstrated potential for instructing patients on medication use.
Nonetheless, the introduction of this innovation into the electronic prescribing workflow demands prescriber validation for human oversight of the technology and requires a strategy for LLM performance governance.
INTERNATIONAL REGISTERED REPORT
RR2-https://doi.
org/10.
12688/verixiv.
1359.
1.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Autonomy on Trial
Autonomy on Trial
Photo by CHUTTERSNAP on Unsplash
Abstract
This paper critically examines how US bioethics and health law conceptualize patient autonomy, contrasting the rights-based, individualist...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Abstract
Funding Acknowledgements
Type of funding sources: None.
INTRODUCTION Patients with heart failure (HF)...
Validation in Doctoral Education: Exploring PhD Students’ Perceptions of Belonging to Scaffold Doctoral Identity Work
Validation in Doctoral Education: Exploring PhD Students’ Perceptions of Belonging to Scaffold Doctoral Identity Work
Aim/Purpose: The aim of this article is to make a case of the role of validation in doctoral education. The purpose is to detail findings from three studies which explore PhD stude...
Factors Associated with Drug Interactions in Medical Prescriptions Received in Community Pharmacies in Yaoundé
Factors Associated with Drug Interactions in Medical Prescriptions Received in Community Pharmacies in Yaoundé
ABSTRACT
Introduction
Prescriptions involving multiple drugs issued by healthcare professionals are frequently at the origin of...
AI and Incidental Findings
AI and Incidental Findings
Photo by Accuray on Unsplash
INTRODUCTION
Delayed and missed follow-up on incidental findings threatens patient health and is a major financial risk for healthcare systems. The hea...
Towards more goal-oriented care through care coordination and care planning.
Towards more goal-oriented care through care coordination and care planning.
The increasing aging of our society is putting increasing pressure on the current organization of care and support. This moved the Flemish government to a thorough reform of primar...

