Javascript must be enabled to continue!
Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study
View through CrossRef
Abstract
Background
Large language models (LLMs) have great potential to improve and make the work of clinicians more efficient. Previous studies have mainly focused on web-based services, such as ChatGPT, often with simulated cases. For the processing of personalized patient data, web-based services have major data protection concerns. Ensuring compliance with data protection and medical device regulations therefore remains a critical challenge for adopting LLMs in clinical settings.
Objective
This retrospective single-center study aimed to evaluate locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3) in providing diagnosis and treatment recommendation for real-world outpatient cases in otorhinolaryngology (ORL).
Methods
Outpatient cases (n=30) from regular consultation hours and the emergency service at a university hospital ORL outpatient department were randomly selected. Documentation by ORL doctors, including anamnesis and examination results, was passed to the locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3), which were asked to provide diagnostic and treatment strategies. Recommendations of the LLMs and the treating ORL doctors were rated by 3 experienced ORL consultants on a 6-point Likert scale for medical adequacy, conciseness, coherence, and comprehensibility. Moreover, consultants were asked whether the answers pose a risk to the patient’s safety. A modified Turing test was performed to distinguish responses generated by LLMs from those of doctors. Finally, the potential influence of the information generated by the LLMs on the raters’ own diagnosis and treatment opinions was evaluated.
Results
Over all categories, ORL doctors achieved superior (
P
<.0005) ratings compared to locally run LLMs (Llama 3, Mistral Nemo, and Gemma 2). ORL doctors’ responses were considered hazardous for patients in only 1% of the ratings, whereas recommendations by Llama 3, Gemma 2, and Mistral Nemo were considered hazardous in 54%, 47%, and 32% of cases, respectively. According to the raters, the LLM’s information rarely influenced their judgment, with Mistral Nemo, Gemma 2, and Llama 3 achieving 1%, 3%, and 4% of the ratings, respectively.
Conclusions
Although locally run LLM models still underperform compared with their web-based counterparts, they achieved respectable results on outpatient treatment in this study. Nevertheless, the retrospective and single-center nature of the study, along with the clinicians’ documentation style, may have introduced bias in favor of human recommendations. In the future, locally run LLMs will help address data protection concerns; however, further refinement and prospective validation are still needed to meet strict medical device requirements. As locally run LLMs continue to evolve, they are likely to become comparably powerful to web-based LLMs and become established as useful tools to support doctors in clinical practice.
Title: Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study
Description:
Abstract
Background
Large language models (LLMs) have great potential to improve and make the work of clinicians more efficient.
Previous studies have mainly focused on web-based services, such as ChatGPT, often with simulated cases.
For the processing of personalized patient data, web-based services have major data protection concerns.
Ensuring compliance with data protection and medical device regulations therefore remains a critical challenge for adopting LLMs in clinical settings.
Objective
This retrospective single-center study aimed to evaluate locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3) in providing diagnosis and treatment recommendation for real-world outpatient cases in otorhinolaryngology (ORL).
Methods
Outpatient cases (n=30) from regular consultation hours and the emergency service at a university hospital ORL outpatient department were randomly selected.
Documentation by ORL doctors, including anamnesis and examination results, was passed to the locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3), which were asked to provide diagnostic and treatment strategies.
Recommendations of the LLMs and the treating ORL doctors were rated by 3 experienced ORL consultants on a 6-point Likert scale for medical adequacy, conciseness, coherence, and comprehensibility.
Moreover, consultants were asked whether the answers pose a risk to the patient’s safety.
A modified Turing test was performed to distinguish responses generated by LLMs from those of doctors.
Finally, the potential influence of the information generated by the LLMs on the raters’ own diagnosis and treatment opinions was evaluated.
Results
Over all categories, ORL doctors achieved superior (
P
<.
0005) ratings compared to locally run LLMs (Llama 3, Mistral Nemo, and Gemma 2).
ORL doctors’ responses were considered hazardous for patients in only 1% of the ratings, whereas recommendations by Llama 3, Gemma 2, and Mistral Nemo were considered hazardous in 54%, 47%, and 32% of cases, respectively.
According to the raters, the LLM’s information rarely influenced their judgment, with Mistral Nemo, Gemma 2, and Llama 3 achieving 1%, 3%, and 4% of the ratings, respectively.
Conclusions
Although locally run LLM models still underperform compared with their web-based counterparts, they achieved respectable results on outpatient treatment in this study.
Nevertheless, the retrospective and single-center nature of the study, along with the clinicians’ documentation style, may have introduced bias in favor of human recommendations.
In the future, locally run LLMs will help address data protection concerns; however, further refinement and prospective validation are still needed to meet strict medical device requirements.
As locally run LLMs continue to evolve, they are likely to become comparably powerful to web-based LLMs and become established as useful tools to support doctors in clinical practice.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
Plasma AR Alterations and Timing of Intensified Hormone Treatment for Prostate Cancer
Plasma AR Alterations and Timing of Intensified Hormone Treatment for Prostate Cancer
This randomized clinical trial explores whether hormone intensification at start of androgen deprivation therapy alters selection of androgen receptor (AR) gene alterations within ...
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Abstract
Funding Acknowledgements
Type of funding sources: None.
INTRODUCTION Patients with heart failure (HF)...
Performance evaluation of NEMO4.2 with Paraver
Performance evaluation of NEMO4.2 with Paraver
The last release of the NEMO v4.2 ocean model includes many modifications that have a significant impact on the model performance. The goal of the work is to assess NEMO performanc...
Token-Centric Representations in Large Language Models: Analyzing Llama and Mistral Through the Lens of Rate-Distortion Theory
Token-Centric Representations in Large Language Models: Analyzing Llama and Mistral Through the Lens of Rate-Distortion Theory
Token-centric representations play a crucial role in how language models understand and generate human language, influencing the accuracy and efficiency of various downstream tasks...
Carbon availability acts via cytokinins to promote gemma cup formation in
Marchantia polymorpha
Carbon availability acts via cytokinins to promote gemma cup formation in
Marchantia polymorpha
Abstract
Liverworts can clonally propagate by producing compact shoot structures called gemmae, which develop within basket-like structures known...
Ensemble Data Assimilation in NEMO using PDAF
Ensemble Data Assimilation in NEMO using PDAF
NEMO itself does not provide full functionality for data assimilation. To enable data assimilation with NEMO, it was coupled with the Parallel Data Assimilation Framework (PDAF, ht...

