Javascript must be enabled to continue!

Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study

Abstract Background Large language models (LLMs) have great potential to improve and make the work of clinicians more efficient. Previous studies have mainly focused on web-based services, such as ChatGPT, often with simulated cases. For the processing of personalized patient data, web-based services have major data protection concerns. Ensuring compliance with data protection and medical device regulations therefore remains a critical challenge for adopting LLMs in clinical settings. Objective This retrospective single-center study aimed to evaluate locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3) in providing diagnosis and treatment recommendation for real-world outpatient cases in otorhinolaryngology (ORL). Methods Outpatient cases (n=30) from regular consultation hours and the emergency service at a university hospital ORL outpatient department were randomly selected. Documentation by ORL doctors, including anamnesis and examination results, was passed to the locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3), which were asked to provide diagnostic and treatment strategies. Recommendations of the LLMs and the treating ORL doctors were rated by 3 experienced ORL consultants on a 6-point Likert scale for medical adequacy, conciseness, coherence, and comprehensibility. Moreover, consultants were asked whether the answers pose a risk to the patient’s safety. A modified Turing test was performed to distinguish responses generated by LLMs from those of doctors. Finally, the potential influence of the information generated by the LLMs on the raters’ own diagnosis and treatment opinions was evaluated. Results Over all categories, ORL doctors achieved superior ( P <.0005) ratings compared to locally run LLMs (Llama 3, Mistral Nemo, and Gemma 2). ORL doctors’ responses were considered hazardous for patients in only 1% of the ratings, whereas recommendations by Llama 3, Gemma 2, and Mistral Nemo were considered hazardous in 54%, 47%, and 32% of cases, respectively. According to the raters, the LLM’s information rarely influenced their judgment, with Mistral Nemo, Gemma 2, and Llama 3 achieving 1%, 3%, and 4% of the ratings, respectively. Conclusions Although locally run LLM models still underperform compared with their web-based counterparts, they achieved respectable results on outpatient treatment in this study. Nevertheless, the retrospective and single-center nature of the study, along with the clinicians’ documentation style, may have introduced bias in favor of human recommendations. In the future, locally run LLMs will help address data protection concerns; however, further refinement and prospective validation are still needed to meet strict medical device requirements. As locally run LLMs continue to evolve, they are likely to become comparably powerful to web-based LLMs and become established as useful tools to support doctors in clinical practice.

JMIR Publications Inc.

Christoph Raphael Buhr Christopher Seifen Katharina Bahr-Hamm Tilman Huppertz Johannes Pordzik Harry Smith Tom Kelsey Andrew Blaikie Christoph Matthias Sebastian Kuhn Jonas Eckrich

JMIR Formative Research

2025

Title: Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study

Description:

Abstract Background Large language models (LLMs) have great potential to improve and make the work of clinicians more efficient.

Previous studies have mainly focused on web-based services, such as ChatGPT, often with simulated cases.

For the processing of personalized patient data, web-based services have major data protection concerns.

Ensuring compliance with data protection and medical device regulations therefore remains a critical challenge for adopting LLMs in clinical settings.

Objective This retrospective single-center study aimed to evaluate locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3) in providing diagnosis and treatment recommendation for real-world outpatient cases in otorhinolaryngology (ORL).

Methods Outpatient cases (n=30) from regular consultation hours and the emergency service at a university hospital ORL outpatient department were randomly selected.

Documentation by ORL doctors, including anamnesis and examination results, was passed to the locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3), which were asked to provide diagnostic and treatment strategies.

Recommendations of the LLMs and the treating ORL doctors were rated by 3 experienced ORL consultants on a 6-point Likert scale for medical adequacy, conciseness, coherence, and comprehensibility.

Moreover, consultants were asked whether the answers pose a risk to the patient’s safety.

A modified Turing test was performed to distinguish responses generated by LLMs from those of doctors.

Finally, the potential influence of the information generated by the LLMs on the raters’ own diagnosis and treatment opinions was evaluated.

Results Over all categories, ORL doctors achieved superior ( P <.

0005) ratings compared to locally run LLMs (Llama 3, Mistral Nemo, and Gemma 2).

ORL doctors’ responses were considered hazardous for patients in only 1% of the ratings, whereas recommendations by Llama 3, Gemma 2, and Mistral Nemo were considered hazardous in 54%, 47%, and 32% of cases, respectively.

According to the raters, the LLM’s information rarely influenced their judgment, with Mistral Nemo, Gemma 2, and Llama 3 achieving 1%, 3%, and 4% of the ratings, respectively.

Conclusions Although locally run LLM models still underperform compared with their web-based counterparts, they achieved respectable results on outpatient treatment in this study.

Nevertheless, the retrospective and single-center nature of the study, along with the clinicians’ documentation style, may have introduced bias in favor of human recommendations.

In the future, locally run LLMs will help address data protection concerns; however, further refinement and prospective validation are still needed to meet strict medical device requirements.

As locally run LLMs continue to evolve, they are likely to become comparably powerful to web-based LLMs and become established as useful tools to support doctors in clinical practice.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga

The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...

Plasma AR Alterations and Timing of Intensified Hormone Treatment for Prostate Cancer

This randomized clinical trial explores whether hormone intensification at start of androgen deprivation therapy alters selection of androgen receptor (AR) gene alterations within ...

Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program

Abstract Funding Acknowledgements Type of funding sources: None. INTRODUCTION Patients with heart failure (HF)...

Performance evaluation of NEMO4.2 with Paraver

The last release of the NEMO v4.2 ocean model includes many modifications that have a significant impact on the model performance. The goal of the work is to assess NEMO performanc...

Token-Centric Representations in Large Language Models: Analyzing Llama and Mistral Through the Lens of Rate-Distortion Theory

Token-centric representations play a crucial role in how language models understand and generate human language, influencing the accuracy and efficiency of various downstream tasks...

Carbon availability acts via cytokinins to promote gemma cup formation in Marchantia polymorpha

Abstract Liverworts can clonally propagate by producing compact shoot structures called gemmae, which develop within basket-like structures known...

Ensemble Data Assimilation in NEMO using PDAF

NEMO itself does not provide full functionality for data assimilation. To enable data assimilation with NEMO, it was coupled with the Parallel Data Assimilation Framework (PDAF, ht...

Email:
Password:

Email:

Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study

Related Results