Javascript must be enabled to continue!
Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study
View through CrossRef
Abstract
Background
Large language models (LLMs) have great potential to improve and make the work of clinicians more efficient. Previous studies have mainly focused on web-based services, such as ChatGPT, often with simulated cases. For the processing of personalized patient data, web-based services have major data protection concerns. Ensuring compliance with data protection and medical device regulations therefore remains a critical challenge for adopting LLMs in clinical settings.
Objective
This retrospective single-center study aimed to evaluate locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3) in providing diagnosis and treatment recommendation for real-world outpatient cases in otorhinolaryngology (ORL).
Methods
Outpatient cases (n=30) from regular consultation hours and the emergency service at a university hospital ORL outpatient department were randomly selected. Documentation by ORL doctors, including anamnesis and examination results, was passed to the locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3), which were asked to provide diagnostic and treatment strategies. Recommendations of the LLMs and the treating ORL doctors were rated by 3 experienced ORL consultants on a 6-point Likert scale for medical adequacy, conciseness, coherence, and comprehensibility. Moreover, consultants were asked whether the answers pose a risk to the patient’s safety. A modified Turing test was performed to distinguish responses generated by LLMs from those of doctors. Finally, the potential influence of the information generated by the LLMs on the raters’ own diagnosis and treatment opinions was evaluated.
Results
Over all categories, ORL doctors achieved superior (
P
<.0005) ratings compared to locally run LLMs (Llama 3, Mistral Nemo, and Gemma 2). ORL doctors’ responses were considered hazardous for patients in only 1% of the ratings, whereas recommendations by Llama 3, Gemma 2, and Mistral Nemo were considered hazardous in 54%, 47%, and 32% of cases, respectively. According to the raters, the LLM’s information rarely influenced their judgment, with Mistral Nemo, Gemma 2, and Llama 3 achieving 1%, 3%, and 4% of the ratings, respectively.
Conclusions
Although locally run LLM models still underperform compared with their web-based counterparts, they achieved respectable results on outpatient treatment in this study. Nevertheless, the retrospective and single-center nature of the study, along with the clinicians’ documentation style, may have introduced bias in favor of human recommendations. In the future, locally run LLMs will help address data protection concerns; however, further refinement and prospective validation are still needed to meet strict medical device requirements. As locally run LLMs continue to evolve, they are likely to become comparably powerful to web-based LLMs and become established as useful tools to support doctors in clinical practice.
Title: Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study
Description:
Abstract
Background
Large language models (LLMs) have great potential to improve and make the work of clinicians more efficient.
Previous studies have mainly focused on web-based services, such as ChatGPT, often with simulated cases.
For the processing of personalized patient data, web-based services have major data protection concerns.
Ensuring compliance with data protection and medical device regulations therefore remains a critical challenge for adopting LLMs in clinical settings.
Objective
This retrospective single-center study aimed to evaluate locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3) in providing diagnosis and treatment recommendation for real-world outpatient cases in otorhinolaryngology (ORL).
Methods
Outpatient cases (n=30) from regular consultation hours and the emergency service at a university hospital ORL outpatient department were randomly selected.
Documentation by ORL doctors, including anamnesis and examination results, was passed to the locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3), which were asked to provide diagnostic and treatment strategies.
Recommendations of the LLMs and the treating ORL doctors were rated by 3 experienced ORL consultants on a 6-point Likert scale for medical adequacy, conciseness, coherence, and comprehensibility.
Moreover, consultants were asked whether the answers pose a risk to the patient’s safety.
A modified Turing test was performed to distinguish responses generated by LLMs from those of doctors.
Finally, the potential influence of the information generated by the LLMs on the raters’ own diagnosis and treatment opinions was evaluated.
Results
Over all categories, ORL doctors achieved superior (
P
<.
0005) ratings compared to locally run LLMs (Llama 3, Mistral Nemo, and Gemma 2).
ORL doctors’ responses were considered hazardous for patients in only 1% of the ratings, whereas recommendations by Llama 3, Gemma 2, and Mistral Nemo were considered hazardous in 54%, 47%, and 32% of cases, respectively.
According to the raters, the LLM’s information rarely influenced their judgment, with Mistral Nemo, Gemma 2, and Llama 3 achieving 1%, 3%, and 4% of the ratings, respectively.
Conclusions
Although locally run LLM models still underperform compared with their web-based counterparts, they achieved respectable results on outpatient treatment in this study.
Nevertheless, the retrospective and single-center nature of the study, along with the clinicians’ documentation style, may have introduced bias in favor of human recommendations.
In the future, locally run LLMs will help address data protection concerns; however, further refinement and prospective validation are still needed to meet strict medical device requirements.
As locally run LLMs continue to evolve, they are likely to become comparably powerful to web-based LLMs and become established as useful tools to support doctors in clinical practice.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Abstract
Funding Acknowledgements
Type of funding sources: None.
INTRODUCTION Patients with heart failure (HF)...
Determining the Nutritional Value of Sausages made with Llama and Alpaca Meat with the Addition of Goose Flour and Mashua
Determining the Nutritional Value of Sausages made with Llama and Alpaca Meat with the Addition of Goose Flour and Mashua
This research determined the nutritional value of salami made with South American camelid meat as an alternative to non-traditional meat. The camelid meat in the form of sausages c...
Ensemble Data Assimilation in NEMO using PDAF
Ensemble Data Assimilation in NEMO using PDAF
NEMO itself does not provide full functionality for data assimilation. To enable data assimilation with NEMO, it was coupled with the Parallel Data Assimilation Framework (PDAF, ht...
CellCover Defines Marker Gene Panels Capturing Developmental Progression in Neocortical Neural Stem Cell Identity
CellCover Defines Marker Gene Panels Capturing Developmental Progression in Neocortical Neural Stem Cell Identity
Definition of cell classes across the tissues of living organisms is central in the analysis of growing atlases of single-cell RNA sequencing (scRNA-seq) data across biomedicine. M...
CellCover Defines Marker Gene Panels Capturing Developmental Progression in Neocortical Neural Stem Cell Identity
CellCover Defines Marker Gene Panels Capturing Developmental Progression in Neocortical Neural Stem Cell Identity
Definition of cell classes across the tissues of living organisms is central in the analysis of growing atlases of single-cell RNA sequencing (scRNA-seq) data across biomedicine. M...
Performance assessment of large language models in cancer staging: Comparative analysis of Mistral models
Performance assessment of large language models in cancer staging: Comparative analysis of Mistral models
ABSTRACT
Cancer staging plays a critical role in treatment planning and prognosis but is often embedded in unstructured clinical narratives. To automate the extract...
Pattern of Outpatient Health Service Utilization by Older People in Iran
Pattern of Outpatient Health Service Utilization by Older People in Iran
Objectives: Considering the rapid growth of Iran’s elderly population with consequent increase in the costs of health services, it is necessary to be aware of the pattern of outpat...

