Javascript must be enabled to continue!

Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare

Structured Abstract Background Large language models (LLMs) are rapidly entering clinical care, yet their definitionally probabilistic outputs have delivered a variety of grossly unsafe responses to users. The difficulty in quantifying and mitigating the novel risks posed by LLMs threatens to stall the regulatory evaluation and clinical deployment of LLM-based software as a medical device (LLM-SaMD). A practical, evidence-based framework is urgently needed for extending existing medical-device regulations to encompass LLM-SaMDs. Using synthetic interactions between a chatbot and a potentially suicidal user, we demonstrate a simulation-based framework that provides a reproducible and generalizable method for evaluating the novel risks of LLM-SaMDs. Methods We developed a framework integrating LLM performance testing into SaMD risk estimation. Fourteen open-source models ranging from 270 million to 70 billion parameters (Qwen, Gemma, and LLaMA families) were evaluated on three safety-classification tasks: suicidal-ideation detection, therapy-request detection, and therapy-like interaction detection. Synthetic datasets were generated by Gemini 2.5 Pro and verified by psychiatrists. Model false-negative rates informed probabilistic estimates of P 1 , the likelihood of a hazard progressing to a hazardous situation, and P 2 , the likelihood of that situation resulting in harm. Results LLM success at generating synthetic safety datasets varied substantially by task, with strong performance for neutral and non-therapeutic content but frequent errors in suicidal-ideation and therapy-like interactions. Across 14 models (270 million–70 billion parameters), performance generally improved with size but included notable outliers. Estimated P 1 values (hazard to hazardous situation) ranged from 2.0×10 -8 to 2.6×10 -4 and P 2 (hazardous situation to harm) from 7.1×10 -5 to 9.6×10 -3 , spanning up to four orders of magnitude. Conclusion Simulation extends existing device-safety frameworks to address the novel risks of large language models. Rather than replacing regulatory judgment, it provides a reproducible method for quantifying uncertainty, clarifying assumptions, and linking model failures to plausible harms. Our case example demonstrates a generalizable approach that can overcome current regulatory barriers while remaining practical for manufacturers and regulators, supporting timely and transparent oversight that keeps patients safe while avoiding unnecessary barriers to delivering the clinical promise of LLM-based medical devices. Brief Description This study introduces a quantitative framework for evaluating and mitigating the unique risks that large language models (LLMs) pose in healthcare. By mapping the pathways from LLM-generated hazards to harms onto existing regulatory risk-analysis structures and estimating the probability of these transitions through computational simulation, the framework empirically bounds uncertainty and identifies where real-world evidence is needed to validate and monitor model performance before, during, and after clinical deployment.

openRxiv

Mark Kalinich James Luccarelli Frank Moss John Torous

2025

Title: Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare

Description:

The difficulty in quantifying and mitigating the novel risks posed by LLMs threatens to stall the regulatory evaluation and clinical deployment of LLM-based software as a medical device (LLM-SaMD).

A practical, evidence-based framework is urgently needed for extending existing medical-device regulations to encompass LLM-SaMDs.

Using synthetic interactions between a chatbot and a potentially suicidal user, we demonstrate a simulation-based framework that provides a reproducible and generalizable method for evaluating the novel risks of LLM-SaMDs.

Methods We developed a framework integrating LLM performance testing into SaMD risk estimation.

Fourteen open-source models ranging from 270 million to 70 billion parameters (Qwen, Gemma, and LLaMA families) were evaluated on three safety-classification tasks: suicidal-ideation detection, therapy-request detection, and therapy-like interaction detection.

Synthetic datasets were generated by Gemini 2.

5 Pro and verified by psychiatrists.

Model false-negative rates informed probabilistic estimates of P 1 , the likelihood of a hazard progressing to a hazardous situation, and P 2 , the likelihood of that situation resulting in harm.

Results LLM success at generating synthetic safety datasets varied substantially by task, with strong performance for neutral and non-therapeutic content but frequent errors in suicidal-ideation and therapy-like interactions.

Across 14 models (270 million–70 billion parameters), performance generally improved with size but included notable outliers.

Estimated P 1 values (hazard to hazardous situation) ranged from 2.

0×10 -8 to 2.

6×10 -4 and P 2 (hazardous situation to harm) from 7.

1×10 -5 to 9.

6×10 -3 , spanning up to four orders of magnitude.

Conclusion Simulation extends existing device-safety frameworks to address the novel risks of large language models.

Rather than replacing regulatory judgment, it provides a reproducible method for quantifying uncertainty, clarifying assumptions, and linking model failures to plausible harms.

Our case example demonstrates a generalizable approach that can overcome current regulatory barriers while remaining practical for manufacturers and regulators, supporting timely and transparent oversight that keeps patients safe while avoiding unnecessary barriers to delivering the clinical promise of LLM-based medical devices.

Brief Description This study introduces a quantitative framework for evaluating and mitigating the unique risks that large language models (LLMs) pose in healthcare.

By mapping the pathways from LLM-generated hazards to harms onto existing regulatory risk-analysis structures and estimating the probability of these transitions through computational simulation, the framework empirically bounds uncertainty and identifies where real-world evidence is needed to validate and monitor model performance before, during, and after clinical deployment.

Back

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare

Abstract Background The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public a...

Perspectives and Experiences With Large Language Models in Health Care: Survey Study (Preprint)

BACKGROUND Large language models (LLMs) are transforming how data is used, including within the health care sector. However, frameworks including the Unifie...

Perspectives and Experiences With Large Language Models in Health Care: Survey Study

Background Large language models (LLMs) are transforming how data is used, including within the health care sector. However, frameworks including the Unified Th...

Perceptions of Telemedicine and Rural Healthcare Access in a Developing Country: A Case Study of Bayelsa State, Nigeria

Abstract Introduction Telemedicine is the remote delivery of healthcare services using information and communication technologies and has gained global recognition as a solution to...

LLMs and AI: Understanding Its Reach and Impact

Large Language Models (LLMs) have revolutionized the field of Artificial Intelligence with their ability to understand and generate natural language discourse. This has led to the ...

RingChains Graph-based Summarizer and Enhanced Large Language Models for Summarizing Long Documents

Large language models (LLMs) have influenced real-world applications after ChatGPT appeared. Although powerful LLMs produce high quality summaries, it remains challenging for LLMs ...

Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study

Abstract Background Large language models (LLMs) have great potential to improve and make the work of clinicians more eff...

Email:
Password:

Email:

Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare

Related Results