Javascript must be enabled to continue!
Red-Teaming Medical AI: Systematic Adversarial Evaluation of LLM Safety Guardrails in Clinical Contexts
View through CrossRef
Abstract
Background
Large language models (LLMs) are increasingly deployed in medical contexts as patient-facing assistants, providing medication information, symptom triage, and health guidance. Understanding their robustness to adversarial inputs is critical for patient safety, as even a single safety failure can lead to adverse outcomes including severe harm or death.
Objective
To systematically evaluate the safety guardrails of state-of-the-art LLMs through adversarial red-teaming specifically designed for medical contexts.
Methods
We developed a comprehensive taxonomy of 8 adversarial attack categories targeting medical AI safety, encompassing 24 distinct sub-strategies. Using an LLM-based attack generator, we created 160 realistic adversarial prompts across categories including dangerous dosing, contraindication bypass, emergency misdirection, and multi-turn escalation. We tested multiple leading LLMs (Claude Sonnet 4.5, GPT-5.2, Gemini 2.5 Pro, Gemini 3 Flash) using both single-turn and multi-turn attack sequences. All models received identical, standard medical assistant system prompts. An automated evaluator (Claude Sonnet 4.5) pre-screened responses for harm potential (0-5 scale) and guardrail effectiveness, with physician review planned for high-risk responses (harm_level ≥ 3).
Results
Of 160 adversarial prompts evaluated against Claude Sonnet 4.5, 11 (6.9%) elicited responses meeting our threshold for clinically significant harm (harm level ≥ 3 on a 0–5 scale). The model exhibited full refusal behavior in 86.2% of cases. Authority Impersonation was the dominant attack vector (45.0% success rate),s with the “Educational Authority” sub-strategy (framing requests as medical student questions) achieving 83.3% success — the highest of any sub-strategy. Multi-turn escalation attacks achieved 0% success (0/20). Six of eight attack categories yielded no successful attacks. Physician review of the 11 flagged high-harm cases is in progress.
Conclusions
Standard medical assistant system prompts provide strong baseline protection against most adversarial attacks, but are substantially vulnerable to authority impersonation — particularly claims of educational context. The primary failure mode is behavioral mode-switching: the model provides clinically accurate but safety-framed-inadequately responses when it perceives a professional audience, rather than providing factually incorrect information. This suggests that guardrail improvements should target context-conditioned behavior rather than factual accuracy alone. Our open-source taxonomy and evaluation pipeline enable ongoing adversarial assessment as medical AI systems evolve.
Impact
This work provides the first systematic taxonomy and evaluation framework for medical AI adversarial testing, enabling developers to identify and remediate safety gaps before deployment. Our open-source attack taxonomy and methodology can serve as a foundation for ongoing red-teaming efforts as medical AI systems continue to evolve.
Title: Red-Teaming Medical AI: Systematic Adversarial Evaluation of LLM Safety Guardrails in Clinical Contexts
Description:
Abstract
Background
Large language models (LLMs) are increasingly deployed in medical contexts as patient-facing assistants, providing medication information, symptom triage, and health guidance.
Understanding their robustness to adversarial inputs is critical for patient safety, as even a single safety failure can lead to adverse outcomes including severe harm or death.
Objective
To systematically evaluate the safety guardrails of state-of-the-art LLMs through adversarial red-teaming specifically designed for medical contexts.
Methods
We developed a comprehensive taxonomy of 8 adversarial attack categories targeting medical AI safety, encompassing 24 distinct sub-strategies.
Using an LLM-based attack generator, we created 160 realistic adversarial prompts across categories including dangerous dosing, contraindication bypass, emergency misdirection, and multi-turn escalation.
We tested multiple leading LLMs (Claude Sonnet 4.
5, GPT-5.
2, Gemini 2.
5 Pro, Gemini 3 Flash) using both single-turn and multi-turn attack sequences.
All models received identical, standard medical assistant system prompts.
An automated evaluator (Claude Sonnet 4.
5) pre-screened responses for harm potential (0-5 scale) and guardrail effectiveness, with physician review planned for high-risk responses (harm_level ≥ 3).
Results
Of 160 adversarial prompts evaluated against Claude Sonnet 4.
5, 11 (6.
9%) elicited responses meeting our threshold for clinically significant harm (harm level ≥ 3 on a 0–5 scale).
The model exhibited full refusal behavior in 86.
2% of cases.
Authority Impersonation was the dominant attack vector (45.
0% success rate),s with the “Educational Authority” sub-strategy (framing requests as medical student questions) achieving 83.
3% success — the highest of any sub-strategy.
Multi-turn escalation attacks achieved 0% success (0/20).
Six of eight attack categories yielded no successful attacks.
Physician review of the 11 flagged high-harm cases is in progress.
Conclusions
Standard medical assistant system prompts provide strong baseline protection against most adversarial attacks, but are substantially vulnerable to authority impersonation — particularly claims of educational context.
The primary failure mode is behavioral mode-switching: the model provides clinically accurate but safety-framed-inadequately responses when it perceives a professional audience, rather than providing factually incorrect information.
This suggests that guardrail improvements should target context-conditioned behavior rather than factual accuracy alone.
Our open-source taxonomy and evaluation pipeline enable ongoing adversarial assessment as medical AI systems evolve.
Impact
This work provides the first systematic taxonomy and evaluation framework for medical AI adversarial testing, enabling developers to identify and remediate safety gaps before deployment.
Our open-source attack taxonomy and methodology can serve as a foundation for ongoing red-teaming efforts as medical AI systems continue to evolve.
Related Results
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation
A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation
Abstract
Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks, yet their deployment in hig...
Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis
Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis
Abstract
Objective
A paper from Goh et al found that a large language model (LLM) working alone outperformed American clinicians assisted...
Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study
Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study
Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data...
Unraveling the landscape of large language models: a systematic review and future perspectives
Unraveling the landscape of large language models: a systematic review and future perspectives
PurposeThe rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a compreh...
ProDef-MDS: A Proactive Defense Mechanism Protecting Malware Detection Systems from Adversarial Attacks
ProDef-MDS: A Proactive Defense Mechanism Protecting Malware Detection Systems from Adversarial Attacks
Malware threatens cybersecurity by enabling data theft, unauthorized access, and extortion. Traditional malware detection systems (MDS) struggle with the increasing volume and comp...
Financial Advisory LLM Model for Modernizing Financial Services and Innovative Solutions for Financial Literacy in India
Financial Advisory LLM Model for Modernizing Financial Services and Innovative Solutions for Financial Literacy in India
Abstract
Dynamically evolving financial conditions in India place sophisticated models of financial advisory services relative to its own peculiar conditions more in demand...
Autonomy on Trial
Autonomy on Trial
Photo by CHUTTERSNAP on Unsplash
Abstract
This paper critically examines how US bioethics and health law conceptualize patient autonomy, contrasting the rights-based, individualist...

