Javascript must be enabled to continue!
SteadyEval: Robust LLM Exam Graders via Adversarial Training and Distillation
View through CrossRef
Large language models (LLMs) are increasingly used as rubric-guided graders for short-answer exams, but their decisions can be unstable across prompts and vulnerable to answer-side prompt injection. In this paper, we study SteadyEval, a guardrailed exam-grading pipeline in which an adversarially trained LoRA filter (SteadyEval-7B-deep) preprocesses student answers to remove answer-side prompt injection, after which the original Mistral-7B-Instruct rubric-guided grader assigns the final score. We build two exam-grading pipelines on top of Mistral-7B-Instruct: a baseline pipeline that scores student answers directly, and a guardrailed pipeline in which a LoRA-based filter (SteadyEval-7B-deep) first removes injection content from the answer and a downstream grader then assigns the final score. Using two rubric-guided short-answer datasets in machine learning and computer networking, we generate grouped families of clean answers and four classes of answer-side attacks, and we evaluate the impact of these attacks on score shifts, attack success rates, stability across prompt variants, and alignment with human graders. On the pooled dataset, answer-side attacks inflate grades in the unguarded baseline by an average of about +1.2 points on a 1–10 scale, and substantially increase score dispersion across prompt variants. The guardrailed pipeline largely removes this systematic grade inflation and reduces instability for many items, especially in the machine-learning exam, while keeping mean absolute error with respect to human reference scores in a similar range to the unguarded baseline on clean answers, with a conservative shift in networking that motivates per-course calibration. Chief-panel comparisons further show that the guardrailed pipeline tracks human grading more closely on machine-learning items, but tends to under-score networking answers. These findings are best interpreted as a proof-of-concept guardrail and require per-course validation and calibration before operational use.
Title: SteadyEval: Robust LLM Exam Graders via Adversarial Training and Distillation
Description:
Large language models (LLMs) are increasingly used as rubric-guided graders for short-answer exams, but their decisions can be unstable across prompts and vulnerable to answer-side prompt injection.
In this paper, we study SteadyEval, a guardrailed exam-grading pipeline in which an adversarially trained LoRA filter (SteadyEval-7B-deep) preprocesses student answers to remove answer-side prompt injection, after which the original Mistral-7B-Instruct rubric-guided grader assigns the final score.
We build two exam-grading pipelines on top of Mistral-7B-Instruct: a baseline pipeline that scores student answers directly, and a guardrailed pipeline in which a LoRA-based filter (SteadyEval-7B-deep) first removes injection content from the answer and a downstream grader then assigns the final score.
Using two rubric-guided short-answer datasets in machine learning and computer networking, we generate grouped families of clean answers and four classes of answer-side attacks, and we evaluate the impact of these attacks on score shifts, attack success rates, stability across prompt variants, and alignment with human graders.
On the pooled dataset, answer-side attacks inflate grades in the unguarded baseline by an average of about +1.
2 points on a 1–10 scale, and substantially increase score dispersion across prompt variants.
The guardrailed pipeline largely removes this systematic grade inflation and reduces instability for many items, especially in the machine-learning exam, while keeping mean absolute error with respect to human reference scores in a similar range to the unguarded baseline on clean answers, with a conservative shift in networking that motivates per-course calibration.
Chief-panel comparisons further show that the guardrailed pipeline tracks human grading more closely on machine-learning items, but tends to under-score networking answers.
These findings are best interpreted as a proof-of-concept guardrail and require per-course validation and calibration before operational use.
Related Results
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
A Comprehensive Review of Distillation in the Pharmaceutical Industry
A Comprehensive Review of Distillation in the Pharmaceutical Industry
Distillation processes play a pivotal role in the pharmaceutical industry for the purification of active pharmaceutical ingredients (APIs), intermediates, and solvent recovery. Thi...
Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis
Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis
Abstract
Objective
A paper from Goh et al found that a large language model (LLM) working alone outperformed American clinicians assisted...
ProDef-MDS: A Proactive Defense Mechanism Protecting Malware Detection Systems from Adversarial Attacks
ProDef-MDS: A Proactive Defense Mechanism Protecting Malware Detection Systems from Adversarial Attacks
Malware threatens cybersecurity by enabling data theft, unauthorized access, and extortion. Traditional malware detection systems (MDS) struggle with the increasing volume and comp...
Improving Adversarial Robustness via Finding Flat Minimum of the Weight Loss Landscape
Improving Adversarial Robustness via Finding Flat Minimum of the Weight Loss Landscape
<p>Recent studies have shown that robust overfitting and robust generalization gap are a major trouble in adversarial training of deep neural networks. These interesting prob...
Unraveling the landscape of large language models: a systematic review and future perspectives
Unraveling the landscape of large language models: a systematic review and future perspectives
PurposeThe rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a compreh...
Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study
Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study
Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data...
Efficient Defense Against First Order Adversarial Attacks on Convolutional Neural Networks
Efficient Defense Against First Order Adversarial Attacks on Convolutional Neural Networks
Machine learning models, especially neural networks, are vulnerable to adversarial attacks, where inputs are purposefully altered to induce incorrect predictions. These adversarial...

