Javascript must be enabled to continue!
Applied with Caution: Extreme-Scenario Testing Reveals Significant Risks in Using LLMs for Humanities and Social Sciences Paper Evaluation
View through CrossRef
The deployment of large language models (LLMs) in academic paper evaluation is increasingly widespread, yet their trustworthiness remains debated; to expose fundamental flaws often masked under conventional testing, this study employed extreme-scenario testing to systematically probe the lower performance boundaries of LLMs in assessing the scientific validity and logical coherence of papers from the humanities and social sciences (HSS). Through a highly credible quasi-experiment, 40 high-quality Chinese papers from philosophy, sociology, education, and psychology were selected, for which domain experts created versions with implanted “scientific flaws” and “logical flaws”. Three representative LLMs (GPT-4, DeepSeek, and Doubao) were evaluated against a baseline of 24 doctoral candidates, following a protocol progressing from ‘broad’ to ‘targeted’ prompts. Key findings reveal poor evaluation consistency, with significantly low intra-rater and inter-rater reliability for the LLMs, and limited flaw detection capability, as all models failed to distinguish between original and flawed papers under broad prompts, unlike human evaluators; although targeted prompts improved detection, LLM performance remained substantially inferior, particularly in tasks requiring deep empirical insight and logical reasoning. The study proposes that LLMs operate on a fundamentally different “task decomposition-semantic understanding” mechanism, relying on limited text extraction and shallow semantic comparison rather than the human process of “worldscape reconstruction → meaning construction and critique”, resulting in a critical inability to assess argumentative plausibility and logical coherence. It concludes that current LLMs possess fundamental limitations in evaluations requiring depth and critical thinking, are not reliable independent evaluators, and that over-trusting them carries substantial risks, necessitating rational human-AI collaborative frameworks, enhanced model adaptation through downstream alignment techniques like prompt engineering and fine-tuning, and improvements in general capabilities such as logical reasoning.
Title: Applied with Caution: Extreme-Scenario Testing Reveals Significant Risks in Using LLMs for Humanities and Social Sciences Paper Evaluation
Description:
The deployment of large language models (LLMs) in academic paper evaluation is increasingly widespread, yet their trustworthiness remains debated; to expose fundamental flaws often masked under conventional testing, this study employed extreme-scenario testing to systematically probe the lower performance boundaries of LLMs in assessing the scientific validity and logical coherence of papers from the humanities and social sciences (HSS).
Through a highly credible quasi-experiment, 40 high-quality Chinese papers from philosophy, sociology, education, and psychology were selected, for which domain experts created versions with implanted “scientific flaws” and “logical flaws”.
Three representative LLMs (GPT-4, DeepSeek, and Doubao) were evaluated against a baseline of 24 doctoral candidates, following a protocol progressing from ‘broad’ to ‘targeted’ prompts.
Key findings reveal poor evaluation consistency, with significantly low intra-rater and inter-rater reliability for the LLMs, and limited flaw detection capability, as all models failed to distinguish between original and flawed papers under broad prompts, unlike human evaluators; although targeted prompts improved detection, LLM performance remained substantially inferior, particularly in tasks requiring deep empirical insight and logical reasoning.
The study proposes that LLMs operate on a fundamentally different “task decomposition-semantic understanding” mechanism, relying on limited text extraction and shallow semantic comparison rather than the human process of “worldscape reconstruction → meaning construction and critique”, resulting in a critical inability to assess argumentative plausibility and logical coherence.
It concludes that current LLMs possess fundamental limitations in evaluations requiring depth and critical thinking, are not reliable independent evaluators, and that over-trusting them carries substantial risks, necessitating rational human-AI collaborative frameworks, enhanced model adaptation through downstream alignment techniques like prompt engineering and fine-tuning, and improvements in general capabilities such as logical reasoning.
Related Results
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
Perspectives and Experiences With Large Language Models in Health Care: Survey Study
Perspectives and Experiences With Large Language Models in Health Care: Survey Study
Background
Large language models (LLMs) are transforming how data is used, including within the health care sector. However, frameworks including the Unified Theory of ...
Perspectives and Experiences With Large Language Models in Health Care: Survey Study (Preprint)
Perspectives and Experiences With Large Language Models in Health Care: Survey Study (Preprint)
BACKGROUND
Large language models (LLMs) are transforming how data is used, including within the health care sector. However, frameworks including the Unifie...
LLMs and AI: Understanding Its Reach and Impact
LLMs and AI: Understanding Its Reach and Impact
Large Language Models (LLMs) have revolutionized the field of Artificial Intelligence with their ability to understand and generate natural language discourse. This has led to the ...
[RETRACTED] Keto Extreme Fat Burner Tim Noakes v1
[RETRACTED] Keto Extreme Fat Burner Tim Noakes v1
[RETRACTED]Keto Extreme Fat Burner Tim Noakes Reviews - Losing where is the fantasy of many, however not every person can achieve it. On the off chance that you have a fantasy abou...
[RETRACTED] Keto Extreme Fat Burner Price at Clicks (price at clicks) - Reviews, dischem, takealot, tim noakes & price at clicks | Read Must v1
[RETRACTED] Keto Extreme Fat Burner Price at Clicks (price at clicks) - Reviews, dischem, takealot, tim noakes & price at clicks | Read Must v1
[RETRACTED]Keto Extreme Fat Burner Price at Clicks (price at clicks) - Reviews, dischem, takealot, tim noakes & price at clicks | Read Must Keto Extreme Fat Burner Price at Cl...
[RETRACTED] Keto Extreme Fat Burner Price at Clicks (price at clicks) - Reviews, dischem, takealot, tim noakes & price at clicks | Read Must v1
[RETRACTED] Keto Extreme Fat Burner Price at Clicks (price at clicks) - Reviews, dischem, takealot, tim noakes & price at clicks | Read Must v1
[RETRACTED]Keto Extreme Fat Burner Price at Clicks (price at clicks) - Reviews, dischem, takealot, tim noakes & price at clicks | Read Must Keto Extreme Fat Burner Price at Cl...
When LLMs meet cybersecurity: a systematic literature review
When LLMs meet cybersecurity: a systematic literature review
Abstract
The rapid development of large language models (LLMs) has opened new avenues across various fields, including cybersecurity, which faces an evolving threat lands...

