Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

When machines judge humanness : findings from an interactive reverse Turing test by large language models

View through CrossRef
While modern large language models (LLMs) increasingly pass short-form Turing tests and are sometimes rated as more human than humans, whether LLMs themselves can act as evaluators in this setting remains poorly investigated. We designed an interactive reverse Turing test in which seven LLMs (ChatGPT 4.5, Claude 3.7 Sonnet, Gemini Advanced 2.5 Pro, Mistral Large 2.1, Grok 3, DeepSeek V3, and Llama 4 Maverick) served as evaluators. Each LLM autonomously posed up to ten questions to hidden participants, who were either humans or other LLMs instructed with minimal or structured prompts. Thematic analysis was applied to both the questions and the reasons underlying final verdicts. Across 238 reverse Turing tests comprising 1,714 questions, AI evaluators identified AI participants as AI in only three tests. AI participants were judged more human than humans (mean probability of being human: 0.88 in AI participants vs 0.78 in humans; p<0.001). Thematic analysis of questioning strategies revealed emphasis on emotions/feelings (14%), memory (13%) and behaviours (11%), with distinct model-specific patterns: e.g., Claude 3.7 Sonnet emphasised on mind and reasoning, Gemini 2.5 Pro focused on abstraction and creativity and ChatGPT 4.5 focused on socio-emotional dimensions. Reasons cited for final verdict most often referred to authenticity of personality (26%) and veracity of emotions (25%). Among tests with human participants : questions were rated by participants moderately difficult to answer (mean 4.65 out of 10), question relevance was rated higher (mean 6.84), and mean duration was 18.7 minutes (12.4). In conclusion, current AI-based conversational screening appears insufficient for ensuring authenticity in dialogue. Future studies may explore longer, multimodal interactions, richer evaluator prompts co-designed with cognitive experts, and hybrid committees of human and AI evaluators.
Title: When machines judge humanness : findings from an interactive reverse Turing test by large language models
Description:
While modern large language models (LLMs) increasingly pass short-form Turing tests and are sometimes rated as more human than humans, whether LLMs themselves can act as evaluators in this setting remains poorly investigated.
We designed an interactive reverse Turing test in which seven LLMs (ChatGPT 4.
5, Claude 3.
7 Sonnet, Gemini Advanced 2.
5 Pro, Mistral Large 2.
1, Grok 3, DeepSeek V3, and Llama 4 Maverick) served as evaluators.
Each LLM autonomously posed up to ten questions to hidden participants, who were either humans or other LLMs instructed with minimal or structured prompts.
Thematic analysis was applied to both the questions and the reasons underlying final verdicts.
Across 238 reverse Turing tests comprising 1,714 questions, AI evaluators identified AI participants as AI in only three tests.
AI participants were judged more human than humans (mean probability of being human: 0.
88 in AI participants vs 0.
78 in humans; p<0.
001).
Thematic analysis of questioning strategies revealed emphasis on emotions/feelings (14%), memory (13%) and behaviours (11%), with distinct model-specific patterns: e.
g.
, Claude 3.
7 Sonnet emphasised on mind and reasoning, Gemini 2.
5 Pro focused on abstraction and creativity and ChatGPT 4.
5 focused on socio-emotional dimensions.
Reasons cited for final verdict most often referred to authenticity of personality (26%) and veracity of emotions (25%).
Among tests with human participants : questions were rated by participants moderately difficult to answer (mean 4.
65 out of 10), question relevance was rated higher (mean 6.
84), and mean duration was 18.
7 minutes (12.
4).
In conclusion, current AI-based conversational screening appears insufficient for ensuring authenticity in dialogue.
Future studies may explore longer, multimodal interactions, richer evaluator prompts co-designed with cognitive experts, and hybrid committees of human and AI evaluators.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Turing machines
Turing machines
Turing machines are abstract computing devices, named after Alan Mathison Turing. A Turing machine operates on a potentially infinite tape uniformly divided into squares, and is ca...
Delilah—encrypting speech
Delilah—encrypting speech
Once Enigma was solved and the pioneering work on Tunny was done, Turing’s battering-ram mind was needed elsewhere. Routine codebreaking irked him and he was at his best when break...
Generalized Computational Systems
Generalized Computational Systems
The definition of a computational system that I proposed in chapter 1 (definition 3) employs the concept of Turing computability. In this chapter, however, I will show that this co...
The Universal Turing Machine: A Half-Century Survey
The Universal Turing Machine: A Half-Century Survey
Abstract This volume commemorates the work of Alan Turing, because it was Turing who not only introduced the most persuasive and influential concept of a machine mod...
Aviation English - A global perspective: analysis, teaching, assessment
Aviation English - A global perspective: analysis, teaching, assessment
This e-book brings together 13 chapters written by aviation English researchers and practitioners settled in six different countries, representing institutions and universities fro...
Turing Incomputable Computation
Turing Incomputable Computation
A new computing model, called the active element machine (AEM), is presented that demonstrates Turing incomputable computation using quantum random input. The AEM deterministically...
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
<span lang="EN-MY">In this paper, a wideband millimeter-wave (mm-Wave) printed dipole antenna is proposed to be used for fifth generation (5G) communications. The single elem...

Back to Top