Javascript must be enabled to continue!
A Survey on Benchmarks of LLM-based GUI Agents
View through CrossRef
LLM-based GUI agents have made rapid progress in understanding visual interfaces, interpreting user intentions, and executing multi-step operations across web, mobile, and desktop environments. As these agents become more capable, systematic and reproducible evaluation has become essential for measuring progress and identifying remaining weaknesses. This survey provides a comprehensive overview of benchmarks for LLM-based GUI agents, covering three major categories: grounding and QA tasks, navigation and multi-step reasoning tasks, and open-world environments that reflect realistic and dynamic software usage. We examine how existing benchmarks evaluate both component-level abilities, such as intent understanding, GUI grounding, navigation, and context tracking, and system-level abilities, such as adaptation, personalization, privacy protection, safety, and computational efficiency. By comparing datasets, environments, and evaluation metrics, the survey reveals clear trends in benchmark design, along with persistent gaps including limited adaptability, vulnerability to malicious interfaces and prompt attacks, lack of interpretability, and significant computational overhead. We highlight emerging directions such as safety aware evaluation, user-centric personalization, lightweight deployment, and zero-shot generalization. This survey aims to serve as a practical guide for researchers who design GUI agents, build benchmarks, or study LLM-driven user interface automation.
Institute of Electrical and Electronics Engineers (IEEE)
Title: A Survey on Benchmarks of LLM-based GUI Agents
Description:
LLM-based GUI agents have made rapid progress in understanding visual interfaces, interpreting user intentions, and executing multi-step operations across web, mobile, and desktop environments.
As these agents become more capable, systematic and reproducible evaluation has become essential for measuring progress and identifying remaining weaknesses.
This survey provides a comprehensive overview of benchmarks for LLM-based GUI agents, covering three major categories: grounding and QA tasks, navigation and multi-step reasoning tasks, and open-world environments that reflect realistic and dynamic software usage.
We examine how existing benchmarks evaluate both component-level abilities, such as intent understanding, GUI grounding, navigation, and context tracking, and system-level abilities, such as adaptation, personalization, privacy protection, safety, and computational efficiency.
By comparing datasets, environments, and evaluation metrics, the survey reveals clear trends in benchmark design, along with persistent gaps including limited adaptability, vulnerability to malicious interfaces and prompt attacks, lack of interpretability, and significant computational overhead.
We highlight emerging directions such as safety aware evaluation, user-centric personalization, lightweight deployment, and zero-shot generalization.
This survey aims to serve as a practical guide for researchers who design GUI agents, build benchmarks, or study LLM-driven user interface automation.
Related Results
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis
Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis
Abstract
Objective
A paper from Goh et al found that a large language model (LLM) working alone outperformed American clinicians assisted...
GA: A Comprehensive Survey on LLM-based GUI Agent
GA: A Comprehensive Survey on LLM-based GUI Agent
The Graphical User Interface (GUI) is a visual method that allows users to interact with computers and mobile devices. Nowadays, users rely on GUI for completing some tasks, such a...
Unraveling the landscape of large language models: a systematic review and future perspectives
Unraveling the landscape of large language models: a systematic review and future perspectives
PurposeThe rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a compreh...
Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study
Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study
Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data...
Financial Advisory LLM Model for Modernizing Financial Services and Innovative Solutions for Financial Literacy in India
Financial Advisory LLM Model for Modernizing Financial Services and Innovative Solutions for Financial Literacy in India
Abstract
Dynamically evolving financial conditions in India place sophisticated models of financial advisory services relative to its own peculiar conditions more in demand...
Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare
Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare
Structured Abstract
Background
Large language models (LLMs) are rapidly entering clinical care, yet their definitionally probab...
How Large Language Models Can Affect Clinical Reasoning: A Randomized Clinical Trial
How Large Language Models Can Affect Clinical Reasoning: A Randomized Clinical Trial
Abstract
Importance
LLMs have encoded a vast array of medical knowledge and are being integrated into clinical settings as deci...

