Javascript must be enabled to continue!

A Survey on Benchmarks of LLM-based GUI Agents

LLM-based GUI agents have made rapid progress in understanding visual interfaces, interpreting user intentions, and executing multi-step operations across web, mobile, and desktop environments. As these agents become more capable, systematic and reproducible evaluation has become essential for measuring progress and identifying remaining weaknesses. This survey provides a comprehensive overview of benchmarks for LLM-based GUI agents, covering three major categories: grounding and QA tasks, navigation and multi-step reasoning tasks, and open-world environments that reflect realistic and dynamic software usage. We examine how existing benchmarks evaluate both component-level abilities, such as intent understanding, GUI grounding, navigation, and context tracking, and system-level abilities, such as adaptation, personalization, privacy protection, safety, and computational efficiency. By comparing datasets, environments, and evaluation metrics, the survey reveals clear trends in benchmark design, along with persistent gaps including limited adaptability, vulnerability to malicious interfaces and prompt attacks, lack of interpretability, and significant computational overhead. We highlight emerging directions such as safety aware evaluation, user-centric personalization, lightweight deployment, and zero-shot generalization. This survey aims to serve as a practical guide for researchers who design GUI agents, build benchmarks, or study LLM-driven user interface automation.

Institute of Electrical and Electronics Engineers (IEEE)

Yihong Chen Shuai Wang Yaqing Wang Quanming Yao

2025

Title: A Survey on Benchmarks of LLM-based GUI Agents

Description:

LLM-based GUI agents have made rapid progress in understanding visual interfaces, interpreting user intentions, and executing multi-step operations across web, mobile, and desktop environments.

As these agents become more capable, systematic and reproducible evaluation has become essential for measuring progress and identifying remaining weaknesses.

This survey provides a comprehensive overview of benchmarks for LLM-based GUI agents, covering three major categories: grounding and QA tasks, navigation and multi-step reasoning tasks, and open-world environments that reflect realistic and dynamic software usage.

We examine how existing benchmarks evaluate both component-level abilities, such as intent understanding, GUI grounding, navigation, and context tracking, and system-level abilities, such as adaptation, personalization, privacy protection, safety, and computational efficiency.

By comparing datasets, environments, and evaluation metrics, the survey reveals clear trends in benchmark design, along with persistent gaps including limited adaptability, vulnerability to malicious interfaces and prompt attacks, lack of interpretability, and significant computational overhead.

We highlight emerging directions such as safety aware evaluation, user-centric personalization, lightweight deployment, and zero-shot generalization.

This survey aims to serve as a practical guide for researchers who design GUI agents, build benchmarks, or study LLM-driven user interface automation.

Back

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis

Abstract Objective A paper from Goh et al found that a large language model (LLM) working alone outperformed American clinicians assisted...

GA: A Comprehensive Survey on LLM-based GUI Agent

The Graphical User Interface (GUI) is a visual method that allows users to interact with computers and mobile devices. Nowadays, users rely on GUI for completing some tasks, such a...

Unraveling the landscape of large language models: a systematic review and future perspectives

PurposeThe rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a compreh...

Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study

Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data...

Financial Advisory LLM Model for Modernizing Financial Services and Innovative Solutions for Financial Literacy in India

Abstract Dynamically evolving financial conditions in India place sophisticated models of financial advisory services relative to its own peculiar conditions more in demand...

Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare

Structured Abstract Background Large language models (LLMs) are rapidly entering clinical care, yet their definitionally probab...

How Large Language Models Can Affect Clinical Reasoning: A Randomized Clinical Trial

Abstract Importance LLMs have encoded a vast array of medical knowledge and are being integrated into clinical settings as deci...

Email:
Password:

Email:

A Survey on Benchmarks of LLM-based GUI Agents

Related Results