Javascript must be enabled to continue!

GA: A Comprehensive Survey on LLM-based GUI Agent

The Graphical User Interface (GUI) is a visual method that allows users to interact with computers and mobile devices. Nowadays, users rely on GUI for completing some tasks, such as browsing web or using mobile applications. Users often meet some needs such as setting an alarm for 8:00 AM to wake them up and checking the weather for tomorrow. Some commercial agents have been integrated into users personal phones to help the user accomplish a series of basic tasks. Unfortunately, these commercial agents often relied on fixed templates or program scripts to ensure reliability. This also limited their functionality to some basic system applications. Recently large language models (LLMs) have made significant breakthroughs in natural language processing (NLP). Astonishingly, LLMs have demonstrated not only a strong ability to understand and generate text but also planning and reasoning capabilities. Some researchers have considered using LLMs as the agent’s brain, equipping these agents with corresponding capabilities. LLM-based agents are also being applied to help users automate tasks on their personal phones and computers. These agents often can understand the GUI environment on personal phones and computers, allowing them to make decisions to complete tasks. This is also the origin of the term “GUI Agent”. Our review surveys recent research on LLM-based GUI Agents. We summarize the capabilities of existing GUI Agents and also discuss the GUI Agent task automation pipeline. A comprehensive list of studies in this paper will be available at a GitHub repositories.

Institute of Electrical and Electronics Engineers (IEEE)

Longzhao Huang Jun Liu Changwei Wang Rongtao Xu Wenhao Xu Zhiwei Xu Qi Zhang Yu Zhang Kexue Fu Longxiang Gao Yanran Xu Lei Zhang Li Guo Shibiao Xu

2025

Title: GA: A Comprehensive Survey on LLM-based GUI Agent

Description:

The Graphical User Interface (GUI) is a visual method that allows users to interact with computers and mobile devices.

Nowadays, users rely on GUI for completing some tasks, such as browsing web or using mobile applications.

Users often meet some needs such as setting an alarm for 8:00 AM to wake them up and checking the weather for tomorrow.

Some commercial agents have been integrated into users personal phones to help the user accomplish a series of basic tasks.

Unfortunately, these commercial agents often relied on fixed templates or program scripts to ensure reliability.

This also limited their functionality to some basic system applications.

Recently large language models (LLMs) have made significant breakthroughs in natural language processing (NLP).

Astonishingly, LLMs have demonstrated not only a strong ability to understand and generate text but also planning and reasoning capabilities.

Some researchers have considered using LLMs as the agent’s brain, equipping these agents with corresponding capabilities.

LLM-based agents are also being applied to help users automate tasks on their personal phones and computers.

These agents often can understand the GUI environment on personal phones and computers, allowing them to make decisions to complete tasks.

This is also the origin of the term “GUI Agent”.

Our review surveys recent research on LLM-based GUI Agents.

We summarize the capabilities of existing GUI Agents and also discuss the GUI Agent task automation pipeline.

A comprehensive list of studies in this paper will be available at a GitHub repositories.

Back

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...

Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis

Abstract Objective A paper from Goh et al found that a large language model (LLM) working alone outperformed American clinicians assisted...

Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study

Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data...

Unraveling the landscape of large language models: a systematic review and future perspectives

PurposeThe rapid rise of large language models (LLMs) has propelled them to the forefront of applications in natural language processing (NLP). This paper aims to present a compreh...

Financial Advisory LLM Model for Modernizing Financial Services and Innovative Solutions for Financial Literacy in India

Abstract Dynamically evolving financial conditions in India place sophisticated models of financial advisory services relative to its own peculiar conditions more in demand...

A Survey on Benchmarks of LLM-based GUI Agents

LLM-based GUI agents have made rapid progress in understanding visual interfaces, interpreting user intentions, and executing multi-step operations across web, mobile, and desktop ...

Generalized Agent Theory from First Principles

To address the fragmentation in the definition of Agent and the profound challenges concerning the nature of intelligence, consciousness, and the observer-based unification of phys...

Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare

Structured Abstract Background Large language models (LLMs) are rapidly entering clinical care, yet their definitionally probab...

Email:
Password:

Email:

GA: A Comprehensive Survey on LLM-based GUI Agent

Related Results