Javascript must be enabled to continue!

PlanAgent: Embodied Visual-Language Model for Grounded Task planning with Environment Map

Abstract Embodied Intelligence refers to the agent interacting with the environment, perceiving, planning, decision-making, and executing like humans, which is applicable in smart homes, drone inspections, and other domains. Embodied task planning is one of the main tasks of embodied intelligence, which generates detailed step-by-step plans while perceiving the surrounding environment and understanding language instruction. Visual-language models, with powerful multimodal representation capabilities, have been generalized to various tasks. When applied to embodied task planning, it still faces the following two challenges. Firstly, the intricate complexity of the environment leads to difficulties in global environment information modeling. Secondly, frequent turns in task paths result in the dependence on strong spatial reasoning ability. To overcome these challenges, we propose PlanAgent, the first embodied visual-language model for embodied task planning. Specifically, the environment map is employed to model the global environment information. Then we present the environment map encoder to extract task-related information from the environment. Further, to reduce task path planning's dependence on strong spatial reasoning, we introduce the self-posture-aware training strategy to break down long-term spatial reasoning into short-term. We build the EmbodiedPlan-20k dataset for grounded planning in embodied tasks. Our experiments on the dataset demonstrate that PlanAgent outperforms previous methods and all components are effective.

Research Square Platform LLC

Yuanchang Yue Fanglong Yao Youzhi Liu Nayu Liu Li Jin Zequn Zhang Daobing Zhang Xian Sun Kun Fu

2024

Title: PlanAgent: Embodied Visual-Language Model for Grounded Task planning with Environment Map

Description:

Embodied task planning is one of the main tasks of embodied intelligence, which generates detailed step-by-step plans while perceiving the surrounding environment and understanding language instruction.

Visual-language models, with powerful multimodal representation capabilities, have been generalized to various tasks.

When applied to embodied task planning, it still faces the following two challenges.

Firstly, the intricate complexity of the environment leads to difficulties in global environment information modeling.

Secondly, frequent turns in task paths result in the dependence on strong spatial reasoning ability.

To overcome these challenges, we propose PlanAgent, the first embodied visual-language model for embodied task planning.

Specifically, the environment map is employed to model the global environment information.

Then we present the environment map encoder to extract task-related information from the environment.

Further, to reduce task path planning's dependence on strong spatial reasoning, we introduce the self-posture-aware training strategy to break down long-term spatial reasoning into short-term.

We build the EmbodiedPlan-20k dataset for grounded planning in embodied tasks.

Our experiments on the dataset demonstrate that PlanAgent outperforms previous methods and all components are effective.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program

Abstract Funding Acknowledgements Type of funding sources: None. INTRODUCTION Patients with heart failure (HF)...

Grounded Theory

Widely used in social work, grounded theory is one of the oldest and best-known qualitative research methods. Even so, it is often misunderstood. Created at a time when positivism ...

A Wideband mm-Wave Printed Dipole Antenna for 5G Applications

<span lang="EN-MY">In this paper, a wideband millimeter-wave (mm-Wave) printed dipole antenna is proposed to be used for fifth generation (5G) communications. The single elem...

The Predictive Value of MAP and ETCO2 Changes After Emergency Endotracheal Intubation for Severe Cardiovascular Collapse

Abstract Objective: To analyze the changes in mean arterial pressure (MAP) and end-tidal CO2 (ETCO2) in patients after emergency endotracheal intubation (ETI). To explore t...

Aviation English - A global perspective: analysis, teaching, assessment

This e-book brings together 13 chapters written by aviation English researchers and practitioners settled in six different countries, representing institutions and universities fro...

Disturbance of Information in Superior Parietal Lobe during Dual-task Interference in a Simulated Driving Task

AbstractPerforming a secondary task while driving causes a decline in driving performance. This phenomenon, called dual-task interference, can have lethal consequences. Previous fM...

Rodnoosjetljiv jezik na primjeru njemačkih časopisa Brigitte i Der Spiegel

On the basis of the comparative analysis of texts of the German biweekly magazine Brigitte and the weekly magazine Der Spiegel and under the presumption that gender-sensitive langu...

Email:
Password:

Email:

PlanAgent: Embodied Visual-Language Model for Grounded Task planning with Environment Map

Related Results