Javascript must be enabled to continue!
PlanAgent: Embodied Visual-Language Model for Grounded Task planning with Environment Map
View through CrossRef
Abstract
Embodied Intelligence refers to the agent interacting with the environment, perceiving, planning, decision-making, and executing like humans, which is applicable in smart homes, drone inspections, and other domains. Embodied task planning is one of the main tasks of embodied intelligence, which generates detailed step-by-step plans while perceiving the surrounding environment and understanding language instruction. Visual-language models, with powerful multimodal representation capabilities, have been generalized to various tasks. When applied to embodied task planning, it still faces the following two challenges. Firstly, the intricate complexity of the environment leads to difficulties in global environment information modeling. Secondly, frequent turns in task paths result in the dependence on strong spatial reasoning ability. To overcome these challenges, we propose PlanAgent, the first embodied visual-language model for embodied task planning. Specifically, the environment map is employed to model the global environment information. Then we present the environment map encoder to extract task-related information from the environment. Further, to reduce task path planning's dependence on strong spatial reasoning, we introduce the self-posture-aware training strategy to break down long-term spatial reasoning into short-term. We build the EmbodiedPlan-20k dataset for grounded planning in embodied tasks. Our experiments on the dataset demonstrate that PlanAgent outperforms previous methods and all components are effective.
Research Square Platform LLC
Title: PlanAgent: Embodied Visual-Language Model for Grounded Task planning with Environment Map
Description:
Abstract
Embodied Intelligence refers to the agent interacting with the environment, perceiving, planning, decision-making, and executing like humans, which is applicable in smart homes, drone inspections, and other domains.
Embodied task planning is one of the main tasks of embodied intelligence, which generates detailed step-by-step plans while perceiving the surrounding environment and understanding language instruction.
Visual-language models, with powerful multimodal representation capabilities, have been generalized to various tasks.
When applied to embodied task planning, it still faces the following two challenges.
Firstly, the intricate complexity of the environment leads to difficulties in global environment information modeling.
Secondly, frequent turns in task paths result in the dependence on strong spatial reasoning ability.
To overcome these challenges, we propose PlanAgent, the first embodied visual-language model for embodied task planning.
Specifically, the environment map is employed to model the global environment information.
Then we present the environment map encoder to extract task-related information from the environment.
Further, to reduce task path planning's dependence on strong spatial reasoning, we introduce the self-posture-aware training strategy to break down long-term spatial reasoning into short-term.
We build the EmbodiedPlan-20k dataset for grounded planning in embodied tasks.
Our experiments on the dataset demonstrate that PlanAgent outperforms previous methods and all components are effective.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Grounded Theory
Grounded Theory
Widely used in social work, grounded theory is one of the oldest and best-known qualitative research methods. Even so, it is often misunderstood. Created at a time when positivism ...
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
<span lang="EN-MY">In this paper, a wideband millimeter-wave (mm-Wave) printed dipole antenna is proposed to be used for fifth generation (5G) communications. The single elem...
The Predictive Value of MAP and ETCO2 Changes After Emergency Endotracheal Intubation for Severe Cardiovascular Collapse
The Predictive Value of MAP and ETCO2 Changes After Emergency Endotracheal Intubation for Severe Cardiovascular Collapse
Abstract
Objective: To analyze the changes in mean arterial pressure (MAP) and end-tidal CO2 (ETCO2) in patients after emergency endotracheal intubation (ETI). To explore t...
Aviation English - A global perspective: analysis, teaching, assessment
Aviation English - A global perspective: analysis, teaching, assessment
This e-book brings together 13 chapters written by aviation English researchers and practitioners settled in six different countries, representing institutions and universities fro...
Disturbance of Information in Superior Parietal Lobe during Dual-task Interference in a Simulated Driving Task
Disturbance of Information in Superior Parietal Lobe during Dual-task Interference in a Simulated Driving Task
AbstractPerforming a secondary task while driving causes a decline in driving performance. This phenomenon, called dual-task interference, can have lethal consequences. Previous fM...
Embodied AI: A Survey on the Evolution from Perceptive to Behavioral Intelligence
Embodied AI: A Survey on the Evolution from Perceptive to Behavioral Intelligence
ABSTRACTCreating intelligent beings like humans is a long‐standing goal in AI research, such as intelligent robots in science fiction. Classic AI technologies are disembodied, and ...
Rodnoosjetljiv jezik na primjeru njemačkih časopisa Brigitte i Der Spiegel
Rodnoosjetljiv jezik na primjeru njemačkih časopisa Brigitte i Der Spiegel
On the basis of the comparative analysis of texts of the German biweekly magazine Brigitte and the weekly magazine Der Spiegel and under the presumption that gender-sensitive langu...

