Javascript must be enabled to continue!

MME: Video Representation Learning as World Model for Understanding and Planning

Video representation learning, which seeks to learn general and discriminative video representations for video understanding and robotic planning, attracts extensive research in computer vision. This task, however, is crucial but challenging due to the lack of human annotation and large data volume. The existing state-of-the-art video representation learning methods seek to learn a representation model by firstly masking out lots of regions in the input video and secondly asking the model to predict the appearance contents (e.g., video RGB pixels or hand-crafted image feature) in these regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from a single frame. To overcome this limitation, we present Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues in videos. In MME, we focus on addressing two critical challenges to improve the representation performance: 1) how to well represent the possible long-term motion across multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely sampled videos. Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions. Specifically, the motion trajectory is composed of two components, relative position transition in several continuous frames which is tracked using dense optical flow to indicate a trajectory, and Histogram of Gradient (HOG) aligned with this trajectory. Besides, we enforce the model to reconstruct dense motion trajectories in both spatial and temporal dimensions. In this scene, the model is asked to reconstruct motion trajectories in a higher frame rate, given a temporally sparse video as input. In the spatial dimension, the start points of motion trajectories are aligned with a dense grid with 8 × 8 stride. Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details. Experimental results show that our MME consistently improves the performance of existing video representation learning methods on action recognition (e.g., Kinetics-400, Something-Something V2, UCF101, and HMDB51) and action detection (e.g., AVA) benchmarks. More impressive, we found that our MME captures discriminative object motion clues that enables a more robust world model, which achieves significant improvement over the baselines in the robot arm manipulation task. The source code and pre-trained models are available at https://github.com/XinyuSun/MME

Institute of Electrical and Electronics Engineers (IEEE)

Xinyu Sun Changhao Li Chen Jian Chuang Gan Peihao Chen Mingkui Tan

2025

Title: MME: Video Representation Learning as World Model for Understanding and Planning

Description:

Video representation learning, which seeks to learn general and discriminative video representations for video understanding and robotic planning, attracts extensive research in computer vision.

This task, however, is crucial but challenging due to the lack of human annotation and large data volume.

The existing state-of-the-art video representation learning methods seek to learn a representation model by firstly masking out lots of regions in the input video and secondly asking the model to predict the appearance contents (e.

, video RGB pixels or hand-crafted image feature) in these regions.

However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from a single frame.

To overcome this limitation, we present Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues in videos.

In MME, we focus on addressing two critical challenges to improve the representation performance: 1) how to well represent the possible long-term motion across multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely sampled videos.

Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.

Specifically, the motion trajectory is composed of two components, relative position transition in several continuous frames which is tracked using dense optical flow to indicate a trajectory, and Histogram of Gradient (HOG) aligned with this trajectory.

Besides, we enforce the model to reconstruct dense motion trajectories in both spatial and temporal dimensions.

In this scene, the model is asked to reconstruct motion trajectories in a higher frame rate, given a temporally sparse video as input.

In the spatial dimension, the start points of motion trajectories are aligned with a dense grid with 8 × 8 stride.

Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.

Experimental results show that our MME consistently improves the performance of existing video representation learning methods on action recognition (e.

, Kinetics-400, Something-Something V2, UCF101, and HMDB51) and action detection (e.

, AVA) benchmarks.

More impressive, we found that our MME captures discriminative object motion clues that enables a more robust world model, which achieves significant improvement over the baselines in the robot arm manipulation task.

The source code and pre-trained models are available at https://github.

com/XinyuSun/MME.

Back

With the rapid development of the Internet, a new carrier for people to perceive the world and communicate with each other - audio and video - is gradually being favoured by the pu...

CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021

The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...

Madame Vijaya Lakshmi Pandit: A Non-Violent Agitator

This study was a documentary-critical analysis of selected speeches by Mme. Pandit. The speeches analyzed were: \Peace and Freedom Indivisible,\ given in 1946 before the United Nat...

Impact of Higher Resolution on Precipitation over China in CMIP6 HighResMIP Models

Climate models participated in the High Resolution Model Intercomparison Project (HighResMIP) of Coupled Model Intercomparison Project 6 (CMIP6) are evaluated to reveal the impact ...

Correlation of Lower Limb Alignment with Mensical Meniscal Extrusion in Knee Osteoarthritis

Abstract [Objective] This study aims to explore the relationship between Lower limb alignment parameters and the degree of Meniscal Medial Extrusion (MME) in patients with ...

Video tracking for marketing applications

Traçage du contenu marketing vidéo Au cours des dernières décennies, la production et la consommation de vidéos ont considérablement augmenté et il est communément ...

A Proposed Adaptive Bitrate Scheme Based on Bandwidth Prediction Algorithm for Smoothly Video Streaming

A robust video-bitrate adaptive scheme at client-aspect plays a significant role in keeping a good quality of video streaming technology experience. Video quality affects the amoun...

Positive Association between Peri-Surgical Opioid Exposure and Post-Discharge Opioid-Related Outcomes

Background: Multiple studies have investigated the epidemic of persistent opioid use as a common postsurgical complication. However, there exists a knowledge gap in the association...

Email:
Password:

Email:

MME: Video Representation Learning as World Model for Understanding and Planning

Related Results