Javascript must be enabled to continue!

Enhanced visual multi-modal fusion framework for dense video captioning

Abstract Dense video captioning is a machine translation task that aims to localize events from full video and describe them separately. Human observers who parse the video are also impressed by frames containing a high amount of information and duplication, but video often appears unrelated frames in the subject. However, existing works have largely ignored these details. To fully incorporate human visual perception into the process of understanding video, we propose an enhanced visual multi-modal fusion framework (Evmff), which utilizes the captions of video keyframes to improve dense video captioning performance. We first extract video keyframes through time stamps and then apply the recently proposed image captioning method DLCT to obtain a temporally aligned caption of the keyframe. Evmff fuses the textual information of speech, the caption of image keyframes, video features, and audio features, applying transformer architecture to convert the data into text descriptions. The performance of our model is verified using the ActivityNet Captions dataset with four different indicators, Bleu@N, METEOR, Rouge_L, and CIDEr-D. Ablation experiments show that employing the video keyframe description as the input of the multi-modal model compensates for the deficiency in visual information understanding. Our code will be released.

Research Square Platform LLC

Ruizhe Zhong Qingchuan Zhang Min Zuo

2023

Title: Enhanced visual multi-modal fusion framework for dense video captioning

Description:

Abstract Dense video captioning is a machine translation task that aims to localize events from full video and describe them separately.

Human observers who parse the video are also impressed by frames containing a high amount of information and duplication, but video often appears unrelated frames in the subject.

However, existing works have largely ignored these details.

To fully incorporate human visual perception into the process of understanding video, we propose an enhanced visual multi-modal fusion framework (Evmff), which utilizes the captions of video keyframes to improve dense video captioning performance.

We first extract video keyframes through time stamps and then apply the recently proposed image captioning method DLCT to obtain a temporally aligned caption of the keyframe.

Evmff fuses the textual information of speech, the caption of image keyframes, video features, and audio features, applying transformer architecture to convert the data into text descriptions.

The performance of our model is verified using the ActivityNet Captions dataset with four different indicators, Bleu@N, METEOR, Rouge_L, and CIDEr-D.

Ablation experiments show that employing the video keyframe description as the input of the multi-modal model compensates for the deficiency in visual information understanding.

Our code will be released.

Back

Related Results

The Nuclear Fusion Award

The Nuclear Fusion Award ceremony for 2009 and 2010 award winners was held during the 23rd IAEA Fusion Energy Conference in Daejeon. This time, both 2009 and 2010 award winners w...

TAPER-WE: Transformer-Based Model Attention with Relative Position Encoding and Word Embedding for Video Captioning and Summarization in Dense Environment

In the era of burgeoning digital content, the need for automated video captioning and summarization in dense environments has become increasingly critical. This paper introduces TA...

ANALISIS MODAL KERJA PADA KOPERASI SERBA USAHA DI KOTA METRO

Modal kerja merupakan suatu kekayaan yang digunakan untuk membelanjai perusahaan sehari-hari. Modal kerja biasanya berbentuk uang kas, piutang, persediaan barang yang kesemuanya it...

Image Captioning with External Knowledge

This dissertation is dedicated to image captioning, the task of automatically generating a natural language description of a given image. Most modern automatic caption generators a...

A Comprehensive Survey on Image Captioning for Indian Languages: Techniques, Datasets, and Challenges

Abstract In image captioning, we generate visual descriptions from an image. Image Cap-tioning requires identifying the key entity, feature, and association in an image. Th...

An Analysis on Recent Approaches for Image Captioning

Image captioning is an interdisciplinary area that uses techniques from computer vision and natural language processing to provide a textual description of a picture. The Image cap...

Audio and video editing system design based on OpenCV

With the rapid development of the Internet, a new carrier for people to perceive the world and communicate with each other - audio and video - is gradually being favoured by the pu...

Nonproliferation and fusion power plants

Abstract The world now appears to be on the brink of realizing commercial fusion. As fusion energy progresses towards near-term commercial deployment, the question arises a...

Email:
Password:

Email: