Javascript must be enabled to continue!

Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.

International Joint Conferences on Artificial Intelligence Organization

Weijia Liu Jiuxin Cao Bo Miao Zhiheng Fu Xuelin Zhu Jiawei Ge Bo Liu Mehwish Nasim Ajmal Mian

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

2025

Title: Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Description:

Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization.

To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations.

Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules.

TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations.

TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training.

Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR.

Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics.

Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.

Back

AbstractCryo-electron microscopy (cryoEM) is becoming the preferred method for resolving protein structures. Low signal-to-noise (SNR) in cryoEM images reduces the confidence and t...

Enhancing bone scan image quality: an improved self-supervised denoising approach

Abstract Objective. Bone scans play an important role in skeletal lesion assessment, but gamma cameras exhibit challenges with low sensitivity and...

E-Press and Oppress

From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...

Unconventional Method of Subsea Umbilical Retrieval Using Anchor Handling Vessel

Abstract A deepwater field in West Africa was decommissioned and subsea facilities retrieval operation was carried out as part of the Abandonment and Decommissioning...

Baikal: Unpaired Denoising of Fluorescence Microscopy Images using Diffusion Models

Abstract Fluorescence microscopy is an indispensable tool for biological discovery but image quality is constrained by desired spatial and tempor...

On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/

<spa...

Combining denoising of RNA-seq data and flux balance analysis for cluster analysis of single cells

Abstract Background Sophisticated methods to properly pre-process and analyze the increasing collection of single-cell RNA sequencing (scRNA-seq) da...

Wavelet Denoising of Well Logs and its Geological Performance

Well logs play a very important role in exploration and even exploitation of energy resources, but they usually contain kinds of noises which affect the results of the geological i...

Email:
Password:

Email:

Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Related Results