Javascript must be enabled to continue!

Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.

International Joint Conferences on Artificial Intelligence Organization

Weijia Liu Jiuxin Cao Bo Miao Zhiheng Fu Xuelin Zhu Jiawei Ge Bo Liu Mehwish Nasim Ajmal Mian

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

2025

Title: Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Description:

Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization.

To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations.

Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules.

TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations.

TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training.

Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR.

Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics.

Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.

Back

AbstractCryo-electron microscopy (cryoEM) is becoming the preferred method for resolving protein structures. Low signal-to-noise (SNR) in cryoEM images reduces the confidence and t...

Enhancing bone scan image quality: an improved self-supervised denoising approach

Abstract Objective. Bone scans play an important role in skeletal lesion assessment, but gamma cameras exhibit challenges with low sensitivity and...

E-Press and Oppress

From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...

On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/

<spa...

Combining denoising of RNA-seq data and flux balance analysis for cluster analysis of single cells

Abstract Background Sophisticated methods to properly pre-process and analyze the increasing collection of single-cell RNA sequencing (scRNA-seq) da...

Wavelet Denoising of Well Logs and its Geological Performance

Well logs play a very important role in exploration and even exploitation of energy resources, but they usually contain kinds of noises which affect the results of the geological i...

Improving Sentence Retrieval Using Sequence Similarity

Sentence retrieval is an information retrieval technique that aims to find sentences corresponding to an information need. It is used for tasks like question answering (QA) or nove...

A New Remote Sensing Image Retrieval Method Based on CNN and YOLO

<>Retrieving remote sensing images plays a key role in RS fields, which activates researchers to design a highly effective extraction method of image high-level features. How...

Email:
Password:

Email:

Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Related Results