Javascript must be enabled to continue!
Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
View through CrossRef
Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.
International Joint Conferences on Artificial Intelligence Organization
Title: Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
Description:
Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization.
To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations.
Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules.
TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations.
TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training.
Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR.
Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics.
Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.
Related Results
Topaz-Denoise: general deep denoising models for cryoEM and cryoET
Topaz-Denoise: general deep denoising models for cryoEM and cryoET
AbstractCryo-electron microscopy (cryoEM) is becoming the preferred method for resolving protein structures. Low signal-to-noise (SNR) in cryoEM images reduces the confidence and t...
Enhancing bone scan image quality: an improved self-supervised denoising approach
Enhancing bone scan image quality: an improved self-supervised denoising approach
Abstract
Objective. Bone scans play an important role in skeletal lesion assessment, but gamma cameras exhibit challenges with low sensitivity and...
E-Press and Oppress
E-Press and Oppress
From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
<span style="font-size:11pt"><span style="background:#f9f9f4"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><b><spa...
Combining denoising of RNA-seq data and flux balance analysis for cluster analysis of single cells
Combining denoising of RNA-seq data and flux balance analysis for cluster analysis of single cells
Abstract
Background
Sophisticated methods to properly pre-process and analyze the increasing collection of single-cell RNA sequencing (scRNA-seq) da...
Wavelet Denoising of Well Logs and its Geological Performance
Wavelet Denoising of Well Logs and its Geological Performance
Well logs play a very important role in exploration and even exploitation of energy resources, but they usually contain kinds of noises which affect the results of the geological i...
Improving Sentence Retrieval Using Sequence Similarity
Improving Sentence Retrieval Using Sequence Similarity
Sentence retrieval is an information retrieval technique that aims to find sentences corresponding to an information need. It is used for tasks like question answering (QA) or nove...
A New Remote Sensing Image Retrieval Method Based on CNN and YOLO
A New Remote Sensing Image Retrieval Method Based on CNN and YOLO
<>Retrieving remote sensing images plays a key role in RS fields, which activates researchers to design a highly effective extraction method of image high-level features. How...

