Javascript must be enabled to continue!
Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
View through CrossRef
Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.
International Joint Conferences on Artificial Intelligence Organization
Title: Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
Description:
Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization.
To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations.
Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules.
TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations.
TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training.
Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR.
Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics.
Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.
Related Results
Topaz-Denoise: general deep denoising models for cryoEM and cryoET
Topaz-Denoise: general deep denoising models for cryoEM and cryoET
AbstractCryo-electron microscopy (cryoEM) is becoming the preferred method for resolving protein structures. Low signal-to-noise (SNR) in cryoEM images reduces the confidence and t...
Enhancing bone scan image quality: an improved self-supervised denoising approach
Enhancing bone scan image quality: an improved self-supervised denoising approach
Abstract
Objective. Bone scans play an important role in skeletal lesion assessment, but gamma cameras exhibit challenges with low sensitivity and...
E-Press and Oppress
E-Press and Oppress
From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...
Unconventional Method of Subsea Umbilical Retrieval Using Anchor Handling Vessel
Unconventional Method of Subsea Umbilical Retrieval Using Anchor Handling Vessel
Abstract
A deepwater field in West Africa was decommissioned and subsea facilities retrieval operation was carried out as part of the Abandonment and Decommissioning...
Baikal: Unpaired Denoising of Fluorescence Microscopy Images using Diffusion Models
Baikal: Unpaired Denoising of Fluorescence Microscopy Images using Diffusion Models
Abstract
Fluorescence microscopy is an indispensable tool for biological discovery but image quality is constrained by desired spatial and tempor...
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
<span style="font-size:11pt"><span style="background:#f9f9f4"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><b><spa...
Combining denoising of RNA-seq data and flux balance analysis for cluster analysis of single cells
Combining denoising of RNA-seq data and flux balance analysis for cluster analysis of single cells
Abstract
Background
Sophisticated methods to properly pre-process and analyze the increasing collection of single-cell RNA sequencing (scRNA-seq) da...
Wavelet Denoising of Well Logs and its Geological Performance
Wavelet Denoising of Well Logs and its Geological Performance
Well logs play a very important role in exploration and even exploitation of energy resources, but they usually contain kinds of noises which affect the results of the geological i...

