Javascript must be enabled to continue!
RefCap: Image Captioning with Referent Objects Attributes
View through CrossRef
Abstract
In recent years, significant progress has been made in visual-linguistic multi-modality research, leading to advancements in visual comprehension and its applications in computer vision tasks. One fundamental task in visual-linguistic understanding is image captioning, which involves generating human-understandable textual descriptions given an input image. This paper introduces an end-to-end referring expression image captioning model that incorporates supervision of interesting objects. Our model utilizes user-specified object keywords as a prefix to generate specific captions that are relevant to the target object. The model consists of three modules including: i) visual grounding, ii) referring object selection, and iii) image captioning modules. To evaluate its performance, we conducted experiments on the RefCOCO and COCO captioning datasets. The experimental results demonstrate that our proposed method effectively generates meaningful captions aligned with users' specific interests.
Title: RefCap: Image Captioning with Referent Objects Attributes
Description:
Abstract
In recent years, significant progress has been made in visual-linguistic multi-modality research, leading to advancements in visual comprehension and its applications in computer vision tasks.
One fundamental task in visual-linguistic understanding is image captioning, which involves generating human-understandable textual descriptions given an input image.
This paper introduces an end-to-end referring expression image captioning model that incorporates supervision of interesting objects.
Our model utilizes user-specified object keywords as a prefix to generate specific captions that are relevant to the target object.
The model consists of three modules including: i) visual grounding, ii) referring object selection, and iii) image captioning modules.
To evaluate its performance, we conducted experiments on the RefCOCO and COCO captioning datasets.
The experimental results demonstrate that our proposed method effectively generates meaningful captions aligned with users' specific interests.
Related Results
Image Captioning with External Knowledge
Image Captioning with External Knowledge
This dissertation is dedicated to image captioning, the task of automatically generating a natural language description of a given image. Most modern automatic caption generators a...
A Comprehensive Survey on Image Captioning for Indian Languages: Techniques, Datasets, and Challenges
A Comprehensive Survey on Image Captioning for Indian Languages: Techniques, Datasets, and Challenges
Abstract
In image captioning, we generate visual descriptions from an image. Image Cap-tioning requires identifying the key entity, feature, and association in an image. Th...
An Analysis on Recent Approaches for Image Captioning
An Analysis on Recent Approaches for Image Captioning
Image captioning is an interdisciplinary area that uses techniques from computer vision and natural language processing to provide a textual description of a picture. The Image cap...
Double Exposure
Double Exposure
I. Happy Endings
Chaplin’s Modern Times features one of the most subtly strange endings in Hollywood history. It concludes with the Tramp (Chaplin) and the Gamin (Paulette Godda...
The Road Map From Artificial Intelligence, Machine Learning, Deep Learning Techniques Towards Image Captioning System.
The Road Map From Artificial Intelligence, Machine Learning, Deep Learning Techniques Towards Image Captioning System.
Abstract
Image Captioning is the process of generating textual descriptions of an image. These descriptions need to be syntactically and semantically correct. Image Caption...
Better Understanding: Stylized Image Captioning with Style Attention and Adversarial Training
Better Understanding: Stylized Image Captioning with Style Attention and Adversarial Training
Compared with traditional image captioning technology, stylized image captioning has broader application scenarios, such as a better understanding of images. However, stylized imag...
TAPER-WE: Transformer-Based Model Attention with Relative Position Encoding and Word Embedding for Video Captioning and Summarization in Dense Environment
TAPER-WE: Transformer-Based Model Attention with Relative Position Encoding and Word Embedding for Video Captioning and Summarization in Dense Environment
In the era of burgeoning digital content, the need for automated video captioning and summarization in dense environments has become increasingly critical. This paper introduces TA...

