Javascript must be enabled to continue!

Text-Anchored Residual Cross-Modal Fusion for Multimodal Sentiment Analysis: A Unified and Protocol-Aware Evaluation on MVSA-Single

Multimodal sentiment analysis aims to infer sentiment polarity by jointly modeling textual and visual information. Despite recent advances in pretrained language and vision encoders, sentiment prediction from social media posts remains challenging because textual and visual modalities are often weakly aligned, semantically noisy, and unevenly informative. Recent studies have emphasized the importance of fine-grained cross-modal fusion, stronger pretrained visual representations, and strategies for reducing modality bias in MVSA-style benchmarks. In this work, we present a systematic implementation-driven study of multimodal sentiment classification on MVSA-Single. We first construct a clean three-class sentiment-consistent subset and then implement a wide set of baselines, including text-only DistilBERT, image-only ResNet18, simple multimodal fusion, gated fusion, residual fusion, multi-task contrastive fusion, DINOv2-based fusion, and attention bottleneck fusion. Building on these experiments, we propose a semantic cross-modal fusion architecture that combines a RoBERTa text encoder with a CLIP vision encoder through cross-attention, allowing textual representations to selectively attend to sentiment-relevant visual signals. On the clean 2592-sample subset, the proposed model achieved the best overall performance, reaching 82.63% validation accuracy, 79.62% test accuracy and 79.42 weighted F1, outperforming all other implemented baselines under the same experimental pipeline and dataset setting. To improve comparability with prior MVSA-Single studies, we additionally reconstructed a broader processed setting from the 4511-sample HDF5 version and aligned 4318 text–image pairs with original image files. On this harder protocol-matched setting, the same model achieved 72.69% test accuracy and 70.66 weighted F1, revealing a substantial performance gap caused by dataset construction and residual multimodal noise. These findings show that strong cross-modal semantic alignment contributes more to robust multimodal sentiment prediction than simply increasing architectural complexity and that CLIP-based visual semantics are more beneficial than DINOv2 in our text–image sentiment setting.

MDPI AG

Kosala Natarajan Nirmalrani Vairaperumal

Applied Sciences

2026

Title: Text-Anchored Residual Cross-Modal Fusion for Multimodal Sentiment Analysis: A Unified and Protocol-Aware Evaluation on MVSA-Single

Description:

Multimodal sentiment analysis aims to infer sentiment polarity by jointly modeling textual and visual information.

Despite recent advances in pretrained language and vision encoders, sentiment prediction from social media posts remains challenging because textual and visual modalities are often weakly aligned, semantically noisy, and unevenly informative.

Recent studies have emphasized the importance of fine-grained cross-modal fusion, stronger pretrained visual representations, and strategies for reducing modality bias in MVSA-style benchmarks.

In this work, we present a systematic implementation-driven study of multimodal sentiment classification on MVSA-Single.

We first construct a clean three-class sentiment-consistent subset and then implement a wide set of baselines, including text-only DistilBERT, image-only ResNet18, simple multimodal fusion, gated fusion, residual fusion, multi-task contrastive fusion, DINOv2-based fusion, and attention bottleneck fusion.

Building on these experiments, we propose a semantic cross-modal fusion architecture that combines a RoBERTa text encoder with a CLIP vision encoder through cross-attention, allowing textual representations to selectively attend to sentiment-relevant visual signals.

On the clean 2592-sample subset, the proposed model achieved the best overall performance, reaching 82.

63% validation accuracy, 79.

62% test accuracy and 79.

42 weighted F1, outperforming all other implemented baselines under the same experimental pipeline and dataset setting.

To improve comparability with prior MVSA-Single studies, we additionally reconstructed a broader processed setting from the 4511-sample HDF5 version and aligned 4318 text–image pairs with original image files.

On this harder protocol-matched setting, the same model achieved 72.

69% test accuracy and 70.

66 weighted F1, revealing a substantial performance gap caused by dataset construction and residual multimodal noise.

These findings show that strong cross-modal semantic alignment contributes more to robust multimodal sentiment prediction than simply increasing architectural complexity and that CLIP-based visual semantics are more beneficial than DINOv2 in our text–image sentiment setting.

Back

BACKGROUND Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...

The Nuclear Fusion Award

The Nuclear Fusion Award ceremony for 2009 and 2010 award winners was held during the 23rd IAEA Fusion Energy Conference in Daejeon. This time, both 2009 and 2010 award winners w...

AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model

Multimodal sentiment analysis is an essential task in natural language processing which refers to the fact that machines can analyze and recognize emotions through logical reasonin...

Sleep Habits and Occurrence of Lowback Pain among Craftsmen

<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...

Sleep Habits and Occurrence of Lowback Pain among Craftsmen

<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...

Application of Multimodal Data Fusion Attentive Dual Residual Generative Adversarial Network in Sentiment Recognition and Sentiment Analysis

Recent advancements in Internet technology have led to increased multi-modal data posting on social media, online shopping portals, and video repositories recognizing significance ...

Sentiment/tone (Automated Content Analysis)

Sentiment/tone describes the way issues or specific actors are described in coverage. Many analyses differentiate between negative, neutral/balanced or positive sentiment/tone as b...

Sentiment Analysis with Python: A Hands-on Approach

Sentiment Analysis is a rapidly growing field in Natural Language Processing (NLP) that aims to extract opinions, emotions, and attitudes expressed in text. It has a wide range o...

Email:
Password:

Email:

Text-Anchored Residual Cross-Modal Fusion for Multimodal Sentiment Analysis: A Unified and Protocol-Aware Evaluation on MVSA-Single

Related Results