Javascript must be enabled to continue!

Study on a Bimodal Emotion Recognition Algorithm Based on Deep Fusion of Speech and Images

Emotion recognition aims to identify affective categories by analyzing physiological signals and behavioral characteristics and is one of the key research directions in Artificial Intelligence (AI). Addressing the issues of limited unimodal representation capability and insufficiently established cross-modal deep association in existing emotion recognition algorithms, which lead to inadequate accuracy and robustness in complex scenarios, this paper proposes a bimodal emotion recognition algorithm based on deep fusion of speech and image to enhance the accuracy and robustness of emotion recognition by establishing an effective cross-modal interaction and adaptive fusion mechanism. For image modality, Bilinear Interpolation (BI) is used for image scale normalization, followed by an improved Convolutional Neural Network (CNN) to extract deep spatial features. An improved Sparse Autoencoder (SAE) is then employed to compress features, reduce redundancy, and enhance fine-grained details. Finally, an improved Multilayer Perceptron (MLP) performs emotion classification. For speech modality, prosodic features, Mel-Frequency Cepstral Coefficients (MFCCs), geometric features, and attribute features are fused to form a multidimensional acoustic representation, which is subsequently classified using an improved MLP. After unimodal recognition, the outputs of the two modalities are fused at the decision level using a dynamic adaptive weighting strategy to generate the final emotion category and its corresponding probability. The experimental results indicate that, under a unified test set, the accuracy of sentiment recognition in this paper is improved by 14% and 18% respectively compared to single-modality models. Compared to other fusion strategy models, the recognition accuracy is concentrated in the range of 65% to 70%. The deep fusion method proposed in this paper achieves an overall recognition accuracy of 81.18%, which is superior to other methods. This result verifies the effectiveness of the algorithm proposed in this paper in improving the accuracy of sentiment recognition and system robustness.

Elsevier BV

He Hongfang Liu Chenchen Wang Gang Zhang Xiaoyu

2026

Title: Study on a Bimodal Emotion Recognition Algorithm Based on Deep Fusion of Speech and Images

Description:

Emotion recognition aims to identify affective categories by analyzing physiological signals and behavioral characteristics and is one of the key research directions in Artificial Intelligence (AI).

Addressing the issues of limited unimodal representation capability and insufficiently established cross-modal deep association in existing emotion recognition algorithms, which lead to inadequate accuracy and robustness in complex scenarios, this paper proposes a bimodal emotion recognition algorithm based on deep fusion of speech and image to enhance the accuracy and robustness of emotion recognition by establishing an effective cross-modal interaction and adaptive fusion mechanism.

For image modality, Bilinear Interpolation (BI) is used for image scale normalization, followed by an improved Convolutional Neural Network (CNN) to extract deep spatial features.

An improved Sparse Autoencoder (SAE) is then employed to compress features, reduce redundancy, and enhance fine-grained details.

Finally, an improved Multilayer Perceptron (MLP) performs emotion classification.

For speech modality, prosodic features, Mel-Frequency Cepstral Coefficients (MFCCs), geometric features, and attribute features are fused to form a multidimensional acoustic representation, which is subsequently classified using an improved MLP.

After unimodal recognition, the outputs of the two modalities are fused at the decision level using a dynamic adaptive weighting strategy to generate the final emotion category and its corresponding probability.

The experimental results indicate that, under a unified test set, the accuracy of sentiment recognition in this paper is improved by 14% and 18% respectively compared to single-modality models.

Compared to other fusion strategy models, the recognition accuracy is concentrated in the range of 65% to 70%.

The deep fusion method proposed in this paper achieves an overall recognition accuracy of 81.

18%, which is superior to other methods.

This result verifies the effectiveness of the algorithm proposed in this paper in improving the accuracy of sentiment recognition and system robustness.

Back

BACKGROUND Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...

The Nuclear Fusion Award

The Nuclear Fusion Award ceremony for 2009 and 2010 award winners was held during the 23rd IAEA Fusion Energy Conference in Daejeon. This time, both 2009 and 2010 award winners w...

Language Development in Children with Cochlear Implant using Bimodal Approach: SLP Perspective

Background: The development of language skills in children with cochlear implants is a vital area of research, particularly in understanding the impact of the bimodal approach. Thi...

Nonproliferation and fusion power plants

Abstract The world now appears to be on the brink of realizing commercial fusion. As fusion energy progresses towards near-term commercial deployment, the question arises a...

What about males? Exploring sex differences in the relationship between emotion difficulties and eating disorders

Abstract Objective: While eating disorders (ED) are more commonly diagnosed in females, there is growing awareness that men also experience ED and may do so in a different ...

Introduction: Autonomic Psychophysiology

Abstract The autonomic psychophysiology of emotion has a long thought tradition in philosophy but a short empirical tradition in psychological research. Yet the past...

Depth-aware salient object segmentation

Object segmentation is an important task which is widely employed in many computer vision applications such as object detection, tracking, recognition, and ret...

Bimodal SWCC and Bimodal PSD of Soils with Dual-Porosity Structure

The soil–water characteristic curve (SWCC) and pore-size distribution (PSD) are fundamental characteristics of soils that determine many physical and mechanical properties. Recent ...

Email:
Password:

Email:

Study on a Bimodal Emotion Recognition Algorithm Based on Deep Fusion of Speech and Images

Related Results