Javascript must be enabled to continue!
Utilizing Dimensional Emotion Representations in Speech Emotion Recognition
View through CrossRef
Speech is a natural way of communication amongst humans and advancements in speech emotion recognition (SER) technology allow further improvement of human-computer interactions (HCI) with speech by understanding human emotions. SER systems are traditionally focused on categorizing emotions into discrete classes. However, discrete classes often overlook some subtleties between each emotion as they are prone to individual differences and cultures. In this study, we focused on the use of dimensional emotional values: valence, arousal, and dominance as outputs for an SER instead of the traditional categorical classification. An SER model is developed using largely pre-trained models Wav2Vec 2.0 and HuBERT as feature encoders as a feature extraction technique from raw audio input. The model’s performance is assessed using a mean concordance coefficient (CCC) score for models trained on an English language-based dataset called Interactive Emotional Dyadic Motion Capture (IEMOCAP) and a Korean language-based dataset called Korean Emotion Multimodal Database (KEMDy19). For the experiments done on the IEMOCAP dataset, we reported a mean CCC of 0.3673 on the Wav2Vec 2.0-based model with CCC values of 0.3004, 0.4585, and 0.3431 for the valence, arousal, and dominance values respectively trained on the “anger”, “happy”, “sad”, and “neutral” emotion classes. Meanwhile, a mean CCC of 0.3573 on the HuBERT-based model with CCC values of 0.2789, 0.3295, and 0.3361 for the respectively on the same set of emotional classes. For the experiments done on the KEMDy19 dataset, a mean CCC of 0.5473 on the Wav2Vec 2.0-based model with CCC values of 0.5804 and 0.5142 for the valence and arousal were achieved using all available emotional classes on the dataset, while a mean CCC of 0.5580 from CCC values of 0.5941 and 0.5219 on four emotional classes “anger”, “happy”, “sad”, and “neutral” were observed. For the HuBERT-based model, a mean CCC of 0.5271 with CCC values of 0.5429 and 0.5113 for the valence and arousal were recorded using all available emotional classes, while a mean CCC of 0.5392 from CCC values of 0.5765 and 0.5019 for the valence and arousal values on the four selected emotional classes. The proposed approach outperforms traditional machine learning methods and previously reported CCC values from other literature. Moreover, the use of dimensional emotional values provides a more fine-grained insight into the user’s emotional states allowing for a much deeper understanding of one’s affective state with reduced dimensionality. By applying such SER technologies to other areas such as HCI, affective computing, and psychological research, more personalized and adaptable user interfaces can be developed to suit the emotional needs of each individual. This could also contribute to the advancement of our understanding of human factors by developing emotion recognition systems.
AHFE International
Title: Utilizing Dimensional Emotion Representations in Speech Emotion Recognition
Description:
Speech is a natural way of communication amongst humans and advancements in speech emotion recognition (SER) technology allow further improvement of human-computer interactions (HCI) with speech by understanding human emotions.
SER systems are traditionally focused on categorizing emotions into discrete classes.
However, discrete classes often overlook some subtleties between each emotion as they are prone to individual differences and cultures.
In this study, we focused on the use of dimensional emotional values: valence, arousal, and dominance as outputs for an SER instead of the traditional categorical classification.
An SER model is developed using largely pre-trained models Wav2Vec 2.
0 and HuBERT as feature encoders as a feature extraction technique from raw audio input.
The model’s performance is assessed using a mean concordance coefficient (CCC) score for models trained on an English language-based dataset called Interactive Emotional Dyadic Motion Capture (IEMOCAP) and a Korean language-based dataset called Korean Emotion Multimodal Database (KEMDy19).
For the experiments done on the IEMOCAP dataset, we reported a mean CCC of 0.
3673 on the Wav2Vec 2.
0-based model with CCC values of 0.
3004, 0.
4585, and 0.
3431 for the valence, arousal, and dominance values respectively trained on the “anger”, “happy”, “sad”, and “neutral” emotion classes.
Meanwhile, a mean CCC of 0.
3573 on the HuBERT-based model with CCC values of 0.
2789, 0.
3295, and 0.
3361 for the respectively on the same set of emotional classes.
For the experiments done on the KEMDy19 dataset, a mean CCC of 0.
5473 on the Wav2Vec 2.
0-based model with CCC values of 0.
5804 and 0.
5142 for the valence and arousal were achieved using all available emotional classes on the dataset, while a mean CCC of 0.
5580 from CCC values of 0.
5941 and 0.
5219 on four emotional classes “anger”, “happy”, “sad”, and “neutral” were observed.
For the HuBERT-based model, a mean CCC of 0.
5271 with CCC values of 0.
5429 and 0.
5113 for the valence and arousal were recorded using all available emotional classes, while a mean CCC of 0.
5392 from CCC values of 0.
5765 and 0.
5019 for the valence and arousal values on the four selected emotional classes.
The proposed approach outperforms traditional machine learning methods and previously reported CCC values from other literature.
Moreover, the use of dimensional emotional values provides a more fine-grained insight into the user’s emotional states allowing for a much deeper understanding of one’s affective state with reduced dimensionality.
By applying such SER technologies to other areas such as HCI, affective computing, and psychological research, more personalized and adaptable user interfaces can be developed to suit the emotional needs of each individual.
This could also contribute to the advancement of our understanding of human factors by developing emotion recognition systems.
Related Results
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
BACKGROUND
Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...
The impact of binge drinking on emotion recognition
The impact of binge drinking on emotion recognition
Binge drinking or heavy episodic drinking is variously defined but according to the World Health Organisation (WHO) it is the consumption of at least 60 grams or more of pure alcoh...
AI-Based Emotion Recognition in Education: Progress, Applications, and Open Challenges
AI-Based Emotion Recognition in Education: Progress, Applications, and Open Challenges
AI-based emotion recognition has emerged as a critical component of affect-aware educational technologies, particularly in online, large-scale, and technology-mediated learning env...
The Neural Mechanisms of Private Speech in Second Language Learners’ Oral Production: An fNIRS Study
The Neural Mechanisms of Private Speech in Second Language Learners’ Oral Production: An fNIRS Study
Background: According to Vygotsky’s sociocultural theory, private speech functions both as a tool for thought regulation and as a transitional form between outer and inner speech. ...
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speech, communication, and neuroimaging in Parkinson's disease : Characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : Characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Effects of Data Augmentations on Speech Emotion Recognition
Effects of Data Augmentations on Speech Emotion Recognition
Data augmentation techniques have recently gained more adoption in speech processing, including speech emotion recognition. Although more data tend to be more effective, there may ...

