Javascript must be enabled to continue!
Comparative Analysis of Spectrogram and MFCC Representations for Speech Emotion Recognition Using Machine Learning
View through CrossRef
Emotion recognition is a key area of research within human-computer interaction, addressing the growing need for systems that can respond to human emotional states. While advancements have been made, challenges remain, particularly in selecting appropriate datasets, identifying effective audio features, and optimizing classification models. This study explores how different audio feature representations, specifically Mel-Frequency Cepstral Coefficients (MFCC) and spectrograms, influence the accuracy of emotion classification. By extracting these features from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and applying Random Forest (RF) and Support Vector Machine (SVM) classifiers, the research compares the performance of each feature-classifier pairing. Results indicate that RF and SVM classifiers with MFCC features achieved 50% accuracy, while spectrogram features led to 45% and 54% accuracy, respectively. These findings suggest that simpler models, when combined with appropriate features, can offer promising performance, contributing to more responsive and adaptive human-computer interaction applications.
Centre for Research and Innovation
Title: Comparative Analysis of Spectrogram and MFCC Representations for Speech Emotion Recognition Using Machine Learning
Description:
Emotion recognition is a key area of research within human-computer interaction, addressing the growing need for systems that can respond to human emotional states.
While advancements have been made, challenges remain, particularly in selecting appropriate datasets, identifying effective audio features, and optimizing classification models.
This study explores how different audio feature representations, specifically Mel-Frequency Cepstral Coefficients (MFCC) and spectrograms, influence the accuracy of emotion classification.
By extracting these features from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and applying Random Forest (RF) and Support Vector Machine (SVM) classifiers, the research compares the performance of each feature-classifier pairing.
Results indicate that RF and SVM classifiers with MFCC features achieved 50% accuracy, while spectrogram features led to 45% and 54% accuracy, respectively.
These findings suggest that simpler models, when combined with appropriate features, can offer promising performance, contributing to more responsive and adaptive human-computer interaction applications.
Related Results
Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
Fusion of Cochleogram and Mel Spectrogram Features for Deep Learning Based Speaker Recognition
Fusion of Cochleogram and Mel Spectrogram Features for Deep Learning Based Speaker Recognition
Abstract
Speaker recognition has crucial application in forensic science, financial areas, access control, surveillance and law enforcement. The performance of speaker reco...
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
Abstract
The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set. However, the performance is degraded with...
A Perspective Study on Speech Recognition
A Perspective Study on Speech Recognition
Emotions play an extremely important role in human mental life. It is a medium of expression of one’s perspective or one’s mental state to others. Speech Emotion Recognition (SER) ...
A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition
A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition
The feature fusion from separate source is the current technical difficulties of cross-corpus speech emotion recognition. The purpose of this paper is to, based on Deep Belief Nets...
Machine Learning for Non-Intrusive Speech Quality Assessment
Machine Learning for Non-Intrusive Speech Quality Assessment
<p><b>This thesis presents two studies on non-intrusive speech quality assessment methods. The first applies supervised learning methods to speech quality assessment, w...
Extracting speech spectrogram of speech signal based on generalized S-transform
Extracting speech spectrogram of speech signal based on generalized S-transform
In speech signal processing, time-frequency analysis is commonly employed to extract the spectrogram of speech signals. While many algorithms exist to achieve this with high-qualit...
Robust speech recognition based on deep learning for sports game review
Robust speech recognition based on deep learning for sports game review
Abstract
To verify the feasibility of robust speech recognition based on deep learning in sports game review. In this paper, a robust speech recognition model is bui...

