Javascript must be enabled to continue!
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
View through CrossRef
Abstract
The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set. However, the performance is degraded with noise, channel variation, physical and behavioral changes with the speaker. The studies confirmed that the features which represent speech in the Equal Rectangular Band (ERB) scale are more robust than Mel Scale at low Signal to Noise Ratio (SNR) level. Gammatone Frequency Cepstral Coefficient (GFCC) which represents speech in ERB scale is widely used in classical machine learning based speaker recognition at noisy conditions. Recently, deep learning models are widely applied in speaker recognition and show better performance than classical machine learning. Previous deep learning based speaker recognition models used Mel Spectrogram as an input rather than hand crafted features. However, the performance of Mel spectrogram drastically degraded at low SNR level because Mel Spectrogram represents speech in Mel Scale. Cochleogram is another important input to develop deep learning based speaker recognition models. Cochleogram represents speech in ERB scale, which is more robust at low SNR level. However, none of the studies used the Cochleogram feature to develop deep learning based speaker recognition models. In addition, none of the studies conduct analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition using deep learning. In this study, we conducted an analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition by using additive noises (such as: babble, street and restaurant noises).The train-clean-100 part of the LibriSpeech dataset, which consists of 251 speakers (126 male and 125 female speakers) and 28,539 utterances is used for the experiment. CNN model is used for training and classification of speakers into different classes. The evaluation results show that Cochleogram is more robust than Mel Spectrogram at low SNR level. Both Cochleogram and Mel Spectrogram features show approximately equal accuracy at high SNR and without additive noise. In conclusion, the Cochleogram feature improves performance of deep learning based speaker recognition in noisy conditions.
Title: Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
Description:
Abstract
The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set.
However, the performance is degraded with noise, channel variation, physical and behavioral changes with the speaker.
The studies confirmed that the features which represent speech in the Equal Rectangular Band (ERB) scale are more robust than Mel Scale at low Signal to Noise Ratio (SNR) level.
Gammatone Frequency Cepstral Coefficient (GFCC) which represents speech in ERB scale is widely used in classical machine learning based speaker recognition at noisy conditions.
Recently, deep learning models are widely applied in speaker recognition and show better performance than classical machine learning.
Previous deep learning based speaker recognition models used Mel Spectrogram as an input rather than hand crafted features.
However, the performance of Mel spectrogram drastically degraded at low SNR level because Mel Spectrogram represents speech in Mel Scale.
Cochleogram is another important input to develop deep learning based speaker recognition models.
Cochleogram represents speech in ERB scale, which is more robust at low SNR level.
However, none of the studies used the Cochleogram feature to develop deep learning based speaker recognition models.
In addition, none of the studies conduct analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition using deep learning.
In this study, we conducted an analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition by using additive noises (such as: babble, street and restaurant noises).
The train-clean-100 part of the LibriSpeech dataset, which consists of 251 speakers (126 male and 125 female speakers) and 28,539 utterances is used for the experiment.
CNN model is used for training and classification of speakers into different classes.
The evaluation results show that Cochleogram is more robust than Mel Spectrogram at low SNR level.
Both Cochleogram and Mel Spectrogram features show approximately equal accuracy at high SNR and without additive noise.
In conclusion, the Cochleogram feature improves performance of deep learning based speaker recognition in noisy conditions.
Related Results
Fusion of Cochleogram and Mel Spectrogram Features for Deep Learning Based Speaker Recognition
Fusion of Cochleogram and Mel Spectrogram Features for Deep Learning Based Speaker Recognition
Abstract
Speaker recognition has crucial application in forensic science, financial areas, access control, surveillance and law enforcement. The performance of speaker reco...
Melatonin receptor genes (mel‐1a, mel‐1b, mel‐1c) are differentially expressed in the avian germ line
Melatonin receptor genes (mel‐1a, mel‐1b, mel‐1c) are differentially expressed in the avian germ line
AbstractThe presence of melatonin receptor transcripts (mel‐1a, mel‐1b and mel‐1c) was investigated in primordial germ cells (PGCs), immature and mature oocytes, and sperm of Japan...
Quarantine Powers, Biodefense, and Andrew Speaker
Quarantine Powers, Biodefense, and Andrew Speaker
In January 2007, Andrew Speaker (Speaker) underwent a chest X-ray and CT scan, which revealed an abnormality in his lungs. However, tests results indicated that he did not ha...
Expression of melatonin receptor transcripts (mel-1a, mel-1b and mel-1c) in Japanese quail oocytes and eggs
Expression of melatonin receptor transcripts (mel-1a, mel-1b and mel-1c) in Japanese quail oocytes and eggs
Cloning and sequencing of the cDNA for avian melatonin (MEL) receptors have made it possible to investigate the expression of these receptors in different animal tissues and organs...
Improved Outcomes with Bu/Cy+Melphalan and Bu/Cy+Thiotepa Regimens in Haploidentical Hematopoietic Stem Cell Transplantation for Non-Down Syndrome Acute Megakaryoblastic Leukemia
Improved Outcomes with Bu/Cy+Melphalan and Bu/Cy+Thiotepa Regimens in Haploidentical Hematopoietic Stem Cell Transplantation for Non-Down Syndrome Acute Megakaryoblastic Leukemia
Introduction
Acute Megakaryoblastic Leukemia (AMKL) accounts for approximately 10% of pediatric Acute Myeloid Leukemia (AML) cases and about 1% of adult AML cases...
Classification of Bisyllabic Lexical Stress Patterns Using Deep Neural Networks
Classification of Bisyllabic Lexical Stress Patterns Using Deep Neural Networks
Background and Objectives: As English is a stress-timed language, lexical stress plays an important role in the perception and processing of speech by native speakers. Incorrect st...
Mechanism of suppressing noise intensity of squeezed state enhancement
Mechanism of suppressing noise intensity of squeezed state enhancement
This research focuses on advanced noise suppression technologies for high-precision measurement systems, particularly addressing the limitations of classical noise reducing approac...
An Automated method for the analysis of bearing vibration based on spectrogram pattern matching
An Automated method for the analysis of bearing vibration based on spectrogram pattern matching
As a mean for non-intrusive inspection of bearing systems, the scope of predicting their condition from the acoustic vibrations liberated during their operation, utilizing signal p...


