Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Speech enhancement augmentation for robust speech recognition in noisy environments

View through CrossRef
Abstract. The use of augmentations as a data enrichment method has become an important element in improving the performance of speech recognition systems. To work effectively in noisy conditions, augmentation is usually used to simulate the presence of background noise. However, the quality of speech recognition on samples pre-processed by noise reduction models does not increase. This paper proposes a new approach to speech data augmentation when training ASR systems, intended for their joint use with models for speech enhancement. It was based on the creation of several additional data samples containing speech samples processed by the enhancement model. The proposed approach was tested on the E-Branchformer neural network model using data from the Librispeech set. The quality of speech samples was assessed using the DNSMOS metric. By means of a 100-hour sample of clean speech samples it was shown that the proposed augmentation allows for an improvement in the WER metric of more than 4% in absolute value compared to the generally accepted approach based on adding noisy speech samples. Experiments on 960-hour data demonstrated the robustness of this approach as the training set size increased.
Title: Speech enhancement augmentation for robust speech recognition in noisy environments
Description:
Abstract.
The use of augmentations as a data enrichment method has become an important element in improving the performance of speech recognition systems.
To work effectively in noisy conditions, augmentation is usually used to simulate the presence of background noise.
However, the quality of speech recognition on samples pre-processed by noise reduction models does not increase.
This paper proposes a new approach to speech data augmentation when training ASR systems, intended for their joint use with models for speech enhancement.
It was based on the creation of several additional data samples containing speech samples processed by the enhancement model.
The proposed approach was tested on the E-Branchformer neural network model using data from the Librispeech set.
The quality of speech samples was assessed using the DNSMOS metric.
By means of a 100-hour sample of clean speech samples it was shown that the proposed augmentation allows for an improvement in the WER metric of more than 4% in absolute value compared to the generally accepted approach based on adding noisy speech samples.
Experiments on 960-hour data demonstrated the robustness of this approach as the training set size increased.

Related Results

[RETRACTED] Rhino XL Male Enhancement v1
[RETRACTED] Rhino XL Male Enhancement v1
[RETRACTED]Rhino XL Reviews, NY USA: Studies show that testosterone levels in males decrease constantly with growing age. There are also many other problems that males face due ...
Reference-Based Speech Enhancement via Feature Alignment and Fusion Network
Reference-Based Speech Enhancement via Feature Alignment and Fusion Network
Speech enhancement aims at recovering a clean speech from a noisy input, which can be classified into single speech enhancement and personalized speech enhancement. Personalized sp...
Robust speech recognition based on deep learning for sports game review
Robust speech recognition based on deep learning for sports game review
Abstract To verify the feasibility of robust speech recognition based on deep learning in sports game review. In this paper, a robust speech recognition model is bui...
Enhancing Non-Formal Learning Certificate Classification with Text Augmentation: A Comparison of Character, Token, and Semantic Approaches
Enhancing Non-Formal Learning Certificate Classification with Text Augmentation: A Comparison of Character, Token, and Semantic Approaches
Aim/Purpose: The purpose of this paper is to address the gap in the recognition of prior learning (RPL) by automating the classification of non-formal learning certificates using d...
The Effectiveness of Data Augmentation for Bone Suppression in Chest Radiograph using Convolutional Neural Network
The Effectiveness of Data Augmentation for Bone Suppression in Chest Radiograph using Convolutional Neural Network
Objective: Bone suppression of chest radiograph holds great promise to improve the localization accuracy in Image-Guided Radiation Therapy (IGRT). However, data scarcity has long b...
Identifying Links Between Latent Memory and Speech Recognition Factors
Identifying Links Between Latent Memory and Speech Recognition Factors
Objectives: The link between memory ability and speech recognition accuracy is often examined by correlating summary measures of performance across various tasks, but i...
Discrete Wavelet Transform and Spectral Subtraction Based Speech Enhancement Algorithm for Hearing Aid Application
Discrete Wavelet Transform and Spectral Subtraction Based Speech Enhancement Algorithm for Hearing Aid Application
Abstract Hearing aids are small electronic devices intended to help those with hearing loss improve their hearing ability with the use of advanced audio signal processing t...
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
Abstract The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set. However, the performance is degraded with...

Back to Top