Javascript must be enabled to continue!
SVCGAN: Speaker Voice Conversion Generative Adversarial Network for Children’s Speech Conversion and Recognition
View through CrossRef
Automatic speech recognition (ASR) refers to a technological process that entails the conversion of spoken language into written text. However, the acoustic distinctions between children’s speech and adult speech are substantial, rendering the automatic speech recognition system trained on adult speech inadequate for effectively recognizing children’s speech. To overcome this issue, in this study, we propose speaker conversion generative adversarial network (SVCGAN). SVCGAN is a novel non-parallel voice conversion model, which enhances three key areas: log-cosh loss, semantic-similarity loss, and third adversarial loss. Therefore, the incorporation of these losses better protects semantic information for young children during voice conversion process and improves the quality of the converted speech. Additionally, the character error rate (CER) of children’s speech recognition can benefit from children’s speech transformed into adult speech. Experimental results suggest that SVCGAN demonstrates superior performance across multiple dimensions compared to both CycleGAN-VC3 and MaskCycleGAN-VC models. It encompasses training efficiency, semantic information similarity, voice type similarity, sound naturalness and intelligibility, which leads to a reduction in the CER of speech recognition for young children.
Title: SVCGAN: Speaker Voice Conversion Generative Adversarial Network for Children’s Speech Conversion and Recognition
Description:
Automatic speech recognition (ASR) refers to a technological process that entails the conversion of spoken language into written text.
However, the acoustic distinctions between children’s speech and adult speech are substantial, rendering the automatic speech recognition system trained on adult speech inadequate for effectively recognizing children’s speech.
To overcome this issue, in this study, we propose speaker conversion generative adversarial network (SVCGAN).
SVCGAN is a novel non-parallel voice conversion model, which enhances three key areas: log-cosh loss, semantic-similarity loss, and third adversarial loss.
Therefore, the incorporation of these losses better protects semantic information for young children during voice conversion process and improves the quality of the converted speech.
Additionally, the character error rate (CER) of children’s speech recognition can benefit from children’s speech transformed into adult speech.
Experimental results suggest that SVCGAN demonstrates superior performance across multiple dimensions compared to both CycleGAN-VC3 and MaskCycleGAN-VC models.
It encompasses training efficiency, semantic information similarity, voice type similarity, sound naturalness and intelligibility, which leads to a reduction in the CER of speech recognition for young children.
Related Results
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speech, communication, and neuroimaging in Parkinson's disease : Characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : Characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Quarantine Powers, Biodefense, and Andrew Speaker
Quarantine Powers, Biodefense, and Andrew Speaker
In January 2007, Andrew Speaker (Speaker) underwent a chest X-ray and CT scan, which revealed an abnormality in his lungs. However, tests results indicated that he did not ha...
Brain mechanism of unfamiliar and familiar voice processing: an activation likelihood estimation meta-analysis
Brain mechanism of unfamiliar and familiar voice processing: an activation likelihood estimation meta-analysis
Interpersonal communication through vocal information is very important for human society. During verbal interactions, our vocal cord vibrations convey important information regard...
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion
Abstract
The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set. However, the performance is degraded with...
Fusion of Cochleogram and Mel Spectrogram Features for Deep Learning Based Speaker Recognition
Fusion of Cochleogram and Mel Spectrogram Features for Deep Learning Based Speaker Recognition
Abstract
Speaker recognition has crucial application in forensic science, financial areas, access control, surveillance and law enforcement. The performance of speaker reco...
Cycle-consistent Generative Adversarial Networks (CycleGANs) for the Non-Parallel Creation of Fake Voice Media
Cycle-consistent Generative Adversarial Networks (CycleGANs) for the Non-Parallel Creation of Fake Voice Media
The upsurge of Generative Adversarial Networks (GANs) in the previous five years has led to advancements in unsupervised data manipulation, sourced feature translation, and precise...

