Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

SVCGAN: Speaker Voice Conversion Generative Adversarial Network for Children’s Speech Conversion and Recognition

View through CrossRef
Automatic speech recognition (ASR) refers to a technological process that entails the conversion of spoken language into written text. However, the acoustic distinctions between children’s speech and adult speech are substantial, rendering the automatic speech recognition system trained on adult speech inadequate for effectively recognizing children’s speech. To overcome this issue, in this study, we propose speaker conversion generative adversarial network (SVCGAN). SVCGAN is a novel non-parallel voice conversion model, which enhances three key areas: log-cosh loss, semantic-similarity loss, and third adversarial loss. Therefore, the incorporation of these losses better protects semantic information for young children during voice conversion process and improves the quality of the converted speech. Additionally, the character error rate (CER) of children’s speech recognition can benefit from children’s speech transformed into adult speech. Experimental results suggest that SVCGAN demonstrates superior performance across multiple dimensions compared to both CycleGAN-VC3 and MaskCycleGAN-VC models. It encompasses training efficiency, semantic information similarity, voice type similarity, sound naturalness and intelligibility, which leads to a reduction in the CER of speech recognition for young children.
Title: SVCGAN: Speaker Voice Conversion Generative Adversarial Network for Children’s Speech Conversion and Recognition
Description:
Automatic speech recognition (ASR) refers to a technological process that entails the conversion of spoken language into written text.
However, the acoustic distinctions between children’s speech and adult speech are substantial, rendering the automatic speech recognition system trained on adult speech inadequate for effectively recognizing children’s speech.
To overcome this issue, in this study, we propose speaker conversion generative adversarial network (SVCGAN).
SVCGAN is a novel non-parallel voice conversion model, which enhances three key areas: log-cosh loss, semantic-similarity loss, and third adversarial loss.
Therefore, the incorporation of these losses better protects semantic information for young children during voice conversion process and improves the quality of the converted speech.
Additionally, the character error rate (CER) of children’s speech recognition can benefit from children’s speech transformed into adult speech.
Experimental results suggest that SVCGAN demonstrates superior performance across multiple dimensions compared to both CycleGAN-VC3 and MaskCycleGAN-VC models.
It encompasses training efficiency, semantic information similarity, voice type similarity, sound naturalness and intelligibility, which leads to a reduction in the CER of speech recognition for young children.

Related Results

Makna Voice Over dalam Pemberitaan Feature di Televisi
Makna Voice Over dalam Pemberitaan Feature di Televisi
Abstract. Voice Over or what is known as VO is being discussed a lot, not only about the profession, but also from the industry side and the various voice over techniques used. Due...
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speech, communication, and neuroimaging in Parkinson's disease : Characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : Characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speaker Verification and Identification
Speaker Verification and Identification
A speaker recognition system verifies or identifies a speaker’s identity based on his/her voice. It is considered as one of the most convenient biometric characteristic for human m...
Quarantine Powers, Biodefense, and Andrew Speaker
Quarantine Powers, Biodefense, and Andrew Speaker
In January 2007, “Andrew Speaker (“Speaker”) underwent a chest X-ray and CT scan, which revealed an abnormality in his lungs.” However, tests results indicated that he did not ha...
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
BACKGROUND Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...
Brain mechanism of unfamiliar and familiar voice processing: an activation likelihood estimation meta-analysis
Brain mechanism of unfamiliar and familiar voice processing: an activation likelihood estimation meta-analysis
Interpersonal communication through vocal information is very important for human society. During verbal interactions, our vocal cord vibrations convey important information regard...

Back to Top