Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

An analysis-by-synthesis approach to vocal tract modeling for robust speech recognition

View through CrossRef
I. Background Articulatory modeling is used to incorporate speech production information into automatic speech recognition (ASR) systems. It is believed that solutions to the problems of co-articulation, pronunciation variations, and other speaking style related phenomena rest in how accurately we capture the production process. II. Objective In this work we present a novel approach for speech recognition that incorporates knowledge of the speech production process. We discuss our contribution on going from a purely statistical speech recognizer to one that is motivated by the physical generative process of speech. III. Methods We follow an analysis-by-synthesis approach. Firstly, we attribute a physical meaning to the inner states of the recognition system pertaining to the configurations the human vocal tract takes over time. We utilize a geometric model of the vocal tract, adapt it to our speakers, and derive realistic vocal tract shapes from electromagnetic articulograph (EMA) measurements in the MOCHA database. Secondly, we synthesize speech from the vocal tract configurations using a physiologically-motivated articulatory synthesis model of speech generation. Thirdly, the observation probability of the Hidden Markov Model (HMM), which is used for phone classification, is a function of the distortion between the speech synthesized from the vocal tract configurations and the real speech. The output of each state in the HMM is based on a mixture of density functions. Each density models the distribution of the distortion at the output of each vocal tract configuration. During training, we initialize the model parameters using ground-truth articulatory knowledge. During testing, only the acoustic data is used. IV. Results and conclusion We present phone classification results using our novel dynamic articulatory model and following our adaptation procedure. The table below shows phone error rates (PER) for a female and a male speaker. We use a three-state HMM with different observation densities and initialization techniques. We combine the probabilities of the baseline topology with the new ones. Our novel framework provides a 10.9% relative reduction in phone error rate over our baseline which uses MFCC features. This is achieved using the distortion features with linear discriminant analysis (LDA) and cepstral mean normalization (CMN). We conclude that incorporating articulatory knowledge in the combined statistical framework we devised contributes to lowering the error rates in speech recognition. Features (dimension) Topology Observation Prob / Initialization Female PER Male PER Both PER Improvement Baseline Features MFCC + CMN (13) 3S-128M-HMM Gaussian/VQ 61.6% 55.9% 58.8% Distortion Features (1024) (Prob. Combination with MFCC, α = 0.2) 3S-1024M-HMM Exponential/Flat Sparsity = 21% 57.6% 53.7% 55.7% 5.3% Distortion Features (1024) (Prob. Combination with MFCC, α = 0.2) 3S-1024M-HMM Exponential/EMA Sparsity = 51% 58.3% 53.9% 56.1% 4.6% Adapted Distortion Features (1024) (Prob. Combination with MFCC, α = 0.25) 3S-1024M-HMM Exponential/EMA Sparsity = 51% 58.4% 53.1% 55.7% 5.3% Distortion Features + LDA + CMN (20) (Prob. Combination with MFCC, α = 0.6) 3S-128M-HMM Gaussian/VQ Sparsity = 0% 54.9% 49.8% 52.4% 10.9%
Hamad bin Khalifa University Press (HBKU Press)
Title: An analysis-by-synthesis approach to vocal tract modeling for robust speech recognition
Description:
I.
Background Articulatory modeling is used to incorporate speech production information into automatic speech recognition (ASR) systems.
It is believed that solutions to the problems of co-articulation, pronunciation variations, and other speaking style related phenomena rest in how accurately we capture the production process.
II.
Objective In this work we present a novel approach for speech recognition that incorporates knowledge of the speech production process.
We discuss our contribution on going from a purely statistical speech recognizer to one that is motivated by the physical generative process of speech.
III.
Methods We follow an analysis-by-synthesis approach.
Firstly, we attribute a physical meaning to the inner states of the recognition system pertaining to the configurations the human vocal tract takes over time.
We utilize a geometric model of the vocal tract, adapt it to our speakers, and derive realistic vocal tract shapes from electromagnetic articulograph (EMA) measurements in the MOCHA database.
Secondly, we synthesize speech from the vocal tract configurations using a physiologically-motivated articulatory synthesis model of speech generation.
Thirdly, the observation probability of the Hidden Markov Model (HMM), which is used for phone classification, is a function of the distortion between the speech synthesized from the vocal tract configurations and the real speech.
The output of each state in the HMM is based on a mixture of density functions.
Each density models the distribution of the distortion at the output of each vocal tract configuration.
During training, we initialize the model parameters using ground-truth articulatory knowledge.
During testing, only the acoustic data is used.
IV.
Results and conclusion We present phone classification results using our novel dynamic articulatory model and following our adaptation procedure.
The table below shows phone error rates (PER) for a female and a male speaker.
We use a three-state HMM with different observation densities and initialization techniques.
We combine the probabilities of the baseline topology with the new ones.
Our novel framework provides a 10.
9% relative reduction in phone error rate over our baseline which uses MFCC features.
This is achieved using the distortion features with linear discriminant analysis (LDA) and cepstral mean normalization (CMN).
We conclude that incorporating articulatory knowledge in the combined statistical framework we devised contributes to lowering the error rates in speech recognition.
Features (dimension) Topology Observation Prob / Initialization Female PER Male PER Both PER Improvement Baseline Features MFCC + CMN (13) 3S-128M-HMM Gaussian/VQ 61.
6% 55.
9% 58.
8% Distortion Features (1024) (Prob.
Combination with MFCC, α = 0.
2) 3S-1024M-HMM Exponential/Flat Sparsity = 21% 57.
6% 53.
7% 55.
7% 5.
3% Distortion Features (1024) (Prob.
Combination with MFCC, α = 0.
2) 3S-1024M-HMM Exponential/EMA Sparsity = 51% 58.
3% 53.
9% 56.
1% 4.
6% Adapted Distortion Features (1024) (Prob.
Combination with MFCC, α = 0.
25) 3S-1024M-HMM Exponential/EMA Sparsity = 51% 58.
4% 53.
1% 55.
7% 5.
3% Distortion Features + LDA + CMN (20) (Prob.
Combination with MFCC, α = 0.
6) 3S-128M-HMM Gaussian/VQ Sparsity = 0% 54.
9% 49.
8% 52.
4% 10.
9%.

Related Results

Vocal tract allometry in a mammalian vocal learner
Vocal tract allometry in a mammalian vocal learner
Abstract Acoustic allometry occurs when features of animal vocalisations can be predicted from body size measurements. Despite this being conside...
Synthetic Vocal Tracts - A Review
Synthetic Vocal Tracts - A Review
Synthetic vocal tracts are gadgets powered by a computer system capable of translating the brain activity into synthesized speech, by decoding the movements of muscles involved in ...
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)
BACKGROUND Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...
FONOLOGI BAHASA PRANCIS
FONOLOGI BAHASA PRANCIS
Understanding phonology is the pivotal thing in learning foreign language. By understanding the target language phonology, learners will be easier to learn foreign language pronunc...
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speech, communication, and neuroimaging in Parkinson's disease : Characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : Characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Multiple Concurrent Pilonidal Sinuses: Case report and Literature review
Multiple Concurrent Pilonidal Sinuses: Case report and Literature review
Abstract Introduction: Concurrent pilonidal sinuses (PNSs) at distinct locations are extremely rare. This report highlights an exceptional case of a young female presenting with th...

Back to Top