Javascript must be enabled to continue!

Machine Learning for Non-Intrusive Speech Quality Assessment

This thesis presents two studies on non-intrusive speech quality assessment methods. The first applies supervised learning methods to speech quality assessment, which is a common approach in machine learning based quality assessment. To outperform existing methods, we concentrate on enhancing the feature set. In the second study, we analyse quality assessment from a different point of view inspired by the biological brain and present the first unsupervised learning based non-intrusive quality assessment that removes the need for labelled training data. Supervised learning based, non-intrusive quality predictors generally involve the development of a regressor that maps signal features to a representation of perceived quality. The performance of the predictor largely depends on 1) how sensitive the features are to the different types of distortion, and 2) how well the model learns the relation between the features and the quality score. We improve the performance of the quality estimation by enhancing the feature set and using a contemporary machine learning model that fits this objective. We propose an augmented feature set that includes raw features that are presumably redundant. The speech quality assessment system benefits from this redundancy as it results in reducing the impact of unwanted noise in the input. Feature set augmentation generally leads to the inclusion of features that have non-smooth distributions. We introduce a new pre-processing method and re-distribute the features to facilitate the training. The evaluation of the system on the ITU-T Supplement23 database illustrates that the proposed system outperforms the popular standards and contemporary methods in the literature. The unsupervised learning quality assessment approach presented in this thesis is based on a model that is learnt from clean speech signals. Consequently, it does not need to learn the statistics of any corruption that exists in the degraded speech signals and is trained only with unlabelled clean speech samples. The quality has a new definition, which is based on the divergence between 1) the distribution of the spectrograms of test signals, and 2) the pre-existing model that represents the distribution of the spectrograms of good quality speech. The distribution of the spectrogram of the speech is complex, and hence comparing them is not trivial. To tackle this problem, we propose to map the spectrograms of speech signals to a simple latent space. Generative models that map simple latent distributions into complex distributions are excellent platforms for our work. Generative models that are trained on the spectrograms of clean speech signals learned to map the latent variable $Z$ from a simple distribution $P_Z$ into a spectrogram $X$ from the distribution of good quality speech. Consequently, an inference model is developed by inverting the pre-trained generator, which maps spectrograms of the signal under the test, $X_t$, into its relevant latent variable, $Z_t$, in the latent space. We postulate the divergence between the distribution of the latent variable and the prior distribution $P_Z$ is a good measure of the quality of speech. Generative adversarial nets (GAN) are an effective training method and work well in this application. The proposed system is a novel application for a GAN. The experimental results with the TIMIT and NOIZEUS databases show that the proposed measure correlates positively with the objective quality scores.

Victoria University of Wellington Library

Mouna Hakami

2021

Title: Machine Learning for Non-Intrusive Speech Quality Assessment

Description:

This thesis presents two studies on non-intrusive speech quality assessment methods.

The first applies supervised learning methods to speech quality assessment, which is a common approach in machine learning based quality assessment.

To outperform existing methods, we concentrate on enhancing the feature set.

In the second study, we analyse quality assessment from a different point of view inspired by the biological brain and present the first unsupervised learning based non-intrusive quality assessment that removes the need for labelled training data.

Supervised learning based, non-intrusive quality predictors generally involve the development of a regressor that maps signal features to a representation of perceived quality.

The performance of the predictor largely depends on 1) how sensitive the features are to the different types of distortion, and 2) how well the model learns the relation between the features and the quality score.

We improve the performance of the quality estimation by enhancing the feature set and using a contemporary machine learning model that fits this objective.

We propose an augmented feature set that includes raw features that are presumably redundant.

The speech quality assessment system benefits from this redundancy as it results in reducing the impact of unwanted noise in the input.

Feature set augmentation generally leads to the inclusion of features that have non-smooth distributions.

We introduce a new pre-processing method and re-distribute the features to facilitate the training.

The evaluation of the system on the ITU-T Supplement23 database illustrates that the proposed system outperforms the popular standards and contemporary methods in the literature.

The unsupervised learning quality assessment approach presented in this thesis is based on a model that is learnt from clean speech signals.

Consequently, it does not need to learn the statistics of any corruption that exists in the degraded speech signals and is trained only with unlabelled clean speech samples.

The quality has a new definition, which is based on the divergence between 1) the distribution of the spectrograms of test signals, and 2) the pre-existing model that represents the distribution of the spectrograms of good quality speech.

The distribution of the spectrogram of the speech is complex, and hence comparing them is not trivial.

To tackle this problem, we propose to map the spectrograms of speech signals to a simple latent space.

Generative models that map simple latent distributions into complex distributions are excellent platforms for our work.

Generative models that are trained on the spectrograms of clean speech signals learned to map the latent variable $Z$ from a simple distribution $P_Z$ into a spectrogram $X$ from the distribution of good quality speech.

Consequently, an inference model is developed by inverting the pre-trained generator, which maps spectrograms of the signal under the test, $X_t$, into its relevant latent variable, $Z_t$, in the latent space.

We postulate the divergence between the distribution of the latent variable and the prior distribution $P_Z$ is a good measure of the quality of speech.

Generative adversarial nets (GAN) are an effective training method and work well in this application.

The proposed system is a novel application for a GAN.

The experimental results with the TIMIT and NOIZEUS databases show that the proposed measure correlates positively with the objective quality scores.

.

Back

Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...

Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes

Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...

Speech, communication, and neuroimaging in Parkinson's disease : Characterisation and intervention outcomes

Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...

The Neural Mechanisms of Private Speech in Second Language Learners’ Oral Production: An fNIRS Study

Background: According to Vygotsky’s sociocultural theory, private speech functions both as a tool for thought regulation and as a transitional form between outer and inner speech. ...

Flow Assurance Aspects of Intrusive Erosion Probes

Abstract Sand erosion in subsea components and pipelines can cause serious design and production problems. Erosion is a complex process that is affected by numero...

Ordovician Intrusive‐related Gold‐Copper Mineralization in West‐Central New South Wales, Australia

AbstractThree major types of Ordovician intrusive‐related gold‐copper deposits are recognized in central‐west New South Wales, Australia: porphyry, skarn and high sulphidation epit...

Formation of speech culture of primary schoolchildren by means of speech metaphoricity

Modern education and upbringing is characterized by qualitatively new requirements imposed by educational standards, not only for the content of the educational process, but also f...

Effects of igneous intrusions on source rock in the early diagenetic stage: A case study on Beipiao Formation in Jinyang Basin, Northeast China

Abstract Mesozoic intrusive bodies play an important role in the temperature history and hydrocarbon maturation of the Jinyang Basin in northeastern China. The Beipi...

Email:
Password:

Email:

Machine Learning for Non-Intrusive Speech Quality Assessment

Related Results