Javascript must be enabled to continue!

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

AbstractDysarthria, a motor speech disorder that impacts articulation and speech clarity, presents significant challenges for Automatic Speech Recognition (ASR) systems. This study proposes a groundbreaking approach to enhance the accuracy of Dysarthric Speech Recognition (DSR). A primary innovation lies in the integration of the SepFormer-Speech Enhancement Generative Adversarial Network (S-SEGAN), an advanced generative adversarial network tailored for Dysarthric Speech Enhancement (DSE), as a front-end processing stage for DSR systems. The S-SEGAN integrates SEGAN’s adversarial learning with SepFormer speech separation capabilities, demonstrating significant improvements in performance. Furthermore, a multistage transfer learning approach is employed to assess the DSR models for both word-level and sentence-level DSR. These DSR models are first trained on a large speech dataset (LibriSpeech) and then fine-tuned on dysarthric speech data (both isolated and augmented). Evaluations demonstrate significant DSR accuracy improvements in DSE integration. The Dysarthric Speech (DS)-baseline models (without DSE), Transformer and Conformer achieved Word Recognition Accuracy (WRA) percentages of 68.60% and 69.87%, respectively. The introduction of Hierarchical Attention Network (HAN) with the Transformer and Conformer architectures resulted in improved performance, with T-HAN achieving a WRA of 71.07% and C-HAN reaching 73%. The Transformer model with DSE + DSR for isolated words achieves a WRA of 73.40%, while that of the Conformer model reaches 74.33%. Notably, the T-HAN and C-HAN models with DSE + DSR demonstrate even more substantial enhancements, with WRAs of 75.73% and 76.87%, respectively. Augmenting words further boosts model performance, with the Transformer and Conformer models achieving WRAs of 76.47% and 79.20%, respectively. Remarkably, the T-HAN and C-HAN models with DSE + DSR and augmented words exhibit WRAs of 82.13% and 84.07%, respectively, with C-HAN displaying the highest performance among all proposed models.

Springer Science and Business Media LLC

R. Vinotha D. Hepsiba L. D. Vijay Anand J. Andrew R. Jennifer Eunice

Scientific Reports

2024

Title: Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

Description:

AbstractDysarthria, a motor speech disorder that impacts articulation and speech clarity, presents significant challenges for Automatic Speech Recognition (ASR) systems.

This study proposes a groundbreaking approach to enhance the accuracy of Dysarthric Speech Recognition (DSR).

A primary innovation lies in the integration of the SepFormer-Speech Enhancement Generative Adversarial Network (S-SEGAN), an advanced generative adversarial network tailored for Dysarthric Speech Enhancement (DSE), as a front-end processing stage for DSR systems.

The S-SEGAN integrates SEGAN’s adversarial learning with SepFormer speech separation capabilities, demonstrating significant improvements in performance.

Furthermore, a multistage transfer learning approach is employed to assess the DSR models for both word-level and sentence-level DSR.

These DSR models are first trained on a large speech dataset (LibriSpeech) and then fine-tuned on dysarthric speech data (both isolated and augmented).

Evaluations demonstrate significant DSR accuracy improvements in DSE integration.

The Dysarthric Speech (DS)-baseline models (without DSE), Transformer and Conformer achieved Word Recognition Accuracy (WRA) percentages of 68.

60% and 69.

87%, respectively.

The introduction of Hierarchical Attention Network (HAN) with the Transformer and Conformer architectures resulted in improved performance, with T-HAN achieving a WRA of 71.

07% and C-HAN reaching 73%.

The Transformer model with DSE + DSR for isolated words achieves a WRA of 73.

40%, while that of the Conformer model reaches 74.

33%.

Notably, the T-HAN and C-HAN models with DSE + DSR demonstrate even more substantial enhancements, with WRAs of 75.

73% and 76.

87%, respectively.

Augmenting words further boosts model performance, with the Transformer and Conformer models achieving WRAs of 76.

47% and 79.

20%, respectively.

Remarkably, the T-HAN and C-HAN models with DSE + DSR and augmented words exhibit WRAs of 82.

13% and 84.

07%, respectively, with C-HAN displaying the highest performance among all proposed models.

Back

Dysarthric speech has several pathological characteristics, such as discontinuous pronunciation, uncontrolled volume, slow speech, explosive pronunciation, improper pauses, excessi...

Recent Advances in Dysarthric Speech Recognition: Approaches and Datasets

Dysarthria is a neuromotor speech disorder that results from physical disability and limits speech intelligibility. Dysarthric speakers can make use of speech recognition systems t...

Multimodal Emotion Recognition and Human Computer Interaction for AI-Driven Mental Health Support (Preprint)

BACKGROUND Mental health has become one of the most urgent global health issues of the twenty-first century. The World Health Organization (WHO) reports tha...

A Comprehensive Survey of Automatic Dysarthric Speech Recognition

Automatic dysarthric speech recognition (DSR) is very crucial for many human computer interaction systems that enables the human to interact with machine in natural way. The object...

AFM signal model for dysarthric speech classification using speech biomarkers

Neurological disorders include various conditions affecting the brain, spinal cord, and nervous system which results in reduced performance in different organs and muscles througho...

Comparative Analysis of Deep Learning Models for Dysarthric Speech Detection

Abstract Dysarthria is a speech communication disorder that is associated with neurological impairments. In order to detect this disorder from speech, we present an experim...

Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)

BACKGROUND As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...

CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021

The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...

Email:
Password:

Email:

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

Related Results