Javascript must be enabled to continue!
MDVC corpus: empowering Moroccan Darija speech recognition
View through CrossRef
Automatic speech recognition (ASR) technology has significantly transformed human-machine interactions, but it remains limited in its representation of diverse languages and dialects. Moroccan Darija, the lively Moroccan dialect, has long been underrepresented in the realm of language technology. To address this gap, we present a novel corpus of audio files accompanied by meticulously transcribed Moroccan Darija speech. The corpus comprises 1,000 hours of diverse content, featuring multiple Moroccan accents, extracted from 80 YouTube channels. To standardize the representation of Moroccan Darija in our corpus, we made efforts to establish consistent writing norms and conventions. In addition to the dataset creation, we applied fine-tuning using the Wav2Vec2 model on the Moroccan Darija voice corpus (MDVC) dataset achieving a remarkable word error rate (WER) of 9%. This article discusses the current state of Moroccan Darija research, highlighting the scarcity of resources and the need for robust ASR systems. Our contribution offers a valuable resource for researchers and developers, and by standardizing the Darija language, we strive to improve ASR system for this low resource language.
Institute of Advanced Engineering and Science
Title: MDVC corpus: empowering Moroccan Darija speech recognition
Description:
Automatic speech recognition (ASR) technology has significantly transformed human-machine interactions, but it remains limited in its representation of diverse languages and dialects.
Moroccan Darija, the lively Moroccan dialect, has long been underrepresented in the realm of language technology.
To address this gap, we present a novel corpus of audio files accompanied by meticulously transcribed Moroccan Darija speech.
The corpus comprises 1,000 hours of diverse content, featuring multiple Moroccan accents, extracted from 80 YouTube channels.
To standardize the representation of Moroccan Darija in our corpus, we made efforts to establish consistent writing norms and conventions.
In addition to the dataset creation, we applied fine-tuning using the Wav2Vec2 model on the Moroccan Darija voice corpus (MDVC) dataset achieving a remarkable word error rate (WER) of 9%.
This article discusses the current state of Moroccan Darija research, highlighting the scarcity of resources and the need for robust ASR systems.
Our contribution offers a valuable resource for researchers and developers, and by standardizing the Darija language, we strive to improve ASR system for this low resource language.
Related Results
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...
Arabic Darija dialect on the YouTube account of Aisha Devia official: A sociolinguistic approach
Arabic Darija dialect on the YouTube account of Aisha Devia official: A sociolinguistic approach
This study aims to explain the factors behind the emergence of the Darija dialect in Morocco and to describe the types of Moroccan dialects, especially on Aisha Devi's Official You...
RISE OF DARIJA IN MOROCCAN DIGITAL ADVERTISING
RISE OF DARIJA IN MOROCCAN DIGITAL ADVERTISING
This article examines the sociolinguistic transformation of advertising in Morocco. It mainly focuses on the integration of Moroccan Arabic (Darija) in influencer marketing. Histor...
Exploring the Structural Mappings of Eating Metaphors in Darija
Exploring the Structural Mappings of Eating Metaphors in Darija
This paper explores the pervasive presence of food metaphors in Darija (Moroccan Arabic) and examines their role as a fundamental mechanism of human thought rather than a mere styl...
Identifying Links Between Latent Memory and Speech Recognition Factors
Identifying Links Between Latent Memory and Speech Recognition Factors
Objectives:
The link between memory ability and speech recognition accuracy is often examined by correlating summary measures of performance across various tasks, but i...
The Neural Mechanisms of Private Speech in Second Language Learners’ Oral Production: An fNIRS Study
The Neural Mechanisms of Private Speech in Second Language Learners’ Oral Production: An fNIRS Study
Background: According to Vygotsky’s sociocultural theory, private speech functions both as a tool for thought regulation and as a transitional form between outer and inner speech. ...
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
Speech, communication, and neuroimaging in Parkinson's disease : characterisation and intervention outcomes
<p dir="ltr">Most individuals with Parkinson's disease (PD) experience changes in speech, voice or communication. Speech changes often manifest as hypokinetic dysarthria, a m...

