Javascript must be enabled to continue!
Automatic Diacritics Restoration for Tunisian Dialect
View through CrossRef
Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. First, a word without diacratics could have many possible meanings depending on their diacritization. Second, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Third, without diacritics a word could have many possible parts of speech (POS) instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this article, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on Conditional Random Fields (CRF). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (Word Error Rate (WER) 21.44% for CRF and WER 34.6% for SMT).
Association for Computing Machinery (ACM)
Title: Automatic Diacritics Restoration for Tunisian Dialect
Description:
Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics.
The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools.
Indeed, writing Arabic without diacritics introduces several types of ambiguity.
First, a word without diacratics could have many possible meanings depending on their diacritization.
Second, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12].
In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics.
Third, without diacritics a word could have many possible parts of speech (POS) instead of one.
This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8].
Finally, there is ambiguity at the grammatical level (syntactic ambiguity).
In this article, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts.
We first describe our annotation guidelines and procedure.
Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on Conditional Random Fields (CRF).
In the second approach, we integrate POS features to influence the generation of diacritics.
Diacritics restoration was performed at both the word and the character levels.
The results showed high scores of automatic diacritization based on the CRF system (Word Error Rate (WER) 21.
44% for CRF and WER 34.
6% for SMT).
Related Results
A Study of the Chungcheong Dialect as a Literary Dialect in the Pansori Lyrics of Park Dongjin
A Study of the Chungcheong Dialect as a Literary Dialect in the Pansori Lyrics of Park Dongjin
This paper examines the Chungcheong dialect in Park Dongjin's pansori editorials from the perspective of “Literary Dialect,” focusing on phonological, morphological, and lexical is...
Functions and Translation of Palestinian Dialect in Ibrahim Nasrallah’s Time of White Horses
Functions and Translation of Palestinian Dialect in Ibrahim Nasrallah’s Time of White Horses
The problems that translators of fiction, especially novels, face when translating dialects from one language to another vary because dialects are distinct as much as cultures and ...
Using Diacritics in the Arabic Script of Malay to Scaffold Arab Postgraduate Students in Reading Malay Words
Using Diacritics in the Arabic Script of Malay to Scaffold Arab Postgraduate Students in Reading Malay Words
Purpose – This study aims to investigate the use of diacritics in the Arabic script of Malay to facilitate Arab postgraduate students of UKM to read the Malay words accurately. It ...
Muuttuva ja muuttumaton murre
Muuttuva ja muuttumaton murre
Murteet ovat kehittyneet kulttuuriperinnöksi ja identiteetin rakennuksen välineeksi pitkien prosessien seurauksena. Porin seudullakin murrekirjallisuudella ja murteen käytöllä on j...
Bukovyna dialect of the village Yuzhynets
Bukovyna dialect of the village Yuzhynets
The article deals with description of one dialect as a system. The purpose of of this study is to describe the main features of the dialect v. Yuzhynets, manifested in oral dialect...
A Study on Busan Dialects and Busan Culture Education
A Study on Busan Dialects and Busan Culture Education
The purpose of this study is to identify the language culture that appeared in the Busan dialect and to find a way to use it for cultural education in the Busan dialect.
Until now...
A Method for Arabic Handwritten Diacritics Characters
A Method for Arabic Handwritten Diacritics Characters
An Optical Character Recognition (OCR) is the process of converting an image representation of a document into an editable format. In addition, people have the ability to recognize...
Pronunciation Errors in Arabic YouTube Videos Narrated by AI
Pronunciation Errors in Arabic YouTube Videos Narrated by AI
Arabic has three long vowels /a:/, /u:/, /i:/ and three short vowel /a/, /u/, /i/ which are represented by diacritics marked over and under consonant letters. In words that have sh...

