Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Urdu Toxicity Detection: A Multi-Stage and Multi-Label Classification Approach

View through CrossRef
Social media empowers freedom of expression but is often misused for abuse and hate. The detection of such content is crucial, especially in under-resourced languages like Urdu. To address this challenge, this paper designed a comprehensive multilabel dataset, the Urdu toxicity corpus (UTC). Second, the Urdu toxicity detection model is developed, which detects toxic content from an Urdu dataset presented in Nastaliq Font. The proposed framework initially processed the gathered data and then applied feature engineering using term frequency-inverse document frequency, bag-of-words, and N-gram techniques. Subsequently, the synthetic minority over-sampling technique is used to address the data imbalance problem, and manual data annotation is performed to ensure label accuracy. Four machine learning models, namely logistic regression, support vector machine, random forest, and gradient boosting, are applied to preprocessed data. The results indicate that the RF outperformed all evaluation metrics. Deep learning algorithms, including long short-term memory (LSTM), Bidirectional LSTM, and gated recurrent unit, have also been applied to UTC for classification purposes. Random forest outperforms the other models, achieving a precision, recall, F1-score, and accuracy of 0.97, 0.99, 0.98, and 0.99, respectively. The proposed model demonstrates a strong potential to detect rude, offensive, abusive, and hate speech content from user comments in Urdu Nastaliq.
Title: Urdu Toxicity Detection: A Multi-Stage and Multi-Label Classification Approach
Description:
Social media empowers freedom of expression but is often misused for abuse and hate.
The detection of such content is crucial, especially in under-resourced languages like Urdu.
To address this challenge, this paper designed a comprehensive multilabel dataset, the Urdu toxicity corpus (UTC).
Second, the Urdu toxicity detection model is developed, which detects toxic content from an Urdu dataset presented in Nastaliq Font.
The proposed framework initially processed the gathered data and then applied feature engineering using term frequency-inverse document frequency, bag-of-words, and N-gram techniques.
Subsequently, the synthetic minority over-sampling technique is used to address the data imbalance problem, and manual data annotation is performed to ensure label accuracy.
Four machine learning models, namely logistic regression, support vector machine, random forest, and gradient boosting, are applied to preprocessed data.
The results indicate that the RF outperformed all evaluation metrics.
Deep learning algorithms, including long short-term memory (LSTM), Bidirectional LSTM, and gated recurrent unit, have also been applied to UTC for classification purposes.
Random forest outperforms the other models, achieving a precision, recall, F1-score, and accuracy of 0.
97, 0.
99, 0.
98, and 0.
99, respectively.
The proposed model demonstrates a strong potential to detect rude, offensive, abusive, and hate speech content from user comments in Urdu Nastaliq.

Related Results

Hubungan Pengetahuan terkait Label Gizi dengan Kebiasaan Membaca Label Gizi pada Siswa SMA Al-Islam
Hubungan Pengetahuan terkait Label Gizi dengan Kebiasaan Membaca Label Gizi pada Siswa SMA Al-Islam
Latar Belakang: Masih sedikit konsumen yang dapat memahami dan menggunakan label gizi sesuai dengan fungsinya. Hal ini dikarenakan masih rendahnya kesadaran masyarakat terkait pent...
Identification of Boosters as Metadiscourse across Punjabi and Urdu Languages: A Machine Translation Approach
Identification of Boosters as Metadiscourse across Punjabi and Urdu Languages: A Machine Translation Approach
Boosters are said to function appropriately as metadiscourse features across languages. This study, therefore, aimed to investigate the functions and appropriateness of the metadis...
Urdu Short Paraphrase Detection at Sentence Level
Urdu Short Paraphrase Detection at Sentence Level
Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previou...
Afaan Oromo Multi-Label News Text Classification Using Deep Learning Approach
Afaan Oromo Multi-Label News Text Classification Using Deep Learning Approach
Abstract Classification is a technique for categorizing textual data into a form of predefined categories. Due to its major consequences in regard to critical tasks such as...
MUTUAL TRANSLATIONS OF URDU AND PUNJABI
MUTUAL TRANSLATIONS OF URDU AND PUNJABI
Human being uses language to convey their messages, emotions, feelings, observations and experiences to others. For this, language was used as spoken and written language, and diff...
“Mir Taqi Mir”. A fragment from the History of Urdu Poetry “Water of Life” of Muhammad Husayn Azad
“Mir Taqi Mir”. A fragment from the History of Urdu Poetry “Water of Life” of Muhammad Husayn Azad
The article is a translation into Russian of the chapter from the “Water of Life” by Muhammad Husain Azad (1830–1910). This is the chapter about the greatest Urdu poet Mir Taki Mir...
Fuze Well Mechanical Interface
Fuze Well Mechanical Interface
<div class="section abstract"> <div class="htmlview paragraph">This interface standard applies to fuzes used in airborne weapons that use a 3-Inch Fuze Well. It defin...
Urdu Version of Oxford Knee Score and its Application on Osteoarthritis Knee Patients
Urdu Version of Oxford Knee Score and its Application on Osteoarthritis Knee Patients
Objective: To translate the English version of the Oxford Knee Score into Urdu and then to see the internal consistency and validation with the Visual analogue scale and Numerical ...

Back to Top