Javascript must be enabled to continue!

Detection of Language from Roman Urdu and English Multilingual Corpus

PURPOSE: This study aims to suggest and validate a model to identify the languages from Roman Urdu and English mixed multilingual corpus collected from social media sites. BACKGROUND: The problem of identifying languages from a corpus of written texts that includes two or more languages is known as language identification or detection. Identifying or detecting the language present in social media text is a requirement and it has numerous applications in natural language processing and computational linguistics, like for word embedding generation, emotion analysis and part of speech tagging etc. METHODOLOGY: The dictionary-based baseline with SVM and Bi-Directional LSTM has been used in language identification from collected Roman Urdu and English multilingual Corpus. This research work will help in identify the languages from Roman Urdu and English Corpus. The English and Roman Urdu corpus had been obtained from different social media websites and cross-media platforms such as Facebook, Twitter, Google+, Instagram, WhatsApp, and Messenger, etc. The dictionary-based baseline with SVM and Bi-Directional LSTM has been used in language identification from collected Roman Urdu and English multilingual Corpus. RESULTS: Based on the results achieved using the methodology in the research work the Bi-directional LSTM model performed better with an accuracy of 97.98%. CONCLUSION: The problem in recognizing or detecting the language present in a given document or statement is referred to as language recognition or detection The Corpus of English and Roman Urdu is collected from social media websites. The text for training is submitted to a bi-direction LSTM accordingly to verify if the text is in English language or Urdu language. The results of word recognition for bidirectional word-level LSTM from Roman Urdu and English showed improved results.

Qeios Ltd

Syed Immamul Ansarullah Sajadul Hassan Kumhar Sami Alshmrany

2024

Title: Detection of Language from Roman Urdu and English Multilingual Corpus

Description:

PURPOSE: This study aims to suggest and validate a model to identify the languages from Roman Urdu and English mixed multilingual corpus collected from social media sites.

BACKGROUND: The problem of identifying languages from a corpus of written texts that includes two or more languages is known as language identification or detection.

Identifying or detecting the language present in social media text is a requirement and it has numerous applications in natural language processing and computational linguistics, like for word embedding generation, emotion analysis and part of speech tagging etc.

METHODOLOGY: The dictionary-based baseline with SVM and Bi-Directional LSTM has been used in language identification from collected Roman Urdu and English multilingual Corpus.

This research work will help in identify the languages from Roman Urdu and English Corpus.

The English and Roman Urdu corpus had been obtained from different social media websites and cross-media platforms such as Facebook, Twitter, Google+, Instagram, WhatsApp, and Messenger, etc.

The dictionary-based baseline with SVM and Bi-Directional LSTM has been used in language identification from collected Roman Urdu and English multilingual Corpus.

RESULTS: Based on the results achieved using the methodology in the research work the Bi-directional LSTM model performed better with an accuracy of 97.

98%.

CONCLUSION: The problem in recognizing or detecting the language present in a given document or statement is referred to as language recognition or detection The Corpus of English and Roman Urdu is collected from social media websites.

The text for training is submitted to a bi-direction LSTM accordingly to verify if the text is in English language or Urdu language.

The results of word recognition for bidirectional word-level LSTM from Roman Urdu and English showed improved results.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program

Abstract Funding Acknowledgements Type of funding sources: None. INTRODUCTION Patients with heart failure (HF)...

Aviation English - A global perspective: analysis, teaching, assessment

This e-book brings together 13 chapters written by aviation English researchers and practitioners settled in six different countries, representing institutions and universities fro...

Primary PCI: a reasonable treatment for STEMI care during the COVID-19 pandemic

Abstract Funding Acknowledgements Type of funding sources: None. Introduction ...

A Wideband mm-Wave Printed Dipole Antenna for 5G Applications

<span lang="EN-MY">In this paper, a wideband millimeter-wave (mm-Wave) printed dipole antenna is proposed to be used for fifth generation (5G) communications. The single elem...

Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga

The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...

Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika

Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...

DIGITAL ORTHOGRAPHY AND LINGUISTICS IDENTITY: THE SOCIOLINGUISTIC IMPLICATIONS OF ERRONEOUS URDU CAPTIONS IN DIGITAL MEDIA

Social media platforms have played a significant role in which Urdu is being recognized more frequently through different means especially through captions and subtitles that help ...

Email:
Password:

Email:

Detection of Language from Roman Urdu and English Multilingual Corpus

Related Results