Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Improving Indic code-mixed to monolingual translation using Mixed Script Augmentation, Generation & Transfer Learning

View through CrossRef
The use of code-mixed languages (written in Roman character) on social media platforms is prevalent in multilingual nations. Translation from code-mixed to monolingual is necessary for social media analysis, content filtering, and targeted advertising. Training translation models from scratch is difficult due to the scarcity of available code-mixed resources and the extremely noisy nature of real-time code-mixed sentences. At the moment, multilingual state-of-the-art language models are routinely used for multilingual applications. However, multilingual models are ineffective in handling code-mixed sentences as it is usually written in Roman script but contain words from at least two languages. In the paper, two data augmentation techniques are proposed to improve code-mixed to monolingual translation, one based on script augmentation and the other on code-mixed sentence generation. The proposed approach converts the code-mixed sentences into ‘Mixed Script form’ that restore the native language words in the sentences with corresponding native language scripts. The novelty of the work is that the multilingual language models include each language’s linguistic competence, preserving context in the monolingual sentences, not possible in the earlier models. Using an mT5 model, denoising and mixed-script switching are performed, followed by monolingual translation with another mT5 model. Code-mixed sentences are generated by employing a simple code-mixed sentence generating technique using monolingual parallel inputs. Two different Indic language sets, namely Hindi-English and Bengali-English are applied and in each case, the proposed approach outperforms straight uni-script (Roman) code-mixed to monolingual translation.
Title: Improving Indic code-mixed to monolingual translation using Mixed Script Augmentation, Generation & Transfer Learning
Description:
The use of code-mixed languages (written in Roman character) on social media platforms is prevalent in multilingual nations.
Translation from code-mixed to monolingual is necessary for social media analysis, content filtering, and targeted advertising.
Training translation models from scratch is difficult due to the scarcity of available code-mixed resources and the extremely noisy nature of real-time code-mixed sentences.
At the moment, multilingual state-of-the-art language models are routinely used for multilingual applications.
However, multilingual models are ineffective in handling code-mixed sentences as it is usually written in Roman script but contain words from at least two languages.
In the paper, two data augmentation techniques are proposed to improve code-mixed to monolingual translation, one based on script augmentation and the other on code-mixed sentence generation.
The proposed approach converts the code-mixed sentences into ‘Mixed Script form’ that restore the native language words in the sentences with corresponding native language scripts.
The novelty of the work is that the multilingual language models include each language’s linguistic competence, preserving context in the monolingual sentences, not possible in the earlier models.
Using an mT5 model, denoising and mixed-script switching are performed, followed by monolingual translation with another mT5 model.
Code-mixed sentences are generated by employing a simple code-mixed sentence generating technique using monolingual parallel inputs.
Two different Indic language sets, namely Hindi-English and Bengali-English are applied and in each case, the proposed approach outperforms straight uni-script (Roman) code-mixed to monolingual translation.

Related Results

Cometary Physics Laboratory: spectrophotometric experiments
Cometary Physics Laboratory: spectrophotometric experiments
<p><strong><span dir="ltr" role="presentation">1. Introduction</span></strong&...
North Syrian Mortaria and Other Late Roman Personal and Utility Objects Bearing Inscriptions of Good Luck
North Syrian Mortaria and Other Late Roman Personal and Utility Objects Bearing Inscriptions of Good Luck
<span style="font-size: 11pt; color: black; font-family: 'Times New Roman','serif'">&Pi;&Eta;&Lambda;&Iota;&Nu;&Alpha; &Iota;&Gamma;&Delta...
Morphometry of an hexagonal pit crater in Pavonis Mons, Mars
Morphometry of an hexagonal pit crater in Pavonis Mons, Mars
&lt;p&gt;&lt;strong&gt;Introduction:&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;Pit craters are peculiar depressions found in almost every terrestria...
Un manoscritto equivocato del copista santo Theophilos († 1548)
Un manoscritto equivocato del copista santo Theophilos († 1548)
<p><font size="3"><span class="A1"><span style="font-family: 'Times New Roman','serif'">&Epsilon;&Nu;&Alpha; &Lambda;&Alpha;&Nu;&...
A Touch of Space Weather - Outreach project for visually impaired students
A Touch of Space Weather - Outreach project for visually impaired students
&lt;p&gt;&lt;em&gt;&lt;span data-preserver-spaces=&quot;true&quot;&gt;'A Touch of Space Weather' is a project that brings space weather science into...
Ballistic landslides on comet 67P/Churyumov&#8211;Gerasimenko
Ballistic landslides on comet 67P/Churyumov&#8211;Gerasimenko
&lt;p&gt;&lt;strong&gt;Introduction:&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;The slow ejecta (i.e., with velocity lower than escape velocity) and l...
Effects of a new land surface parametrization scheme on thermal extremes in a Regional Climate Model
Effects of a new land surface parametrization scheme on thermal extremes in a Regional Climate Model
&lt;p&gt;&lt;span&gt;The &lt;/span&gt;&lt;span&gt;EFRE project Big Data@Geo aims at providing high resolution &lt;/span&gt;&lt;span&...

Back to Top