Javascript must be enabled to continue!
Urdu Short Paraphrase Detection at Sentence Level
View through CrossRef
Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique (
F
1
= 0.855). Our corpus is available and free to download for research purposes.
Association for Computing Machinery (ACM)
Title: Urdu Short Paraphrase Detection at Sentence Level
Description:
Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased.
Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection.
There have been very few efforts for paraphrase detection in South Asian languages.
However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language.
It is mainly due to the unavailability of the corpora that focus on the sentence level.
The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels.
Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism.
The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers.
Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques.
N-gram is treated as the baseline technique for our research.
The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task.
Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task.
In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level.
The best result we obtained using the feature fusion technique (
F
1
= 0.
855).
Our corpus is available and free to download for research purposes.
Related Results
Study on Electromagnetic Shielding of Infrared /Visible Optical Window
Study on Electromagnetic Shielding of Infrared /Visible Optical Window
In allusion to electromagnetic radiation damage that existed in daily life, social safety and military field, electromagnetic shielding technology of infrared and infrared optical ...
Thematic Roles of Sentence Elements Found in "Me Before You" Movie
Thematic Roles of Sentence Elements Found in "Me Before You" Movie
Sentence is very important in learning language. Sentence is used in every language activity. For understanding sentence, we must study structure of the sentence, elements that for...
Identification of Boosters as Metadiscourse across Punjabi and Urdu Languages: A Machine Translation Approach
Identification of Boosters as Metadiscourse across Punjabi and Urdu Languages: A Machine Translation Approach
Boosters are said to function appropriately as metadiscourse features across languages. This study, therefore, aimed to investigate the functions and appropriateness of the metadis...
MUTUAL TRANSLATIONS OF URDU AND PUNJABI
MUTUAL TRANSLATIONS OF URDU AND PUNJABI
Human being uses language to convey their messages, emotions, feelings, observations and experiences to others. For this, language was used as spoken and written language, and diff...
“Mir Taqi Mir”. A fragment from the History of Urdu Poetry “Water of Life” of Muhammad Husayn Azad
“Mir Taqi Mir”. A fragment from the History of Urdu Poetry “Water of Life” of Muhammad Husayn Azad
The article is a translation into Russian of the chapter from the “Water of Life” by Muhammad Husain Azad (1830–1910). This is the chapter about the greatest Urdu poet Mir Taki Mir...
Identifying Links Between Latent Memory and Speech Recognition Factors
Identifying Links Between Latent Memory and Speech Recognition Factors
Objectives:
The link between memory ability and speech recognition accuracy is often examined by correlating summary measures of performance across various tasks, but i...
Literary Paradigms in the Conception of South Asian Muslim Identity: Muhammad Iqbal and Muhammad Hasan Askari
Literary Paradigms in the Conception of South Asian Muslim Identity: Muhammad Iqbal and Muhammad Hasan Askari
Literature has always played a synthesizing role in the history of Islam. The same can be said for the Urdu language and literature. Urdu was produced as a result of the mingling o...
An Analytical Study of Sociolinguistic Variations in Urdu Language
An Analytical Study of Sociolinguistic Variations in Urdu Language
This paper will talk about the sociolinguistic variations in Urdu Language. The study uses the qualitative method it takes account of the interviews conducted by the various senior...


