Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Urdu Short Paraphrase Detection at Sentence Level

View through CrossRef
Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique ( F 1 = 0.855). Our corpus is available and free to download for research purposes.
Title: Urdu Short Paraphrase Detection at Sentence Level
Description:
Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased.
Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection.
There have been very few efforts for paraphrase detection in South Asian languages.
However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language.
It is mainly due to the unavailability of the corpora that focus on the sentence level.
The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels.
Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism.
The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers.
Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques.
N-gram is treated as the baseline technique for our research.
The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task.
Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task.
In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level.
The best result we obtained using the feature fusion technique ( F 1 = 0.
855).
Our corpus is available and free to download for research purposes.

Related Results

Services of Radio Pakistan in the Promotion of Urdu Language & Literature
Services of Radio Pakistan in the Promotion of Urdu Language & Literature
Radio is one of the most amazing and effective inventions of the last century. Radio Pakistan came into being with the independence of Pakistan in 1947. From the very beginning, Ra...
Study on Electromagnetic Shielding of Infrared /Visible Optical Window
Study on Electromagnetic Shielding of Infrared /Visible Optical Window
In allusion to electromagnetic radiation damage that existed in daily life, social safety and military field, electromagnetic shielding technology of infrared and infrared optical ...
Thematic Roles of Sentence Elements Found in "Me Before You" Movie
Thematic Roles of Sentence Elements Found in "Me Before You" Movie
Sentence is very important in learning language. Sentence is used in every language activity. For understanding sentence, we must study structure of the sentence, elements that for...
Identification of Boosters as Metadiscourse across Punjabi and Urdu Languages: A Machine Translation Approach
Identification of Boosters as Metadiscourse across Punjabi and Urdu Languages: A Machine Translation Approach
Boosters are said to function appropriately as metadiscourse features across languages. This study, therefore, aimed to investigate the functions and appropriateness of the metadis...
Translation and Cross-Cultural Adaptation of the International Duke Activity Status Index in the Urdu Version
Translation and Cross-Cultural Adaptation of the International Duke Activity Status Index in the Urdu Version
Background: Cardiovascular diseases are a leading cause of morbidity worldwide, necessitating effective tools for functional capacity assessment. The Duke Activity Status Index is ...
MUTUAL TRANSLATIONS OF URDU AND PUNJABI
MUTUAL TRANSLATIONS OF URDU AND PUNJABI
Human being uses language to convey their messages, emotions, feelings, observations and experiences to others. For this, language was used as spoken and written language, and diff...
“Mir Taqi Mir”. A fragment from the History of Urdu Poetry “Water of Life” of Muhammad Husayn Azad
“Mir Taqi Mir”. A fragment from the History of Urdu Poetry “Water of Life” of Muhammad Husayn Azad
The article is a translation into Russian of the chapter from the “Water of Life” by Muhammad Husain Azad (1830–1910). This is the chapter about the greatest Urdu poet Mir Taki Mir...
Identifying Links Between Latent Memory and Speech Recognition Factors
Identifying Links Between Latent Memory and Speech Recognition Factors
Objectives: The link between memory ability and speech recognition accuracy is often examined by correlating summary measures of performance across various tasks, but i...

Back to Top