Javascript must be enabled to continue!

Urdu Short Paraphrase Detection at Sentence Level

Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique ( F 1 = 0.855). Our corpus is available and free to download for research purposes.

Association for Computing Machinery (ACM)

Hamza Hafeez Iqra Muneer Muhammad Sharjeel Muhammad Adnan Ashraf Rao Muhammad Adeel Nawab

ACM Transactions on Asian and Low-Resource Language Information Processing

2023

Title: Urdu Short Paraphrase Detection at Sentence Level

Description:

Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased.

Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection.

There have been very few efforts for paraphrase detection in South Asian languages.

However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language.

It is mainly due to the unavailability of the corpora that focus on the sentence level.

The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels.

Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism.

The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers.

Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques.

N-gram is treated as the baseline technique for our research.

The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task.

Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task.

In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level.

The best result we obtained using the feature fusion technique ( F 1 = 0.

855).

Our corpus is available and free to download for research purposes.

Back

Abstract Background: As people prefer to obtain medical knowledge online, medical intelligence question-answer systems based on question matching have attracted more and mo...

DIGITAL ORTHOGRAPHY AND LINGUISTICS IDENTITY: THE SOCIOLINGUISTIC IMPLICATIONS OF ERRONEOUS URDU CAPTIONS IN DIGITAL MEDIA

Social media platforms have played a significant role in which Urdu is being recognized more frequently through different means especially through captions and subtitles that help ...

Pola Fungsi Kalimat pada Novel “Pulang” Karya Tere Liye dan Kelayakannya sebagai Materi Pengayaan Siswa Kelas Xll SMA

Understanding sentence function patterns plays a major role in reading a novel, especially in class XII. By studying the understanding of sentence function patterns, class XII stud...

Services of Radio Pakistan in the Promotion of Urdu Language & Literature

Radio is one of the most amazing and effective inventions of the last century. Radio Pakistan came into being with the independence of Pakistan in 1947. From the very beginning, Ra...

Evaluating Classical and Transformer-Based Models for Urdu Abstractive Text Summarization: A Systematic Review

The rapid growth of digital content in Urdu has created an urgent need for effective automatic text summarization (ATS) systems. While extractive methods have been widely studied, ...

A Systematic Review and Experimental Evaluation of Classical and Transformer-Based Models for Urdu Abstractive Text Summarization

The rapid growth of digital content in Urdu has created an urgent need for effective automatic text summarization (ATS) systems. While extractive methods have been widely studied, ...

Study on Electromagnetic Shielding of Infrared /Visible Optical Window

In allusion to electromagnetic radiation damage that existed in daily life, social safety and military field, electromagnetic shielding technology of infrared and infrared optical ...

Identification of Boosters as Metadiscourse across Punjabi and Urdu Languages: A Machine Translation Approach

Boosters are said to function appropriately as metadiscourse features across languages. This study, therefore, aimed to investigate the functions and appropriateness of the metadis...

Email:
Password:

Email:

Urdu Short Paraphrase Detection at Sentence Level

Related Results