Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

View through CrossRef
Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.
Title: Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
Description:
Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks.
Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages.
However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language.
Usable pre-trained models for automatic Amharic text processing are not available.
This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models.
We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model.
We investigated the performance of query expansion using word embeddings.
We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks.
Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation.
We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora.
Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Amharic Adhoc Information Retrieval System Based on Morphological Features
Amharic Adhoc Information Retrieval System Based on Morphological Features
Information retrieval (IR) is one of the most important research and development areas due to the explosion of digital data and the need of accessing relevant information from huge...
PRACTICALITY OF ALTERNATIVE ASSESSMENTS: FROM AMHARIC LANGUAGE INSTRUCTORS’ VIEW POINTS
PRACTICALITY OF ALTERNATIVE ASSESSMENTS: FROM AMHARIC LANGUAGE INSTRUCTORS’ VIEW POINTS
The purpose of this study was examining the practicality of Alternative Assessment in Ethiopian higher education Amharic Language educational context. The study also, endeavors to ...
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
<span style="font-size:11pt"><span style="background:#f9f9f4"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><b><spa...
E-Press and Oppress
E-Press and Oppress
From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...
syntax of Amharic ideophones
syntax of Amharic ideophones
This study is on Amharic ideophones, a subject that has not been described well in the syntax of Amharic. The data used for the analysis are collected from natural settings of the ...
Rodnoosjetljiv jezik na primjeru njemačkih časopisa Brigitte i Der Spiegel
Rodnoosjetljiv jezik na primjeru njemačkih časopisa Brigitte i Der Spiegel
On the basis of the comparative analysis of texts of the German biweekly magazine Brigitte and the weekly magazine Der Spiegel and under the presumption that gender-sensitive langu...
Evaluation of an Amharic-Language translation of Continuity of Care Satisfaction Tool among Postnatal Mothers in Ethiopia
Evaluation of an Amharic-Language translation of Continuity of Care Satisfaction Tool among Postnatal Mothers in Ethiopia
Abstract Background: Beginning in the 1990s, women’s dissatisfaction with maternity services has been widely reported in the literature. However, there is a lack of consist...

Back to Top