Javascript must be enabled to continue!

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.

MDPI AG

Tilahun Yeshambel Josiane Mothe Yaregal Assabie

Information

2023

Title: Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Description:

Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages.

However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language.

Usable pre-trained models for automatic Amharic text processing are not available.

This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models.

We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model.

We investigated the performance of query expansion using word embeddings.

We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks.

Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation.

We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora.

Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Developing an audio search engine for Amharic speech web resources

Abstract While general-purpose search engines primarily serve English-language content, the web has seen enormous growth in non-resource-rich languages like Amhar...

Amharic Adhoc Information Retrieval System Based on Morphological Features

Information retrieval (IR) is one of the most important research and development areas due to the explosion of digital data and the need of accessing relevant information from huge...

Sleep Habits and Occurrence of Lowback Pain among Craftsmen

<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...

Sleep Habits and Occurrence of Lowback Pain among Craftsmen

<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...

Developing Amharic Sign Language Recognition Model for Amharic Characters Using Deep Learning Approach

Abstract Hearing-impaired people use Sign Language to communicate with each other as well as with other communities. Usually, they are unable to communicate with normal peo...

Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga

The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...

Coreference Resolution for Amharic Text using Bidirectional Encoder Representation from Transformer (BERT)

Abstract Coreference resolution is the process of finding an entity which is refers to the same entity in a text. In coreference resolution similar entities are men...

Email:
Password:

Email:

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Related Results