Javascript must be enabled to continue!

Incremental Learning Static Word Embeddings for Low-Resource NLP

Natural Language Processing (NLP) development for Low-Resource Languages (LRL) remains challenging due to limited data availability, linguistic diversity, and computational constraints. Many NLP solutions rely on complex models and high volume/quality data, which makes them difficult to use in Low-Resource NLP. Inspired by the NLP challenges and insights revealed by various previous works, the underexplored Incremental Learning (IL) Static Word Embedding (SWE) system in the low-resource NLP case of Indonesia’s local languages is proposed and presented. With basic-level models and hyperparameter sweeps, these models are tested in the scenario of incrementally incorporating 10 different local languages into themselves. The simulations indicate this type of model resists Catastrophic Forgetting (CF) very well and delivers competitive performance on the downstream task of sentiment analysis. In terms of f1 scores, the proposed model succeeds to exceed other baseline models and even rival heavy Transformer models. The proposed model can be considered as a prospective holistic solution for low-resource NLP. Future works could explore this model’s behavior in finer-grained NLP tasks, different IL settings, or test more advanced models.

Science Research Society

Nathan J. Lee

Journal of Information Systems Engineering and Management

2025

Title: Incremental Learning Static Word Embeddings for Low-Resource NLP

Description:

Natural Language Processing (NLP) development for Low-Resource Languages (LRL) remains challenging due to limited data availability, linguistic diversity, and computational constraints.

Many NLP solutions rely on complex models and high volume/quality data, which makes them difficult to use in Low-Resource NLP.

Inspired by the NLP challenges and insights revealed by various previous works, the underexplored Incremental Learning (IL) Static Word Embedding (SWE) system in the low-resource NLP case of Indonesia’s local languages is proposed and presented.

With basic-level models and hyperparameter sweeps, these models are tested in the scenario of incrementally incorporating 10 different local languages into themselves.

The simulations indicate this type of model resists Catastrophic Forgetting (CF) very well and delivers competitive performance on the downstream task of sentiment analysis.

In terms of f1 scores, the proposed model succeeds to exceed other baseline models and even rival heavy Transformer models.

The proposed model can be considered as a prospective holistic solution for low-resource NLP.

Future works could explore this model’s behavior in finer-grained NLP tasks, different IL settings, or test more advanced models.

Back

Related Results

AI and Incidental Findings

Photo by Accuray on Unsplash INTRODUCTION Delayed and missed follow-up on incidental findings threatens patient health and is a major financial risk for healthcare systems. The hea...

Advancements in Word Embeddings: A Comprehensive Survey and Analysis

In recent years, the field of Natural Language Processing (NLP) has seen significant growth in the study of word representation, with word embeddings proving valuable for various N...

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations fo...

When Word Embeddings Become Endangered

Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as s...

Exploring Word Embeddings for Text Classification: A Comparative Analysis

For language tasks like text classification and sequence labeling, word embeddings are essential for providing input characteristics in deep models. There have been many word embed...

Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)

BACKGROUND Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embe...

CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021

The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...

Natural Language Processing Applications in Mechanical Engineering Education

Abstract NLP, or Natural Language Processing, is a branch of artificial intelligence, enabling machines to understand and respond to human language in both written a...

Email:
Password:

Email: