Javascript must be enabled to continue!
RAG Based QA for Low Resource Languages
View through CrossRef
Abstract
Question Answering (QA) has been an important research direction in Natural Language Processing (NLP) and artificial intelligence. The majority of current large language models (LLMs) that have been fine-tuned concentrate on improving performance on different NLP tasks including question answering with new dataset since the current model does not give accurate results. To adapt the LLM to a specific domain the fine-tuned method have a great impact. However, fine-tuning have a limitation of labeled data. To address such problem we use RAG with LLMs on question answering NLP tasks using different documents. We use Samuael/llama-2-7b-tebot-amharic a fine-tuned LLM for Question answering tasks including RAG techniques. We use publicly available autoregressive language models Samuael/llama-2-7b-tebot-amharic from hugging face as a base model. We use RAG because it helps to augment the knowledge of different documents such as text, doc, xml, html, pdf with large language model. We also fine-tune with LORA method on Amharic (AmharicInstructiondataset) dataset from the hugging face having a collection of more than 100000 records in different domains. Fine-tuning in AI is the process of adjusting the weights and parameters of a pre-trained model on new data to improve its performance on a specific task[2]. Experimental results on 50 test sets for named entity recognition, question answering tasks achieves superior performance compared to general LLMs. We termed a fine-tuned version of Samuael/llama-2-7b-tebot-amharic as llama-2-AmLLM that is optimized for question answering. After fine-tuning the model achieve a BLEU score of 0.4432 on the given test set, significantly exceeding previous state of the art for this task.
Springer Science and Business Media LLC
Title: RAG Based QA for Low Resource Languages
Description:
Abstract
Question Answering (QA) has been an important research direction in Natural Language Processing (NLP) and artificial intelligence.
The majority of current large language models (LLMs) that have been fine-tuned concentrate on improving performance on different NLP tasks including question answering with new dataset since the current model does not give accurate results.
To adapt the LLM to a specific domain the fine-tuned method have a great impact.
However, fine-tuning have a limitation of labeled data.
To address such problem we use RAG with LLMs on question answering NLP tasks using different documents.
We use Samuael/llama-2-7b-tebot-amharic a fine-tuned LLM for Question answering tasks including RAG techniques.
We use publicly available autoregressive language models Samuael/llama-2-7b-tebot-amharic from hugging face as a base model.
We use RAG because it helps to augment the knowledge of different documents such as text, doc, xml, html, pdf with large language model.
We also fine-tune with LORA method on Amharic (AmharicInstructiondataset) dataset from the hugging face having a collection of more than 100000 records in different domains.
Fine-tuning in AI is the process of adjusting the weights and parameters of a pre-trained model on new data to improve its performance on a specific task[2].
Experimental results on 50 test sets for named entity recognition, question answering tasks achieves superior performance compared to general LLMs.
We termed a fine-tuned version of Samuael/llama-2-7b-tebot-amharic as llama-2-AmLLM that is optimized for question answering.
After fine-tuning the model achieve a BLEU score of 0.
4432 on the given test set, significantly exceeding previous state of the art for this task.
Related Results
Kra-Dai Languages
Kra-Dai Languages
Kra-Dai (also called Tai-Kadai and Kam-Tai) is a family of approximately 100 languages spoken in Southeast Asia, extending from the island of Hainan, China, in the east to the Indi...
Natural language processing applications for low-resource languages
Natural language processing applications for low-resource languages
AbstractNatural language processing (NLP) has significantly advanced our ability to model and interact with human language through technology. However, these advancements have disp...
Mande Languages
Mande Languages
Mande is a mid-range language family in Western Sub-Saharan Africa that includes 60 to 75 languages spoken by 30 to 40 million people. According to the glottochronological data, it...
Khoisan Languages
Khoisan Languages
The languages traditionally referred to as “Khoisan” languages are spoken in southern and eastern Africa, specifically in the Republic of South Africa, Namibia, Botswana, Angola, a...
Perbandingan Kosa Kata Antara Bahasa Dentong dan Bahasa Duri (Sebuah Tinjauan Linguistik)
Perbandingan Kosa Kata Antara Bahasa Dentong dan Bahasa Duri (Sebuah Tinjauan Linguistik)
The problems of this research are (1) the relationship of similarities and similarities in the vocabulary of Dentong and Duri languages (2) the relationship between sound and mea...
Prosody and Intonation in Non-Bantu Niger-Congo Languages: An Annotated Bibliography
Prosody and Intonation in Non-Bantu Niger-Congo Languages: An Annotated Bibliography
Most linguists are well aware of the fact that data pertaining to languages spoken in Africa are often less readily available than information on languages spoken in Europe and som...
When Word Embeddings Become Endangered
When Word Embeddings Become Endangered
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as s...
The resilient potential behaviours in an Internal Medicine Department: Application of resilience assessment grid
The resilient potential behaviours in an Internal Medicine Department: Application of resilience assessment grid
Background
The healthcare system is frequently subject to unpredictable conditions such as organisational changes and pandemics. In order to perform as required under these conditi...

