Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Albanian Text Classification: Bag of Words Model and Word Analogies

View through CrossRef
Abstract Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.
Title: Albanian Text Classification: Bag of Words Model and Word Analogies
Description:
Abstract Background: Text classification is a very important task in information retrieval.
Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms.
Objectives: We focus on the text classification for Albanian news articles using two approaches.
Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space.
Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles.
In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters.
In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information.
We have measured the accuracy for each classifier separately.
We have also analyzed the training and testing time.
Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text.
FastText shows better performance when classifying multi-label text.
Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts.
The best results are achieved with a bag of words model, with an accuracy of 94%.

Related Results

Računalno potpomognuto usmjeravanje kod dvojezičnih govornika
Računalno potpomognuto usmjeravanje kod dvojezičnih govornika
This thesis investigates whether modern computer models can confirm how people encounter words and then use these findings in didactics. In recent years, computers have been used i...
Afrikanske smede
Afrikanske smede
African Smiths Cultural-historical and sociological problems illuminated by studies among the Tuareg and by comparative analysisIn KUML 1957 in connection with a description of sla...
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Gulistan Albanian Melikdom: Albanian Cultural Heritage
Gulistan Albanian Melikdom: Albanian Cultural Heritage
The purpose of the research paper is a scientific study of the historical, archaeological and architectural of the Caucasian Albanian material culture and heritage of the Gulistan ...
The Rising of “Alblish” (Albanian + English)—Data Collection and Analysis of Anglicisms in the Albanian Language
The Rising of “Alblish” (Albanian + English)—Data Collection and Analysis of Anglicisms in the Albanian Language
This paper investigates the impact of English on the Albanian language, a language contact phenomenon hitherto largely unexamined by Albanian and non-Albanian linguists alike. This...
Is an Albanian man’s besa his bond?
Is an Albanian man’s besa his bond?
“An Englishman’s word is his bond”. What about the Albanian concept of besa? Besa, regarded as the holiest word and a precious value in Albanian culture, is reflected in Albanian p...
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Let [Formula: see text] be a connected graph of order at least two with vertex set [Formula: see text]. For [Formula: see text], let [Formula: see text] denote the length of an [Fo...

Back to Top