Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

TF-IDF-Based Classification of Uzbek Educational Texts

View through CrossRef
This paper presents a baseline study on automatic Uzbek text classification. Uzbek is a morphologically rich and low-resource language, which makes reliable preprocessing and evaluation challenging. The approach integrates Term Frequency–Inverse Document Frequency (TF–IDF) representation with three conventional methods: linear regression (LR), k-Nearest Neighbors (k-NN), and cosine similarity (CS, implemented as a 1-NN retrieval model). The objective is to categorize school learning materials by grade level (grades 5–11) to support improved alignment between curricular texts and students’ intellectual development. A balanced dataset of Uzbek school textbooks across different subjects was constructed, preprocessed with standard NLP tools, and converted into TF–IDF vectors. Experimental results on the internal test set of 70 files show that LR achieved 92.9% accuracy (precision = 0.94, recall = 0.93, F1 = 0.93), while CS performed comparably with 91.4% accuracy (precision = 0.92, recall = 0.91, F1 = 0.92). In contrast, k-NN obtained only 28.6% accuracy, confirming its weakness in high-dimensional sparse feature spaces. External evaluation on seven Uzbek literary works further demonstrated that LR and CS yielded consistent and interpretable grade-level mappings, whereas k-NN results were unstable. Overall, the findings establish reliable baselines for Uzbek educational text classification and highlight the potential of extending beyond lexical overlap toward semantically richer models in future work.
Title: TF-IDF-Based Classification of Uzbek Educational Texts
Description:
This paper presents a baseline study on automatic Uzbek text classification.
Uzbek is a morphologically rich and low-resource language, which makes reliable preprocessing and evaluation challenging.
The approach integrates Term Frequency–Inverse Document Frequency (TF–IDF) representation with three conventional methods: linear regression (LR), k-Nearest Neighbors (k-NN), and cosine similarity (CS, implemented as a 1-NN retrieval model).
The objective is to categorize school learning materials by grade level (grades 5–11) to support improved alignment between curricular texts and students’ intellectual development.
A balanced dataset of Uzbek school textbooks across different subjects was constructed, preprocessed with standard NLP tools, and converted into TF–IDF vectors.
Experimental results on the internal test set of 70 files show that LR achieved 92.
9% accuracy (precision = 0.
94, recall = 0.
93, F1 = 0.
93), while CS performed comparably with 91.
4% accuracy (precision = 0.
92, recall = 0.
91, F1 = 0.
92).
In contrast, k-NN obtained only 28.
6% accuracy, confirming its weakness in high-dimensional sparse feature spaces.
External evaluation on seven Uzbek literary works further demonstrated that LR and CS yielded consistent and interpretable grade-level mappings, whereas k-NN results were unstable.
Overall, the findings establish reliable baselines for Uzbek educational text classification and highlight the potential of extending beyond lexical overlap toward semantically richer models in future work.

Related Results

Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...
TF-IDF Based Classification of Uzbek Educational Texts
TF-IDF Based Classification of Uzbek Educational Texts
This paper presents an approach to automatic Uzbek text classification. Uzbek language is a morphologically rich and low-resource language. The approach integrates Term Frequency–I...
Responsibilised Resilience? Reworking Neoliberal Social Policy Texts
Responsibilised Resilience? Reworking Neoliberal Social Policy Texts
Introduction This essay begins with the premise that resilience, broadly defined as positive adaptation despite adversity (Garmezy and Rutter), and resilience building are importa...
Developing Intensity-Duration-Frequency (IDF) Curves Based on Rainfall Cumulative Distribution Frequency (CDF) for Can Tho City, Vietnam
Developing Intensity-Duration-Frequency (IDF) Curves Based on Rainfall Cumulative Distribution Frequency (CDF) for Can Tho City, Vietnam
Information on the relationship between rainfall intensity, duration and accumulation frequency or return period (IDF) is commonly utilized in the design and management of urban dr...
National identity in the Piano works of Uzbek Composers
National identity in the Piano works of Uzbek Composers
This article explores the expression of national identity in the piano compositions of Uzbek composers from the 20th century to the present day. Drawing upon musical analysis, hist...
Software Requirements Classification Using Machine Learning Algorithms
Software Requirements Classification Using Machine Learning Algorithms
The correct classification of requirements has become an essential task within software engineering. This study shows a comparison among the text feature extraction techniques, and...
Investigating Computational Identity and Empowerment of The Students Studying Programming: A Text Mining Study
Investigating Computational Identity and Empowerment of The Students Studying Programming: A Text Mining Study
In this study, it is aimed to predict the data obtained from the answers given by the students who receive programming education to open-ended questions with text mining algorithms...
Comparative Analysis of Developed Rainfall Intensity–Duration–Frequency Curves for Erbil with Other Iraqi Urban Areas
Comparative Analysis of Developed Rainfall Intensity–Duration–Frequency Curves for Erbil with Other Iraqi Urban Areas
Rainfall Intensity–Duration–Frequency (IDF) relationships are widely used in water infrastructure design and construction. IDF curves represent the relationship between rainfall in...

Back to Top