Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

TF-IDF-Based Classification of Uzbek Educational Texts

View through CrossRef
This paper presents a baseline study on automatic Uzbek text classification. Uzbek is a morphologically rich and low-resource language, which makes reliable preprocessing and evaluation challenging. The approach integrates Term Frequency–Inverse Document Frequency (TF–IDF) representation with three conventional methods: linear regression (LR), k-Nearest Neighbors (k-NN), and cosine similarity (CS, implemented as a 1-NN retrieval model). The objective is to categorize school learning materials by grade level (grades 5–11) to support improved alignment between curricular texts and students’ intellectual development. A balanced dataset of Uzbek school textbooks across different subjects was constructed, preprocessed with standard NLP tools, and converted into TF–IDF vectors. Experimental results on the internal test set of 70 files show that LR achieved 92.9% accuracy (precision = 0.94, recall = 0.93, F1 = 0.93), while CS performed comparably with 91.4% accuracy (precision = 0.92, recall = 0.91, F1 = 0.92). In contrast, k-NN obtained only 28.6% accuracy, confirming its weakness in high-dimensional sparse feature spaces. External evaluation on seven Uzbek literary works further demonstrated that LR and CS yielded consistent and interpretable grade-level mappings, whereas k-NN results were unstable. Overall, the findings establish reliable baselines for Uzbek educational text classification and highlight the potential of extending beyond lexical overlap toward semantically richer models in future work.
Title: TF-IDF-Based Classification of Uzbek Educational Texts
Description:
This paper presents a baseline study on automatic Uzbek text classification.
Uzbek is a morphologically rich and low-resource language, which makes reliable preprocessing and evaluation challenging.
The approach integrates Term Frequency–Inverse Document Frequency (TF–IDF) representation with three conventional methods: linear regression (LR), k-Nearest Neighbors (k-NN), and cosine similarity (CS, implemented as a 1-NN retrieval model).
The objective is to categorize school learning materials by grade level (grades 5–11) to support improved alignment between curricular texts and students’ intellectual development.
A balanced dataset of Uzbek school textbooks across different subjects was constructed, preprocessed with standard NLP tools, and converted into TF–IDF vectors.
Experimental results on the internal test set of 70 files show that LR achieved 92.
9% accuracy (precision = 0.
94, recall = 0.
93, F1 = 0.
93), while CS performed comparably with 91.
4% accuracy (precision = 0.
92, recall = 0.
91, F1 = 0.
92).
In contrast, k-NN obtained only 28.
6% accuracy, confirming its weakness in high-dimensional sparse feature spaces.
External evaluation on seven Uzbek literary works further demonstrated that LR and CS yielded consistent and interpretable grade-level mappings, whereas k-NN results were unstable.
Overall, the findings establish reliable baselines for Uzbek educational text classification and highlight the potential of extending beyond lexical overlap toward semantically richer models in future work.

Related Results

Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...
TF-IDF Based Classification of Uzbek Educational Texts
TF-IDF Based Classification of Uzbek Educational Texts
This paper presents an approach to automatic Uzbek text classification. Uzbek language is a morphologically rich and low-resource language. The approach integrates Term Frequency–I...
Software Requirements Classification Using Machine Learning Algorithms
Software Requirements Classification Using Machine Learning Algorithms
The correct classification of requirements has become an essential task within software engineering. This study shows a comparison among the text feature extraction techniques, and...
Comparative Analysis of Developed Rainfall Intensity–Duration–Frequency Curves for Erbil with Other Iraqi Urban Areas
Comparative Analysis of Developed Rainfall Intensity–Duration–Frequency Curves for Erbil with Other Iraqi Urban Areas
Rainfall Intensity–Duration–Frequency (IDF) relationships are widely used in water infrastructure design and construction. IDF curves represent the relationship between rainfall in...
Biblical Texts and Interpretations in the Dead Sea Scrolls: Biblical Texts
Biblical Texts and Interpretations in the Dead Sea Scrolls: Biblical Texts
The introduction to this entry places the Dead Sea Scrolls in their historical and chronological context and discusses the popularity and provenance of the texts found in the Judea...
Modelling of Intensity-Duration Frequency curves for Upper Cauvery Karnataka through Normal Distribution
Modelling of Intensity-Duration Frequency curves for Upper Cauvery Karnataka through Normal Distribution
The IDF Curves accessible are for the most part done by fitting arrangement of yearly greatest precipitation force to parametric dispersions. Intensity-duration-frequency (IDF) cur...
JIS Definition Identified More Malaysian Adults with Metabolic Syndrome Compared to the NCEP-ATP III and IDF Criteria
JIS Definition Identified More Malaysian Adults with Metabolic Syndrome Compared to the NCEP-ATP III and IDF Criteria
Metabolic syndrome (MetS) is a steering force for the cardiovascular diseases epidemic in Asia. This study aimed to compare the prevalence of MetS in Malaysian adults using NCEP-AT...
Physicochemical properties of dietary fiber of bergamot and its effect on diabetic mice
Physicochemical properties of dietary fiber of bergamot and its effect on diabetic mice
Bergamot (Citrus medica L. var. sarcodactylis) contains different bioactive compounds, and their effects remain unclear. Therefore, the structural and bio-function of bergamot diet...

Back to Top