Javascript must be enabled to continue!

TF-IDF Based Classification of Uzbek Educational Texts

This paper presents an approach to automatic Uzbek text classification. Uzbek language is a morphologically rich and low-resource language. The approach integrates Term Frequency–Inverse Document Frequency (TF-IDF) representation with conventional machine learning and similarity-based approaches. The aim is to categorize learning materials at the school grade level to support improved alignment of materials and student learning outcomes. In order to carry out the research, a dataset of 5th-11th grade school textbooks in different subjects was collected. The texts were preprocessed using standard natural language processing (NLP) tools and were transformed into TF-IDF vectors. These were used to train three common classification models: Logistic Regression (LR), k-Nearest Neighbors (k-NN), and Cosine Similarity (CS).Each new input text is compared with the grade-level textbook corpus, and the grade with the highest similarity is selected. It provides an estimate of the appropriate intellectual level for the material. The experimental findings indicate that Logistic Regression achieved 82% accuracy, and Cosine Similarity performed slightly better at 85.7%. Conversely, the k-NN method achieved only 22% accuracy, indicating its low applicability for Uzbek text classification. Overall, the proposed approach demonstrates practical value for pedagogical purposes and potential applicability to wider document analysis issues.

MDPI AG

Khabibulla Madatov Sapura Sattarova Jernej Vičič

2025

Title: TF-IDF Based Classification of Uzbek Educational Texts

Description:

This paper presents an approach to automatic Uzbek text classification.

Uzbek language is a morphologically rich and low-resource language.

The approach integrates Term Frequency–Inverse Document Frequency (TF-IDF) representation with conventional machine learning and similarity-based approaches.

The aim is to categorize learning materials at the school grade level to support improved alignment of materials and student learning outcomes.

In order to carry out the research, a dataset of 5th-11th grade school textbooks in different subjects was collected.

The texts were preprocessed using standard natural language processing (NLP) tools and were transformed into TF-IDF vectors.

These were used to train three common classification models: Logistic Regression (LR), k-Nearest Neighbors (k-NN), and Cosine Similarity (CS).

Each new input text is compared with the grade-level textbook corpus, and the grade with the highest similarity is selected.

It provides an estimate of the appropriate intellectual level for the material.

The experimental findings indicate that Logistic Regression achieved 82% accuracy, and Cosine Similarity performed slightly better at 85.

7%.

Conversely, the k-NN method achieved only 22% accuracy, indicating its low applicability for Uzbek text classification.

Overall, the proposed approach demonstrates practical value for pedagogical purposes and potential applicability to wider document analysis issues.

Back

This paper presents a baseline study on automatic Uzbek text classification. Uzbek is a morphologically rich and low-resource language, which makes reliable preprocessing and evalu...

Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika

Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...

Responsibilised Resilience? Reworking Neoliberal Social Policy Texts

Introduction This essay begins with the premise that resilience, broadly defined as positive adaptation despite adversity (Garmezy and Rutter), and resilience building are importa...

Developing Intensity-Duration-Frequency (IDF) Curves Based on Rainfall Cumulative Distribution Frequency (CDF) for Can Tho City, Vietnam

Information on the relationship between rainfall intensity, duration and accumulation frequency or return period (IDF) is commonly utilized in the design and management of urban dr...

National identity in the Piano works of Uzbek Composers

This article explores the expression of national identity in the piano compositions of Uzbek composers from the 20th century to the present day. Drawing upon musical analysis, hist...

Software Requirements Classification Using Machine Learning Algorithms

The correct classification of requirements has become an essential task within software engineering. This study shows a comparison among the text feature extraction techniques, and...

Investigating Computational Identity and Empowerment of The Students Studying Programming: A Text Mining Study

In this study, it is aimed to predict the data obtained from the answers given by the students who receive programming education to open-ended questions with text mining algorithms...

Comparative Analysis of Developed Rainfall Intensity–Duration–Frequency Curves for Erbil with Other Iraqi Urban Areas

Rainfall Intensity–Duration–Frequency (IDF) relationships are widely used in water infrastructure design and construction. IDF curves represent the relationship between rainfall in...

Email:
Password:

Email:

TF-IDF Based Classification of Uzbek Educational Texts

Related Results