Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

XGBoost for Educational Performance: Comparing SMOTE and SMOTE-TOMEK on Imbalanced Data

View through CrossRef
Class imbalance poses a critical challenge in educational performance prediction, particularly in accurately identifying at-risk students within small datasets. This study rigorously evaluates three data balancing strategies—baseline imbalanced processing, SMOTE (Synthetic Minority Over-sampling Technique), and SMOTE-TOMEK—integrated with the XGBoost classifier, using academic records from 161 Indonesian junior high school students. The objective is to assess their effectiveness in improving minority-class recognition and overall model reliability. The results demonstrate that SMOTE-TOMEK significantly outperforms other methods, achieving a 75% recall for the minority class—representing a 50% absolute improvement over both SMOTE and the baseline. It also recorded the highest scores across key metrics: AUC-PR (0.9874), Matthews Correlation Coefficient (0.6786), and G-mean (0.8345). Notably, SMOTE-TOMEK identified one additional at-risk student for every four cases without compromising majority-class precision (93%), highlighting its practical utility in real-world educational interventions. In contrast, while SMOTE improved probabilistic metrics such as AUC-ROC (0.9286), it failed to reduce false negatives, maintaining the baseline’s 50% error rate in identifying at-risk students. The optimal SMOTE-TOMEK configuration enabled the use of shallower decision trees and stronger regularization, validating its effectiveness in reducing noise and enhancing generalization. Statistical significance of the results was confirmed using Wilcoxon signed-rank tests at a 0.01 significance level. These findings underscore the importance of hybrid resampling techniques in educational AI pipelines. SMOTE-TOMEK not only enhances predictive accuracy but also translates model performance into actionable insights for supporting marginalized learners. The study advocates for its prioritization in future educational data science applications, especially where early identification of vulnerable students is essential for targeted academic support and policy formulation.
Title: XGBoost for Educational Performance: Comparing SMOTE and SMOTE-TOMEK on Imbalanced Data
Description:
Class imbalance poses a critical challenge in educational performance prediction, particularly in accurately identifying at-risk students within small datasets.
This study rigorously evaluates three data balancing strategies—baseline imbalanced processing, SMOTE (Synthetic Minority Over-sampling Technique), and SMOTE-TOMEK—integrated with the XGBoost classifier, using academic records from 161 Indonesian junior high school students.
The objective is to assess their effectiveness in improving minority-class recognition and overall model reliability.
The results demonstrate that SMOTE-TOMEK significantly outperforms other methods, achieving a 75% recall for the minority class—representing a 50% absolute improvement over both SMOTE and the baseline.
It also recorded the highest scores across key metrics: AUC-PR (0.
9874), Matthews Correlation Coefficient (0.
6786), and G-mean (0.
8345).
Notably, SMOTE-TOMEK identified one additional at-risk student for every four cases without compromising majority-class precision (93%), highlighting its practical utility in real-world educational interventions.
In contrast, while SMOTE improved probabilistic metrics such as AUC-ROC (0.
9286), it failed to reduce false negatives, maintaining the baseline’s 50% error rate in identifying at-risk students.
The optimal SMOTE-TOMEK configuration enabled the use of shallower decision trees and stronger regularization, validating its effectiveness in reducing noise and enhancing generalization.
Statistical significance of the results was confirmed using Wilcoxon signed-rank tests at a 0.
01 significance level.
These findings underscore the importance of hybrid resampling techniques in educational AI pipelines.
SMOTE-TOMEK not only enhances predictive accuracy but also translates model performance into actionable insights for supporting marginalized learners.
The study advocates for its prioritization in future educational data science applications, especially where early identification of vulnerable students is essential for targeted academic support and policy formulation.

Related Results

Advanced Re-Sampling Techniques for Multi-Class Imbalanced Classification
Advanced Re-Sampling Techniques for Multi-Class Imbalanced Classification
Imbalanced classification is a common problem in machine learning, where one class significantly outnumbers the others. This imbalance leads to biased model performance, where the ...
Ensemble learning with imbalanced data handling in the early detection of capital markets
Ensemble learning with imbalanced data handling in the early detection of capital markets
Research aims: This study aims to create an early detection model to predict events in the Indonesian capital market.Design/Methodology/Approach: A quantitative study comparing ens...
Comparative analysis of resampling algorithms in the prediction of stroke diseases
Comparative analysis of resampling algorithms in the prediction of stroke diseases
Stroke disease is a serious cause of death globally. Early predictions of the disease will save a lot of lives but most of the clinical datasets are imbalanced in nature including ...
Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost
Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost
One of the key challenges in financial distress data is class imbalance, where the data are characterized by a highly imbalanced ratio between the number of distressed and non-dist...
Application of Machine Learning Techniques for Customer Churn Prediction in the Banking Sector
Application of Machine Learning Techniques for Customer Churn Prediction in the Banking Sector
Aim/Purpose: Previous studies have primarily focused on comparing predictive models without considering the impact of data preprocessing on model performance. Therefore, this study...
Integrasi Metode Decision Tree dan SMOTE untuk Klasifikasi Data Kecelakaan Lalu Lintas
Integrasi Metode Decision Tree dan SMOTE untuk Klasifikasi Data Kecelakaan Lalu Lintas
Kecelakaan lalu lintas merupakan suatu peristiwa yang tidak dapat diprediksi dengan pasti dan dapat mengakibatkan korban jiwa, korban luka ringan, korban luka berat atau kerugian m...
Klasifikasi Status Indeks Desa Membangun Jawa Barat Menggunakan Algoritma XGBoost
Klasifikasi Status Indeks Desa Membangun Jawa Barat Menggunakan Algoritma XGBoost
Abstract. Based on data from Statistics Indonesia 2020 shows that rural areas in West Java have an average poverty rate of 10,64%, which is higher than urban areas at 7,79%. To est...
Investigating Data Balancing Effects for Enhanced Behavioural Risk Detection in Cervical Cancer Using BiGRU: A Pilot Study
Investigating Data Balancing Effects for Enhanced Behavioural Risk Detection in Cervical Cancer Using BiGRU: A Pilot Study
Cervical cancer is a growth of cells that start at the cervix in the uterus that connects the vagina with its most common strain is the human papillomavirus. It is an easily treata...

Back to Top