Javascript must be enabled to continue!

XGBoost for Educational Performance: Comparing SMOTE and SMOTE-TOMEK on Imbalanced Data

Class imbalance poses a critical challenge in educational performance prediction, particularly in accurately identifying at-risk students within small datasets. This study rigorously evaluates three data balancing strategies—baseline imbalanced processing, SMOTE (Synthetic Minority Over-sampling Technique), and SMOTE-TOMEK—integrated with the XGBoost classifier, using academic records from 161 Indonesian junior high school students. The objective is to assess their effectiveness in improving minority-class recognition and overall model reliability. The results demonstrate that SMOTE-TOMEK significantly outperforms other methods, achieving a 75% recall for the minority class—representing a 50% absolute improvement over both SMOTE and the baseline. It also recorded the highest scores across key metrics: AUC-PR (0.9874), Matthews Correlation Coefficient (0.6786), and G-mean (0.8345). Notably, SMOTE-TOMEK identified one additional at-risk student for every four cases without compromising majority-class precision (93%), highlighting its practical utility in real-world educational interventions. In contrast, while SMOTE improved probabilistic metrics such as AUC-ROC (0.9286), it failed to reduce false negatives, maintaining the baseline’s 50% error rate in identifying at-risk students. The optimal SMOTE-TOMEK configuration enabled the use of shallower decision trees and stronger regularization, validating its effectiveness in reducing noise and enhancing generalization. Statistical significance of the results was confirmed using Wilcoxon signed-rank tests at a 0.01 significance level. These findings underscore the importance of hybrid resampling techniques in educational AI pipelines. SMOTE-TOMEK not only enhances predictive accuracy but also translates model performance into actionable insights for supporting marginalized learners. The study advocates for its prioritization in future educational data science applications, especially where early identification of vulnerable students is essential for targeted academic support and policy formulation.

International Forum of Researchers and Lecturers

Ucta Pradema Sanjaya Purwosiwi Pandansari Ade Pratama Rina Purwanti Noor Laila Ramadhani Ari Eko Budiyanto Kristanti Kristanti Iwan Setiawan Wibisono Kustiyono Kustiyono Agung Wibowo Sri Mujiyono

Proceeding International Collaborative Conference on Multidisciplinary Science

2026

Title: XGBoost for Educational Performance: Comparing SMOTE and SMOTE-TOMEK on Imbalanced Data

Description:

Class imbalance poses a critical challenge in educational performance prediction, particularly in accurately identifying at-risk students within small datasets.

This study rigorously evaluates three data balancing strategies—baseline imbalanced processing, SMOTE (Synthetic Minority Over-sampling Technique), and SMOTE-TOMEK—integrated with the XGBoost classifier, using academic records from 161 Indonesian junior high school students.

The objective is to assess their effectiveness in improving minority-class recognition and overall model reliability.

The results demonstrate that SMOTE-TOMEK significantly outperforms other methods, achieving a 75% recall for the minority class—representing a 50% absolute improvement over both SMOTE and the baseline.

It also recorded the highest scores across key metrics: AUC-PR (0.

9874), Matthews Correlation Coefficient (0.

6786), and G-mean (0.

8345).

Notably, SMOTE-TOMEK identified one additional at-risk student for every four cases without compromising majority-class precision (93%), highlighting its practical utility in real-world educational interventions.

In contrast, while SMOTE improved probabilistic metrics such as AUC-ROC (0.

9286), it failed to reduce false negatives, maintaining the baseline’s 50% error rate in identifying at-risk students.

The optimal SMOTE-TOMEK configuration enabled the use of shallower decision trees and stronger regularization, validating its effectiveness in reducing noise and enhancing generalization.

Statistical significance of the results was confirmed using Wilcoxon signed-rank tests at a 0.

01 significance level.

These findings underscore the importance of hybrid resampling techniques in educational AI pipelines.

SMOTE-TOMEK not only enhances predictive accuracy but also translates model performance into actionable insights for supporting marginalized learners.

The study advocates for its prioritization in future educational data science applications, especially where early identification of vulnerable students is essential for targeted academic support and policy formulation.

Back

Imbalanced classification is a common problem in machine learning, where one class significantly outnumbers the others. This imbalance leads to biased model performance, where the ...

Ensemble learning with imbalanced data handling in the early detection of capital markets

Research aims: This study aims to create an early detection model to predict events in the Indonesian capital market.Design/Methodology/Approach: A quantitative study comparing ens...

Comparative analysis of resampling algorithms in the prediction of stroke diseases

Stroke disease is a serious cause of death globally. Early predictions of the disease will save a lot of lives but most of the clinical datasets are imbalanced in nature including ...

Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost

One of the key challenges in financial distress data is class imbalance, where the data are characterized by a highly imbalanced ratio between the number of distressed and non-dist...

Application of Machine Learning Techniques for Customer Churn Prediction in the Banking Sector

Aim/Purpose: Previous studies have primarily focused on comparing predictive models without considering the impact of data preprocessing on model performance. Therefore, this study...

Integrasi Metode Decision Tree dan SMOTE untuk Klasifikasi Data Kecelakaan Lalu Lintas

Kecelakaan lalu lintas merupakan suatu peristiwa yang tidak dapat diprediksi dengan pasti dan dapat mengakibatkan korban jiwa, korban luka ringan, korban luka berat atau kerugian m...

Klasifikasi Status Indeks Desa Membangun Jawa Barat Menggunakan Algoritma XGBoost

Abstract. Based on data from Statistics Indonesia 2020 shows that rural areas in West Java have an average poverty rate of 10,64%, which is higher than urban areas at 7,79%. To est...

Investigating Data Balancing Effects for Enhanced Behavioural Risk Detection in Cervical Cancer Using BiGRU: A Pilot Study

Cervical cancer is a growth of cells that start at the cervix in the uterus that connects the vagina with its most common strain is the human papillomavirus. It is an easily treata...

Email:
Password:

Email:

XGBoost for Educational Performance: Comparing SMOTE and SMOTE-TOMEK on Imbalanced Data

Related Results