Javascript must be enabled to continue!

Early detection of university dropout with cluster analysis and machine learning classification techniques

This study addresses the challenge of student dropout in the Faculty of Sciences and Technologies of the National University of Caaguazú in Paraguay by constructing an early warning model based on academic factors. Employing a data science methodology, academic records were characterized and analyzed, using techniques such as cluster analysis and the elbow method to optimize student segmentation. Several predictive machine learning models were adjusted, including logistic regression, decision trees, and k-nearest neighbors, which were evaluated using precision, recall, and F1 Score metrics to determine their effectiveness in classifying academic statuses.With the cluster analysis, four well-defined clusters were identified. These were characterized through cluster analysis into: early dropout, late dropout, thesis stage and graduate.The models had an average performance of 88% accuracy. These models were trained only with academic data (grades obtained in the courses). The data used covers four careers from 2012 to 2021: Computer Engineering, Civil Engineering, Electronics Engineering and Electrical Engineering.An early warning model for student dropout in the Faculty of Sciences and Technologies was built, using estimates based on relevant academic factors extracted from the faculty's academic database.In the study carried out, an effective characterization of the academic database was achieved using advanced data science techniques. Initially, the elbow method was used to determine the optimal number of clusters, identifying four different groups. The student population was segmented into: graduates, early dropouts (students who dropped out before five years of their degree), late dropouts (those who left their degree after five years) and students who completed the curriculum but have yet to present their Project. End of Degree. This detailed analysis allowed us to better understand the academic distribution.Predictive machine learning models were tuned and evaluated using four different data configurations in a series of training and prediction experiments. The most effective method was the third experiment, which combined data from students in states 2 and 5 through the third year. This combination created a more homogeneous and representative data set of academic success, allowing the models to more accurately identify the patterns and key factors that predetermine successful academic outcomes.With the selection of the third experiment, for the different majors, the optimal models varied: for Computer Science, the best model turned out to be K-Nearest Neighbors (KNN) with an accuracy, precision and recall of 0.896 and F1 of 0.895 in contrast to the one that had The lowest performance was Decision Tree (DT) with an accuracy and recall of 0.793, a precision of 0.853 and F1 of 0.814; for Electricity and Civil, the Decision Tree (DT) model was the most effective, in electricity with an accuracy and recall of 0.980, a precision of 0.981, and F1 of 0.979, in the civil career with an accuracy and recall of 0.968 , a precision of 0.976 and F1 of 0.970, however, the one that had the lowest performance in the two races was KNN, respectively for the Electricity race with an accuracy and recall of 0.823, a precision of 0.805 and F1 of 0.814 and in the Civil race with accuracy, recall and F1 of 0.843 and a precision of 0.850; and for Electronics, Logistic Regression (RL) and K-Nearest Neighbors (KNN) demonstrated better performance with accuracy and recall of 0.888, with a precision of 0.898 and F1 of 0.882, unlike the Decision Tree Model (DT) demonstrated lower accuracy and recall of 0.777, improving a little in precision with 0.809 compared to the RF and SVM models that demonstrated a lower precision of 0.740. The conclusions highlight the performance of different models in early identification of at-risk students, further proposing the integration of socioeconomic and psychological factors for future research in the field.

Softaliza Tecnologias LTDA

Juan Vicente Bogado Machuca Nathalia Romina González Duarte

Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE)

2025

Title: Early detection of university dropout with cluster analysis and machine learning classification techniques

Description:

Employing a data science methodology, academic records were characterized and analyzed, using techniques such as cluster analysis and the elbow method to optimize student segmentation.

Several predictive machine learning models were adjusted, including logistic regression, decision trees, and k-nearest neighbors, which were evaluated using precision, recall, and F1 Score metrics to determine their effectiveness in classifying academic statuses.

With the cluster analysis, four well-defined clusters were identified.

These were characterized through cluster analysis into: early dropout, late dropout, thesis stage and graduate.

The models had an average performance of 88% accuracy.

These models were trained only with academic data (grades obtained in the courses).

The data used covers four careers from 2012 to 2021: Computer Engineering, Civil Engineering, Electronics Engineering and Electrical Engineering.

An early warning model for student dropout in the Faculty of Sciences and Technologies was built, using estimates based on relevant academic factors extracted from the faculty's academic database.

In the study carried out, an effective characterization of the academic database was achieved using advanced data science techniques.

Initially, the elbow method was used to determine the optimal number of clusters, identifying four different groups.

The student population was segmented into: graduates, early dropouts (students who dropped out before five years of their degree), late dropouts (those who left their degree after five years) and students who completed the curriculum but have yet to present their Project.

End of Degree.

This detailed analysis allowed us to better understand the academic distribution.

Predictive machine learning models were tuned and evaluated using four different data configurations in a series of training and prediction experiments.

The most effective method was the third experiment, which combined data from students in states 2 and 5 through the third year.

This combination created a more homogeneous and representative data set of academic success, allowing the models to more accurately identify the patterns and key factors that predetermine successful academic outcomes.

With the selection of the third experiment, for the different majors, the optimal models varied: for Computer Science, the best model turned out to be K-Nearest Neighbors (KNN) with an accuracy, precision and recall of 0.

896 and F1 of 0.

895 in contrast to the one that had The lowest performance was Decision Tree (DT) with an accuracy and recall of 0.

793, a precision of 0.

853 and F1 of 0.

814; for Electricity and Civil, the Decision Tree (DT) model was the most effective, in electricity with an accuracy and recall of 0.

980, a precision of 0.

981, and F1 of 0.

979, in the civil career with an accuracy and recall of 0.

968 , a precision of 0.

976 and F1 of 0.

970, however, the one that had the lowest performance in the two races was KNN, respectively for the Electricity race with an accuracy and recall of 0.

823, a precision of 0.

805 and F1 of 0.

814 and in the Civil race with accuracy, recall and F1 of 0.

843 and a precision of 0.

850; and for Electronics, Logistic Regression (RL) and K-Nearest Neighbors (KNN) demonstrated better performance with accuracy and recall of 0.

888, with a precision of 0.

898 and F1 of 0.

882, unlike the Decision Tree Model (DT) demonstrated lower accuracy and recall of 0.

777, improving a little in precision with 0.

809 compared to the RF and SVM models that demonstrated a lower precision of 0.

740.

The conclusions highlight the performance of different models in early identification of at-risk students, further proposing the integration of socioeconomic and psychological factors for future research in the field.

Back

AbstractIn this manuscript, the combination of IoT and Multilayer Hybrid Dropout Deep-learning Model for waste image categorization is proposed to categorize the wastes as bio wast...

Constructing a VANET based on cluster chains

SUMMARYThe paper proposes a scheme on constructing a vehicular ad‐hoc network based on cluster chains. In the cluster construction algorithm, the distance from a potential cluster ...

Enhancing Non-Formal Learning Certificate Classification with Text Augmentation: A Comparison of Character, Token, and Semantic Approaches

Aim/Purpose: The purpose of this paper is to address the gap in the recognition of prior learning (RPL) by automating the classification of non-formal learning certificates using d...

Regional directions of the cluster development strategy in the field of tourism and hospitality

The monograph consists of an introduction, 5 chapters, lists of used sources for each chapter separately; contains 31 tables and 37 figures. The monograph examines the theoretical ...

Antenatal care dropout and associated factors among mothers delivering in public health facilities of Dire Dawa Town, Eastern Ethiopia

AbstractIntroductionMore than two-thirds of the pregnant women in Africa have at least one antenatal care contact with a health care provider. However, to achieve the full life-sav...

Predictive Model for School Dropout in Chimborazo Province, Ecuador

Introduction: School dropout is a complex problem influenced by various factors, including disparities in educational quality, inadequate infrastructure, and adverse socio-cultural...

Dropout Prediction Model for College Students in MOOCs Based on Weighted Multi-feature and SVM

Due to the COVID -19 pandemic, MOOCs have become a popular form of learning for college students. However, unlike traditional face-to-face courses, MOOCs offer little faculty super...

Advanced frameworks for fraud detection leveraging quantum machine learning and data science in fintech ecosystems

The rapid expansion of the fintech sector has brought with it an increasing demand for robust and sophisticated fraud detection systems capable of managing large volumes of financi...

Email:
Password:

Email:

Early detection of university dropout with cluster analysis and machine learning classification techniques

Related Results