Javascript must be enabled to continue!

Resampling Techniques for Imbalanced Water Quality Classification

Early detection of water quality status is crucial for preventing water pollution that could negatively impact community health. However, the infrequency of pollution events in water quality data leads to a class imbalance issue, where one class has significantly more observations than the other. Training machine learning models on imbalanced data can result in overfitting, reducing sensitivity and impairing predictive performance. Therefore, this study uses various oversampling techniques, including Synthetic Minority Oversampling Technique (SMOTE), Random Oversampling (ROS), Rapidly Converging Gibbs Sampler (RACOG) and Adaptive Synthetic Oversampling (ADASYN), and under-sampling techniques such as Random Under-Sampling (RUS), to balance the data before fitting into the machine learning. Secondary data on multiple water quality parameters, which are from the Department of Environment Malaysia, were utilized. The dataset consisted of a binary target variable, which is the water quality classification (WQC) and nine physicochemical parameters. The performance of artificial neural networks (ANN), support vector machines (SVM), and gradient boosting (GB) classifiers trained on the balanced dataset was assessed using balanced accuracy, sensitivity, f-measure, and Area Under the Curve (AUC). The results showed that the optimal performance for gradient boosting was achieved with ROS samples, yielding a balanced accuracy of 89.65%, sensitivity of 82.50%, f-measure of 84.62%, and AUC of 98.35%. In contrast, the best performance for ANN was achieved with RACOG samples, while for SVM, ADASYN samples produced the best results. Different sampling techniques showed the best results for different models because each machine learning algorithm has unique ways of learning patterns from data, and different resampling techniques address class imbalance in different ways.

Simo Excel PLT

Nur Hanisah Abdul Malek Nur Idalisa Norddin Nor Shahirul Umirah Idris Nurul Syazana Abdul Halim Amal Najihah Muhamad Nor

PaperASIA

2025

Title: Resampling Techniques for Imbalanced Water Quality Classification

Description:

Early detection of water quality status is crucial for preventing water pollution that could negatively impact community health.

However, the infrequency of pollution events in water quality data leads to a class imbalance issue, where one class has significantly more observations than the other.

Training machine learning models on imbalanced data can result in overfitting, reducing sensitivity and impairing predictive performance.

Therefore, this study uses various oversampling techniques, including Synthetic Minority Oversampling Technique (SMOTE), Random Oversampling (ROS), Rapidly Converging Gibbs Sampler (RACOG) and Adaptive Synthetic Oversampling (ADASYN), and under-sampling techniques such as Random Under-Sampling (RUS), to balance the data before fitting into the machine learning.

Secondary data on multiple water quality parameters, which are from the Department of Environment Malaysia, were utilized.

The dataset consisted of a binary target variable, which is the water quality classification (WQC) and nine physicochemical parameters.

The performance of artificial neural networks (ANN), support vector machines (SVM), and gradient boosting (GB) classifiers trained on the balanced dataset was assessed using balanced accuracy, sensitivity, f-measure, and Area Under the Curve (AUC).

The results showed that the optimal performance for gradient boosting was achieved with ROS samples, yielding a balanced accuracy of 89.

65%, sensitivity of 82.

50%, f-measure of 84.

62%, and AUC of 98.

35%.

In contrast, the best performance for ANN was achieved with RACOG samples, while for SVM, ADASYN samples produced the best results.

Different sampling techniques showed the best results for different models because each machine learning algorithm has unique ways of learning patterns from data, and different resampling techniques address class imbalance in different ways.

Back

Imbalanced classification is a common problem in machine learning, where one class significantly outnumbers the others. This imbalance leads to biased model performance, where the ...

REBALANCING DATA FOR CANCER-ASSOCIATED THROMBOSIS: COMPARISON OF DIFFERENT RESAMPLING APPROACH

Objective: Cancer-associated thrombosis (CAT) presents a complex challenge in oncology, exacerbated by data imbalances in related datasets that often lead to suboptimal outcomes in...

Measuring Resampling Methods on Imbalanced Educational Dataset’s Classification Performance

Imbalanced data refers to a condition that there is a different size of samples between one class with another class(es). It made the term “majority” class that represents the clas...

Application of Machine Learning Techniques for Customer Churn Prediction in the Banking Sector

Aim/Purpose: Previous studies have primarily focused on comparing predictive models without considering the impact of data preprocessing on model performance. Therefore, this study...

Integrated hydrological modelling for sustainable water allocation planning : Mkomazi Basin, South Africa case study

Allocation of freshwater resources between societal needs and natural ecological systems is of great concern for water managers. This development has challenged decision-makers reg...

Use of Formation Water and Associated Gases and their Simultaneous Utilization for Obtaining Microelement Concentrates Fresh Water and Drinking Water

Abstract Purpose: The invention relates to the oil industry, inorganic chemistry, in particular, to the methods of complex processing of formation water, using flare gas of oil and...

An automated approach for binary classification on imbalanced data

Abstract Imbalanced data is present in various business areas and must be dealt with the appropriate resampling techniques and classification algorithms. However, there is ...

Machine Learning Based on Resampling Approaches and Deep Reinforcement Learning for Credit Card Fraud Detection Systems

The problem of imbalanced datasets is a significant concern when creating reliable credit card fraud (CCF) detection systems. In this work, we study and evaluate recent advances in...

Email:
Password:

Email:

Resampling Techniques for Imbalanced Water Quality Classification

Related Results