Javascript must be enabled to continue!

An Oversampling Technique with Descriptive Statistics

Oversampling is often applied as a means to win a better knowledge model. Several oversampling methods based on synthetic instances have been suggested, and SMOTE is one of the representative oversampling methods that can generate synthetic instances of a minor class. Until now, the oversampled data has been used conventionally to train machine learning models without statistical analysis, so it is not certain that the machine learning models will be fine for unseen cases in the future. However, because such synthetic data is different from the original data, we may wonder how much it resembles the original data so that the oversampled data is worth using to train machine learning models. For this purpose, I conducted this study on a representative dataset called wine data in the UCI machine learning repository, which is one of the datasets that has been experimented with by many researchers in research for knowledge discovery models. I generated synthetic data iteratively using SMOTE, and I compared the synthetic data with the original data of wine to see if it was statistically reliable using a box plot and t-test. Moreover, since training a machine learning model by supplying more high-quality training instances increases the probability of obtaining a machine learning model with higher accuracy, it was also checked whether a better machine learning model of random forests can be obtained by generating much more synthetic data than the original data and using it for training the random forests. The results of the experiment showed that small-scale oversampling produced synthetic data with statistical characteristics that were statistically slightly different from the original data, but when the oversampling rate was relatively high, it was possible to generate data with statistical characteristics similar to the original data, in other words, after generating high-quality training data, and by using it to train the random forests, it was possible to generate random forests with higher accuracy than using the original data alone, from 97.75% to 100%. Therefore, by supplying additional statistically reliable synthetic data as a way of oversampling, it was possible to create a machine-learning model with a higher predictive rate.

World Scientific and Engineering Academy and Society (WSEAS)

Hyontai Sug

WSEAS TRANSACTIONS ON INFORMATION SCIENCE AND APPLICATIONS

2024

Title: An Oversampling Technique with Descriptive Statistics

Description:

Oversampling is often applied as a means to win a better knowledge model.

Several oversampling methods based on synthetic instances have been suggested, and SMOTE is one of the representative oversampling methods that can generate synthetic instances of a minor class.

Until now, the oversampled data has been used conventionally to train machine learning models without statistical analysis, so it is not certain that the machine learning models will be fine for unseen cases in the future.

However, because such synthetic data is different from the original data, we may wonder how much it resembles the original data so that the oversampled data is worth using to train machine learning models.

For this purpose, I conducted this study on a representative dataset called wine data in the UCI machine learning repository, which is one of the datasets that has been experimented with by many researchers in research for knowledge discovery models.

I generated synthetic data iteratively using SMOTE, and I compared the synthetic data with the original data of wine to see if it was statistically reliable using a box plot and t-test.

Moreover, since training a machine learning model by supplying more high-quality training instances increases the probability of obtaining a machine learning model with higher accuracy, it was also checked whether a better machine learning model of random forests can be obtained by generating much more synthetic data than the original data and using it for training the random forests.

The results of the experiment showed that small-scale oversampling produced synthetic data with statistical characteristics that were statistically slightly different from the original data, but when the oversampling rate was relatively high, it was possible to generate data with statistical characteristics similar to the original data, in other words, after generating high-quality training data, and by using it to train the random forests, it was possible to generate random forests with higher accuracy than using the original data alone, from 97.

75% to 100%.

Therefore, by supplying additional statistically reliable synthetic data as a way of oversampling, it was possible to create a machine-learning model with a higher predictive rate.

Back

Problem The problem addressed in this study is the anxiety experienced by graduate students toward statistics courses, which often causes students to delay taking statistics cours...

New challenges within cross-border statistics

The Czech Statistical Office has participated in the cross-border statistics since 1990, with first experience in the Euro-region among the Czech Republic, Poland and Germany (Saxo...

Improving the Normative and Legal Framework for Organization of Official Statistics in Ukraine as Part of Implementing the Association Agreement between Ukraine and EU

The process of adaption of the national law to the law of EU in “Statistics” field and the effectiveness of its practical implementation in 2014–2020 in the two dimensions – harmon...

Generalized Order Statistics

AbstractOrder statistics and record values appear in many statistical applications and are widely used in statistical modeling and inference. Both models describe random variables ...

Generalized Order Statistics

AbstractOrder statistics and record values appear in many statistical applications and are widely used in statistical modeling and inference. Both models describe random variables ...

Spatial Statistics along Networks

AbstractSpatial statistics along networks is a branch of spatial statistics. Traditional spatial statistics deals with events occurring on a plane, referred to asplanar spatial sta...

Addressing Class Imbalance in Soil Movement Predictions

Abstract. Landslides threaten human life and infrastructure, resulting in fatalities and economic losses. Monitoring stations provide valuable data for predicting soil movement, wh...

Statistics and Economy of Fish Farming in Ukraine

Pisciculture of Ukraine, according to the head of the State Agency of Fisheries, is being reformed. At the same time, it is estimated that 60-70% of the fish industry is “in the sh...

Email:
Password:

Email:

An Oversampling Technique with Descriptive Statistics

Related Results