Javascript must be enabled to continue!
An Oversampling Technique with Descriptive Statistics
View through CrossRef
Oversampling is often applied as a means to win a better knowledge model. Several oversampling methods based on synthetic instances have been suggested, and SMOTE is one of the representative oversampling methods that can generate synthetic instances of a minor class. Until now, the oversampled data has been used conventionally to train machine learning models without statistical analysis, so it is not certain that the machine learning models will be fine for unseen cases in the future. However, because such synthetic data is different from the original data, we may wonder how much it resembles the original data so that the oversampled data is worth using to train machine learning models. For this purpose, I conducted this study on a representative dataset called wine data in the UCI machine learning repository, which is one of the datasets that has been experimented with by many researchers in research for knowledge discovery models. I generated synthetic data iteratively using SMOTE, and I compared the synthetic data with the original data of wine to see if it was statistically reliable using a box plot and t-test. Moreover, since training a machine learning model by supplying more high-quality training instances increases the probability of obtaining a machine learning model with higher accuracy, it was also checked whether a better machine learning model of random forests can be obtained by generating much more synthetic data than the original data and using it for training the random forests. The results of the experiment showed that small-scale oversampling produced synthetic data with statistical characteristics that were statistically slightly different from the original data, but when the oversampling rate was relatively high, it was possible to generate data with statistical characteristics similar to the original data, in other words, after generating high-quality training data, and by using it to train the random forests, it was possible to generate random forests with higher accuracy than using the original data alone, from 97.75% to 100%. Therefore, by supplying additional statistically reliable synthetic data as a way of oversampling, it was possible to create a machine-learning model with a higher predictive rate.
World Scientific and Engineering Academy and Society (WSEAS)
Title: An Oversampling Technique with Descriptive Statistics
Description:
Oversampling is often applied as a means to win a better knowledge model.
Several oversampling methods based on synthetic instances have been suggested, and SMOTE is one of the representative oversampling methods that can generate synthetic instances of a minor class.
Until now, the oversampled data has been used conventionally to train machine learning models without statistical analysis, so it is not certain that the machine learning models will be fine for unseen cases in the future.
However, because such synthetic data is different from the original data, we may wonder how much it resembles the original data so that the oversampled data is worth using to train machine learning models.
For this purpose, I conducted this study on a representative dataset called wine data in the UCI machine learning repository, which is one of the datasets that has been experimented with by many researchers in research for knowledge discovery models.
I generated synthetic data iteratively using SMOTE, and I compared the synthetic data with the original data of wine to see if it was statistically reliable using a box plot and t-test.
Moreover, since training a machine learning model by supplying more high-quality training instances increases the probability of obtaining a machine learning model with higher accuracy, it was also checked whether a better machine learning model of random forests can be obtained by generating much more synthetic data than the original data and using it for training the random forests.
The results of the experiment showed that small-scale oversampling produced synthetic data with statistical characteristics that were statistically slightly different from the original data, but when the oversampling rate was relatively high, it was possible to generate data with statistical characteristics similar to the original data, in other words, after generating high-quality training data, and by using it to train the random forests, it was possible to generate random forests with higher accuracy than using the original data alone, from 97.
75% to 100%.
Therefore, by supplying additional statistically reliable synthetic data as a way of oversampling, it was possible to create a machine-learning model with a higher predictive rate.
Related Results
Predictors of Statistics Anxiety Among Graduate Students in Saudi Arabia
Predictors of Statistics Anxiety Among Graduate Students in Saudi Arabia
Problem The problem addressed in this study is the anxiety experienced by graduate students toward statistics courses, which often causes students to delay taking statistics cours...
New challenges within cross-border statistics
New challenges within cross-border statistics
The Czech Statistical Office has participated in the cross-border statistics since 1990, with first experience in the Euro-region among the Czech Republic, Poland and Germany (Saxo...
Improving the Normative and Legal Framework for Organization of Official Statistics in Ukraine as Part of Implementing the Association Agreement between Ukraine and EU
Improving the Normative and Legal Framework for Organization of Official Statistics in Ukraine as Part of Implementing the Association Agreement between Ukraine and EU
The process of adaption of the national law to the law of EU in “Statistics” field and the effectiveness of its practical implementation in 2014–2020 in the two dimensions – harmon...
Generalized Order Statistics
Generalized Order Statistics
AbstractOrder statistics and record values appear in many statistical applications and are widely used in statistical modeling and inference. Both models describe random variables ...
Generalized Order Statistics
Generalized Order Statistics
AbstractOrder statistics and record values appear in many statistical applications and are widely used in statistical modeling and inference. Both models describe random variables ...
Spatial Statistics along Networks
Spatial Statistics along Networks
AbstractSpatial statistics along networks is a branch of spatial statistics. Traditional spatial statistics deals with events occurring on a plane, referred to asplanar spatial sta...
Addressing Class Imbalance in Soil Movement Predictions
Addressing Class Imbalance in Soil Movement Predictions
Abstract. Landslides threaten human life and infrastructure, resulting in fatalities and economic losses. Monitoring stations provide valuable data for predicting soil movement, wh...
Statistics and Economy of Fish Farming in Ukraine
Statistics and Economy of Fish Farming in Ukraine
Pisciculture of Ukraine, according to the head of the State Agency of Fisheries, is being reformed. At the same time, it is estimated that 60-70% of the fish industry is “in the sh...

