Javascript must be enabled to continue!
Effect of separate sampling on classification accuracy
View through CrossRef
Abstract
Motivation: Measurements are commonly taken from two phenotypes to build a classifier, where the number of data points from each class is predetermined, not random. In this ‘separate sampling’ scenario, the data cannot be used to estimate the class prior probabilities. Moreover, predetermined class sizes can severely degrade classifier performance, even for large samples.
Results: We employ simulations using both synthetic and real data to show the detrimental effect of separate sampling on a variety of classification rules. We establish propositions related to the effect on the expected classifier error owing to a sampling ratio different from the population class ratio. From these we derive a sample-based minimax sampling ratio and provide an algorithm for approximating it from the data. We also extend to arbitrary distributions the classical population-based Anderson linear discriminant analysis minimax sampling ratio derived from the discriminant form of the Bayes classifier.
Availability: All the codes for synthetic data and real data examples are written in MATLAB. A function called mmratio, whose output is an approximation of the minimax sampling ratio of a given dataset, is also written in MATLAB. All the codes are available at: http://gsp.tamu.edu/Publications/supplementary/shahrokh13b.
Contact: edward@ece.tamu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Title: Effect of separate sampling on classification accuracy
Description:
Abstract
Motivation: Measurements are commonly taken from two phenotypes to build a classifier, where the number of data points from each class is predetermined, not random.
In this ‘separate sampling’ scenario, the data cannot be used to estimate the class prior probabilities.
Moreover, predetermined class sizes can severely degrade classifier performance, even for large samples.
Results: We employ simulations using both synthetic and real data to show the detrimental effect of separate sampling on a variety of classification rules.
We establish propositions related to the effect on the expected classifier error owing to a sampling ratio different from the population class ratio.
From these we derive a sample-based minimax sampling ratio and provide an algorithm for approximating it from the data.
We also extend to arbitrary distributions the classical population-based Anderson linear discriminant analysis minimax sampling ratio derived from the discriminant form of the Bayes classifier.
Availability: All the codes for synthetic data and real data examples are written in MATLAB.
A function called mmratio, whose output is an approximation of the minimax sampling ratio of a given dataset, is also written in MATLAB.
All the codes are available at: http://gsp.
tamu.
edu/Publications/supplementary/shahrokh13b.
Contact: edward@ece.
tamu.
edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Related Results
Optimising tool wear and workpiece condition monitoring via cyber-physical systems for smart manufacturing
Optimising tool wear and workpiece condition monitoring via cyber-physical systems for smart manufacturing
Smart manufacturing has been developed since the introduction of Industry 4.0. It consists of resource sharing and networking, predictive engineering, and material and data analyti...
Enhancing Non-Formal Learning Certificate Classification with Text Augmentation: A Comparison of Character, Token, and Semantic Approaches
Enhancing Non-Formal Learning Certificate Classification with Text Augmentation: A Comparison of Character, Token, and Semantic Approaches
Aim/Purpose: The purpose of this paper is to address the gap in the recognition of prior learning (RPL) by automating the classification of non-formal learning certificates using d...
Improving Typical Urban Land-Use Classification with Active-Passive Remote Sensing and Multi-Attention Modules Hybrid Network: A Case Study of Qibin District, Henan, China
Improving Typical Urban Land-Use Classification with Active-Passive Remote Sensing and Multi-Attention Modules Hybrid Network: A Case Study of Qibin District, Henan, China
The study of high-precision land-use classification is essential for the sustainable development of land resources. This study addresses the problem of classification errors in opt...
Improving Medical Document Classification via Feature Engineering
Improving Medical Document Classification via Feature Engineering
<p dir="ltr">Document classification (DC) is the task of assigning the predefined labels to unseen documents by utilizing the model trained on the available labeled documents...
Research of Email Classification based on Deep Neural Network
Research of Email Classification based on Deep Neural Network
Abstract
The effective distinction between normal email and spam, so as to maximize the possible of filtering spam has become a research hotspot currently. Naive bay...
One- and Two-Phase Software Requirement Classification Using Ensemble Deep Learning
One- and Two-Phase Software Requirement Classification Using Ensemble Deep Learning
Recently, deep learning (DL) has been utilized successfully in different fields, achieving remarkable results. Thus, there is a noticeable focus on DL approaches to automate softwa...
COMPARATIVE DESCRIPTION OF THE DANIS-WEBER, AO, LAUGE HANSEN AND DIAS-TACHDJIAN CLASSIFICATION SYSTEMS FOR ANKLE FRACTURES
COMPARATIVE DESCRIPTION OF THE DANIS-WEBER, AO, LAUGE HANSEN AND DIAS-TACHDJIAN CLASSIFICATION SYSTEMS FOR ANKLE FRACTURES
Introduction: Ankle fractures are very common in emergency departments around the world. Through time and scientific advances, several means of classification have been structured ...
Predictors of False-Negative Axillary FNA Among Breast Cancer Patients: A Cross-Sectional Study
Predictors of False-Negative Axillary FNA Among Breast Cancer Patients: A Cross-Sectional Study
Abstract
Introduction
Fine-needle aspiration (FNA) is commonly used to investigate lymphadenopathy of suspected metastatic origin. The current study aims to find the association be...

