Javascript must be enabled to continue!
Unbiased Precision Estimation under Separate Sampling
View through CrossRef
Abstract
Motivation
Precision and recall have become very popular classification accuracy metrics in the statistical learning literature. These metrics are ordinarily defined under the assumption that the data are sampled randomly from the mixture of the populations. However, observational case-control studies for biomarker discovery often collect data that are sampled separately from the case and control populations, particularly in the case of rare diseases. This discrepancy may introduce severe bias in classifier accuracy estimation.
Results
We demonstrate, using both analytical and numerical methods, that classifier precision estimates can display strong bias under separating sampling, with the bias magnitude depending on the difference between the case prevalences in the data and in the actual population. We show that this bias is systematic in the sense that it cannot be reduced by increasing sample size. If information about the true case prevalence is available from public health records, then a modified precision estimator is proposed that displays smaller bias, which can in fact be reduced to zero as sample size increases under regularity conditions on the classification algorithm. The accuracy of the theoretical analysis and the performance of the proposed precision estimator under separate sampling are investigated using synthetic and real data from observational case-control studies. The results confirmed that the proposed precision estimator indeed becomes unbiased as sample size increases, while the ordinary precision estimator may display large bias, particularly in the case of rare diseases.
Availability
Extra plots are available as Supplementary Materials.
Author summary
Biomedical data are often sampled separately from the case and control populations, particularly in the case of rare diseases. Precision is a popular classification accuracy metric in the statistical learning literature, which implicitly assumes that the data are sampled randomly from the mixture of the populations. In this paper we study the bias of precision under separate sampling using theoretical and numerical methods. We also propose a precision estimator for separate sampling in the case when the prevalence is known from public health records. The results confirmed that the proposed precision estimator becomes unbiased as sample size increases, while the ordinary precision estimator may display large bias, particularly in the case of rare diseases. In the absence of any knowledge about disease prevalence, precision estimates should be avoided under separate sampling.
Title: Unbiased Precision Estimation under Separate Sampling
Description:
Abstract
Motivation
Precision and recall have become very popular classification accuracy metrics in the statistical learning literature.
These metrics are ordinarily defined under the assumption that the data are sampled randomly from the mixture of the populations.
However, observational case-control studies for biomarker discovery often collect data that are sampled separately from the case and control populations, particularly in the case of rare diseases.
This discrepancy may introduce severe bias in classifier accuracy estimation.
Results
We demonstrate, using both analytical and numerical methods, that classifier precision estimates can display strong bias under separating sampling, with the bias magnitude depending on the difference between the case prevalences in the data and in the actual population.
We show that this bias is systematic in the sense that it cannot be reduced by increasing sample size.
If information about the true case prevalence is available from public health records, then a modified precision estimator is proposed that displays smaller bias, which can in fact be reduced to zero as sample size increases under regularity conditions on the classification algorithm.
The accuracy of the theoretical analysis and the performance of the proposed precision estimator under separate sampling are investigated using synthetic and real data from observational case-control studies.
The results confirmed that the proposed precision estimator indeed becomes unbiased as sample size increases, while the ordinary precision estimator may display large bias, particularly in the case of rare diseases.
Availability
Extra plots are available as Supplementary Materials.
Author summary
Biomedical data are often sampled separately from the case and control populations, particularly in the case of rare diseases.
Precision is a popular classification accuracy metric in the statistical learning literature, which implicitly assumes that the data are sampled randomly from the mixture of the populations.
In this paper we study the bias of precision under separate sampling using theoretical and numerical methods.
We also propose a precision estimator for separate sampling in the case when the prevalence is known from public health records.
The results confirmed that the proposed precision estimator becomes unbiased as sample size increases, while the ordinary precision estimator may display large bias, particularly in the case of rare diseases.
In the absence of any knowledge about disease prevalence, precision estimates should be avoided under separate sampling.
Related Results
MARS-seq2.0: an experimental and analytical pipeline for indexed sorting combined with single-cell RNA sequencing v1
MARS-seq2.0: an experimental and analytical pipeline for indexed sorting combined with single-cell RNA sequencing v1
Human tissues comprise trillions of cells that populate a complex space of molecular phenotypes and functions and that vary in abundance by 4–9 orders of magnitude. Relying solely ...
Energy-efficient architectures for recurrent neural networks
Energy-efficient architectures for recurrent neural networks
Deep Learning algorithms have been remarkably successful in applications such as Automatic Speech Recognition and Machine Translation. Thus, these kinds of applications are ubiquit...
A Rapid Evidence Review on Consumer Responses to Precision Fermentation
A Rapid Evidence Review on Consumer Responses to Precision Fermentation
Precision fermentation is a food processing technique that uses genetically modified microorganisms such as yeast or bacteria to produce foods. For example, precision-fermented dai...
Research Into Food Sampling Policies and Approach
Research Into Food Sampling Policies and Approach
Background Local authorities are legally required to inspect food and feed businesses appropriately and consistently, ensuring they meet hygiene (microbiological) and legally presc...
Sampling scales define occupancy and underlying occupancy–abundance relationships in animals
Sampling scales define occupancy and underlying occupancy–abundance relationships in animals
AbstractOccupancy–abundance (OA) relationships are a foundational ecological phenomenon and field of study, and occupancy models are increasingly used to track population trends an...
Audiogram Estimation Performance Using Auditory Evoked Potentials and Gaussian Processes
Audiogram Estimation Performance Using Auditory Evoked Potentials and Gaussian Processes
Objectives:
Auditory evoked potentials (AEPs) play an important role in evaluating hearing in infants and others who are unable to participate reliably in behavioral te...
Pose estimation for robotic percussive riveting.
Pose estimation for robotic percussive riveting.
Recently, a robotic percussive riveting system has been developed at Ryerson University for an automation of percussive riveting process of aero-structural fastening assembly. The ...
Pose estimation for robotic percussive riveting.
Pose estimation for robotic percussive riveting.
Recently, a robotic percussive riveting system has been developed at Ryerson University for an automation of percussive riveting process of aero-structural fastening assembly. The ...

