Javascript must be enabled to continue!
Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data
View through CrossRef
Statistical methods often make assumptions about independence between the samples or features of a dataset. Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice. Whitening transformations are widely applied to remove this correlation structure. Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with n samples and p features is problematic as p approaches or exceeds n. Moreover, the computational time becomes prohibitive since the naive transform is cubic in p. Here we propose a probabilistic model for data whitening and examine its properties based on first principles as p increases. We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features. We examine the out-of-sample performance of the probabilistic whitening model on simulated data, as well as real gene expression and genotype data. In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, implemented in our open source R package decorrelate, had the lowest mean square error while being up to an order of magnitude faster than other methods.
Title: Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data
Description:
Statistical methods often make assumptions about independence between the samples or features of a dataset.
Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice.
Whitening transformations are widely applied to remove this correlation structure.
Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with n samples and p features is problematic as p approaches or exceeds n.
Moreover, the computational time becomes prohibitive since the naive transform is cubic in p.
Here we propose a probabilistic model for data whitening and examine its properties based on first principles as p increases.
We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features.
We examine the out-of-sample performance of the probabilistic whitening model on simulated data, as well as real gene expression and genotype data.
In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, implemented in our open source R package decorrelate, had the lowest mean square error while being up to an order of magnitude faster than other methods.
Related Results
Inventory and pricing management in probabilistic selling
Inventory and pricing management in probabilistic selling
Context: Probabilistic selling is the strategy that the seller creates an additional probabilistic product using existing products. The exact information is unknown to customers u...
Stress Whitening Quantification of Thermoformed Mineral Filled Acrylics
Stress Whitening Quantification of Thermoformed Mineral Filled Acrylics
Stress whitening problem in thermoformed alumina trihydrate (ATH) reinforced poly(methyl methacrylate) (PMMA) was studied. In situ heavy-gage thermoforming of acrylics was entirely...
Study on Physical Simulation Experimental Technology of Ultra-low Permeability Large-scale Outcrop Model
Study on Physical Simulation Experimental Technology of Ultra-low Permeability Large-scale Outcrop Model
Abstract
Ultra-low permeability reserves have accounted for a very large proportion of China's proven reserves and undeveloped reserves at present, so it is very ...
Are Cervical Ribs Indicators of Childhood Cancer? A Narrative Review
Are Cervical Ribs Indicators of Childhood Cancer? A Narrative Review
Abstract
A cervical rib (CR), also known as a supernumerary or extra rib, is an additional rib that forms above the first rib, resulting from the overgrowth of the transverse proce...
Performance Experiment of Ultra high Temperature Cementing Slurry System
Performance Experiment of Ultra high Temperature Cementing Slurry System
Abstract
The continuous development of oil and gas exploration and development to deep and ultra deep wells in China, the formation temperature is also getting higher and h...
Sustainability and ultra-processed foods: role of youth
Sustainability and ultra-processed foods: role of youth
The objective of this research is to study and look at the ways how processed food affects human and environmental health and to find alternatives to processed food. Sustainabilit...
Sustainability and ultra-processed foods: role of youth
Sustainability and ultra-processed foods: role of youth
The objective of this research is to study and look at the ways how processed food affects human and environmental health and to find alternatives to processed food. Sustainabilit...
The Value of Lateral Flow Urine Lipoarabinomannan Assay and Empirical Treatment in the Xpert MTB/RIF Ultra Era: a Prospective Cohort Study
The Value of Lateral Flow Urine Lipoarabinomannan Assay and Empirical Treatment in the Xpert MTB/RIF Ultra Era: a Prospective Cohort Study
Abstract
Introduction: The value of Lateral Flow urine Lipoarabinomannan (LF-LAM) assay and the role of empiric tuberculosis (TB) treatment in the era of the highly sensiti...

