Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data

View through CrossRef
Statistical methods often make assumptions about independence between the samples or features of a dataset. Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice. Whitening transformations are widely applied to remove this correlation structure. Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with n samples and p features is problematic as p approaches or exceeds n. Moreover, the computational time becomes prohibitive since the naive transform is cubic in p. Here we propose a probabilistic model for data whitening and examine its properties based on first principles as p increases. We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features. We examine the out-of-sample performance of the probabilistic whitening model on simulated data, as well as real gene expression and genotype data. In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, implemented in our open source R package decorrelate, had the lowest mean square error while being up to an order of magnitude faster than other methods.
Cold Spring Harbor Laboratory
Title: Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data
Description:
Statistical methods often make assumptions about independence between the samples or features of a dataset.
Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice.
Whitening transformations are widely applied to remove this correlation structure.
Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with n samples and p features is problematic as p approaches or exceeds n.
Moreover, the computational time becomes prohibitive since the naive transform is cubic in p.
Here we propose a probabilistic model for data whitening and examine its properties based on first principles as p increases.
We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features.
We examine the out-of-sample performance of the probabilistic whitening model on simulated data, as well as real gene expression and genotype data.
In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, implemented in our open source R package decorrelate, had the lowest mean square error while being up to an order of magnitude faster than other methods.

Related Results

Inventory and pricing management in probabilistic selling
Inventory and pricing management in probabilistic selling
Context: Probabilistic selling is the strategy that the seller creates an additional probabilistic product using existing products. The exact information is unknown to customers u...
Stress Whitening Quantification of Thermoformed Mineral Filled Acrylics
Stress Whitening Quantification of Thermoformed Mineral Filled Acrylics
Stress whitening problem in thermoformed alumina trihydrate (ATH) reinforced poly(methyl methacrylate) (PMMA) was studied. In situ heavy-gage thermoforming of acrylics was entirely...
Study on Physical Simulation Experimental Technology of Ultra-low Permeability Large-scale Outcrop Model
Study on Physical Simulation Experimental Technology of Ultra-low Permeability Large-scale Outcrop Model
Abstract Ultra-low permeability reserves have accounted for a very large proportion of China's proven reserves and undeveloped reserves at present, so it is very ...
Are Cervical Ribs Indicators of Childhood Cancer? A Narrative Review
Are Cervical Ribs Indicators of Childhood Cancer? A Narrative Review
Abstract A cervical rib (CR), also known as a supernumerary or extra rib, is an additional rib that forms above the first rib, resulting from the overgrowth of the transverse proce...
Performance Experiment of Ultra high Temperature Cementing Slurry System
Performance Experiment of Ultra high Temperature Cementing Slurry System
Abstract The continuous development of oil and gas exploration and development to deep and ultra deep wells in China, the formation temperature is also getting higher and h...
Sustainability and ultra-processed foods: role of youth
Sustainability and ultra-processed foods: role of youth
The objective of this research is to study and look at the ways how processed food affects human and environmental health and to find alternatives to processed food.  Sustainabilit...
Sustainability and ultra-processed foods: role of youth
Sustainability and ultra-processed foods: role of youth
The objective of this research is to study and look at the ways how processed food affects human and environmental health and to find alternatives to processed food.  Sustainabilit...

Back to Top