Javascript must be enabled to continue!

Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data

Statistical methods often make assumptions about independence between the samples or features of a dataset. Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice. Whitening transformations are widely applied to remove this correlation structure. Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with n samples and p features is problematic as p approaches or exceeds n. Moreover, the computational time becomes prohibitive since the naive transform is cubic in p. Here we propose a probabilistic model for data whitening and examine its properties based on first principles as p increases. We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features. We examine the out-of-sample performance of the probabilistic whitening model on simulated data, as well as real gene expression and genotype data. In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, implemented in our open source R package decorrelate, had the lowest mean square error while being up to an order of magnitude faster than other methods.

Cold Spring Harbor Laboratory

Gabriel E. Hoffman Panos Roussos

2025

Title: Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data

Description:

Statistical methods often make assumptions about independence between the samples or features of a dataset.

Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice.

Whitening transformations are widely applied to remove this correlation structure.

Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with n samples and p features is problematic as p approaches or exceeds n.

Moreover, the computational time becomes prohibitive since the naive transform is cubic in p.

Here we propose a probabilistic model for data whitening and examine its properties based on first principles as p increases.

We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features.

We examine the out-of-sample performance of the probabilistic whitening model on simulated data, as well as real gene expression and genotype data.

In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, implemented in our open source R package decorrelate, had the lowest mean square error while being up to an order of magnitude faster than other methods.

Back

Context: Probabilistic selling is the strategy that the seller creates an additional probabilistic product using existing products. The exact information is unknown to customers u...

Stress Whitening Quantification of Thermoformed Mineral Filled Acrylics

Stress whitening problem in thermoformed alumina trihydrate (ATH) reinforced poly(methyl methacrylate) (PMMA) was studied. In situ heavy-gage thermoforming of acrylics was entirely...

Study on Physical Simulation Experimental Technology of Ultra-low Permeability Large-scale Outcrop Model

Abstract Ultra-low permeability reserves have accounted for a very large proportion of China's proven reserves and undeveloped reserves at present, so it is very ...

Performance Experiment of Ultra high Temperature Cementing Slurry System

Abstract The continuous development of oil and gas exploration and development to deep and ultra deep wells in China, the formation temperature is also getting higher and h...

Are Cervical Ribs Indicators of Childhood Cancer? A Narrative Review

Abstract A cervical rib (CR), also known as a supernumerary or extra rib, is an additional rib that forms above the first rib, resulting from the overgrowth of the transverse proce...

Sustainability and ultra-processed foods: role of youth

The objective of this research is to study and look at the ways how processed food affects human and environmental health and to find alternatives to processed food. Sustainabilit...

Sustainability and ultra-processed foods: role of youth

The objective of this research is to study and look at the ways how processed food affects human and environmental health and to find alternatives to processed food. Sustainabilit...

The Value of Lateral Flow Urine Lipoarabinomannan Assay and Empirical Treatment in the Xpert MTB/RIF Ultra Era: a Prospective Cohort Study

Abstract Introduction: The value of Lateral Flow urine Lipoarabinomannan (LF-LAM) assay and the role of empiric tuberculosis (TB) treatment in the era of the highly sensiti...

Email:
Password:

Email:

Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data

Related Results