Javascript must be enabled to continue!
ChIPWig: A Random Access-Enabling Lossless and Lossy Compression Method for ChIP-seq Data
View through CrossRef
Abstract
Motivation
The past decade has witnessed a rapid development of data acquisition technologies that enable integrative genomic and proteomic analysis. One such technology is chromatin immunoprecipitation sequencing (ChIP-seq), developed for analyzing interactions between proteins and DNA via next-generation sequencing technologies. As ChIP-seq experiments are inexpensive and time-efficient, massive datasets from this domain have been acquired, introducing significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a state-of-the-art lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. Wig is a standard file format, which in this setting contains relevant read density information crucial for visualization and downstream processing. ChIPWig may be executed in two different modes: lossless and lossy. Lossless ChIPWig compression allows for random access and fast queries in the file through careful variable-length block-wise encoding. ChIPWig also stores the summary statistics of each block needed for guided access. Lossy ChIPWig, in contrast, performs quantization of the read density values before feeding them into the lossless ChIPWig compressor. Nonuniform lossy quantization leads to further reductions in the file size, while maintaining the same accuracy of the ChIP-seq peak calling and motif discovery pipeline based on the NarrowPeaks method tailor-made for Wig files. The compressors are designed using new statistical modeling approaches coupled with delta and arithmetic encoding.
Results
We tested the ChIPWig compressor on a number of ChIP-seq datasets generated by the ENCODE project. Lossless ChIPWig reduces the file sizes to merely 6% of the original, and offers an average 6-fold compression rate improvement compared to bigWig. The running times for compression and decompression are comparable to those of bigWig. The compression and decompression speed rates are of the order of 0.2 MB/sec using general purpose computers. ChIPWig with random access only slightly degrades the performance and running time when compared to the standard mode. In the lossy mode, the average file sizes reduce by 2-fold compared to the lossless mode. Most importantly, near-optimal nonuniform quantization with respect to mean-square distortion does not affect peak calling and motif discovery results on the data tested.
Availability and Implementation
Source code and binaries freely available for download at
https://github.com/vidarmehr/ChIPWig
Contact
milenkov@illinois.edu
Supplementary information
Is available on bioRxiv.
Title: ChIPWig: A Random Access-Enabling Lossless and Lossy Compression Method for ChIP-seq Data
Description:
Abstract
Motivation
The past decade has witnessed a rapid development of data acquisition technologies that enable integrative genomic and proteomic analysis.
One such technology is chromatin immunoprecipitation sequencing (ChIP-seq), developed for analyzing interactions between proteins and DNA via next-generation sequencing technologies.
As ChIP-seq experiments are inexpensive and time-efficient, massive datasets from this domain have been acquired, introducing significant storage and maintenance challenges.
To address the resulting Big Data problems, we propose a state-of-the-art lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig.
Wig is a standard file format, which in this setting contains relevant read density information crucial for visualization and downstream processing.
ChIPWig may be executed in two different modes: lossless and lossy.
Lossless ChIPWig compression allows for random access and fast queries in the file through careful variable-length block-wise encoding.
ChIPWig also stores the summary statistics of each block needed for guided access.
Lossy ChIPWig, in contrast, performs quantization of the read density values before feeding them into the lossless ChIPWig compressor.
Nonuniform lossy quantization leads to further reductions in the file size, while maintaining the same accuracy of the ChIP-seq peak calling and motif discovery pipeline based on the NarrowPeaks method tailor-made for Wig files.
The compressors are designed using new statistical modeling approaches coupled with delta and arithmetic encoding.
Results
We tested the ChIPWig compressor on a number of ChIP-seq datasets generated by the ENCODE project.
Lossless ChIPWig reduces the file sizes to merely 6% of the original, and offers an average 6-fold compression rate improvement compared to bigWig.
The running times for compression and decompression are comparable to those of bigWig.
The compression and decompression speed rates are of the order of 0.
2 MB/sec using general purpose computers.
ChIPWig with random access only slightly degrades the performance and running time when compared to the standard mode.
In the lossy mode, the average file sizes reduce by 2-fold compared to the lossless mode.
Most importantly, near-optimal nonuniform quantization with respect to mean-square distortion does not affect peak calling and motif discovery results on the data tested.
Availability and Implementation
Source code and binaries freely available for download at
https://github.
com/vidarmehr/ChIPWig
Contact
milenkov@illinois.
edu
Supplementary information
Is available on bioRxiv.
Related Results
Practical notes on lossy compression of scientific data
Practical notes on lossy compression of scientific data
<p>Lossy compression methods are extremely efficient in terms of space and performance and allow for reduction of network bandwidth and disk space needed to store dat...
Data Compression
Data Compression
AbstractLossless compression systems and lossy compression systems are the two types of data compression systems. In a lossless compression system, a lossless code is designed to e...
Effective lossy and lossless color image compression with Multilayer Perceptron
Effective lossy and lossless color image compression with Multilayer Perceptron
This paper presents the effective lossy and lossless color image compression algorithm with Multilayer perceptron. The parallel structure of neural network and the concept of image...
Comparing genome-wide chromatin profiles using ChIP-chip or ChIP-seq
Comparing genome-wide chromatin profiles using ChIP-chip or ChIP-seq
AbstractMotivation: ChIP-chip and ChIP-seq technologies provide genome-wide measurements of various types of chromatin marks at an unprecedented resolution. With ChIP samples colle...
Two-level fusion big data compression and reconstruction framework combining second-generation wavelet and lossless compression
Two-level fusion big data compression and reconstruction framework combining second-generation wavelet and lossless compression
AbstractIn view of the characteristics of big data, fuzziness, and real time of data acquisition and transmission in the fuzzy information system faced by aircraft health managemen...
Lossless Compression Method for Medical Image Sequences Using Super-Spatial Structure Prediction and Inter-frame Coding
Lossless Compression Method for Medical Image Sequences Using Super-Spatial Structure Prediction and Inter-frame Coding
Space research organizations, hospitals and military air surveillance activities, among others, produce a huge amount of data in the form of images hence a large storage space is r...
Unmeasured human transcription factor ChIP-seq data shape functional genomics and demand strategic prioritization
Unmeasured human transcription factor ChIP-seq data shape functional genomics and demand strategic prioritization
Abstract
Transcription factor (TF) chromatin immunoprecipitation followed by sequencing (ChIP-seq) is essential for identifying genome-wide TF-binding sites (TFBSs),...
Method and VLSI implementation of lossy‐to‐lossless LTM ECG compression framework
Method and VLSI implementation of lossy‐to‐lossless LTM ECG compression framework
A lossy‐to‐lossless compression framework for electrocardiogram (ECG) signals is proposed for wearable monitoring devices. In this framework, the tail bits of the coefficients gene...

