Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Disq, a library for manipulating bioinformatics sequencing formats in Apache Spark

View through CrossRef
ADAM and GATK have independently developed parallel and distributed genomic applications on Apache Spark. To access flat file formats such as BAM, CRAM, SAM, and VCF, both depend on the htsjdk library, which provides low­level codecs, and the Hadoop-­BAM library, which extends these for parallel and distributed access. Hadoop-­BAM was found to have correctness (invalid BAM file splits, leading to corrupt read data) and performance (sequential implementation of some parallelizable tasks) issues. The Spark­-BAM project demonstrated these issues could be addressed, and developed a comprehensive benchmark. Thus members of the ADAM, Hadoop-­BAM, htsjdk, GATK, Spark­-BAM, and ViraPipe projects identified an opportunity to collaborate on a replacement library. Discussion between collaborators began virtually, then in­person at OpenBio Winter Codefest 2018 in Boston, and continued at GCCBOSC Collaboration Fest 2018 in Portland. A new project Disq was started in 2018, and has since made at least three releases (most recently version 0.3.0, released 19 March 2019). Benchmarks show that Disq is faster and more accurate than Hadoop­-BAM, and at least as fast as Spark­BAM. Disq also adds significant new features, such as support for writing sharded files for efficiency, for taking advantage of index files while reading (e.g. .sbi index files to find splits between BAM records, .crai index files to find record boundaries in CRAM files), and for writing index files where appropriate. In addition to unit tests, Disq includes integration tests that run against real­world files (multi­GB in size). SAMtools and BCFtools are used to verify files written with Disq can be read successfully. Disq has been incorporated into ADAM and GATK, and will provide a convenient venue for further collaboration between those project teams. We also welcome new collaborators seeking correct and performant access to flat file formats on Apache Spark.
Title: Disq, a library for manipulating bioinformatics sequencing formats in Apache Spark
Description:
ADAM and GATK have independently developed parallel and distributed genomic applications on Apache Spark.
To access flat file formats such as BAM, CRAM, SAM, and VCF, both depend on the htsjdk library, which provides low­level codecs, and the Hadoop-­BAM library, which extends these for parallel and distributed access.
Hadoop-­BAM was found to have correctness (invalid BAM file splits, leading to corrupt read data) and performance (sequential implementation of some parallelizable tasks) issues.
The Spark­-BAM project demonstrated these issues could be addressed, and developed a comprehensive benchmark.
Thus members of the ADAM, Hadoop-­BAM, htsjdk, GATK, Spark­-BAM, and ViraPipe projects identified an opportunity to collaborate on a replacement library.
Discussion between collaborators began virtually, then in­person at OpenBio Winter Codefest 2018 in Boston, and continued at GCCBOSC Collaboration Fest 2018 in Portland.
A new project Disq was started in 2018, and has since made at least three releases (most recently version 0.
3.
0, released 19 March 2019).
Benchmarks show that Disq is faster and more accurate than Hadoop­-BAM, and at least as fast as Spark­BAM.
Disq also adds significant new features, such as support for writing sharded files for efficiency, for taking advantage of index files while reading (e.
g.
.
sbi index files to find splits between BAM records, .
crai index files to find record boundaries in CRAM files), and for writing index files where appropriate.
In addition to unit tests, Disq includes integration tests that run against real­world files (multi­GB in size).
SAMtools and BCFtools are used to verify files written with Disq can be read successfully.
Disq has been incorporated into ADAM and GATK, and will provide a convenient venue for further collaboration between those project teams.
We also welcome new collaborators seeking correct and performant access to flat file formats on Apache Spark.

Related Results

Optical Measurement of Spark Deflection Inside a Pre-chamber for Spark-Ignition Engines
Optical Measurement of Spark Deflection Inside a Pre-chamber for Spark-Ignition Engines
<div class="section abstract"><div class="htmlview paragraph">The start of combustion in a spark-ignited engine is highly dependent upon the conditions between the two ...
Software analysis of scientific texts: comparative study of distributed computing frameworks
Software analysis of scientific texts: comparative study of distributed computing frameworks
The relevance of this study is related to the need for efficient analysis of scientific texts in the context of the growing amount of information. This study aims to conduct a stud...
Compressive structural bioinformatics
Compressive structural bioinformatics
We are developing compressed 3D molecular data representations and workflows (“Compressive Structural Bioinformatics”) to speed up mining and visualization of 3D structural data by...
Compressive structural bioinformatics
Compressive structural bioinformatics
We are developing compressed 3D molecular data representations and workflows (“Compressive Structural Bioinformatics”) to speed up mining and visualization of 3D structural data by...
Next Generation Sequencing Technologies and Their Applications
Next Generation Sequencing Technologies and Their Applications
Abstract The advances in next generation sequencing (NGS) technologies have tremendous impacts on the studies of structural and f...
Advancements in Biomedical and Bioinformatics Engineering
Advancements in Biomedical and Bioinformatics Engineering
Abstract: The field of biomedical and bioinformatics engineering is witnessing rapid advancements that are revolutionizing healthcare and medical research. This chapter provides a...

Back to Top