Javascript must be enabled to continue!

Disq, a library for manipulating bioinformatics sequencing formats in Apache Spark

ADAM and GATK have independently developed parallel and distributed genomic applications on Apache Spark. To access flat file formats such as BAM, CRAM, SAM, and VCF, both depend on the htsjdk library, which provides lowlevel codecs, and the Hadoop-BAM library, which extends these for parallel and distributed access. Hadoop-BAM was found to have correctness (invalid BAM file splits, leading to corrupt read data) and performance (sequential implementation of some parallelizable tasks) issues. The Spark-BAM project demonstrated these issues could be addressed, and developed a comprehensive benchmark. Thus members of the ADAM, Hadoop-BAM, htsjdk, GATK, Spark-BAM, and ViraPipe projects identified an opportunity to collaborate on a replacement library. Discussion between collaborators began virtually, then inperson at OpenBio Winter Codefest 2018 in Boston, and continued at GCCBOSC Collaboration Fest 2018 in Portland. A new project Disq was started in 2018, and has since made at least three releases (most recently version 0.3.0, released 19 March 2019). Benchmarks show that Disq is faster and more accurate than Hadoop-BAM, and at least as fast as SparkBAM. Disq also adds significant new features, such as support for writing sharded files for efficiency, for taking advantage of index files while reading (e.g. .sbi index files to find splits between BAM records, .crai index files to find record boundaries in CRAM files), and for writing index files where appropriate. In addition to unit tests, Disq includes integration tests that run against realworld files (multiGB in size). SAMtools and BCFtools are used to verify files written with Disq can be read successfully. Disq has been incorporated into ADAM and GATK, and will provide a convenient venue for further collaboration between those project teams. We also welcome new collaborators seeking correct and performant access to flat file formats on Apache Spark.

F1000 Research Ltd

Louis Bergelson Michael L Heuer Chris Norman Tom White Ryan Williams

2025

Title: Disq, a library for manipulating bioinformatics sequencing formats in Apache Spark

Description:

ADAM and GATK have independently developed parallel and distributed genomic applications on Apache Spark.

To access flat file formats such as BAM, CRAM, SAM, and VCF, both depend on the htsjdk library, which provides lowlevel codecs, and the Hadoop-BAM library, which extends these for parallel and distributed access.

Hadoop-BAM was found to have correctness (invalid BAM file splits, leading to corrupt read data) and performance (sequential implementation of some parallelizable tasks) issues.

The Spark-BAM project demonstrated these issues could be addressed, and developed a comprehensive benchmark.

Thus members of the ADAM, Hadoop-BAM, htsjdk, GATK, Spark-BAM, and ViraPipe projects identified an opportunity to collaborate on a replacement library.

Discussion between collaborators began virtually, then inperson at OpenBio Winter Codefest 2018 in Boston, and continued at GCCBOSC Collaboration Fest 2018 in Portland.

A new project Disq was started in 2018, and has since made at least three releases (most recently version 0.

0, released 19 March 2019).

Benchmarks show that Disq is faster and more accurate than Hadoop-BAM, and at least as fast as SparkBAM.

Disq also adds significant new features, such as support for writing sharded files for efficiency, for taking advantage of index files while reading (e.

sbi index files to find splits between BAM records, .

crai index files to find record boundaries in CRAM files), and for writing index files where appropriate.

In addition to unit tests, Disq includes integration tests that run against realworld files (multiGB in size).

SAMtools and BCFtools are used to verify files written with Disq can be read successfully.

Disq has been incorporated into ADAM and GATK, and will provide a convenient venue for further collaboration between those project teams.

We also welcome new collaborators seeking correct and performant access to flat file formats on Apache Spark.

Back

Abstract A spark plug is a part of an internal combustion engine with an electrode tip in the combustion chamber. Spar...

Optical Measurement of Spark Deflection Inside a Pre-chamber for Spark-Ignition Engines

<div class="section abstract"><div class="htmlview paragraph">The start of combustion in a spark-ignited engine is highly dependent upon the conditions between the two ...

Validity of Acute Physiology and Chronic Health Evaluation (APACHE) IV for the Prediction of Prolonged Intensive Care Unit (ICU) Length of Stay in Dr. Sardjito General Hospital in the COVID Era

Introduction: APACHE IV was a good predictor of ICU length of stay in the USA and some countries outside the USA but poor in others. It is important to develop a scoring system for...

Software analysis of scientific texts: comparative study of distributed computing frameworks

The relevance of this study is related to the need for efficient analysis of scientific texts in the context of the growing amount of information. This study aims to conduct a stud...

Compressive structural bioinformatics

We are developing compressed 3D molecular data representations and workflows (“Compressive Structural Bioinformatics”) to speed up mining and visualization of 3D structural data by...

Compressive structural bioinformatics

We are developing compressed 3D molecular data representations and workflows (“Compressive Structural Bioinformatics”) to speed up mining and visualization of 3D structural data by...

Next Generation Sequencing Technologies and Their Applications

Abstract The advances in next generation sequencing (NGS) technologies have tremendous impacts on the studies of structural and f...

Advancements in Biomedical and Bioinformatics Engineering

Abstract: The field of biomedical and bioinformatics engineering is witnessing rapid advancements that are revolutionizing healthcare and medical research. This chapter provides a...

Email:
Password:

Email:

Disq, a library for manipulating bioinformatics sequencing formats in Apache Spark

Related Results