Javascript must be enabled to continue!

Minimizer-space de Bruijn graphs

Abstract DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers , that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg , resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

openRxiv

Barış Ekim Bonnie Berger Rayan Chikhi

2021

Title: Minimizer-space de Bruijn graphs

Description:

Abstract DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates.

We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.

those based on overlapping reads using minimizer sketches.

Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet.

By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers , that are k-mers over a larger alphabet consisting of minimizer tokens.

Our approach, mdBG or minimizer-dBG, achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy.

We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes.

For assembly, we implemented mdBG in software we call rust-mdbg , resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads.

A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM.

For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes.

We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

Back

Abstract Motivation In this era of exponential data growth, minimizer sampling has become a standard algor...

10-minimizers: a promising class of constant-space minimizers

Abstract Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of siz...

Multi de Bruijn Sequences and the Cross-Join Method

We show a method to construct binary multi de Bruijn sequences using the cross-join method. We extend the proof given by Alhakim for ordinary de Bruijn sequences to the case of mul...

MBG: Minimizer-based Sparse de Bruijn Graph Construction

Motivation De Bruijn graphs can be constructed from short reads efficiently and have been used for many purposes. Traditionally long read sequencing technologies ...

Phased Multi de Bruijn Sequences

We introduce phased multi de Bruijn sequences, a generalization of de Bruijn sequences. A phased string is a string whose positions sequentially rotate through several alphabets; e...

Building Large Updatable Colored de Bruijn Graphs via Merging

MOTIVATION: There exists several massive genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze s...

Independent Set in Neutrosophic Graphs

New setting is introduced to study neutrosophic independent number and independent neutrosophic-number arising neighborhood of different vertices. Neighbor is a key term to have th...

Failed Independent Number in Neutrosophic Graphs

New setting is introduced to study neutrosophic failed-independent number and failed independent neutrosophic-number arising neighborhood of different vertices. Neighbor is a key t...

Email:
Password:

Email:

Minimizer-space de Bruijn graphs

Related Results