Javascript must be enabled to continue!
Minimizer-space de Bruijn graphs
View through CrossRef
Abstract
DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of
minimizer-space
sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call
k-min-mers
, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call
rust-mdbg
, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.
Title: Minimizer-space de Bruijn graphs
Description:
Abstract
DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates.
We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.
g.
those based on overlapping reads using minimizer sketches.
Here, we introduce the concept of
minimizer-space
sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet.
By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call
k-min-mers
, that are k-mers over a larger alphabet consisting of minimizer tokens.
Our approach, mdBG or minimizer-dBG, achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy.
We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes.
For assembly, we implemented mdBG in software we call
rust-mdbg
, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads.
A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM.
For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes.
We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.
Related Results
Weighted minimizer sampling improves long read mapping
Weighted minimizer sampling improves long read mapping
Abstract
Motivation
In this era of exponential data growth, minimizer sampling has become a standard algor...
10-minimizers: a promising class of constant-space minimizers
10-minimizers: a promising class of constant-space minimizers
Abstract
Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of siz...
Multi de Bruijn Sequences and the Cross-Join Method
Multi de Bruijn Sequences and the Cross-Join Method
We show a method to construct binary multi de Bruijn sequences using the cross-join method. We extend the proof given by Alhakim for ordinary de Bruijn sequences to the case of mul...
MBG: Minimizer-based Sparse de Bruijn Graph Construction
MBG: Minimizer-based Sparse de Bruijn Graph Construction
Motivation
De Bruijn graphs can be constructed from short reads efficiently and have been used for many purposes. Traditionally long read sequencing technologies ...
Phased Multi de Bruijn Sequences
Phased Multi de Bruijn Sequences
We introduce phased multi de Bruijn sequences, a generalization of de Bruijn sequences. A phased string is a string whose positions sequentially rotate through several alphabets; e...
Building Large Updatable Colored de Bruijn Graphs via Merging
Building Large Updatable Colored de Bruijn Graphs via Merging
MOTIVATION: There exists several massive genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze s...
Distribution of Variables in Lambda-Terms with Restrictions on De Bruijn Indices and De Bruijn Levels
Distribution of Variables in Lambda-Terms with Restrictions on De Bruijn Indices and De Bruijn Levels
We consider two special subclasses of lambda-terms that are restricted by a bound on the number of abstractions between a variable and its binding lambda, the so-called De-Bruijn i...
Buffering Updates Enables Efficient Dynamic de Bruijn Graphs
Buffering Updates Enables Efficient Dynamic de Bruijn Graphs
Abstract
Motivation
The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduc...

