Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Building Large Updatable Colored de Bruijn Graphs via Merging

View through CrossRef
MOTIVATION: There exists several massive genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph have been developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed. RESULTS: We create a method for constructing and updating the colored de Bruijn graph on a very-large dataset through partitioning the data into smaller subsets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this paper. We refer to the resulting method as VariMerge. We validate our approach, and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8,000 strains. Lastly, we compare VariMerge to other competing methods --- including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method by Almodaresi(2019) and Multi-BRWT --- and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16,000 strains in a manner that allows additional samples to be added. Competing methods either did not scale to this large of a dataset or cannot allow for additions without reconstruction. AVAILABILITY: Our software is available under GPLv3 at https://github.com/cosmo-team/cosmo/tree/VARI-merge.
Title: Building Large Updatable Colored de Bruijn Graphs via Merging
Description:
MOTIVATION: There exists several massive genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data.
To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph have been developed.
Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction.
This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed.
RESULTS: We create a method for constructing and updating the colored de Bruijn graph on a very-large dataset through partitioning the data into smaller subsets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph.
The last step, merging succinctly, is the algorithmic challenge which we solve in this paper.
We refer to the resulting method as VariMerge.
We validate our approach, and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8,000 strains.
Lastly, we compare VariMerge to other competing methods --- including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method by Almodaresi(2019) and Multi-BRWT --- and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16,000 strains in a manner that allows additional samples to be added.
Competing methods either did not scale to this large of a dataset or cannot allow for additions without reconstruction.
AVAILABILITY: Our software is available under GPLv3 at https://github.
com/cosmo-team/cosmo/tree/VARI-merge.

Related Results

Multi de Bruijn Sequences and the Cross-Join Method
Multi de Bruijn Sequences and the Cross-Join Method
We show a method to construct binary multi de Bruijn sequences using the cross-join method. We extend the proof given by Alhakim for ordinary de Bruijn sequences to the case of mul...
Phased Multi de Bruijn Sequences
Phased Multi de Bruijn Sequences
We introduce phased multi de Bruijn sequences, a generalization of de Bruijn sequences. A phased string is a string whose positions sequentially rotate through several alphabets; e...
MBG: Minimizer-based Sparse de Bruijn Graph Construction
MBG: Minimizer-based Sparse de Bruijn Graph Construction
Motivation De Bruijn graphs can be constructed from short reads efficiently and have been used for many purposes. Traditionally long read sequencing technologies ...
Distribution of Variables in Lambda-Terms with Restrictions on De Bruijn Indices and De Bruijn Levels
Distribution of Variables in Lambda-Terms with Restrictions on De Bruijn Indices and De Bruijn Levels
We consider two special subclasses of lambda-terms that are restricted by a bound on the number of abstractions between a variable and its binding lambda, the so-called De-Bruijn i...
Buffering Updates Enables Efficient Dynamic de Bruijn Graphs
Buffering Updates Enables Efficient Dynamic de Bruijn Graphs
Abstract Motivation The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduc...
Independent Set in Neutrosophic Graphs
Independent Set in Neutrosophic Graphs
New setting is introduced to study neutrosophic independent number and independent neutrosophic-number arising neighborhood of different vertices. Neighbor is a key term to have th...
Failed Independent Number in Neutrosophic Graphs
Failed Independent Number in Neutrosophic Graphs
New setting is introduced to study neutrosophic failed-independent number and failed independent neutrosophic-number arising neighborhood of different vertices. Neighbor is a key t...
Computing a Minimum Subset Feedback Vertex Set on Chordal Graphs Parameterized by Leafage
Computing a Minimum Subset Feedback Vertex Set on Chordal Graphs Parameterized by Leafage
Abstract Chordal graphs are characterized as the intersection graphs of subtrees in a tree and such a representation is known as the tree model. Restricting the characteriz...

Back to Top