Javascript must be enabled to continue!
Haplotype Matching with GBWT for Pangenome Graphs
View through CrossRef
Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching. Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome. However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets. Work has been done to represent large sets of haplotypes as paths through a pangenome graph. The graph Burrows-Wheeler transform (GBWT) is one such work. The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets. Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT. In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets. We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT. We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.) to the GBWT. We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT. We also provide multiple versions of the query algorithms for different combinations of the available data structures. The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.
Title: Haplotype Matching with GBWT for Pangenome Graphs
Description:
Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly.
In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching.
Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome.
However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets.
Work has been done to represent large sets of haplotypes as paths through a pangenome graph.
The graph Burrows-Wheeler transform (GBWT) is one such work.
The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets.
Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT.
In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets.
We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT.
We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.
) to the GBWT.
We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT.
We also provide multiple versions of the query algorithms for different combinations of the available data structures.
The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.
Related Results
Disentangling the impacts of abiotic and biotic environmental factors and dispersal dynamics on the pangenome fluidity of bacterial pathogens
Disentangling the impacts of abiotic and biotic environmental factors and dispersal dynamics on the pangenome fluidity of bacterial pathogens
ABSTRACT
Understanding how pangenomes originate and evolve is crucial for predicting evolutionary trajectories and uncovering ecological interact...
ODGI: understanding pangenome graphs
ODGI: understanding pangenome graphs
Abstract
Motivation
Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These...
Cluster-efficient pangenome graph construction with nf-core/pangenome
Cluster-efficient pangenome graph construction with nf-core/pangenome
Abstract
Motivation
Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. ...
Cluster efficient pangenome graph construction with nf-core/pangenome
Cluster efficient pangenome graph construction with nf-core/pangenome
Abstract
Motivation
Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. Howeve...
SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
Abstract
Breakthrough advances in long-read sequencing technologies have opened unprecedented opportunities to study genetic variations through comprehensive pangen...
Genomic characterization of the
C. tuberculostearicum
species complex, a ubiquitous member of the human skin microbiome
Genomic characterization of the
C. tuberculostearicum
species complex, a ubiquitous member of the human skin microbiome
ABSTRACT
Corynebacterium
is a predominant genus in the skin microbiome, yet its genetic diversity on skin is incompletely chara...
PanGraphViewer: A Versatile Tool to Visualize Pangenome Graphs
PanGraphViewer: A Versatile Tool to Visualize Pangenome Graphs
AbstractPangenome graphs provide a powerful way to present both sequence and structural features in a given genome relative to the typical features of a population. There are diffe...
Independent Set in Neutrosophic Graphs
Independent Set in Neutrosophic Graphs
New setting is introduced to study neutrosophic independent number and independent neutrosophic-number arising neighborhood of different vertices. Neighbor is a key term to have th...

