Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Haplotype Matching with GBWT for Pangenome Graphs

View through CrossRef
Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching. Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome. However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets. Work has been done to represent large sets of haplotypes as paths through a pangenome graph. The graph Burrows-Wheeler transform (GBWT) is one such work. The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets. Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT. In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets. We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT. We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.) to the GBWT. We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT. We also provide multiple versions of the query algorithms for different combinations of the available data structures. The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.
Title: Haplotype Matching with GBWT for Pangenome Graphs
Description:
Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly.
In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching.
Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome.
However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets.
Work has been done to represent large sets of haplotypes as paths through a pangenome graph.
The graph Burrows-Wheeler transform (GBWT) is one such work.
The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets.
Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT.
In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets.
We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT.
We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.
) to the GBWT.
We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT.
We also provide multiple versions of the query algorithms for different combinations of the available data structures.
The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.

Related Results

PanGraphViewer: A Versatile Tool to Visualize Pangenome Graphs
PanGraphViewer: A Versatile Tool to Visualize Pangenome Graphs
AbstractPangenome graphs provide a powerful way to present both sequence and structural features in a given genome relative to the typical features of a population. There are diffe...
2021 Census to Census Coverage Survey Matching Results.
2021 Census to Census Coverage Survey Matching Results.
The 2021 England and Wales Census was matched to the Census Coverage Survey (CCS). This was an essential requisite for estimating undercount in the Census. To ensure outputs could ...
Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
AbstractThe Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threadi...
mtDNA haplotype network analysis: Exploring genetic relationships and diversity in dog haplogroups
mtDNA haplotype network analysis: Exploring genetic relationships and diversity in dog haplogroups
The genetic diversity and relationships of dog haplogroups were studied by analyzing the HV1 region of mitochondrial DNA. Previous studies have found six distinct haplogroups (A, B...
Efficient inference of large pangenomes with PanTA
Efficient inference of large pangenomes with PanTA
AbstractPangenome analysis is an indispensable step in bacterial genomics to address the high variability of bacteria genomes. However, speed and scalability remain a challenge for...
Pangenome – its Aspect and Prospect
Pangenome – its Aspect and Prospect
Pangenome is a very new discipline showing a collection of unique and variable genomes of any species in one model. This discipline is a combination of three subjects of Biology li...
Twilight graphs
Twilight graphs
AbstractThis paper deals primarily with countable, simple, connected graphs and the following two conditions which are trivially satisfied if the graphs are finite:(a) there is an ...

Back to Top