Javascript must be enabled to continue!

Haplotype Matching with GBWT for Pangenome Graphs

Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching. Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome. However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets. Work has been done to represent large sets of haplotypes as paths through a pangenome graph. The graph Burrows-Wheeler transform (GBWT) is one such work. The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets. Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT. In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets. We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT. We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.) to the GBWT. We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT. We also provide multiple versions of the query algorithms for different combinations of the available data structures. The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.

openRxiv

Ahsan Sanaullah Seba Villalobos Degui Zhi Shaojie Zhang

2025

Title: Haplotype Matching with GBWT for Pangenome Graphs

Description:

Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly.

In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching.

Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome.

However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets.

Work has been done to represent large sets of haplotypes as paths through a pangenome graph.

The graph Burrows-Wheeler transform (GBWT) is one such work.

The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets.

Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT.

In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets.

We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT.

We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.

) to the GBWT.

We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT.

We also provide multiple versions of the query algorithms for different combinations of the available data structures.

The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.

Back

Related Results

ODGI: understanding pangenome graphs

Abstract Motivation Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These...

Cluster-efficient pangenome graph construction with nf-core/pangenome

Abstract Motivation Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. ...

Cluster efficient pangenome graph construction with nf-core/pangenome

Abstract Motivation Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. Howeve...

PanGraphViewer: A Versatile Tool to Visualize Pangenome Graphs

AbstractPangenome graphs provide a powerful way to present both sequence and structural features in a given genome relative to the typical features of a population. There are diffe...

Pangenome graph layout by Path-Guided Stochastic Gradient Descent

Abstract Motivation The increasing availability of complete genomes demands for models to study genomic variability within enti...

2021 Census to Census Coverage Survey Matching Results.

The 2021 England and Wales Census was matched to the Census Coverage Survey (CCS). This was an essential requisite for estimating undercount in the Census. To ensure outputs could ...

Efficient inference of large pangenomes with PanTA

Abstract Pangenome analysis is an indispensable step in bacterial genomics to address the high variability of bacteria genomes. However, speed an...

Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model

Abstract The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in ...

Email:
Password:

Email: