Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Syllable-PBWT for space-efficient haplotype long-match query

View through CrossRef
AbstractThe positional Burrows-Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query. Compared to Algorithm 3 of Sanaullah et al. (2021), the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions. The implementation of our algorithm is available athttps://github.com/ZhiGroup/Syllable-PBWT.
Title: Syllable-PBWT for space-efficient haplotype long-match query
Description:
AbstractThe positional Burrows-Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data.
For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes.
However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes.
In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison.
With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query.
Compared to Algorithm 3 of Sanaullah et al.
(2021), the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data.
Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions.
The implementation of our algorithm is available athttps://github.
com/ZhiGroup/Syllable-PBWT.

Related Results

Haplotype-based Parallel PBWT for Biobank Scale Data
Haplotype-based Parallel PBWT for Biobank Scale Data
AbstractDurbin’s positional Burrows-Wheeler transform (PBWT) enables algorithms with the optimal time complexity ofO(MN) for reporting all vs all haplotype matches in a population ...
d-PBWT: dynamic positional Burrows-Wheeler transform
d-PBWT: dynamic positional Burrows-Wheeler transform
AbstractDurbin’s PBWT, a scalable data structure for haplotype matching, has been successfully applied to identical by descent (IBD) segment identification and genotype imputation....
Haplotype Matching with GBWT for Pangenome Graphs
Haplotype Matching with GBWT for Pangenome Graphs
Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Bu...
Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
AbstractThe Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threadi...
RaPID-Query for Fast Identity by Descent Search and Genealogical Analysis
RaPID-Query for Fast Identity by Descent Search and Genealogical Analysis
AbstractThe size of genetic databases has grown large enough such that, genetic genealogical search, a process of inferring familial relatedness by identifying DNA matches, has bec...
Blood Cross Matching Without Anti-Human Globulin (AHG) and Bovine Serum: A New Interest for an Old Idea
Blood Cross Matching Without Anti-Human Globulin (AHG) and Bovine Serum: A New Interest for an Old Idea
Abstract  Introduction Transfusion medicine promotes the safety of blood transfusions by rigorously testing to eliminate risks of infection and hemolytic. The efficacy (to correct ...
Seditious Spaces
Seditious Spaces
The title ‘Seditious Spaces’ is derived from one aspect of Britain’s colonial legacy in Malaysia (formerly Malaya): the Sedition Act 1948. While colonial rule may seem like it was ...
The Syllable Structure of Tiv
The Syllable Structure of Tiv
Linguistic studies reveal that every language has a particular way of combining its sounds to form words or parts of words called syllables. The paper looks at the syllable structu...

Back to Top