Javascript must be enabled to continue!
Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
View through CrossRef
AbstractThe Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation. However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity (O(MN), whereMis the number of haplotypes andNis the number of sites in the panel). Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM. But the solution space of the LS for large panels is still elusive. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, andh-MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.
Title: Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model
Description:
AbstractThe Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading).
For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation.
However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity (O(MN), whereMis the number of haplotypes andNis the number of sites in the panel).
Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM.
But the solution space of the LS for large panels is still elusive.
Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel.
The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)).
This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible.
Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph.
In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, andh-MPSC solutions.
In doing so, our algorithms reveal the solution space of LS for large panels.
Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing.
We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.
Related Results
L᾽«unilinguisme» officiel de Constantinople byzantine (VIIe-XIIe s.)
L᾽«unilinguisme» officiel de Constantinople byzantine (VIIe-XIIe s.)
<p>Νίκος Οικονομίδης</...
Cometary Physics Laboratory: spectrophotometric experiments
Cometary Physics Laboratory: spectrophotometric experiments
<p><strong><span dir="ltr" role="presentation">1. Introduction</span></strong&...
North Syrian Mortaria and Other Late Roman Personal and Utility Objects Bearing Inscriptions of Good Luck
North Syrian Mortaria and Other Late Roman Personal and Utility Objects Bearing Inscriptions of Good Luck
<span style="font-size: 11pt; color: black; font-family: 'Times New Roman','serif'">ΠΗΛΙΝΑ ΙΓ&Delta...
Morphometry of an hexagonal pit crater in Pavonis Mons, Mars
Morphometry of an hexagonal pit crater in Pavonis Mons, Mars
<p><strong>Introduction:</strong></p>
<p>Pit craters are peculiar depressions found in almost every terrestria...
Un manoscritto equivocato del copista santo Theophilos († 1548)
Un manoscritto equivocato del copista santo Theophilos († 1548)
<p><font size="3"><span class="A1"><span style="font-family: 'Times New Roman','serif'">ΕΝΑ ΛΑΝ&...
Ballistic landslides on comet 67P/Churyumov–Gerasimenko
Ballistic landslides on comet 67P/Churyumov–Gerasimenko
<p><strong>Introduction:</strong></p><p>The slow ejecta (i.e., with velocity lower than escape velocity) and l...
Stress transfer process in doublet events studied by numerical TREMOL simulations: Study case Ometepec 1982 Doublet.
Stress transfer process in doublet events studied by numerical TREMOL simulations: Study case Ometepec 1982 Doublet.
<pre class="western"><span><span lang="en-US">Earthquake doublets are a characteristic rupture <...
Effects of a new land surface parametrization scheme on thermal extremes in a Regional Climate Model
Effects of a new land surface parametrization scheme on thermal extremes in a Regional Climate Model
<p><span>The </span><span>EFRE project Big Data@Geo aims at providing high resolution </span><span&...

