Javascript must be enabled to continue!
The Hitchhiker’s Guide to Sequencing Data Types and Volumes for Population-Scale Pangenome Construction
View through CrossRef
Abstract
Long-read (LR) technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT (ULONT). Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references. However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity. The absence of comprehensive guidance on optimal data selection exacerbates these challenges. To fill this gap, our study evaluates available data types, their significance, and the required volumes for robust de novo assembly in population-level pangenome projects. The results show that achieving chromosome-level haplotype-resolved assembly requires 20x high-quality long reads (HQLR) such as PacBio HiFi or ONT duplex, combined with 15-20x of ULONT per haplotype and 30x of long-range data such as Omni-C. High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in NG50 and phasing accuracies, while usage of duplex generates more T2T contigs. As Long-Read Technologies advance, our study reevaluates recommended data types and volumes, providing practical guidelines for selecting sequencing platforms and coverage. These insights aim to be vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.
Title: The Hitchhiker’s Guide to Sequencing Data Types and Volumes for Population-Scale Pangenome Construction
Description:
Abstract
Long-read (LR) technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT (ULONT).
Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references.
However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity.
The absence of comprehensive guidance on optimal data selection exacerbates these challenges.
To fill this gap, our study evaluates available data types, their significance, and the required volumes for robust de novo assembly in population-level pangenome projects.
The results show that achieving chromosome-level haplotype-resolved assembly requires 20x high-quality long reads (HQLR) such as PacBio HiFi or ONT duplex, combined with 15-20x of ULONT per haplotype and 30x of long-range data such as Omni-C.
High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in NG50 and phasing accuracies, while usage of duplex generates more T2T contigs.
As Long-Read Technologies advance, our study reevaluates recommended data types and volumes, providing practical guidelines for selecting sequencing platforms and coverage.
These insights aim to be vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.
Related Results
Cluster-efficient pangenome graph construction with nf-core/pangenome
Cluster-efficient pangenome graph construction with nf-core/pangenome
Abstract
Motivation
Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. ...
Cluster efficient pangenome graph construction with nf-core/pangenome
Cluster efficient pangenome graph construction with nf-core/pangenome
Abstract
Motivation
Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. Howeve...
ODGI: understanding pangenome graphs
ODGI: understanding pangenome graphs
Abstract
Motivation
Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These...
Pangenome References Improve Biomarker Estimation from Tumor Sequencing Data
Pangenome References Improve Biomarker Estimation from Tumor Sequencing Data
Abstract
It has recently been shown that patients from non-European ancestries are at a higher risk of inappropriate clinical intervention because of inaccurate bio...
MARS-seq2.0: an experimental and analytical pipeline for indexed sorting combined with single-cell RNA sequencing v1
MARS-seq2.0: an experimental and analytical pipeline for indexed sorting combined with single-cell RNA sequencing v1
Human tissues comprise trillions of cells that populate a complex space of molecular phenotypes and functions and that vary in abundance by 4–9 orders of magnitude. Relying solely ...
Next Generation Sequencing Technologies and Their Applications
Next Generation Sequencing Technologies and Their Applications
Abstract
The advances in next generation sequencing (NGS) technologies have tremendous impacts on the studies of structural and f...
Hitchhiking Robots: A Collaborative Approach for Efficient Multi-Robot Navigation in Indoor Environments
Hitchhiking Robots: A Collaborative Approach for Efficient Multi-Robot Navigation in Indoor Environments
Hitchhiking is a means of transportation gained by asking other people for a (free) ride. We developed a multi-robot system which is the first of its kind to incorporate hitchhikin...
Pangenome graph layout by Path-Guided Stochastic Gradient Descent
Pangenome graph layout by Path-Guided Stochastic Gradient Descent
Abstract
Motivation
The increasing availability of complete genomes demands for models to study genomic variability within enti...

