Javascript must be enabled to continue!

The Hitchhiker’s Guide to Sequencing Data Types and Volumes for Population-Scale Pangenome Construction

Abstract Long-read (LR) technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT (ULONT). Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references. However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity. The absence of comprehensive guidance on optimal data selection exacerbates these challenges. To fill this gap, our study evaluates available data types, their significance, and the required volumes for robust de novo assembly in population-level pangenome projects. The results show that achieving chromosome-level haplotype-resolved assembly requires 20x high-quality long reads (HQLR) such as PacBio HiFi or ONT duplex, combined with 15-20x of ULONT per haplotype and 30x of long-range data such as Omni-C. High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in NG50 and phasing accuracies, while usage of duplex generates more T2T contigs. As Long-Read Technologies advance, our study reevaluates recommended data types and volumes, providing practical guidelines for selecting sequencing platforms and coverage. These insights aim to be vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.

openRxiv

Prasad Sarashetti Josipa Lipovac Filip Tomas Mile Šikic Jianjun Liu

2024

Title: The Hitchhiker’s Guide to Sequencing Data Types and Volumes for Population-Scale Pangenome Construction

Description:

Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references.

However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity.

The absence of comprehensive guidance on optimal data selection exacerbates these challenges.

To fill this gap, our study evaluates available data types, their significance, and the required volumes for robust de novo assembly in population-level pangenome projects.

The results show that achieving chromosome-level haplotype-resolved assembly requires 20x high-quality long reads (HQLR) such as PacBio HiFi or ONT duplex, combined with 15-20x of ULONT per haplotype and 30x of long-range data such as Omni-C.

High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in NG50 and phasing accuracies, while usage of duplex generates more T2T contigs.

As Long-Read Technologies advance, our study reevaluates recommended data types and volumes, providing practical guidelines for selecting sequencing platforms and coverage.

These insights aim to be vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.

Back

Abstract Motivation Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. ...

Cluster efficient pangenome graph construction with nf-core/pangenome

Abstract Motivation Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. Howeve...

ODGI: understanding pangenome graphs

Abstract Motivation Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These...

Pangenome References Improve Biomarker Estimation from Tumor Sequencing Data

Abstract It has recently been shown that patients from non-European ancestries are at a higher risk of inappropriate clinical intervention because of inaccurate bio...

MARS-seq2.0: an experimental and analytical pipeline for indexed sorting combined with single-cell RNA sequencing v1

Human tissues comprise trillions of cells that populate a complex space of molecular phenotypes and functions and that vary in abundance by 4–9 orders of magnitude. Relying solely ...

Next Generation Sequencing Technologies and Their Applications

Abstract The advances in next generation sequencing (NGS) technologies have tremendous impacts on the studies of structural and f...

Hitchhiking Robots: A Collaborative Approach for Efficient Multi-Robot Navigation in Indoor Environments

Hitchhiking is a means of transportation gained by asking other people for a (free) ride. We developed a multi-robot system which is the first of its kind to incorporate hitchhikin...

Pangenome graph layout by Path-Guided Stochastic Gradient Descent

Abstract Motivation The increasing availability of complete genomes demands for models to study genomic variability within enti...

Email:
Password:

Email:

The Hitchhiker’s Guide to Sequencing Data Types and Volumes for Population-Scale Pangenome Construction

Related Results