Javascript must be enabled to continue!

On bridging paired-end RNA-seq data

Abstract Motivation The widely-used high-throughput RNA-sequencing technologies (RNA-seq) usually produce paired-end reads. We explore if full fragments can be computationally reconstructed from the sequenced two ends—a problem here we refer to as bridging . Solving this problem provides longer, more informative RNA-seq reads, and hence benefits downstream RNA-seq analysis such as transcriptome assembly and expression quantification. However, bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data itself provides sufficient information for accurate bridging, let alone proper models and efficient algorithms that characterize and determine the true bridges. Algorithmic Results We studied this problem in two settings: reference-based bridging, which assumes reads alignments are available and reconstructs the alignments of full fragments, and de novo bridging, which reconstructs sequences of entire fragments from sequences of the two ends. We proposed a novel mathematical formulation that works for both settings—to seek a path in an underlying graph data structure (i.e., splice graph for reference-based bridging, and compacted de Bruijn graph for de novo bridging) such that its bottleneck weight is maximized. This formulation characterizes true bridges and is efficient in filtering out false bridges. This formulation admits optimal substructure property, and hence efficient dynamic programming algorithms can be designed. For reference-based bridging, we designed such an algorithm to calculate the top N bridging paths, followed by a voting approach to select one using the distribution of fragment length. For de novo bridging, we designed a new truncated Dijkstra’s algorithm. To further speed up, we proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra’s algorithm from scratch for all vertices. These innovations result in scalable algorithms that can bridge all paired-end reads in a compacted de Bruijn graph with millions of vertices. Experimental Results We showed that paired-end RNA-seq reads can be accurately bridged to a large extend. Our reference-based bridging tool could correctly bridge more than 79.6% of reads. For de novo bridging, high precision was observed with varied sensitivity. We also showed that bridging can improve reference-based transcript assembly: the improvement was significant (up to 14.4% measured with adjusted precision), and universal in all combinations with different aligners and assemblers. Availability Implementations of the algorithms for reference-based and de novo bridging are available at https://github.com/Shao-Group/rnabridge-align and https://github.com/Shao-Group/rnabridge-denovo , respectively. Scripts, datasets, and documentations that can reproduce the experimental results in this manuscript are available at https://github.com/Shao-Group/rnabridge-test .

openRxiv

Xiang Li Qian Shi Mingfu Shao

2021

Title: On bridging paired-end RNA-seq data

Description:

Abstract Motivation The widely-used high-throughput RNA-sequencing technologies (RNA-seq) usually produce paired-end reads.

We explore if full fragments can be computationally reconstructed from the sequenced two ends—a problem here we refer to as bridging .

Solving this problem provides longer, more informative RNA-seq reads, and hence benefits downstream RNA-seq analysis such as transcriptome assembly and expression quantification.

However, bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors.

It remains unclear if the data itself provides sufficient information for accurate bridging, let alone proper models and efficient algorithms that characterize and determine the true bridges.

Algorithmic Results We studied this problem in two settings: reference-based bridging, which assumes reads alignments are available and reconstructs the alignments of full fragments, and de novo bridging, which reconstructs sequences of entire fragments from sequences of the two ends.

We proposed a novel mathematical formulation that works for both settings—to seek a path in an underlying graph data structure (i.

, splice graph for reference-based bridging, and compacted de Bruijn graph for de novo bridging) such that its bottleneck weight is maximized.

This formulation characterizes true bridges and is efficient in filtering out false bridges.

This formulation admits optimal substructure property, and hence efficient dynamic programming algorithms can be designed.

For reference-based bridging, we designed such an algorithm to calculate the top N bridging paths, followed by a voting approach to select one using the distribution of fragment length.

For de novo bridging, we designed a new truncated Dijkstra’s algorithm.

To further speed up, we proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra’s algorithm from scratch for all vertices.

These innovations result in scalable algorithms that can bridge all paired-end reads in a compacted de Bruijn graph with millions of vertices.

Experimental Results We showed that paired-end RNA-seq reads can be accurately bridged to a large extend.

Our reference-based bridging tool could correctly bridge more than 79.

6% of reads.

For de novo bridging, high precision was observed with varied sensitivity.

We also showed that bridging can improve reference-based transcript assembly: the improvement was significant (up to 14.

4% measured with adjusted precision), and universal in all combinations with different aligners and assemblers.

Availability Implementations of the algorithms for reference-based and de novo bridging are available at https://github.

com/Shao-Group/rnabridge-align and https://github.

com/Shao-Group/rnabridge-denovo , respectively.

Scripts, datasets, and documentations that can reproduce the experimental results in this manuscript are available at https://github.

com/Shao-Group/rnabridge-test .

Back

Abstract Introduction Variant calling based on DNA samples has been the gold standard of clinical testing since the advent of Sanger sequencing. The u...

Abstract 2323: Deciphering RNA degradation: Insights from a comparative analysis of paired fresh frozen/FFPE total RNA-seq

Abstract Background: Fresh frozen (FF) and formalin-fixed paraffin-embedded (FFPE) samples are primary resources for archival tissues in cancer studies. Despite the ...

Detection of Multiple Types of Cancer Driver Mutations Using Targeted RNA Sequencing in NSCLC

ABSTRACTCurrently, DNA and RNA are used separately to capture different types of gene mutations. DNA is commonly used for the detection of SNVs, indels and CNVs; RNA is used for an...

Detecting RNA–RNA interactome

AbstractThe last decade has seen a robust increase in various types of novel RNA molecules and their complexity in gene regulation. RNA molecules play a critical role in cellular e...

Generating Synthetic Single Cell Data from Bulk RNA-seq Using a Pretrained Variational Autoencoder

AbstractSingle cell RNA sequencing (scRNA-seq) is a powerful approach which generates genome-wide gene expression profiles at single cell resolution. Among its many applications, i...

Abstract 4875: HIVE Proteomics: Integrated, cloud-based RNA-Seq and proteomics analysis of prostate adenocarcinoma samples

Abstract Automated bottom-up proteomics workflows implemented with modern mass-spectrometry instrumentation can readily generate millions of peptide fragmentation sp...

Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data

AbstractGene set scoring (GSS) has been routinely conducted for gene expression analysis of bulk or single-cell RNA-seq data, which helps to decipher single-cell heterogeneity and ...

A Nile Grass Rat Transcriptomic Landscape Across 22 Organs By Ultra-deep Sequencing and Comparative RNA-seq pipeline (CRSP)

Abstract RNA sequencing (RNA-seq) has been a widely used high-throughput method to characterize transcriptomic dynamics spatiotemporally. However...

Email:
Password:

Email:

On bridging paired-end RNA-seq data

Related Results