Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

On bridging paired-end RNA-seq data

View through CrossRef
Abstract Motivation The widely-used high-throughput RNA-sequencing technologies (RNA-seq) usually produce paired-end reads. We explore if full fragments can be computationally reconstructed from the sequenced two ends—a problem here we refer to as bridging . Solving this problem provides longer, more informative RNA-seq reads, and hence benefits downstream RNA-seq analysis such as transcriptome assembly and expression quantification. However, bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data itself provides sufficient information for accurate bridging, let alone proper models and efficient algorithms that characterize and determine the true bridges. Algorithmic Results We studied this problem in two settings: reference-based bridging, which assumes reads alignments are available and reconstructs the alignments of full fragments, and de novo bridging, which reconstructs sequences of entire fragments from sequences of the two ends. We proposed a novel mathematical formulation that works for both settings—to seek a path in an underlying graph data structure (i.e., splice graph for reference-based bridging, and compacted de Bruijn graph for de novo bridging) such that its bottleneck weight is maximized. This formulation characterizes true bridges and is efficient in filtering out false bridges. This formulation admits optimal substructure property, and hence efficient dynamic programming algorithms can be designed. For reference-based bridging, we designed such an algorithm to calculate the top N bridging paths, followed by a voting approach to select one using the distribution of fragment length. For de novo bridging, we designed a new truncated Dijkstra’s algorithm. To further speed up, we proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra’s algorithm from scratch for all vertices. These innovations result in scalable algorithms that can bridge all paired-end reads in a compacted de Bruijn graph with millions of vertices. Experimental Results We showed that paired-end RNA-seq reads can be accurately bridged to a large extend. Our reference-based bridging tool could correctly bridge more than 79.6% of reads. For de novo bridging, high precision was observed with varied sensitivity. We also showed that bridging can improve reference-based transcript assembly: the improvement was significant (up to 14.4% measured with adjusted precision), and universal in all combinations with different aligners and assemblers. Availability Implementations of the algorithms for reference-based and de novo bridging are available at https://github.com/Shao-Group/rnabridge-align and https://github.com/Shao-Group/rnabridge-denovo , respectively. Scripts, datasets, and documentations that can reproduce the experimental results in this manuscript are available at https://github.com/Shao-Group/rnabridge-test .
Title: On bridging paired-end RNA-seq data
Description:
Abstract Motivation The widely-used high-throughput RNA-sequencing technologies (RNA-seq) usually produce paired-end reads.
We explore if full fragments can be computationally reconstructed from the sequenced two ends—a problem here we refer to as bridging .
Solving this problem provides longer, more informative RNA-seq reads, and hence benefits downstream RNA-seq analysis such as transcriptome assembly and expression quantification.
However, bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors.
It remains unclear if the data itself provides sufficient information for accurate bridging, let alone proper models and efficient algorithms that characterize and determine the true bridges.
Algorithmic Results We studied this problem in two settings: reference-based bridging, which assumes reads alignments are available and reconstructs the alignments of full fragments, and de novo bridging, which reconstructs sequences of entire fragments from sequences of the two ends.
We proposed a novel mathematical formulation that works for both settings—to seek a path in an underlying graph data structure (i.
e.
, splice graph for reference-based bridging, and compacted de Bruijn graph for de novo bridging) such that its bottleneck weight is maximized.
This formulation characterizes true bridges and is efficient in filtering out false bridges.
This formulation admits optimal substructure property, and hence efficient dynamic programming algorithms can be designed.
For reference-based bridging, we designed such an algorithm to calculate the top N bridging paths, followed by a voting approach to select one using the distribution of fragment length.
For de novo bridging, we designed a new truncated Dijkstra’s algorithm.
To further speed up, we proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra’s algorithm from scratch for all vertices.
These innovations result in scalable algorithms that can bridge all paired-end reads in a compacted de Bruijn graph with millions of vertices.
Experimental Results We showed that paired-end RNA-seq reads can be accurately bridged to a large extend.
Our reference-based bridging tool could correctly bridge more than 79.
6% of reads.
For de novo bridging, high precision was observed with varied sensitivity.
We also showed that bridging can improve reference-based transcript assembly: the improvement was significant (up to 14.
4% measured with adjusted precision), and universal in all combinations with different aligners and assemblers.
Availability Implementations of the algorithms for reference-based and de novo bridging are available at https://github.
com/Shao-Group/rnabridge-align and https://github.
com/Shao-Group/rnabridge-denovo , respectively.
Scripts, datasets, and documentations that can reproduce the experimental results in this manuscript are available at https://github.
com/Shao-Group/rnabridge-test .

Related Results

Abstract P1-05-23: Utilities and challenges of RNA-Seq based expression and variant calling in a clinical setting
Abstract P1-05-23: Utilities and challenges of RNA-Seq based expression and variant calling in a clinical setting
Abstract Introduction Variant calling based on DNA samples has been the gold standard of clinical testing since the advent of Sanger sequencing. The u...
Abstract 2323: Deciphering RNA degradation: Insights from a comparative analysis of paired fresh frozen/FFPE total RNA-seq
Abstract 2323: Deciphering RNA degradation: Insights from a comparative analysis of paired fresh frozen/FFPE total RNA-seq
Abstract Background: Fresh frozen (FF) and formalin-fixed paraffin-embedded (FFPE) samples are primary resources for archival tissues in cancer studies. Despite the ...
Detecting RNA–RNA interactome
Detecting RNA–RNA interactome
AbstractThe last decade has seen a robust increase in various types of novel RNA molecules and their complexity in gene regulation. RNA molecules play a critical role in cellular e...
Detection of Multiple Types of Cancer Driver Mutations Using Targeted RNA Sequencing in NSCLC
Detection of Multiple Types of Cancer Driver Mutations Using Targeted RNA Sequencing in NSCLC
ABSTRACTCurrently, DNA and RNA are used separately to capture different types of gene mutations. DNA is commonly used for the detection of SNVs, indels and CNVs; RNA is used for an...
Generating Synthetic Single Cell Data from Bulk RNA-seq Using a Pretrained Variational Autoencoder
Generating Synthetic Single Cell Data from Bulk RNA-seq Using a Pretrained Variational Autoencoder
AbstractSingle cell RNA sequencing (scRNA-seq) is a powerful approach which generates genome-wide gene expression profiles at single cell resolution. Among its many applications, i...
Abstract 4875: HIVE Proteomics: Integrated, cloud-based RNA-Seq and proteomics analysis of prostate adenocarcinoma samples
Abstract 4875: HIVE Proteomics: Integrated, cloud-based RNA-Seq and proteomics analysis of prostate adenocarcinoma samples
Abstract Automated bottom-up proteomics workflows implemented with modern mass-spectrometry instrumentation can readily generate millions of peptide fragmentation sp...
Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data
Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data
AbstractGene set scoring (GSS) has been routinely conducted for gene expression analysis of bulk or single-cell RNA-seq data, which helps to decipher single-cell heterogeneity and ...

Back to Top