Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Abstract 2708: Toward improved cancer classification using PCA + tSNE dimensionality reduction on bulk RNA-seq data

View through CrossRef
Abstract Intro: Minor variations in cancer type can have a major impact on therapeutic effectiveness and on the course of drug research and development. In order to improve upon existing -omic data classification methods, this study seeks to correctly classify previously unknown cancer samples and to determine -omic overlap between cancers by clustering patient RNA-seq data using PCA+tSNE for dimensionality reduction. Background: Although there have been major innovations in the use of bulk RNA-seq data for in silico modeling, RNA-seq analysis has limitations. RNA-seq data has over 20,000 dimensions, making it difficult to perform clustering and differential analysis important for in silico modeling. Thus, dimensionality reduction is a key for RNA-seq analysis. A novel method typically used in single-cell RNA-seq analysis combines PCA with tSNE. A common argument against the use of tSNE beyond data visualization is that the results are inconsistent and can vary depending on the hyperparameters of tSNE. This work demonstrates that combining PCA with tSNE creates a robust dimensional reduction of bulk RNA-seq data that allows for more accurate clustering of patient samples. Methods: Using the TCGA dataset with over 38 cancer subtypes and 10,351 total samples, the dimension was reduced to 50 principal components using PCA. TSNE was then applied with 1000-2000 iterations, a learning rate of 200, and varying perplexity of 5-50. Robustness was measured by how cancers of the same subtype are grouped in the lower dimension, in this case 2D. Agglomerative and k-means clustering algorithms were applied to the dimensional reduction results. This was done to track the accuracy of clustering on cancer subtypes and the movement of samples between these clusters. Results: For pure tSNE without PCA pre-reduction, clustering was not robust and showed less distinction between cancer subtypes and an increased number of outliers. Clustering accuracy was 50-60% for both k-means and agglomerative models and all varied hyperparameters. PCA+tSNE produced the most robust results with 60-70% accuracy for both k-means and agglomerative and all varied hyperparameters. It was observed that a low perplexity of 5-20, a learning rate of 100-300, and higher iterations (>1,000) produced the best results. Conclusions: PCA+tSNE is able to create accurate low dimensional representations of patient RNA-seq data that can be used to determine similarities and differences between patient samples based on gene expression data. The 10% variance in clustering accuracy suggests the method is robust and may help inform research and treatment by more robustly classifying cancer samples and by identifying similarities between cancers, especially among underserved, misdiagnosed, and rare cancers. Citation Format: Michael Bocker, Mikhail G. Grushko, Katherine E. Arline. Toward improved cancer classification using PCA + tSNE dimensionality reduction on bulk RNA-seq data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 2708.
Title: Abstract 2708: Toward improved cancer classification using PCA + tSNE dimensionality reduction on bulk RNA-seq data
Description:
Abstract Intro: Minor variations in cancer type can have a major impact on therapeutic effectiveness and on the course of drug research and development.
In order to improve upon existing -omic data classification methods, this study seeks to correctly classify previously unknown cancer samples and to determine -omic overlap between cancers by clustering patient RNA-seq data using PCA+tSNE for dimensionality reduction.
Background: Although there have been major innovations in the use of bulk RNA-seq data for in silico modeling, RNA-seq analysis has limitations.
RNA-seq data has over 20,000 dimensions, making it difficult to perform clustering and differential analysis important for in silico modeling.
Thus, dimensionality reduction is a key for RNA-seq analysis.
A novel method typically used in single-cell RNA-seq analysis combines PCA with tSNE.
A common argument against the use of tSNE beyond data visualization is that the results are inconsistent and can vary depending on the hyperparameters of tSNE.
This work demonstrates that combining PCA with tSNE creates a robust dimensional reduction of bulk RNA-seq data that allows for more accurate clustering of patient samples.
Methods: Using the TCGA dataset with over 38 cancer subtypes and 10,351 total samples, the dimension was reduced to 50 principal components using PCA.
TSNE was then applied with 1000-2000 iterations, a learning rate of 200, and varying perplexity of 5-50.
Robustness was measured by how cancers of the same subtype are grouped in the lower dimension, in this case 2D.
Agglomerative and k-means clustering algorithms were applied to the dimensional reduction results.
This was done to track the accuracy of clustering on cancer subtypes and the movement of samples between these clusters.
Results: For pure tSNE without PCA pre-reduction, clustering was not robust and showed less distinction between cancer subtypes and an increased number of outliers.
Clustering accuracy was 50-60% for both k-means and agglomerative models and all varied hyperparameters.
PCA+tSNE produced the most robust results with 60-70% accuracy for both k-means and agglomerative and all varied hyperparameters.
It was observed that a low perplexity of 5-20, a learning rate of 100-300, and higher iterations (>1,000) produced the best results.
Conclusions: PCA+tSNE is able to create accurate low dimensional representations of patient RNA-seq data that can be used to determine similarities and differences between patient samples based on gene expression data.
The 10% variance in clustering accuracy suggests the method is robust and may help inform research and treatment by more robustly classifying cancer samples and by identifying similarities between cancers, especially among underserved, misdiagnosed, and rare cancers.
Citation Format: Michael Bocker, Mikhail G.
Grushko, Katherine E.
Arline.
Toward improved cancer classification using PCA + tSNE dimensionality reduction on bulk RNA-seq data [abstract].
In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13.
Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 2708.

Related Results

Abnormal Status Detection of Catenary Based on TSNE Dimensionality Reduction Method and IGWO-LSSVM Model
Abnormal Status Detection of Catenary Based on TSNE Dimensionality Reduction Method and IGWO-LSSVM Model
Background: Catenary is a crucial component of an electrified railroad's traction power supply system. There is a considerable incidence of abnormal status and failures due to prol...
MARS-seq2.0: an experimental and analytical pipeline for indexed sorting combined with single-cell RNA sequencing v1
MARS-seq2.0: an experimental and analytical pipeline for indexed sorting combined with single-cell RNA sequencing v1
Human tissues comprise trillions of cells that populate a complex space of molecular phenotypes and functions and that vary in abundance by 4–9 orders of magnitude. Relying solely ...
Abstract P1-05-23: Utilities and challenges of RNA-Seq based expression and variant calling in a clinical setting
Abstract P1-05-23: Utilities and challenges of RNA-Seq based expression and variant calling in a clinical setting
Abstract Introduction Variant calling based on DNA samples has been the gold standard of clinical testing since the advent of Sanger sequencing. The u...
Generating Synthetic Single Cell Data from Bulk RNA-seq Using a Pretrained Variational Autoencoder
Generating Synthetic Single Cell Data from Bulk RNA-seq Using a Pretrained Variational Autoencoder
AbstractSingle cell RNA sequencing (scRNA-seq) is a powerful approach which generates genome-wide gene expression profiles at single cell resolution. Among its many applications, i...
Detection of Multiple Types of Cancer Driver Mutations Using Targeted RNA Sequencing in NSCLC
Detection of Multiple Types of Cancer Driver Mutations Using Targeted RNA Sequencing in NSCLC
ABSTRACTCurrently, DNA and RNA are used separately to capture different types of gene mutations. DNA is commonly used for the detection of SNVs, indels and CNVs; RNA is used for an...
MuSiC2: cell type deconvolution for multi-condition bulk RNA-seq data
MuSiC2: cell type deconvolution for multi-condition bulk RNA-seq data
ABSTRACTCell type composition of intact bulk tissues can vary across samples. Deciphering cell type composition and its changes during disease progression is an important step towa...
Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data
Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data
AbstractGene set scoring (GSS) has been routinely conducted for gene expression analysis of bulk or single-cell RNA-seq data, which helps to decipher single-cell heterogeneity and ...

Back to Top