Javascript must be enabled to continue!
Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
View through CrossRef
Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model’s accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
SAGE Publications
Title: Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
Description:
Lung cancer is considered the most common and the deadliest cancer type.
Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer.
Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%.
Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression.
RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers.
Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge.
Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers.
The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples.
The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither.
An exploratory data analysis was performed to identify the probability distribution and principal features.
Due to the limited number of features available, all of them were used in predicting the class.
To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset.
For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost.
Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC.
The imbalance and limited features in the dataset restrict any further improvement in the model’s accuracy or precision.
In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis.
It gave a precision of 91.
3% and 91% recall after fine tuning.
Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
Related Results
Small Cell Lung Cancer and Tarlatamab: A Meta-Analysis of Clinical Trials
Small Cell Lung Cancer and Tarlatamab: A Meta-Analysis of Clinical Trials
Abstract
Introduction
Tarlatamab is a Delta-like ligand 3 (DLL3) -directed bispecific T-cell engager recently approved for use in patients with advanced small cell lung cancer (SCL...
Abstract SY38-02: Clinical investigations of obesity in cancer: BMI and other confounders
Abstract SY38-02: Clinical investigations of obesity in cancer: BMI and other confounders
Abstract
Obesity has been linked with increased incidence and worse outcomes of at least 13 human cancers. For other cancers, our understanding of their relationship...
Lung Cancer Prediction Using Random Forest
Lung Cancer Prediction Using Random Forest
Background:
In recent years, lung cancer is a common cancer across the globe. For the
early prediction of lung cancer, medical practitioners and researchers require an efficient pr...
Abstract 1345: Evidence for genetic mediation of lung cancer through hay fever.
Abstract 1345: Evidence for genetic mediation of lung cancer through hay fever.
Abstract
Introduction: In the past decade, advances in genetics have led to the discovery of numerous lung cancer susceptibility variants. The majority of these vari...
Advanced Machine Learning Techniques for Prognostic Analysis in Breast Cancer
Advanced Machine Learning Techniques for Prognostic Analysis in Breast Cancer
Aims
The aim of this research is mainly to use machine learning methods for forecasting significant characteristics related to breast cancer using the data to f...
Time to Start Up: CT-Basted Radiomics in Children’s Lung Diseases
Time to Start Up: CT-Basted Radiomics in Children’s Lung Diseases
Radiomics is a new interdisciplinary field and a fusion product consisting by large data technology and medical image to aid diagnosis. Radiomics can gather information from differ...
Abstract 1590: Robust evolutionary conservation and pair-wise co-mapping of polygenic colon and lung cancer susceptibility loci
Abstract 1590: Robust evolutionary conservation and pair-wise co-mapping of polygenic colon and lung cancer susceptibility loci
Abstract
Comparing chromosomal locations of statistically significant colon and lung cancer susceptibility loci detected by linkage in mouse and rat and by GWAS i...
The RNA demethylase ALKBH5 promotes the progression and angiogenesis of lung cancer by regulating the stability of the LncRNA PVT1
The RNA demethylase ALKBH5 promotes the progression and angiogenesis of lung cancer by regulating the stability of the LncRNA PVT1
Abstract
Background
N6-methyladenosine (m6A) is the most common posttranscriptional modification of RNA and plays critical roles in human cancer pro...

