Javascript must be enabled to continue!
Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
View through CrossRef
Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model’s accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
SAGE Publications
Title: Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
Description:
Lung cancer is considered the most common and the deadliest cancer type.
Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer.
Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%.
Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression.
RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers.
Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge.
Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers.
The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples.
The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither.
An exploratory data analysis was performed to identify the probability distribution and principal features.
Due to the limited number of features available, all of them were used in predicting the class.
To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset.
For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost.
Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC.
The imbalance and limited features in the dataset restrict any further improvement in the model’s accuracy or precision.
In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis.
It gave a precision of 91.
3% and 91% recall after fine tuning.
Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
Related Results
Abstract 1345: Evidence for genetic mediation of lung cancer through hay fever.
Abstract 1345: Evidence for genetic mediation of lung cancer through hay fever.
Abstract
Introduction: In the past decade, advances in genetics have led to the discovery of numerous lung cancer susceptibility variants. The majority of these vari...
Advanced Machine Learning Techniques for Prognostic Analysis in Breast Cancer
Advanced Machine Learning Techniques for Prognostic Analysis in Breast Cancer
Aims
The aim of this research is mainly to use machine learning methods for forecasting significant characteristics related to breast cancer using the data to f...
Factors influencing and patterns of forest utilization in communities around the Huay Tak Teak Biosphere Reserve, Lampang Province
Factors influencing and patterns of forest utilization in communities around the Huay Tak Teak Biosphere Reserve, Lampang Province
Background and Objectives: To establish the land regulation, it is necessary to know basic information of the surrounding community’s land use and to be aware of basic forest laws....
Abstract 1657: Genome-wide association study of lung cancer: Variation in TP63 gene confers the risk of lung adenocarcinoma
Abstract 1657: Genome-wide association study of lung cancer: Variation in TP63 gene confers the risk of lung adenocarcinoma
Abstract
Lung cancer is the most common cause of death from cancer worldwide, and its incidence is increasing in East Asian and Western countries. Lung cancer compri...
Optimizing Lung Cancer Risk Prediction with Advanced Machine Learning Algorithms and Techniques
Optimizing Lung Cancer Risk Prediction with Advanced Machine Learning Algorithms and Techniques
Lung cancer is among the leading causes of cancer death in the U.S.A. as well as globally and causes more deaths than breast, prostate, and colorectal cancers combined. It thus pre...
Edoxaban and Cancer-Associated Venous Thromboembolism: A Meta-analysis of Clinical Trials
Edoxaban and Cancer-Associated Venous Thromboembolism: A Meta-analysis of Clinical Trials
Abstract
Introduction
Cancer patients face a venous thromboembolism (VTE) risk that is up to 50 times higher compared to individuals without cancer. In 2010, direct oral anticoagul...
Are Cervical Ribs Indicators of Childhood Cancer? A Narrative Review
Are Cervical Ribs Indicators of Childhood Cancer? A Narrative Review
Abstract
A cervical rib (CR), also known as a supernumerary or extra rib, is an additional rib that forms above the first rib, resulting from the overgrowth of the transverse proce...
JAK2 variations and functions in lung adenocarcinoma.
JAK2 variations and functions in lung adenocarcinoma.
e23181 Background: Lung cancer ranks as the first most common cancer and the first leading cause of cancer-related death in China and worldwide. Due to the difficulty in early dia...

