Javascript must be enabled to continue!

Deep learning methods may not outperform other machine learning methods on analyzing genomic studies

Deep Learning (DL) has been broadly applied to solve big data problems in biomedical fields, which is most successful in image processing. Recently, many DL methods have been applied to analyze genomic studies. However, genomic data usually has too small a sample size to fit a complex network. They do not have common structural patterns like images to utilize pre-trained networks or take advantage of convolution layers. The concern of overusing DL methods motivates us to evaluate DL methods’ performance versus popular non-deep Machine Learning (ML) methods for analyzing genomic data with a wide range of sample sizes. In this paper, we conduct a benchmark study using the UK Biobank data and its many random subsets with different sample sizes. The original UK Biobank data has about 500k participants. Each patient has comprehensive patient characteristics, disease histories, and genomic information, i.e., the genotypes of millions of Single-Nucleotide Polymorphism (SNPs). We are interested in predicting the risk of three lung diseases: asthma, COPD, and lung cancer. There are 205,238 participants have recorded disease outcomes for these three diseases. Five prediction models are investigated in this benchmark study, including three non-deep machine learning methods (Elastic Net, XGBoost, and SVM) and two deep learning methods (DNN and LSTM). Besides the most popular performance metrics, such as the F1-score, we promote the hit curve, a visual tool to describe the performance of predicting rare events. We discovered that DL methods frequently fail to outperform non-deep ML in analyzing genomic data, even in large datasets with over 200k samples. The experiment results suggest not overusing DL methods in genomic studies, even with biobank-level sample sizes. The performance differences between DL and non-deep ML decrease as the sample size of data increases. This suggests when the sample size of data is significant, further increasing sample sizes leads to more performance gain in DL methods. Hence, DL methods could be better if we analyze genomic data bigger than this study.

Frontiers Media SA

Yao Dong Shaoze Zhou Li Xing Yumeng Chen Ziyu Ren Yongfeng Dong Xuekui Zhang

Frontiers in Genetics

2022

Title: Deep learning methods may not outperform other machine learning methods on analyzing genomic studies

Description:

Deep Learning (DL) has been broadly applied to solve big data problems in biomedical fields, which is most successful in image processing.

Recently, many DL methods have been applied to analyze genomic studies.

However, genomic data usually has too small a sample size to fit a complex network.

They do not have common structural patterns like images to utilize pre-trained networks or take advantage of convolution layers.

The concern of overusing DL methods motivates us to evaluate DL methods’ performance versus popular non-deep Machine Learning (ML) methods for analyzing genomic data with a wide range of sample sizes.

In this paper, we conduct a benchmark study using the UK Biobank data and its many random subsets with different sample sizes.

The original UK Biobank data has about 500k participants.

Each patient has comprehensive patient characteristics, disease histories, and genomic information, i.

, the genotypes of millions of Single-Nucleotide Polymorphism (SNPs).

We are interested in predicting the risk of three lung diseases: asthma, COPD, and lung cancer.

There are 205,238 participants have recorded disease outcomes for these three diseases.

Five prediction models are investigated in this benchmark study, including three non-deep machine learning methods (Elastic Net, XGBoost, and SVM) and two deep learning methods (DNN and LSTM).

Besides the most popular performance metrics, such as the F1-score, we promote the hit curve, a visual tool to describe the performance of predicting rare events.

We discovered that DL methods frequently fail to outperform non-deep ML in analyzing genomic data, even in large datasets with over 200k samples.

The experiment results suggest not overusing DL methods in genomic studies, even with biobank-level sample sizes.

The performance differences between DL and non-deep ML decrease as the sample size of data increases.

This suggests when the sample size of data is significant, further increasing sample sizes leads to more performance gain in DL methods.

Hence, DL methods could be better if we analyze genomic data bigger than this study.

Back

BACKGROUND As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...

Inaugural Editorial of the Inspire Health First Issue Publication

Recent advances in molecular science, AI, and health informatics are transforming how complex diseases are understood, predicted, and managed. For accurate diagnosis and prognosis,...

Deep Learning: Implications for Human Learning and Memory

Recent years have seen an explosion of interest in deep learning and deep neural networks. Deep learning lies at the heart of unprecedented feats of machine intelligence as well as...

Deep convolutional neural network and IoT technology for healthcare

Background Deep Learning is an AI technology that trains computers to analyze data in an approach similar to the human brain. Deep learning algorithms can find ...

CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021

The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...

Implementasi Pembelajaran IPS Sebagai Penguatan Pendidikan Karakter di Sekolah Dasar

This study aims to analyze the implementation of social studies learning as strengthening character education in elementary schools. The research method used is a qualitative descr...

Genomic studies at Center for Life Sciences- paving the way for personalized medicine in Kazakhstan

Over the last decades there has been vast interest in and focus on the implementation of personalized genomic medicine. In the age of genomic medicine we can now do the genetic tes...

A Systematic Review: Deep Learning for Analyzing Genomic Data to Discover Evolutionary Patterns

With the advancement of genetic sequencing technologies and the increase in the volume of biological data, deep learning (DL) has been adopted as one of the advanced data analysis ...

Email:
Password:

Email:

Deep learning methods may not outperform other machine learning methods on analyzing genomic studies

Related Results