Javascript must be enabled to continue!
Genomic benchmarks: a collection of datasets for genomic sequence classification
View through CrossRef
Abstract
Background
Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition.
Results
Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks.
Conclusions
Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
Springer Science and Business Media LLC
Title: Genomic benchmarks: a collection of datasets for genomic sequence classification
Description:
Abstract
Background
Recently, deep neural networks have been successfully applied in many biological fields.
In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods.
However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures.
In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition.
Results
Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics.
The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles.
The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm.
A simple convolution neural network is also included in a repository and can be used as a baseline model.
Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.
com/ML-Bioinfo-CEITEC/genomic_benchmarks.
Conclusions
Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks.
For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research.
The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
Related Results
Hist2Vec: Kernel-Based Embeddings for Biological Sequence Classification
Hist2Vec: Kernel-Based Embeddings for Biological Sequence Classification
AbstractBiological sequence classification is vital in various fields, such as genomics and bioinformatics. The advancement and reduced cost of genomic sequencing have brought the ...
Processing genome-wide association studies within a repository of heterogeneous genomic datasets
Processing genome-wide association studies within a repository of heterogeneous genomic datasets
Abstract
Background
Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants – typically single-n...
Accuracy and computational efficiency of genomic selection with high-density SNP and whole-genome sequence data.
Accuracy and computational efficiency of genomic selection with high-density SNP and whole-genome sequence data.
Abstract
The prediction of complex or quantitative traits from single nucleotide polymorphism (SNP) genotypes has transformed livestock and plant breeding, and is also pl...
Data-Centric Benchmarking
Data-Centric Benchmarking
In data management, both system designers and users casually resort to performance evaluation. Performance evaluation by experimentation on a real system is generally referred to a...
Improving Medical Document Classification via Feature Engineering
Improving Medical Document Classification via Feature Engineering
<p dir="ltr">Document classification (DC) is the task of assigning the predefined labels to unseen documents by utilizing the model trained on the available labeled documents...
COMPARATIVE DESCRIPTION OF THE DANIS-WEBER, AO, LAUGE HANSEN AND DIAS-TACHDJIAN CLASSIFICATION SYSTEMS FOR ANKLE FRACTURES
COMPARATIVE DESCRIPTION OF THE DANIS-WEBER, AO, LAUGE HANSEN AND DIAS-TACHDJIAN CLASSIFICATION SYSTEMS FOR ANKLE FRACTURES
Introduction: Ankle fractures are very common in emergency departments around the world. Through time and scientific advances, several means of classification have been structured ...
Privacy risk quantification in education data using Markov model
Privacy risk quantification in education data using Markov model
AbstractWith Big Data revolution, the education sector is being reshaped. The current data‐driven education system provides many opportunities to utilize the enormous amount of col...
Protein Fold Classification using Graph Neural Network and Protein Topology Graph
Protein Fold Classification using Graph Neural Network and Protein Topology Graph
AbstractProtein fold classification reveals key structural information about proteins that is essential for understanding their function. While numerous approaches exist in the lit...

