Javascript must be enabled to continue!
NORTH: a highly accurate and scalable Naive Bayes based ORTHologous gene clustering algorithm
View through CrossRef
AbstractBackgroundThe principal objective of comparative genomics is inferring attributes of an unknown gene by comparing it with well-studied genes. In this regard, identifying orthologous genes plays a pivotal role as the orthologous genes remain less diverged in the course of evolution. However, identifying orthologous genes is often difficult, slow, and idiosyncratic, especially in the presence of multiplicity of domains in proteins, evolutionary dynamics (gene duplication, transfer, loss, introgression etc.), multiple paralogous genes, incomplete genome data, and for distantly related species where similarity is hard to recognize.MotivationAdvances in identifying orthologs have mostly been constrained to developing databases of genes or methods which involve computationally expensive BLAST search or constructing phylogenetic trees to infer orthologous relationships. These methods do not generally scale well and cannot analyze large amount of data from diverse organisms with high accuracy. Moreover, most of these methods involve manual parameter tuning, and hence are neither fully automated nor free from human bias.ResultsWe present NORTH, a novel, automated, highly accurate and scalable machine learning based orhtologous gene clustering method. We have utilized the biological basis and intuition of orthologous genes and made an effort to incorporate appropriate ideas from machine learning (ML) and natural language processing (NLP). We have discovered that the BLAST search based protocols deeply resemble a “text classification” problem. Thus, we employ the robustbag-of-words modelaccompanied by a Naive Bayes classifier to cluster the orthologous genes. We studied 1,255,877 genes in the largest 250 ortholog clusters from the KEGG database, across 3,880 organisms comprising the six major groups of life, namely, Archaea, Bacteria, Animals, Fungi, Plants and Protists. Despite having more than a million of genes on distantly related species with acute data imbalance, NORTH is able to cluster them with 98.48% Precision, 98.43% Recall and 98.44%F1score, showing that automatic orthologous gene clustering can be both highly accurate and scalable. NORTH is available as a web interface with a server side application, along with cross-platform native applications (available athttps://nibtehaz.github.io/NORTH/) – allowing queries based on individual genes.
Cold Spring Harbor Laboratory
Title: NORTH: a highly accurate and scalable Naive Bayes based ORTHologous gene clustering algorithm
Description:
AbstractBackgroundThe principal objective of comparative genomics is inferring attributes of an unknown gene by comparing it with well-studied genes.
In this regard, identifying orthologous genes plays a pivotal role as the orthologous genes remain less diverged in the course of evolution.
However, identifying orthologous genes is often difficult, slow, and idiosyncratic, especially in the presence of multiplicity of domains in proteins, evolutionary dynamics (gene duplication, transfer, loss, introgression etc.
), multiple paralogous genes, incomplete genome data, and for distantly related species where similarity is hard to recognize.
MotivationAdvances in identifying orthologs have mostly been constrained to developing databases of genes or methods which involve computationally expensive BLAST search or constructing phylogenetic trees to infer orthologous relationships.
These methods do not generally scale well and cannot analyze large amount of data from diverse organisms with high accuracy.
Moreover, most of these methods involve manual parameter tuning, and hence are neither fully automated nor free from human bias.
ResultsWe present NORTH, a novel, automated, highly accurate and scalable machine learning based orhtologous gene clustering method.
We have utilized the biological basis and intuition of orthologous genes and made an effort to incorporate appropriate ideas from machine learning (ML) and natural language processing (NLP).
We have discovered that the BLAST search based protocols deeply resemble a “text classification” problem.
Thus, we employ the robustbag-of-words modelaccompanied by a Naive Bayes classifier to cluster the orthologous genes.
We studied 1,255,877 genes in the largest 250 ortholog clusters from the KEGG database, across 3,880 organisms comprising the six major groups of life, namely, Archaea, Bacteria, Animals, Fungi, Plants and Protists.
Despite having more than a million of genes on distantly related species with acute data imbalance, NORTH is able to cluster them with 98.
48% Precision, 98.
43% Recall and 98.
44%F1score, showing that automatic orthologous gene clustering can be both highly accurate and scalable.
NORTH is available as a web interface with a server side application, along with cross-platform native applications (available athttps://nibtehaz.
github.
io/NORTH/) – allowing queries based on individual genes.
Related Results
The Kernel Rough K-Means Algorithm
The Kernel Rough K-Means Algorithm
Background:
Clustering is one of the most important data mining methods. The k-means
(c-means ) and its derivative methods are the hotspot in the field of clustering research in re...
Klasifikasi Sentimen Masyarakat terhadap Presiden Indonesia Menggunakan Metode Naive Bayes
Klasifikasi Sentimen Masyarakat terhadap Presiden Indonesia Menggunakan Metode Naive Bayes
Abstract. Social media platform X has become an important platform for expressing public opinion, particularly in the political context, including the 2024 Presidential Election in...
Parallel density clustering algorithm based on MapReduce and optimized cuckoo algorithm
Parallel density clustering algorithm based on MapReduce and optimized cuckoo algorithm
In the process of parallel density clustering, the boundary points of clusters with different densities are blurred and there is data noise, which affects the clustering performanc...
PEMBANGUNAN CHATBOT INTERAKTIF DENGAN MENGGUNAKAN ALGORITMA NAIVE BAYES
PEMBANGUNAN CHATBOT INTERAKTIF DENGAN MENGGUNAKAN ALGORITMA NAIVE BAYES
Chatbots have become integral components of modern digital services, facilitating efficient and responsive interactions between users and technology. As artificial intelligence (AI...
MR-DBIFOA: a parallel Density-based Clustering Algorithm by Using Improve Fruit Fly Optimization
MR-DBIFOA: a parallel Density-based Clustering Algorithm by Using Improve Fruit Fly Optimization
<p>Clustering is an important technique for data analysis and knowledge discovery. In the context of big data, the density-based clustering algorithm faces three challenging ...
PERBANDINGAN ALGORITMA C4.5 DAN NAIVE BAYES DALAM MENDETEKSI HIPERTENSI DI PUSKESMAS BANYUBIRU
PERBANDINGAN ALGORITMA C4.5 DAN NAIVE BAYES DALAM MENDETEKSI HIPERTENSI DI PUSKESMAS BANYUBIRU
Hipertensi menjadi penyebab kematian nomor 1 di dunia setiap tahunnya karena merupakan pintu masuk penyakit lain, seperti : jantung, gagal ginjal, diabetes, dan stroke (Direktur ...
Analisis Sentimen Komentar YouTube pada Video Terkait Insiden Pengemudi Ojek Online dan Anggota Brimob Menggunakan Algoritma Naive
Analisis Sentimen Komentar YouTube pada Video Terkait Insiden Pengemudi Ojek Online dan Anggota Brimob Menggunakan Algoritma Naive
Media sosial telah menjadi ruang ekspresi publik yang dinamis, di mana masyarakat menyampaikan opini terhadap berbagai peristiwa aktual. Penelitian ini bertujuan untuk menganalisis...
Expression and polymorphism of genes in gallstones
Expression and polymorphism of genes in gallstones
ABSTRACT
Through the method of clinical case control study, to explore the expression and genetic polymorphism of KLF14 gene (rs4731702 and rs972283) and SR-B1 gene (rs...

