Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Protein Embedding based Alignment

View through CrossRef
Despite of the many progresses with alignment algorithms, aligning divergent protein sequences including those sharing less than 20-35% pairwise identity (so called “twilight zone”) remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments. These matrices however do not work well within the twilight zone. We developed PEbA for Protein Embedding based Alignments. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on their embeddings from a protein language model. We tested PEbA on benchmark alignments and the results showed that PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over five times as well for pairs of sequences with < 10% identity). We compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA even outperformed DEDAL, a recently developed deep learning model that was created specifically for aligning protein sequences, particularly on longer alignments and sequences with low pairwise identity. Our results suggested that general purpose protein language models provide useful contextual information for accurate protein alignments.
Title: Protein Embedding based Alignment
Description:
Despite of the many progresses with alignment algorithms, aligning divergent protein sequences including those sharing less than 20-35% pairwise identity (so called “twilight zone”) remains a difficult problem.
Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments.
These matrices however do not work well within the twilight zone.
We developed PEbA for Protein Embedding based Alignments.
Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on their embeddings from a protein language model.
We tested PEbA on benchmark alignments and the results showed that PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over five times as well for pairs of sequences with < 10% identity).
We compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences.
PEbA even outperformed DEDAL, a recently developed deep learning model that was created specifically for aligning protein sequences, particularly on longer alignments and sequences with low pairwise identity.
Our results suggested that general purpose protein language models provide useful contextual information for accurate protein alignments.

Related Results

Endothelial Protein C Receptor
Endothelial Protein C Receptor
IntroductionThe protein C anticoagulant pathway plays a critical role in the negative regulation of the blood clotting response. The pathway is triggered by thrombin, which allows ...
TINGKAT PROTEIN DAN LISIN DALAM RANSUM TERHADAP EFISIENSI LISIN DAN PROTEIN NETTO PADA AYAM KAMPUNG UMUR 12 MINGGU
TINGKAT PROTEIN DAN LISIN DALAM RANSUM TERHADAP EFISIENSI LISIN DAN PROTEIN NETTO PADA AYAM KAMPUNG UMUR 12 MINGGU
Penelitian yang dilakukan ini dalam mencari pengaruh tingkat protein dan lisin terhadap efisiensi lisin dan penggunaan protein netto pada ayam kampung yang diperlihara sampai umur ...
An Efficient ZZW Construction Using Low-Density Generator-Matrix Embedding Techniques
An Efficient ZZW Construction Using Low-Density Generator-Matrix Embedding Techniques
A novel steganographic algorithm based on ZZW construction is proposed to improve the steganographic embedding efficiency. Low-density generator-matrix (LDGM) embedding is an effic...
Representing Hierarchical Structured Data Using Cone Embedding
Representing Hierarchical Structured Data Using Cone Embedding
Extracting hierarchical structure in graph data is becoming an important problem in fields such as natural language processing and developmental biology. Hierarchical structures ca...
An Alignment-free Method for Phylogeny Estimation using Maximum Likelihood
An Alignment-free Method for Phylogeny Estimation using Maximum Likelihood
Abstract While alignment has traditionally been the primary approach for establishing homology prior to phylogenetic inference, alignment-free me...
A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification
A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification
Sentiment analysis on social media platforms (i.e., Twitter or Facebook) has become an important tool to learn about users’ opinions and preferences. However, the accuracy of senti...
Ontology Alignment Techniques
Ontology Alignment Techniques
Sometimes the use of a single ontology is not sufficient to cover different vocabularies for the same domain, and it becomes necessary to use several ontologies in order to encompa...
Effective Attributed Network Embedding with Information Behavior Extraction
Effective Attributed Network Embedding with Information Behavior Extraction
Abstract Network embedding has shown its effectiveness in many tasks such as link prediction, node classification, and community detection. Most attributed network embeddin...

Back to Top