Javascript must be enabled to continue!

Protein Embedding based Alignment

Despite of the many progresses with alignment algorithms, aligning divergent protein sequences including those sharing less than 20-35% pairwise identity (so called “twilight zone”) remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments. These matrices however do not work well within the twilight zone. We developed PEbA for Protein Embedding based Alignments. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on their embeddings from a protein language model. We tested PEbA on benchmark alignments and the results showed that PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over five times as well for pairs of sequences with < 10% identity). We compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA even outperformed DEDAL, a recently developed deep learning model that was created specifically for aligning protein sequences, particularly on longer alignments and sequences with low pairwise identity. Our results suggested that general purpose protein language models provide useful contextual information for accurate protein alignments.

Wiley

Benjamin Giovanni Iovino Yuzhen Ye

2023

Title: Protein Embedding based Alignment

Description:

Many alignment algorithms have been using substitution matrices since their creation in the 1970’s to generate alignments.

These matrices however do not work well within the twilight zone.

We developed PEbA for Protein Embedding based Alignments.

Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on their embeddings from a protein language model.

We tested PEbA on benchmark alignments and the results showed that PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over five times as well for pairs of sequences with < 10% identity).

We compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences.

PEbA even outperformed DEDAL, a recently developed deep learning model that was created specifically for aligning protein sequences, particularly on longer alignments and sequences with low pairwise identity.

Our results suggested that general purpose protein language models provide useful contextual information for accurate protein alignments.

Back

Related Results

Endothelial Protein C Receptor

IntroductionThe protein C anticoagulant pathway plays a critical role in the negative regulation of the blood clotting response. The pathway is triggered by thrombin, which allows ...

TINGKAT PROTEIN DAN LISIN DALAM RANSUM TERHADAP EFISIENSI LISIN DAN PROTEIN NETTO PADA AYAM KAMPUNG UMUR 12 MINGGU

Penelitian yang dilakukan ini dalam mencari pengaruh tingkat protein dan lisin terhadap efisiensi lisin dan penggunaan protein netto pada ayam kampung yang diperlihara sampai umur ...

An Efficient ZZW Construction Using Low-Density Generator-Matrix Embedding Techniques

A novel steganographic algorithm based on ZZW construction is proposed to improve the steganographic embedding efficiency. Low-density generator-matrix (LDGM) embedding is an effic...

Representing Hierarchical Structured Data Using Cone Embedding

Extracting hierarchical structure in graph data is becoming an important problem in fields such as natural language processing and developmental biology. Hierarchical structures ca...

An Alignment-free Method for Phylogeny Estimation using Maximum Likelihood

Abstract While alignment has traditionally been the primary approach for establishing homology prior to phylogenetic inference, alignment-free me...

A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification

Sentiment analysis on social media platforms (i.e., Twitter or Facebook) has become an important tool to learn about users’ opinions and preferences. However, the accuracy of senti...

Ontology Alignment Techniques

Sometimes the use of a single ontology is not sufficient to cover different vocabularies for the same domain, and it becomes necessary to use several ontologies in order to encompa...

Effective Attributed Network Embedding with Information Behavior Extraction

Abstract Network embedding has shown its effectiveness in many tasks such as link prediction, node classification, and community detection. Most attributed network embeddin...

Email:
Password:

Email: