Javascript must be enabled to continue!
Scoring alignments by embedding vector similarity
View through CrossRef
AbstractSequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLO-SUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose theE-scorebetween two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the newE-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on variousE-scores is available as a web server ate-score.csd.uwo.ca. The source code is freely available for download fromgithub.com/lucian-ilie/E-score.
Title: Scoring alignments by embedding vector similarity
Description:
AbstractSequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry.
Scoring matrices, such as PAM or BLO-SUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context.
We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent.
It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words.
These ideas have been applied to protein sequences, producing embedding vectors for protein residues.
We propose theE-scorebetween two residues as the cosine similarity between their embedding vector representations.
Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the newE-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices.
The new method proposes to change the way alignments are computed, with far reaching implications in all areas of textual data that use sequence similarity.
The program to compute alignments based on variousE-scores is available as a web server ate-score.
csd.
uwo.
ca.
The source code is freely available for download fromgithub.
com/lucian-ilie/E-score.
Related Results
COFFEE: an objective function for multiple sequence alignments.
COFFEE: an objective function for multiple sequence alignments.
Abstract
MOTIVATION: In order to increase the accuracy of multiple sequence alignments, we designed a new strategy for optimizing multiple sequence alignments by gen...
Multiple Alignments of Data Objects and Generalized Center Star Algorithm
Multiple Alignments of Data Objects and Generalized Center Star Algorithm
Multiple alignments of strings have been extensively studied as an effective tool to study string-type data such as DNA. In this paper, we generalize the notion of multiple alignme...
Similarity Search with Data Missing
Similarity Search with Data Missing
Similarity search is a fundamental research problem with broad applications in various research fields, including data mining, information retrieval, and machine learning. The core...
Effective Attributed Network Embedding with Information Behavior Extraction
Effective Attributed Network Embedding with Information Behavior Extraction
Abstract
Network embedding has shown its effectiveness in many tasks such as link prediction, node classification, and community detection. Most attributed network embeddin...
Effective attributed network embedding with information behavior extraction
Effective attributed network embedding with information behavior extraction
Network embedding has shown its effectiveness in many tasks, such as link prediction, node classification, and community detection. Most attributed network embedding methods consid...
Clinical impact of manual scoring of peripheral arterial tonometry in patients with sleep apnea
Clinical impact of manual scoring of peripheral arterial tonometry in patients with sleep apnea
Abstract
Purpose
The objective was to analyze the clinical implications of manual scoring of sleep studies using peripheral arterial tonometry (PAT)...
An Efficient ZZW Construction Using Low-Density Generator-Matrix Embedding Techniques
An Efficient ZZW Construction Using Low-Density Generator-Matrix Embedding Techniques
A novel steganographic algorithm based on ZZW construction is proposed to improve the steganographic embedding efficiency. Low-density generator-matrix (LDGM) embedding is an effic...
Evaluation of driver visual demand at different design speeds on complex two-dimensional rural highway alignments
Evaluation of driver visual demand at different design speeds on complex two-dimensional rural highway alignments
Road crashes are a major cause of loss of human life, property and money throughout the world. One of the reasons behind these crashes is the interaction between drivers and road a...

