Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Paying Attention to Attention: High Attention Sites as Indicators of Protein Family and Function in Language Models

View through CrossRef
Abstract Protein Language Models (PLMs) use transformer architectures to capture patterns within protein sequences, providing a powerful computational representation of the protein sequence [1]. Through large-scale training on protein sequence data, PLMs generate vector representations that encapsulate the biochemical and structural properties of proteins [2]. At the core of PLMs is the attention mechanism, which facilitates the capture of long-range dependencies by computing pairwise importance scores across residues, thereby highlighting regions of biological interaction within the sequence [3]. The attention matrices offer an untapped opportunity to uncover specific biological properties of proteins, particularly their functions. In this work, we introduce a novel approach, using the Evolutionary Scale Model (ESM) [4], for identifying High Attention (HA) sites within protein sequences, corresponding to key residues that define protein families. By examining attention patterns across multiple layers, we pinpoint residues that contribute most to family classification and function prediction. Our contributions are as follows: (1) we propose a method for identifying HA sites at critical residues from the middle layers of the PLM; (2) we demonstrate that these HA sites provide interpretable links to biological functions; and (3) we show that HA sites improve active site predictions for functions of unannotated proteins. We make available the HA sites for the human proteome. This work offers a broadly applicable approach to protein classification and functional annotation and provides a biological interpretation of the PLM’s representation. 1 Author Summary Understanding how proteins work is critical to advancements in biology and medicine, and protein language models (PLMs) facilitate studying protein sequences at scale. These models identify patterns within protein sequences by focusing on key regions of the sequence that are important to distinguish the protein. Our work focuses on the Evolutionary Scale Model (ESM), a state-of-the-art PLM, and we analyze the model’s internal attention mechanism to identify the significant residues. We developed a new method to identify “High Attention (HA)” sites—specific parts of a protein sequence that are essential for classifying proteins into families and predicting their functions. By analyzing how the model prioritizes certain regions of protein sequences, we discovered that these HA sites often correspond to residues critical for biological activity, such as active sites where chemical reactions occur. Our approach helps interpret how PLMs understand protein data and enhances predictions for proteins whose functions are still unknown. As part of this work, we provide HA-site information for the entire human proteome, offering researchers a resource to further study the potential functional relevance of these residues.
Title: Paying Attention to Attention: High Attention Sites as Indicators of Protein Family and Function in Language Models
Description:
Abstract Protein Language Models (PLMs) use transformer architectures to capture patterns within protein sequences, providing a powerful computational representation of the protein sequence [1].
Through large-scale training on protein sequence data, PLMs generate vector representations that encapsulate the biochemical and structural properties of proteins [2].
At the core of PLMs is the attention mechanism, which facilitates the capture of long-range dependencies by computing pairwise importance scores across residues, thereby highlighting regions of biological interaction within the sequence [3].
The attention matrices offer an untapped opportunity to uncover specific biological properties of proteins, particularly their functions.
In this work, we introduce a novel approach, using the Evolutionary Scale Model (ESM) [4], for identifying High Attention (HA) sites within protein sequences, corresponding to key residues that define protein families.
By examining attention patterns across multiple layers, we pinpoint residues that contribute most to family classification and function prediction.
Our contributions are as follows: (1) we propose a method for identifying HA sites at critical residues from the middle layers of the PLM; (2) we demonstrate that these HA sites provide interpretable links to biological functions; and (3) we show that HA sites improve active site predictions for functions of unannotated proteins.
We make available the HA sites for the human proteome.
This work offers a broadly applicable approach to protein classification and functional annotation and provides a biological interpretation of the PLM’s representation.
1 Author Summary Understanding how proteins work is critical to advancements in biology and medicine, and protein language models (PLMs) facilitate studying protein sequences at scale.
These models identify patterns within protein sequences by focusing on key regions of the sequence that are important to distinguish the protein.
Our work focuses on the Evolutionary Scale Model (ESM), a state-of-the-art PLM, and we analyze the model’s internal attention mechanism to identify the significant residues.
We developed a new method to identify “High Attention (HA)” sites—specific parts of a protein sequence that are essential for classifying proteins into families and predicting their functions.
By analyzing how the model prioritizes certain regions of protein sequences, we discovered that these HA sites often correspond to residues critical for biological activity, such as active sites where chemical reactions occur.
Our approach helps interpret how PLMs understand protein data and enhances predictions for proteins whose functions are still unknown.
As part of this work, we provide HA-site information for the entire human proteome, offering researchers a resource to further study the potential functional relevance of these residues.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
<span style="font-size:11pt"><span style="background:#f9f9f4"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><b><spa...
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
<span lang="EN-MY">In this paper, a wideband millimeter-wave (mm-Wave) printed dipole antenna is proposed to be used for fifth generation (5G) communications. The single elem...
Family Pediatrics
Family Pediatrics
ABSTRACT/EXECUTIVE SUMMARYWhy a Task Force on the Family?The practice of pediatrics is unique among medical specialties in many ways, among which is the nearly certain presence of ...
Autonomy on Trial
Autonomy on Trial
Photo by CHUTTERSNAP on Unsplash Abstract This paper critically examines how US bioethics and health law conceptualize patient autonomy, contrasting the rights-based, individualist...
Endothelial Protein C Receptor
Endothelial Protein C Receptor
IntroductionThe protein C anticoagulant pathway plays a critical role in the negative regulation of the blood clotting response. The pathway is triggered by thrombin, which allows ...
Indicators of infertility and fertility care: a systematic scoping review
Indicators of infertility and fertility care: a systematic scoping review
Abstract STUDY QUESTION What is the scope of literature regarding infertility and fertility care indicators in terms of types an...

Back to Top