Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Pool PaRTI: A PageRank-Based Pooling Method for Identifying Critical Residues and Enhancing Protein Sequence Representations

View through CrossRef
Abstract Motivation Protein language models produce token-level embeddings for each residue, resulting in an output matrix with dimensions that vary based on sequence length. However, downstream machine learning models typically require fixed-length input vectors, necessitating a pooling method to compress the output matrix into a single vector representation of the entire protein. Traditional pooling methods often result in substantial information loss, impacting downstream task performance. We aim to develop a pooling method that produces more expressive general-purpose protein embedding vectors while offering biological interpretability. Results We introduce Pool PaRTI, a novel pooling method that leverages internal transformer attention matrices and PageRank to assign token importance weights. Our unsupervised and parameter-free approach consistently prioritizes residues experimentally annotated as critical for function, assigning them higher importance scores. Across four diverse protein machine learning tasks, Pool PaRTI enables significant performance gains in predictive performance. Additionally, it enhances interpretability by identifying biologically relevant regions without relying on explicit structural data or annotated training. To assess generalizability, we evaluated Pool PaRTI with two encoder-only protein language models, confirming its robustness across different models. Availability and Implementation Pool PaRTI is implemented in Python with PyTorch and is available at https://github.com/Helix-Research-Lab/Pool_PaRTI.git . The Pool PaRTI sequence embeddings and residue importance values for all human proteins on UniProt are available at https://zenodo.org/records/15036725 for ESM2 and protBERT.
Title: Pool PaRTI: A PageRank-Based Pooling Method for Identifying Critical Residues and Enhancing Protein Sequence Representations
Description:
Abstract Motivation Protein language models produce token-level embeddings for each residue, resulting in an output matrix with dimensions that vary based on sequence length.
However, downstream machine learning models typically require fixed-length input vectors, necessitating a pooling method to compress the output matrix into a single vector representation of the entire protein.
Traditional pooling methods often result in substantial information loss, impacting downstream task performance.
We aim to develop a pooling method that produces more expressive general-purpose protein embedding vectors while offering biological interpretability.
Results We introduce Pool PaRTI, a novel pooling method that leverages internal transformer attention matrices and PageRank to assign token importance weights.
Our unsupervised and parameter-free approach consistently prioritizes residues experimentally annotated as critical for function, assigning them higher importance scores.
Across four diverse protein machine learning tasks, Pool PaRTI enables significant performance gains in predictive performance.
Additionally, it enhances interpretability by identifying biologically relevant regions without relying on explicit structural data or annotated training.
To assess generalizability, we evaluated Pool PaRTI with two encoder-only protein language models, confirming its robustness across different models.
Availability and Implementation Pool PaRTI is implemented in Python with PyTorch and is available at https://github.
com/Helix-Research-Lab/Pool_PaRTI.
git .
The Pool PaRTI sequence embeddings and residue importance values for all human proteins on UniProt are available at https://zenodo.
org/records/15036725 for ESM2 and protBERT.

Related Results

Pooling Operations in Deep Learning: From “Invariable” to “Variable”
Pooling Operations in Deep Learning: From “Invariable” to “Variable”
Deep learning has become a research hotspot in multimedia, especially in the field of image processing. Pooling operation is an important operation in deep learning. Pooling operat...
Comparison of PageRank Algorithm Implementations on a Single Computer
Comparison of PageRank Algorithm Implementations on a Single Computer
Pagerank Algorithm is an algorithm used for calculating web page ranking in Google search engine. Problem arises for Pagerank Algorithm due to big main memory usage, thus make it i...
Paying Attention to Attention: High Attention Sites as Indicators of Protein Family and Function in Language Models
Paying Attention to Attention: High Attention Sites as Indicators of Protein Family and Function in Language Models
Abstract Protein Language Models (PLMs) use transformer architectures to capture patterns within protein sequences, providing a powerful computational representatio...
Endothelial Protein C Receptor
Endothelial Protein C Receptor
IntroductionThe protein C anticoagulant pathway plays a critical role in the negative regulation of the blood clotting response. The pathway is triggered by thrombin, which allows ...
FPGA implementation of AAD pooling unit and performance analysis
FPGA implementation of AAD pooling unit and performance analysis
Convolutional Neural Network (CNN) has been witnessing a massive growth for its various applications in different fields. It is a category of Neural Network or Deep learning that i...
Simple Hierarchical PageRank Graph Neural Networks
Simple Hierarchical PageRank Graph Neural Networks
Abstract Graph neural networks (GNNs) have many variants for graph representation learning. Several works introduce PageRank into GNNs to improve its neighborhood aggregati...
Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features
Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features
The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-through...
Molecular Dynamics Studies of Intrinsically Disordered Peptides and Proteins
Molecular Dynamics Studies of Intrinsically Disordered Peptides and Proteins
A tremendous amount of evidence has accumulated in regards to the importance of intrinsically disordered proteins (IDPs) in the functioning of the cell and their role in human dise...

Back to Top