Javascript must be enabled to continue!

ProFET: Protein Feature Engineering Toolkit captures high-level protein functions

The amount of sequenced genomes and proteins is growing at an unprecedented pace. Unfortunately, manual curation and functional knowledge lags behind. Homologous inference often fails at labeling proteins with diverse functions and broad classes. Several drawbacks exist in sequence-based approached for functional annotation: (i) Some functions (e.g., cell processes) cannot be detected by sequence-based methods; (ii) Statistical models mostly capture local patterns (e.g., protein domain); (iii) Rare sequences or those that have very few homologues cannot be successfully used for inference (Radivojac, et al., 2013). We focus on the ability to identify high-level protein functionality of proteins in view of the above difficulties. We hypothesize that a universal, feature engineering approach can yield classification of high-level functions and unified properties when combined with machine learning (ML) approaches, without requiring external databases, structural knowledge or sequence alignment. In this study, we present a novel bioinformatics toolkit called ProFET (Protein Feature Engineering Toolkit). ProFET extracts hundreds of features covering the elementary biophysical and sequence derived attributes without using any external databases. ProFET as a feature extraction platform serves many of the classification tasks. ProFET was compiled as a flexible tool for any size of protein sequence. Our platform adds to previous studies that use quantitative feature representations for sequences. The communality in these methods is the transformation step in which thousands of protein sequences are converted to hundreds of features. Traditionally, these features carry elementary biochemical and biophysical properties while others are statistically derived (e.g. frequency of AA and dipeptides). ProFET introduce many novel additions for the elementary representation. For example, features that are based on a reduced alphabets, high performance AA scales, binary autocorrelation, sequence segmentation, mirror k-mers and more. Many of these features not only improved performance while allowing a compact representation, but also expose statistical importance properties in proteins. The advantage of using reduced alphabet has been noted in the literature for 3D-structure representation. ProFET helps in closing the gap between the growth in protein sequences and the lack of functional knowledge. In assessing the performance, the results of ProFET were used as input for machine earning (ML) approaches. We analyze in-depth two test cases, Neuropeptide Precursors and Thermophiles (Fig 1). Classifying thermophile proteins was used as a test case for a binary classification of functionality that is not explicitly derived from the sequence. Classifying neuropeptide hormone precursors serves to assess the classification of poorly studied protein niche. The extracted features’ show excellent biological interpretability. We expanded our analysis to 17 established and novel protein benchmark classification datasets involving a variety of binary and multi-class tasks. These datasets cover multiple functional aspects including the 3D structure annotations, nucleic acid binding (for RNA and DNA), virus classes and more. The overall results show state of the art performance. The success of ProFET applies to a wide range of high-level functions such as subcellular localization, structural classes and proteins with unique functional properties (e.g., neuropeptide precursors, thermophilic proteins). In addition, ProFET allows easy, universal discovery of new target proteins, as well as understanding the features underlying different high-level protein functions. In summary, we generalize the approach to a range of from subcellular localization to viral phylogeny tasks. In all the illustrated cases, ProFET was used as a generic framework for feature extraction and prediction. External information that is often available (e.g., the family PSSM, GO annotation, structural prediction, disorder predictors) was not included. The results from ProFET were incorporated in data analysis pipelines, implemented in python, and adapted for multi-genome scale analysis. ProFET provides automatic generation of protein sequence features for statistical learning. ProFET source code and the datasets used are freely available at https://github.com/ddofer/ProFET.

F1000 Research Ltd

Dan Ofer Michal Linial

2025

Title: ProFET: Protein Feature Engineering Toolkit captures high-level protein functions

Description:

The amount of sequenced genomes and proteins is growing at an unprecedented pace.

Unfortunately, manual curation and functional knowledge lags behind.

Homologous inference often fails at labeling proteins with diverse functions and broad classes.

Several drawbacks exist in sequence-based approached for functional annotation: (i) Some functions (e.

, cell processes) cannot be detected by sequence-based methods; (ii) Statistical models mostly capture local patterns (e.

, protein domain); (iii) Rare sequences or those that have very few homologues cannot be successfully used for inference (Radivojac, et al.

, 2013).

We focus on the ability to identify high-level protein functionality of proteins in view of the above difficulties.

We hypothesize that a universal, feature engineering approach can yield classification of high-level functions and unified properties when combined with machine learning (ML) approaches, without requiring external databases, structural knowledge or sequence alignment.

In this study, we present a novel bioinformatics toolkit called ProFET (Protein Feature Engineering Toolkit).

ProFET extracts hundreds of features covering the elementary biophysical and sequence derived attributes without using any external databases.

ProFET as a feature extraction platform serves many of the classification tasks.

ProFET was compiled as a flexible tool for any size of protein sequence.

Our platform adds to previous studies that use quantitative feature representations for sequences.

The communality in these methods is the transformation step in which thousands of protein sequences are converted to hundreds of features.

Traditionally, these features carry elementary biochemical and biophysical properties while others are statistically derived (e.

frequency of AA and dipeptides).

ProFET introduce many novel additions for the elementary representation.

For example, features that are based on a reduced alphabets, high performance AA scales, binary autocorrelation, sequence segmentation, mirror k-mers and more.

Many of these features not only improved performance while allowing a compact representation, but also expose statistical importance properties in proteins.

The advantage of using reduced alphabet has been noted in the literature for 3D-structure representation.

ProFET helps in closing the gap between the growth in protein sequences and the lack of functional knowledge.

In assessing the performance, the results of ProFET were used as input for machine earning (ML) approaches.

We analyze in-depth two test cases, Neuropeptide Precursors and Thermophiles (Fig 1).

Classifying thermophile proteins was used as a test case for a binary classification of functionality that is not explicitly derived from the sequence.

Classifying neuropeptide hormone precursors serves to assess the classification of poorly studied protein niche.

The extracted features’ show excellent biological interpretability.

We expanded our analysis to 17 established and novel protein benchmark classification datasets involving a variety of binary and multi-class tasks.

These datasets cover multiple functional aspects including the 3D structure annotations, nucleic acid binding (for RNA and DNA), virus classes and more.

The overall results show state of the art performance.

The success of ProFET applies to a wide range of high-level functions such as subcellular localization, structural classes and proteins with unique functional properties (e.

, neuropeptide precursors, thermophilic proteins).

In addition, ProFET allows easy, universal discovery of new target proteins, as well as understanding the features underlying different high-level protein functions.

In summary, we generalize the approach to a range of from subcellular localization to viral phylogeny tasks.

In all the illustrated cases, ProFET was used as a generic framework for feature extraction and prediction.

External information that is often available (e.

, the family PSSM, GO annotation, structural prediction, disorder predictors) was not included.

The results from ProFET were incorporated in data analysis pipelines, implemented in python, and adapted for multi-genome scale analysis.

ProFET provides automatic generation of protein sequence features for statistical learning.

ProFET source code and the datasets used are freely available at https://github.

com/ddofer/ProFET.

Back

Related Results

The Europlanet Evaluation Toolkit

Evaluation can provide essential information in understanding the effectiveness and accessibility of outreach activities in engaging diverse communities. In this presentation, we w...

The Europlanet Evaluation Toolkit

<div> <p>In this presentation, we will give an overview of the Europlanet Evaluation Toolkit, a resource that aims to empower outreach providers and edu...

Response to Toshihide Tsuda, Yumiko Miyano and Eiji Yamamoto [1]

Abstract Background In August 2021, we published in Environmental Health a Toolkit for detecting misused epidemiological methods with the goal of pr...

Enhancing Sepsis Prevention in Long-Term Care Facilities: Development of an Infection Prevention Sepsis Toolkit

Objectives: The purpose of this Doctor of Nursing Practice (DNP) project was to create, validate, and refine an Evidence-Based Sepsis Prevention Toolkit to enhanced early sepsis re...

From features to functions : leveraging protein feature architectures in comparative genomics

When analyzing genomic data, one of the key challenges is the annotation of new genes. The toolkit for incorporating newly discovered proteins into a comprehensive evolutionary and...

Temporary Permanence

<p><b>Aotearoa has undoubtedly some of the most beautiful landscapes in the world, a privilege for its inhabitants. However, as our cities have developed post-colonisat...

Co-development of the ‘Move More’ Toolkit: A Theoretically Informed Resource to Support Physical Activity Promotion and Participation within Secure Psychiatric Care

Introduction: Physical activity (PA) is effective in improving physical and mental health outcomes of individuals within secure psychiatric care. However, psychiatric inpatients re...

Endothelial Protein C Receptor

IntroductionThe protein C anticoagulant pathway plays a critical role in the negative regulation of the blood clotting response. The pathway is triggered by thrombin, which allows ...

Email:
Password:

Email: