Javascript must be enabled to continue!
CATHe2: Enhanced CATH Superfamily Detection Using ProstT5 and Structural Alphabets
View through CrossRef
Abstract
Motivation
The CATH database is a free publicly available online resource that provides annotations about the evolutionary and structural relationships of protein domains. Due to the flux of protein structures coming mainly from the recent breakthrough of AlphaFold and therefore the non-feasibility of manual intervention, the CATH team recently developed an automatic CATH superfamily classifier called CATHe, that uses a feed-forward network classifier with protein Language Model (pLM) embeddings as input. Using the same dataset, in this paper, we present, CATHe2 that improves on CATHe by switching the old pLM ProtT5 for one of the most recent versions called ProstT5, and by introducing domain 3D information as input to the classifier, in the form of Structural Alphabet representation, namely 3Di sequence embeddings. Finally, CATHe2 implements a new version of the feed-forward network (FNN, i.e, non-recurrent neural network) classifier architecture, fine-tuned to perform at the CATH superfamily prediction task.
Results
The best CATHe2 model reaches an accuracy of 92.2 ± 0.7% with an F1 score of 82.3 ± 1.3% which constitutes an improvement of 9.9% on the F1 score and 6.6% on the accuracy, from the previous CATHe version (85.6 ± 0.4% accuracy and 72.4 ± 0.7% F1 score) on its largest dataset (~ 1700 superfamilies). This model uses ProstT5 AA sequence and 3Di sequence embeddings as input to the classifier, but a simplified version requiring only AA sequences, already improves CATHe’s F1 score by 6.7 ± 1.3% and accuracy by 6.6 ± 0.7% on its largest dataset.
Availability & Implementation
The code is available on
https://GitHub.com/Mouret-Orfeu/CATHe2
. Datasets:
https://doi.org/10.5281/zenodo.14534966
Contact
orfeu.mouret.pro@outlook.fr
,
j.abbass@kingston.ac.uk
Title: CATHe2: Enhanced CATH Superfamily Detection Using ProstT5 and Structural Alphabets
Description:
Abstract
Motivation
The CATH database is a free publicly available online resource that provides annotations about the evolutionary and structural relationships of protein domains.
Due to the flux of protein structures coming mainly from the recent breakthrough of AlphaFold and therefore the non-feasibility of manual intervention, the CATH team recently developed an automatic CATH superfamily classifier called CATHe, that uses a feed-forward network classifier with protein Language Model (pLM) embeddings as input.
Using the same dataset, in this paper, we present, CATHe2 that improves on CATHe by switching the old pLM ProtT5 for one of the most recent versions called ProstT5, and by introducing domain 3D information as input to the classifier, in the form of Structural Alphabet representation, namely 3Di sequence embeddings.
Finally, CATHe2 implements a new version of the feed-forward network (FNN, i.
e, non-recurrent neural network) classifier architecture, fine-tuned to perform at the CATH superfamily prediction task.
Results
The best CATHe2 model reaches an accuracy of 92.
2 ± 0.
7% with an F1 score of 82.
3 ± 1.
3% which constitutes an improvement of 9.
9% on the F1 score and 6.
6% on the accuracy, from the previous CATHe version (85.
6 ± 0.
4% accuracy and 72.
4 ± 0.
7% F1 score) on its largest dataset (~ 1700 superfamilies).
This model uses ProstT5 AA sequence and 3Di sequence embeddings as input to the classifier, but a simplified version requiring only AA sequences, already improves CATHe’s F1 score by 6.
7 ± 1.
3% and accuracy by 6.
6 ± 0.
7% on its largest dataset.
Availability & Implementation
The code is available on
https://GitHub.
com/Mouret-Orfeu/CATHe2
.
Datasets:
https://doi.
org/10.
5281/zenodo.
14534966
Contact
orfeu.
mouret.
pro@outlook.
fr
,
j.
abbass@kingston.
ac.
uk.
Related Results
Abstract P6-05-11: Prognostic value of androgen receptor and cathepsin D co-expression in non-metastatic triple-negative breast cancer and correlation with other biomarkers
Abstract P6-05-11: Prognostic value of androgen receptor and cathepsin D co-expression in non-metastatic triple-negative breast cancer and correlation with other biomarkers
Abstract
Background: Microarrays studies identified the subtype of luminal androgen receptor (LAR) among triple-negative breast cancer (TNBC). This subgroup is disti...
TEDLH: Domain HMMs for sensitive detection of remote homologues
TEDLH: Domain HMMs for sensitive detection of remote homologues
Abstract
Motivation
The Encyclopedia of Domains (TED) provides domain annotations for proteins in the AlphaFold Protein Structu...
Cath’s Anxiety in Rainbow Rowell’s Fangirl
Cath’s Anxiety in Rainbow Rowell’s Fangirl
This research delves into the anxiety experience of Cath, the central protagonist in Rainbow Rowell’s novel “Fangirl.” The study addresses to investigate three key issues: the symp...
Co-Expression of Androgen Receptor and Cathepsin D Defines a Triple-Negative Breast Cancer Subgroup with Poorer Overall Survival
Co-Expression of Androgen Receptor and Cathepsin D Defines a Triple-Negative Breast Cancer Subgroup with Poorer Overall Survival
Background: In the triple-negative breast cancer (TNBC) group, the luminal androgen receptor subtype is characterized by expression of androgen receptor (AR) and lack of estrogen r...
Form Follows Force: A theoretical framework for Structural Morphology, and Form-Finding research on shell structures
Form Follows Force: A theoretical framework for Structural Morphology, and Form-Finding research on shell structures
The springing up of freeform architecture and structures introduces many challenges to structural engineers. The main challenge is to generate structural forms with high structural...
Role of proteinases in renal hypertrophy and matrix accumulation
Role of proteinases in renal hypertrophy and matrix accumulation
Abstract
Graded compensatory renal growth was induced either by unilateral (UNX) or 5/6 nephrectomy (5/6-NX). Over the experimental period of 16 weeks, kidney wei...
Abstract 12225: Effects of Obesity on Noninvasive Testing and Angiography Results in Patients With Suspected Cardiac Ischemia: Insights From the PROMISE Trial
Abstract 12225: Effects of Obesity on Noninvasive Testing and Angiography Results in Patients With Suspected Cardiac Ischemia: Insights From the PROMISE Trial
Background:
Evaluation of obese patients with suspected CAD is challenging because obesity is a CV risk factor, but adiposity itself may mimic symptoms of CAD and reduc...
Cath Raby in conversation with Jen Webb on research higher degree examination administration
Cath Raby in conversation with Jen Webb on research higher degree examination administration
This paper is the edited transcript of a conversation between Cath Raby (Research Students Office) and Jen Webb about an administrator’s perspective of the process of examining cre...

