Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

CATHe2: Enhanced CATH Superfamily Detection Using ProstT5 and Structural Alphabets

View through CrossRef
Abstract Motivation The CATH database is a free publicly available online resource that provides annotations about the evolutionary and structural relationships of protein domains. Due to the flux of protein structures coming mainly from the recent breakthrough of AlphaFold and therefore the non-feasibility of manual intervention, the CATH team recently developed an automatic CATH superfamily classifier called CATHe, that uses a feed-forward network classifier with protein Language Model (pLM) embeddings as input. Using the same dataset, in this paper, we present, CATHe2 that improves on CATHe by switching the old pLM ProtT5 for one of the most recent versions called ProstT5, and by introducing domain 3D information as input to the classifier, in the form of Structural Alphabet representation, namely 3Di sequence embeddings. Finally, CATHe2 implements a new version of the feed-forward network (FNN, i.e, non-recurrent neural network) classifier architecture, fine-tuned to perform at the CATH superfamily prediction task. Results The best CATHe2 model reaches an accuracy of 92.2 ± 0.7% with an F1 score of 82.3 ± 1.3% which constitutes an improvement of 9.9% on the F1 score and 6.6% on the accuracy, from the previous CATHe version (85.6 ± 0.4% accuracy and 72.4 ± 0.7% F1 score) on its largest dataset (~ 1700 superfamilies). This model uses ProstT5 AA sequence and 3Di sequence embeddings as input to the classifier, but a simplified version requiring only AA sequences, already improves CATHe’s F1 score by 6.7 ± 1.3% and accuracy by 6.6 ± 0.7% on its largest dataset. Availability & Implementation The code is available on https://GitHub.com/Mouret-Orfeu/CATHe2 . Datasets: https://doi.org/10.5281/zenodo.14534966 Contact orfeu.mouret.pro@outlook.fr , j.abbass@kingston.ac.uk
Title: CATHe2: Enhanced CATH Superfamily Detection Using ProstT5 and Structural Alphabets
Description:
Abstract Motivation The CATH database is a free publicly available online resource that provides annotations about the evolutionary and structural relationships of protein domains.
Due to the flux of protein structures coming mainly from the recent breakthrough of AlphaFold and therefore the non-feasibility of manual intervention, the CATH team recently developed an automatic CATH superfamily classifier called CATHe, that uses a feed-forward network classifier with protein Language Model (pLM) embeddings as input.
Using the same dataset, in this paper, we present, CATHe2 that improves on CATHe by switching the old pLM ProtT5 for one of the most recent versions called ProstT5, and by introducing domain 3D information as input to the classifier, in the form of Structural Alphabet representation, namely 3Di sequence embeddings.
Finally, CATHe2 implements a new version of the feed-forward network (FNN, i.
e, non-recurrent neural network) classifier architecture, fine-tuned to perform at the CATH superfamily prediction task.
Results The best CATHe2 model reaches an accuracy of 92.
2 ± 0.
7% with an F1 score of 82.
3 ± 1.
3% which constitutes an improvement of 9.
9% on the F1 score and 6.
6% on the accuracy, from the previous CATHe version (85.
6 ± 0.
4% accuracy and 72.
4 ± 0.
7% F1 score) on its largest dataset (~ 1700 superfamilies).
This model uses ProstT5 AA sequence and 3Di sequence embeddings as input to the classifier, but a simplified version requiring only AA sequences, already improves CATHe’s F1 score by 6.
7 ± 1.
3% and accuracy by 6.
6 ± 0.
7% on its largest dataset.
Availability & Implementation The code is available on https://GitHub.
com/Mouret-Orfeu/CATHe2 .
Datasets: https://doi.
org/10.
5281/zenodo.
14534966 Contact orfeu.
mouret.
pro@outlook.
fr , j.
abbass@kingston.
ac.
uk.

Related Results

TEDLH: Domain HMMs for sensitive detection of remote homologues
TEDLH: Domain HMMs for sensitive detection of remote homologues
Abstract Motivation The Encyclopedia of Domains (TED) provides domain annotations for proteins in the AlphaFold Protein Structu...
Cath’s Anxiety in Rainbow Rowell’s Fangirl
Cath’s Anxiety in Rainbow Rowell’s Fangirl
This research delves into the anxiety experience of Cath, the central protagonist in Rainbow Rowell’s novel “Fangirl.” The study addresses to investigate three key issues: the symp...
Co-Expression of Androgen Receptor and Cathepsin D Defines a Triple-Negative Breast Cancer Subgroup with Poorer Overall Survival
Co-Expression of Androgen Receptor and Cathepsin D Defines a Triple-Negative Breast Cancer Subgroup with Poorer Overall Survival
Background: In the triple-negative breast cancer (TNBC) group, the luminal androgen receptor subtype is characterized by expression of androgen receptor (AR) and lack of estrogen r...
Form Follows Force: A theoretical framework for Structural Morphology, and Form-Finding research on shell structures
Form Follows Force: A theoretical framework for Structural Morphology, and Form-Finding research on shell structures
The springing up of freeform architecture and structures introduces many challenges to structural engineers. The main challenge is to generate structural forms with high structural...
Role of proteinases in renal hypertrophy and matrix accumulation
Role of proteinases in renal hypertrophy and matrix accumulation
Abstract Graded compensatory renal growth was induced either by unilateral (UNX) or 5/6 nephrectomy (5/6-NX). Over the experimental period of 16 weeks, kidney wei...
Cath Raby in conversation with Jen Webb on research higher degree examination administration
Cath Raby in conversation with Jen Webb on research higher degree examination administration
This paper is the edited transcript of a conversation between Cath Raby (Research Students Office) and Jen Webb about an administrator’s perspective of the process of examining cre...

Back to Top