Javascript must be enabled to continue!

From features to functions : leveraging protein feature architectures in comparative genomics

When analyzing genomic data, one of the key challenges is the annotation of new genes. The toolkit for incorporating newly discovered proteins into a comprehensive evolutionary and functional network is diverse. It includes search heuristics based on sequence similarity to identify significantly similar sequences. Additionally, it involves identifying orthologs, which are also used for preliminary functional annotation. However, since the function of gene can change if given enough time, it is necessary to consider other information to identify functional divergence between proteins. As one complementary form of information, it is possible to annotate protein sequences with features such as functional domains, transmembrane domains, low complexity regions, secondary structure elements, or compositional properties. The sum of all features annotated onto a sequence, the feature architecture, can provide further information on a proteins function. To perform this task effectively, tools that can compare and classify feature architectures on a large scale are necessary However, multiple challenges arise when dealing with feature architecture. Many existing schemes for comparing feature architectures cannot cope with features arising from multiple annotation sources. Those that do, fall short in the resolution of overlapping and redundant feature annotations. In this thesis, I present different approaches to leveraging feature architectures as a complementary information source for evolutionary and functional studies. First, we introduce the Feature Architecture Similarity, a tool to perform pairwise scoring of the similarity between two feature architectures. It uses a scoring method that considers the presence/absence of features, as well as positional information. It also allows the integration of features from multiple annotation sources into one feature architecture, while resolving overlapping feature annotations. A benchmark on more than 10,000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved. We then demonstrate the utility of FAS on feature architecture comparison tasks in three case studies. In the second work package, we apply FAS in the assessment of the functional impact of alternative splicing. Here, we present SPICE, a tool that aids in understanding the functional variations within a proteome resulting from different protein isoforms expressed through alternative splicing. In this pipeline, we introduce a new measure, the Expression Weighted Feature Disturbance (EWFD), that combines the FAS score between protein isoforms with their relative transcript expression values. We demonstrate the use of SPICE with two datasets: First, we do an exemplary analysis using long-read sequencing data from the Long-read RNA-seq Genome Annotation Assessment Project and demonstrate how the results can be explored within the SPICE dashboard. Secondly, we explore how SPICE performs with the inclusion of novel, unannotated transcripts using a larger, more diverse dataset provided by the ENCODE consortium. In the last part, we move away from pairwise comparisons of feature architectures and instead consider groups of functional equivalent proteins for machine learning. Here, we train AI models based on the feature architectures of a functional group of proteins to identify novel members. We propose a novel approach to encode feature architectures for later machine learning applications in two variants. In the first, we dynamically define, for each model, a feature space out of the features annotated with FAS. In the second, we consider a static feature space based on the DSSP 8-state secondary structure representation and disordered regions. In a test on 59 orthologous groups of the tricarboxylic acid cycle annotated in the Kyoto Encyclopedia of Genes and Genomes we confirm that both variants have a high recall rate. With the three work packages introduced in this thesis, we extend the toolbox for studies on protein function with tools that effectively leverage feature architecture information on large-scale datasets and provide ideas for the future development of further methods for protein comparison and annotation.

University Library J. C. Senckenberg

Julian Dosch

2024

Title: From features to functions : leveraging protein feature architectures in comparative genomics

Description:

When analyzing genomic data, one of the key challenges is the annotation of new genes.

The toolkit for incorporating newly discovered proteins into a comprehensive evolutionary and functional network is diverse.

It includes search heuristics based on sequence similarity to identify significantly similar sequences.

Additionally, it involves identifying orthologs, which are also used for preliminary functional annotation.

However, since the function of gene can change if given enough time, it is necessary to consider other information to identify functional divergence between proteins.

As one complementary form of information, it is possible to annotate protein sequences with features such as functional domains, transmembrane domains, low complexity regions, secondary structure elements, or compositional properties.

The sum of all features annotated onto a sequence, the feature architecture, can provide further information on a proteins function.

To perform this task effectively, tools that can compare and classify feature architectures on a large scale are necessary However, multiple challenges arise when dealing with feature architecture.

Many existing schemes for comparing feature architectures cannot cope with features arising from multiple annotation sources.

Those that do, fall short in the resolution of overlapping and redundant feature annotations.

In this thesis, I present different approaches to leveraging feature architectures as a complementary information source for evolutionary and functional studies.

First, we introduce the Feature Architecture Similarity, a tool to perform pairwise scoring of the similarity between two feature architectures.

It uses a scoring method that considers the presence/absence of features, as well as positional information.

It also allows the integration of features from multiple annotation sources into one feature architecture, while resolving overlapping feature annotations.

A benchmark on more than 10,000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved.

We then demonstrate the utility of FAS on feature architecture comparison tasks in three case studies.

In the second work package, we apply FAS in the assessment of the functional impact of alternative splicing.

Here, we present SPICE, a tool that aids in understanding the functional variations within a proteome resulting from different protein isoforms expressed through alternative splicing.

In this pipeline, we introduce a new measure, the Expression Weighted Feature Disturbance (EWFD), that combines the FAS score between protein isoforms with their relative transcript expression values.

We demonstrate the use of SPICE with two datasets: First, we do an exemplary analysis using long-read sequencing data from the Long-read RNA-seq Genome Annotation Assessment Project and demonstrate how the results can be explored within the SPICE dashboard.

Secondly, we explore how SPICE performs with the inclusion of novel, unannotated transcripts using a larger, more diverse dataset provided by the ENCODE consortium.

In the last part, we move away from pairwise comparisons of feature architectures and instead consider groups of functional equivalent proteins for machine learning.

Here, we train AI models based on the feature architectures of a functional group of proteins to identify novel members.

We propose a novel approach to encode feature architectures for later machine learning applications in two variants.

In the first, we dynamically define, for each model, a feature space out of the features annotated with FAS.

In the second, we consider a static feature space based on the DSSP 8-state secondary structure representation and disordered regions.

In a test on 59 orthologous groups of the tricarboxylic acid cycle annotated in the Kyoto Encyclopedia of Genes and Genomes we confirm that both variants have a high recall rate.

With the three work packages introduced in this thesis, we extend the toolbox for studies on protein function with tools that effectively leverage feature architecture information on large-scale datasets and provide ideas for the future development of further methods for protein comparison and annotation.

Back

In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...

Endothelial Protein C Receptor

IntroductionThe protein C anticoagulant pathway plays a critical role in the negative regulation of the blood clotting response. The pathway is triggered by thrombin, which allows ...

Genomics and society: four scenarios for 2015

This paper develops four alternative scenarios depicting possible futures for genomics applications within a broader social context. The scenarios integrate forecasts for future ge...

Optimising tool wear and workpiece condition monitoring via cyber-physical systems for smart manufacturing

Smart manufacturing has been developed since the introduction of Industry 4.0. It consists of resource sharing and networking, predictive engineering, and material and data analyti...

ProFET: Protein Feature Engineering Toolkit captures high-level protein functions

The amount of sequenced genomes and proteins is growing at an unprecedented pace. Unfortunately, manual curation and functional knowledge lags behind. Homologous inference often fa...

TINGKAT PROTEIN DAN LISIN DALAM RANSUM TERHADAP EFISIENSI LISIN DAN PROTEIN NETTO PADA AYAM KAMPUNG UMUR 12 MINGGU

Penelitian yang dilakukan ini dalam mencari pengaruh tingkat protein dan lisin terhadap efisiensi lisin dan penggunaan protein netto pada ayam kampung yang diperlihara sampai umur ...

Genetic Programming based Feature Manipulation for Skin Cancer Image Classification

<p>Skin image classification involves the development of computational methods for solving problems such as cancer detection in lesion images, and their use for biomedical re...

Ultrasound characteristics of medullary thyroid carcinoma and prognostic significance

Purpose: To investigate the ultrasound characteristics of medullary thyroid carcinoma and their relationship with prognosis. ...

Email:
Password:

Email:

From features to functions : leveraging protein feature architectures in comparative genomics

Related Results