Javascript must be enabled to continue!

Genome-wide assessment of ortholog quality in Ensembl

Orthologs prediction can be a difficult exercise, particularly on a large scale. While methods to cluster genes into protein families have improved greatly with the use of HMM profiling, verification of orthologs should be performed for robustness. Here, we propose a new quality check that focuses on large-scale rearrangement information. Gene neighbourhoods surrounding each gene of a pair of orthologs rightly are expected to be well conserved. Our approach makes the assumption that rearrangements are likely to happen to a group of contiguous genes, rather than genes in isolation (1). Our method uses local synteny information to calculate a percentage score for orthologous pairs of genes (2 genes upstream and 2 genes down stream of each gene of the queried pair). The higher the gene order conservation (GOC) percentage score of an ortholog the more confidence we have in the prediction of said ortholog. The distribution of the GOC percentage score for all the orthologs inferred for a genome can also be used as an indication of the quality of the assembly of that genome. We have also used the distribution of the GOC percentage score for all the orthologs inferred for a set of species to compare two gene tree pipelines employing 2 different approaches to gene tree inference. Our GOC scores are used in combination with whole-genome-alignments-based (WGA) scores and more traditional filters such as a threshold on the percentage of sequence identity to identify a subset of high-confidence orthologs. The GOC and WGA scores and the sets of high-confidence orthologs are available since version 86 of Ensembl.

F1000 Research Ltd

Wasiu Akanni Mateus Patricio Matthieu Muffato Bronwen Aken Paul Flicek

2025

Title: Genome-wide assessment of ortholog quality in Ensembl

Description:

Orthologs prediction can be a difficult exercise, particularly on a large scale.

While methods to cluster genes into protein families have improved greatly with the use of HMM profiling, verification of orthologs should be performed for robustness.

Here, we propose a new quality check that focuses on large-scale rearrangement information.

Gene neighbourhoods surrounding each gene of a pair of orthologs rightly are expected to be well conserved.

Our approach makes the assumption that rearrangements are likely to happen to a group of contiguous genes, rather than genes in isolation (1).

Our method uses local synteny information to calculate a percentage score for orthologous pairs of genes (2 genes upstream and 2 genes down stream of each gene of the queried pair).

The higher the gene order conservation (GOC) percentage score of an ortholog the more confidence we have in the prediction of said ortholog.

The distribution of the GOC percentage score for all the orthologs inferred for a genome can also be used as an indication of the quality of the assembly of that genome.

We have also used the distribution of the GOC percentage score for all the orthologs inferred for a set of species to compare two gene tree pipelines employing 2 different approaches to gene tree inference.

Our GOC scores are used in combination with whole-genome-alignments-based (WGA) scores and more traditional filters such as a threshold on the percentage of sequence identity to identify a subset of high-confidence orthologs.

The GOC and WGA scores and the sets of high-confidence orthologs are available since version 86 of Ensembl.

Back

Abstract Orthology prediction is challenging yet rewarding. Orthologs are the cornerstone of almost all comparative genomics studies. Dozens of ortholog resources h...

Fish annotation in Ensembl

The actinopterygians, or ray-finned fish, account for nearly half of all extant vertebrates and exhibit a high level of phenotypic diversity. Previous and continuing studies have r...

Inclusion of pseudogenes in the Ensembl comparative genomics resources

Pseudogenes are segments of DNA that are related to functional genes but have lost functionality, often from the accumulation of multiple mutations. Pseudogenes can thus be found a...

Nucleic Acids Research

The InParanoid project gathers proteomes of completely sequenced eukaryotic species plus Escherichia coli and calculates pairwise ortholog relationships among them. The new release...

A metric and its derived protein network for evaluation of ortholog database inconsistency

Abstract Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strat...

The Impact of IL28B Gene Polymorphisms on Drug Responses

To achieve high therapeutic efficacy in the patient, information on pharmacokinetics, pharmacodynamics, and pharmacogenetics is required. With the development of science and techno...

Whole Genome Resequencing and 1000 Genomes Project

Abstract The recent advances in sequencing technologies have enabled the whole human genome to be sequenced within weeks. To date, several human...

TreeFam and Ensembl: Phylogenetic resources

The Ensembl and Ensembl Genomes projects create and distribute genome annotations for a wide range of genomes, including model organisms. The number of publicly available genomes i...

Email:
Password:

Email:

Genome-wide assessment of ortholog quality in Ensembl

Related Results