Javascript must be enabled to continue!
Applying negative rule mining to improve genome annotation
View through CrossRef
Abstract
Background
Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items.
Results
Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower.
Conclusion
Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection.
Springer Science and Business Media LLC
Title: Applying negative rule mining to improve genome annotation
Description:
Abstract
Background
Unsupervised annotation of proteins by software pipelines suffers from very high error rates.
Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences.
We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors.
Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items.
Results
Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule.
The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.
6% of the whole body of the similarity-transferred annotation in the PEDANT genome database.
Positive rule mining does not identify two thirds of these errors.
The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower.
Conclusion
Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection.
Related Results
Predictors of False-Negative Axillary FNA Among Breast Cancer Patients: A Cross-Sectional Study
Predictors of False-Negative Axillary FNA Among Breast Cancer Patients: A Cross-Sectional Study
Abstract
Introduction
Fine-needle aspiration (FNA) is commonly used to investigate lymphadenopathy of suspected metastatic origin. The current study aims to find the association be...
Light at the End of the Tunnel: Mining Justice and Health
Light at the End of the Tunnel: Mining Justice and Health
The mining industry provides valuable mined commodities and financial support for communities worldwide. Mining has become safer for workers. Significant injustices, however, are c...
An extensible genome annotation workbench based on the Galaxy Platform
An extensible genome annotation workbench based on the Galaxy Platform
Introduction
Falling costs of genetic sequencing have allowed sequencing and annotation of the genomes of non-model organism. In annotating non-mod...
Impact of Mining on Socioeconomic Status in Puno, Peru
Impact of Mining on Socioeconomic Status in Puno, Peru
This study examines the direct and indirect effects of mining activities on key socioeconomic indicators such as per capita income, the Human Development Index (HDI), and education...
Benchmarking Hayai-Annotation Plants: A Re-evaluation Using Standard Evaluation Metrics
Benchmarking Hayai-Annotation Plants: A Re-evaluation Using Standard Evaluation Metrics
Abstract
The rapid growth of next-generation sequencing (NGS) technology has led to a surge in the determination of whole genome sequences in pla...
Mining sequence annotation databanks for association patterns
Mining sequence annotation databanks for association patterns
Abstract
Motivation: Millions of protein sequences currently being deposited to sequence databanks will never be annotated manually. Similarity-based annotation gene...
Galaxy Genome Annotation: Galaxy as a platform for the annotation of genomes
Galaxy Genome Annotation: Galaxy as a platform for the annotation of genomes
Galaxy Genome Annotation (GGA) is a project focusing on developments and resources to turn Galaxy into a complete and efficient platform for the structural and functional annotatio...
Galaxy Genome Annotation: Galaxy as a platform for the annotation of genomes
Galaxy Genome Annotation: Galaxy as a platform for the annotation of genomes
Galaxy Genome Annotation (GGA) is a project focusing on developments and resources to turn Galaxy into a complete and efficient platform for the structural and functional annotatio...

