Javascript must be enabled to continue!
Large-scale Manual Curation and Harmonization of Metadata from Metagenomic and Cancer Genomic Repositories: Challenges and Solutions
View through CrossRef
Abstract
Public omics repositories contain vast amounts of valuable data, but their metadata suffers from extreme heterogeneity, unstandardized terminologies, and quality issues that severely limit data reusability and cross-study integration. While prospective metadata standards exist, the majority of published omics data remain in non-standardized formats requiring retrospective curation. We performed comprehensive manual curation and harmonization of clinical metadata from 212,027 samples across 468 studies in two major repositories:
curatedMetagenomicData
(93 studies, 22,588 samples) and cBioPortal (375 studies, 189,438 samples). Through systematic ontology mapping, we consolidated redundant, dispersed information into much fewer harmonized columns, reduced unique values, and increased the completeness of major attributes. This curation process revealed common metadata quality issues, including typos, inconsistent terminologies, misplaced values, conflicting annotations, and inappropriately merged information across attributes. We document the challenges, decisions, and solutions encountered during large-scale metadata harmonization across two distinct omics domains. The harmonized metadata, accessible through the
OmicsMLRepoR
Bioconductor package, enables repository-wide queries and cross-study analyses previously challenging with heterogeneous metadata. Our experience provides practical guidance for similar curation efforts and demonstrates the value of investing in retrospective metadata improvement for existing public omics resources.
Title: Large-scale Manual Curation and Harmonization of Metadata from Metagenomic and Cancer Genomic Repositories: Challenges and Solutions
Description:
Abstract
Public omics repositories contain vast amounts of valuable data, but their metadata suffers from extreme heterogeneity, unstandardized terminologies, and quality issues that severely limit data reusability and cross-study integration.
While prospective metadata standards exist, the majority of published omics data remain in non-standardized formats requiring retrospective curation.
We performed comprehensive manual curation and harmonization of clinical metadata from 212,027 samples across 468 studies in two major repositories:
curatedMetagenomicData
(93 studies, 22,588 samples) and cBioPortal (375 studies, 189,438 samples).
Through systematic ontology mapping, we consolidated redundant, dispersed information into much fewer harmonized columns, reduced unique values, and increased the completeness of major attributes.
This curation process revealed common metadata quality issues, including typos, inconsistent terminologies, misplaced values, conflicting annotations, and inappropriately merged information across attributes.
We document the challenges, decisions, and solutions encountered during large-scale metadata harmonization across two distinct omics domains.
The harmonized metadata, accessible through the
OmicsMLRepoR
Bioconductor package, enables repository-wide queries and cross-study analyses previously challenging with heterogeneous metadata.
Our experience provides practical guidance for similar curation efforts and demonstrates the value of investing in retrospective metadata improvement for existing public omics resources.
Related Results
Big Metadata, Smart Metadata, and Metadata Capital: Toward Greater Synergy Between Data Science and Metadata
Big Metadata, Smart Metadata, and Metadata Capital: Toward Greater Synergy Between Data Science and Metadata
Abstract
Purpose
The purpose of the paper is to provide a framework for addressing the disconnect between metadata and data scie...
Literature Review on Metadata Governance
Literature Review on Metadata Governance
The framework of metadata governance is a subset of the primary data governance framework implementation within an enterprise. Metadata management helps identify data provenance an...
Globally Findable Planetary Data: The Interdisciplinary TRR170-DB Repository
Globally Findable Planetary Data: The Interdisciplinary TRR170-DB Repository
Introduction: The TRR170-DB data repository (https://planetary-data-portal.org/) manages the research data from the collaborative research center ‘Late Accretion onto Ter...
FAIR Digital Objects in Official Statistics
FAIR Digital Objects in Official Statistics
Introduction*1
Statistical offices on national and international scale provide statistics on demography, labour, income, society, economy, environment and othe...
Metadata quality and interoperability of GLAM digital images
Metadata quality and interoperability of GLAM digital images
PurposeThis study aims to explore how metadata have been applied in GLAM (galleries, libraries, archives and museums) institutions in New Zealand (NZ) and to analyse its overall qu...
Digital Curation and Doctoral Research
Digital Curation and Doctoral Research
This article considers digital curation in doctoral study and the role of the doctoral supervisor and institution in facilitating students’ acquisition of digital curation skills...
Using Metadata to Understand Search Behavior in Digital Libraries
Using Metadata to Understand Search Behavior in Digital Libraries
This thesis explores how search log analysis can be used to gain a deeper understanding of online search behavior in curated collections by leveraging the metadata. For this, we us...

