Javascript must be enabled to continue!
Deep graph convolutional network for US birth data harmonization (Preprint)
View through CrossRef
BACKGROUND
In the United States, State laws require birth certificates to be completed for all births; and federal law mandates national collection and publication of births and other vital statistics data. National Center for Health Statistics (NCHS) has published the key statistics of birth data over the years. These data files, from as early as the 1970s, have been released and made publicly available. There are about 3 million new births each year, and every birth is a record in the data set described by hundreds of variables. The total data cover more than half of the current US population, making it an invaluable resource to study and examine birth epidemiology. Using such big data, researchers can ask interesting questions and study longitudinal patterns, for example, the impact of mother's drinking status to infertility in metropolitans in the last decade, or the education level of the biological father to the c-sections over the years.
However, existing published data sets cannot directly support these research questions as there are adjustments to the variables and their categories, which makes these individually published data files fragmented. The information contained in the published data files is highly diverse, containing hundreds of variables each year. Besides minor adjustments like renaming and increasing variable categories, some major updates significantly changed the fields of statistics (including removal, addition, and modification of the variables), making the published data disconnected and ambiguous to use over multiple years. Researchers have previously reconstructed features to study temporal patterns, but the scale is limited (focusing only on a few variables of interest). Many have reinvented the wheels, and such reconstructions lack consistency as different researchers might use different criteria to harmonize variables, leading to inconsistent findings and limiting the reproducibility of research. There is no systematic effort to combine about five decades of data files into a database that includes every variable that has ever been released by NCHS.
OBJECTIVE
To utilize machine learning techniques to combine the United States (US) natality data for the last five decades, with changing variables and factors, into a consistent database.
METHODS
We developed a feasible and efficient deep-learning-based framework to harmonize data sets of live births in the US from 1970 to 2018. We constructed a graph based on the property and elements of databases including variables and conducted a graph convolutional network (GCN) on the graph to learn the graph embeddings for nodes where the learned embeddings implied the similarity of variables. We devised a novel loss function with a slack margin and a banlist mechanism (for a random walk) to learn the desired structure (two nodes sharing more information were more similar to each other.). We developed an active learning mechanism to conduct the harmonization.
RESULTS
We harmonized historical US birth data and resolved conflicts in ambiguous terms. From a total of 9,321 variables (i.e., 783 stemmed variables, from 1970 to 2018) we applied our model iteratively together with human review, obtaining 323 hyperchains of variables. Hyperchains for harmonization were composed of 201 stemmed variable pairs when considering any pairs of different stemmed variables changed over years. During the harmonization, the first round of our model provided 305 candidates stemmed variable pairs (based on the top-20 most similar variables of each variable based on the learned embeddings of variables) and achieved recall and precision of 87.56%, 57.70%, respectively.
CONCLUSIONS
Our harmonized graph neural network (HGNN) method provides a feasible and efficient way to connect relevant databases at a meta-level. Adapting to databases' property and characteristics, HGNN can learn patterns and search relations globally, which is powerful to discover the similarity between variables among databases. Smart utilization of machine learning can significantly reduce the manual effort in database harmonization and integration of fragmented data into useful databases for future research.
Title: Deep graph convolutional network for US birth data harmonization (Preprint)
Description:
BACKGROUND
In the United States, State laws require birth certificates to be completed for all births; and federal law mandates national collection and publication of births and other vital statistics data.
National Center for Health Statistics (NCHS) has published the key statistics of birth data over the years.
These data files, from as early as the 1970s, have been released and made publicly available.
There are about 3 million new births each year, and every birth is a record in the data set described by hundreds of variables.
The total data cover more than half of the current US population, making it an invaluable resource to study and examine birth epidemiology.
Using such big data, researchers can ask interesting questions and study longitudinal patterns, for example, the impact of mother's drinking status to infertility in metropolitans in the last decade, or the education level of the biological father to the c-sections over the years.
However, existing published data sets cannot directly support these research questions as there are adjustments to the variables and their categories, which makes these individually published data files fragmented.
The information contained in the published data files is highly diverse, containing hundreds of variables each year.
Besides minor adjustments like renaming and increasing variable categories, some major updates significantly changed the fields of statistics (including removal, addition, and modification of the variables), making the published data disconnected and ambiguous to use over multiple years.
Researchers have previously reconstructed features to study temporal patterns, but the scale is limited (focusing only on a few variables of interest).
Many have reinvented the wheels, and such reconstructions lack consistency as different researchers might use different criteria to harmonize variables, leading to inconsistent findings and limiting the reproducibility of research.
There is no systematic effort to combine about five decades of data files into a database that includes every variable that has ever been released by NCHS.
OBJECTIVE
To utilize machine learning techniques to combine the United States (US) natality data for the last five decades, with changing variables and factors, into a consistent database.
METHODS
We developed a feasible and efficient deep-learning-based framework to harmonize data sets of live births in the US from 1970 to 2018.
We constructed a graph based on the property and elements of databases including variables and conducted a graph convolutional network (GCN) on the graph to learn the graph embeddings for nodes where the learned embeddings implied the similarity of variables.
We devised a novel loss function with a slack margin and a banlist mechanism (for a random walk) to learn the desired structure (two nodes sharing more information were more similar to each other.
).
We developed an active learning mechanism to conduct the harmonization.
RESULTS
We harmonized historical US birth data and resolved conflicts in ambiguous terms.
From a total of 9,321 variables (i.
e.
, 783 stemmed variables, from 1970 to 2018) we applied our model iteratively together with human review, obtaining 323 hyperchains of variables.
Hyperchains for harmonization were composed of 201 stemmed variable pairs when considering any pairs of different stemmed variables changed over years.
During the harmonization, the first round of our model provided 305 candidates stemmed variable pairs (based on the top-20 most similar variables of each variable based on the learned embeddings of variables) and achieved recall and precision of 87.
56%, 57.
70%, respectively.
CONCLUSIONS
Our harmonized graph neural network (HGNN) method provides a feasible and efficient way to connect relevant databases at a meta-level.
Adapting to databases' property and characteristics, HGNN can learn patterns and search relations globally, which is powerful to discover the similarity between variables among databases.
Smart utilization of machine learning can significantly reduce the manual effort in database harmonization and integration of fragmented data into useful databases for future research.
Related Results
Graph convolutional neural networks for 3D data analysis
Graph convolutional neural networks for 3D data analysis
(English) Deep Learning allows the extraction of complex features directly from raw input data, eliminating the need for hand-crafted features from the classical Machine Learning p...
Domination of Polynomial with Application
Domination of Polynomial with Application
In this paper, .We .initiate the study of domination. polynomial , consider G=(V,E) be a simple, finite, and directed graph without. isolated. vertex .We present a study of the Ira...
Bilangan Terhubung Titik Pelangi pada Graf Garis dan Graf Tengah dari Hasil Operasi Comb Graf Bintang C<sub>3</sub> dan Graf Bintang S<sub>n</sub>
Bilangan Terhubung Titik Pelangi pada Graf Garis dan Graf Tengah dari Hasil Operasi Comb Graf Bintang C<sub>3</sub> dan Graf Bintang S<sub>n</sub>
Penelitian ini bertujuan menentukan bilangan terhubung titik pelangi (rainbow vertex connection number) pada graf garis dan graf tengah yang diperoleh dari hasil operasi comb antar...
Effect of data harmonization of multicentric dataset in ASD/TD classification
Effect of data harmonization of multicentric dataset in ASD/TD classification
Abstract
Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlat...
Bootstrapping a Biodiversity Knowledge Graph
Bootstrapping a Biodiversity Knowledge Graph
The "biodiversity knowledge graph" is a nice metaphor for connecting biodiversity data sources, but can we actually build it? Do we have sufficient linked data available? Given tha...
CommunityGCN: community detection using node classification with graph convolution network
CommunityGCN: community detection using node classification with graph convolution network
PurposeA community demonstrates the unique qualities and relationships between its members that distinguish it from other communities within a network. Network analysis relies heav...
CG-TGAN: Conditional Generative Adversarial Networks with Graph Neural Networks for Tabular Data Synthesizing
CG-TGAN: Conditional Generative Adversarial Networks with Graph Neural Networks for Tabular Data Synthesizing
Data sharing is necessary for AI to be widely used, but sharing sensitive data with others with privacy is risky.
To solve these problems, it is necessary to synthesize realistic t...
Harmonization of Islamic Legal Institutions and Customary Law in Marriage Dispensation Cases at The Panyabungan Religious Court
Harmonization of Islamic Legal Institutions and Customary Law in Marriage Dispensation Cases at The Panyabungan Religious Court
Harmonization between customary law and Islamic law (fiqh) has long occurred in our homeland. This study aims to illustrate the harmonization between Islamic legal institutions and...

