Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Privacy preserving synthetic health data

View through CrossRef
We examine the feasibility of using synthetic medical data generated by GANs in the classroom, to teach data science in health informatics. Teaching data analysis with actual medical data such as electronic healthcare records is greatly restrained by laws protecting patient's privacy, such as HIPAA in the United States. While beneficial, these laws severely limit access to medical data thus stagnating innovation and limiting educational opportunities. The process of obfuscation of medical data is costly and time consuming with high penalties for accidental release. Health histories recovered from deidentified data may result in harm to the subject. Research and education is biased towards the few publicly-available datasets such as the ICU dataset "Medical Information Mart for Intensive Care" (MIMIC). Our focus lies on the problem of making available to medical students and researchers a wider variety of medical datasets, by creating synthetic data which retains utility for teaching purposes, and ideally even for research, while definitively preserving privacy. Our proposed workflow consists of training a generative model of synthetic data, using real data in a secure sand-boxed environment, exporting the model to the outside, and then synthesizing data. This procedure complies with our healthcare partners’ regulatory requirements. We develop a novel Wasserstein GAN and conduct a benchmark study on MIMIC data comparing it to 5 other approaches using novel metrics, based on nearest neighbor adversarial accuracy, for defining the resemblance and privacy of synthetic data generated from real data. We performed a comparison of 6 data generative methods on the MIMIC-III mortality problem: Gaussian Multivariate, Wasserstein GAN (WGAN), Parzen Windows, Additive Noise Model (ANM), Differential Privacy preserving data obfuscation, and (6) Copy the original data. Using our resemblance metric only the WGAN and Parzen windows show enough resemblance when compared to both the training and testing data. With our privacy metric the WGAN, Gaussian Multivariate, and Parzen Windows all excel. The Gaussian Multivarite method is ultimately ruled out as it does not retain a high enough utility level for education and research purposes. The Parzen Windows method is eliminated because it stores the original data and therefore sacrifices privacy in the model itself. Therefore the WGAN was the only effective method that maintained privacy and that allowed model export. This workflow can be used to address the vital need to create datasets for health education and research without undergoing deidentification which can be both costly and risky, and lose information.
Title: Privacy preserving synthetic health data
Description:
We examine the feasibility of using synthetic medical data generated by GANs in the classroom, to teach data science in health informatics.
Teaching data analysis with actual medical data such as electronic healthcare records is greatly restrained by laws protecting patient's privacy, such as HIPAA in the United States.
While beneficial, these laws severely limit access to medical data thus stagnating innovation and limiting educational opportunities.
The process of obfuscation of medical data is costly and time consuming with high penalties for accidental release.
Health histories recovered from deidentified data may result in harm to the subject.
Research and education is biased towards the few publicly-available datasets such as the ICU dataset "Medical Information Mart for Intensive Care" (MIMIC).
Our focus lies on the problem of making available to medical students and researchers a wider variety of medical datasets, by creating synthetic data which retains utility for teaching purposes, and ideally even for research, while definitively preserving privacy.
Our proposed workflow consists of training a generative model of synthetic data, using real data in a secure sand-boxed environment, exporting the model to the outside, and then synthesizing data.
This procedure complies with our healthcare partners’ regulatory requirements.
We develop a novel Wasserstein GAN and conduct a benchmark study on MIMIC data comparing it to 5 other approaches using novel metrics, based on nearest neighbor adversarial accuracy, for defining the resemblance and privacy of synthetic data generated from real data.
We performed a comparison of 6 data generative methods on the MIMIC-III mortality problem: Gaussian Multivariate, Wasserstein GAN (WGAN), Parzen Windows, Additive Noise Model (ANM), Differential Privacy preserving data obfuscation, and (6) Copy the original data.
Using our resemblance metric only the WGAN and Parzen windows show enough resemblance when compared to both the training and testing data.
With our privacy metric the WGAN, Gaussian Multivariate, and Parzen Windows all excel.
The Gaussian Multivarite method is ultimately ruled out as it does not retain a high enough utility level for education and research purposes.
The Parzen Windows method is eliminated because it stores the original data and therefore sacrifices privacy in the model itself.
Therefore the WGAN was the only effective method that maintained privacy and that allowed model export.
This workflow can be used to address the vital need to create datasets for health education and research without undergoing deidentification which can be both costly and risky, and lose information.

Related Results

Privacy and Security for Digital Health: Assessing Risks and Harms to Users
Privacy and Security for Digital Health: Assessing Risks and Harms to Users
Electronic Health (e-Health), such as mobile health (mHealth) and Health Information Systems (HIS), benefits healthcare consumers and professionals. However, it also poses potentia...
Augmented Differential Privacy Framework for Data Analytics
Augmented Differential Privacy Framework for Data Analytics
Abstract Differential privacy has emerged as a popular privacy framework for providing privacy preserving noisy query answers based on statistical properties of databases. ...
Privacy Risk in Recommender Systems
Privacy Risk in Recommender Systems
Nowadays, recommender systems are mostly used in many online applications to filter information and help users in selecting their relevant requirements. It avoids users to become o...
Privacy Threats and Privacy Preservation in Multiple Data Releases of High-Dimensional Datasets
Privacy Threats and Privacy Preservation in Multiple Data Releases of High-Dimensional Datasets
A major challenge is when datasets are released to be utilized in the outside scope of data-collecting organizations, it is how to balance data utilities and data privacy. To achie...
THE SECURITY AND PRIVACY MEASURING SYSTEM FOR THE INTERNET OF THINGS DEVICES
THE SECURITY AND PRIVACY MEASURING SYSTEM FOR THE INTERNET OF THINGS DEVICES
The purpose of the article: elimination of the gap in existing need in the set of clear and objective security and privacy metrics for the IoT devices users and manufacturers and a...
Enhancing Cloud Data Privacy with a Scalable Hybrid Approach: HE-DP-SMC
Enhancing Cloud Data Privacy with a Scalable Hybrid Approach: HE-DP-SMC
Cloud computing has become a popular way to store and access data. However, there are concerns about the privacy of data stored in the cloud. This paper proposes a novel privacy-pr...
Housing Improvements for Health and Associated Socio‐Economic Outcomes: A Systematic Review
Housing Improvements for Health and Associated Socio‐Economic Outcomes: A Systematic Review
Poor housing is associated with poor health. This suggests that improving housing conditions might lead to improved health for residents. This review searched widely for studies fr...

Back to Top