Javascript must be enabled to continue!
The Costs of Anonymization: Case Study Using Clinical Data (Preprint)
View through CrossRef
BACKGROUND
Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns, such as reidentification risk, can be addressed through the application of anonymization algorithms, whereby data are altered so that it is no longer reasonably related to a person. Yet, such alterations have the potential to influence the data set’s statistical properties, such that the privacy-utility trade-off must be considered. This has been studied in theory, but evidence based on real-world individual-level clinical data is rare, and anonymization has not broadly been adopted in clinical practice.
OBJECTIVE
The goal of this study is to contribute to a better understanding of anonymization in the real world by comprehensively evaluating the privacy-utility trade-off of differently anonymized data using data and scientific results from the German Chronic Kidney Disease (GCKD) study.
METHODS
The GCKD data set extracted for this study consists of 5217 records and 70 variables. A 2-step procedure was followed to determine which variables constituted reidentification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. The data were then transformed via generalization and suppression, and the anonymization process was varied using a generic and a use case–specific configuration. To assess the utility of the anonymized GCKD data, general-purpose metrics (ie, data granularity and entropy), as well as use case–specific metrics (ie, reproducibility), were applied. Reproducibility was assessed by measuring the overlap of the 95% CI lengths between anonymized and original results.
RESULTS
Reproducibility measured by 95% CI overlap was higher than utility obtained from general-purpose metrics. For example, granularity varied between 68.2% and 87.6%, and entropy varied between 25.5% and 46.2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied. A nonoverlapping 95% CI was detected in 6 estimates across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%. The use case–specific configuration outperformed the generic one in terms of actual utility (ie, reproducibility) at the same level of privacy.
CONCLUSIONS
Our results illustrate the challenges that anonymization faces when aiming to support multiple likely and possibly competing uses, while use case–specific anonymization can provide greater utility. This aspect should be taken into account when evaluating the associated costs of anonymized data and attempting to maintain sufficiently high levels of privacy for anonymized data.
CLINICALTRIAL
German Clinical Trials Register DRKS00003971; https://drks.de/search/en/trial/DRKS00003971
INTERNATIONAL REGISTERED REPORT
RR2-10.1093/ndt/gfr456
JMIR Publications Inc.
Title: The Costs of Anonymization: Case Study Using Clinical Data (Preprint)
Description:
BACKGROUND
Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration.
However, privacy concerns remain a barrier to data sharing.
Certain concerns, such as reidentification risk, can be addressed through the application of anonymization algorithms, whereby data are altered so that it is no longer reasonably related to a person.
Yet, such alterations have the potential to influence the data set’s statistical properties, such that the privacy-utility trade-off must be considered.
This has been studied in theory, but evidence based on real-world individual-level clinical data is rare, and anonymization has not broadly been adopted in clinical practice.
OBJECTIVE
The goal of this study is to contribute to a better understanding of anonymization in the real world by comprehensively evaluating the privacy-utility trade-off of differently anonymized data using data and scientific results from the German Chronic Kidney Disease (GCKD) study.
METHODS
The GCKD data set extracted for this study consists of 5217 records and 70 variables.
A 2-step procedure was followed to determine which variables constituted reidentification risks.
To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.
02 to 1.
The data were then transformed via generalization and suppression, and the anonymization process was varied using a generic and a use case–specific configuration.
To assess the utility of the anonymized GCKD data, general-purpose metrics (ie, data granularity and entropy), as well as use case–specific metrics (ie, reproducibility), were applied.
Reproducibility was assessed by measuring the overlap of the 95% CI lengths between anonymized and original results.
RESULTS
Reproducibility measured by 95% CI overlap was higher than utility obtained from general-purpose metrics.
For example, granularity varied between 68.
2% and 87.
6%, and entropy varied between 25.
5% and 46.
2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied.
A nonoverlapping 95% CI was detected in 6 estimates across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%.
The use case–specific configuration outperformed the generic one in terms of actual utility (ie, reproducibility) at the same level of privacy.
CONCLUSIONS
Our results illustrate the challenges that anonymization faces when aiming to support multiple likely and possibly competing uses, while use case–specific anonymization can provide greater utility.
This aspect should be taken into account when evaluating the associated costs of anonymized data and attempting to maintain sufficiently high levels of privacy for anonymized data.
CLINICALTRIAL
German Clinical Trials Register DRKS00003971; https://drks.
de/search/en/trial/DRKS00003971
INTERNATIONAL REGISTERED REPORT
RR2-10.
1093/ndt/gfr456.
Related Results
Hydatid Disease of The Brain Parenchyma: A Systematic Review
Hydatid Disease of The Brain Parenchyma: A Systematic Review
Abstarct
Introduction
Isolated brain hydatid disease (BHD) is an extremely rare form of echinococcosis. A prompt and timely diagnosis is a crucial step in disease management. This ...
Breast Carcinoma within Fibroadenoma: A Systematic Review
Breast Carcinoma within Fibroadenoma: A Systematic Review
Abstract
Introduction
Fibroadenoma is the most common benign breast lesion; however, it carries a potential risk of malignant transformation. This systematic review provides an ove...
Method of evaluating and diagnosing costs for event management
Method of evaluating and diagnosing costs for event management
The article develops a method of evaluating and diagnosing costs for event management in the form of a matrix that takes into account the directions of managing event processes of ...
Data Anonymization for Open Science: A Case Study
Data Anonymization for Open Science: A Case Study
ABSTRACT
One of many challenges to open science is anonymization of personal data so that it may be shared. This paper presents a case study of the anonymization of...
Subtle biases introduced in equity studies through data anonymization
Subtle biases introduced in equity studies through data anonymization
This work investigates the trade-off between data anonymization and utility, particularly focusing on the implications for equity-related research in education. Using microdata fro...
Enhancing IoT Cybersecurity through Multi-Technique Data Anonymization: A Differential Privacy Framework Using Public IoT Datasets
Enhancing IoT Cybersecurity through Multi-Technique Data Anonymization: A Differential Privacy Framework Using Public IoT Datasets
The proliferation of Internet of Things (IoT) deployments in critical domains such as smart homes, healthcare, and industrial control has significantly expanded the attack surface ...
Healthcare Utilization and Imputed Costs of Acute Myeloid Leukemia Patients By FLT3 Status and Early Midostaurin Use at a Comprehensive Cancer Center
Healthcare Utilization and Imputed Costs of Acute Myeloid Leukemia Patients By FLT3 Status and Early Midostaurin Use at a Comprehensive Cancer Center
Abstract
INTRODUCTION: Mutation of FLT3, a tyrosine kinase receptor, is one of the most common molecular alterations in AML. In 2017, the FDA approved midostaurin fo...
Anonymize or synthesize? Privacy-preserving methods for heart failure score analytics
Anonymize or synthesize? Privacy-preserving methods for heart failure score analytics
Abstract
Aims
Data availability remains a critical challenge in modern, data-driven medical research. Due to the sensitive natur...

