Javascript must be enabled to continue!
Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)
View through CrossRef
BACKGROUND
Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models.
OBJECTIVE
This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information.
METHODS
We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each.
RESULTS
We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient.
CONCLUSIONS
Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.
Title: Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)
Description:
BACKGROUND
Word embeddings are dense numeric vectors used to represent language in neural networks.
Until recently, there had been no publicly released embeddings trained on clinical data.
Our work is the first to study the privacy implications of releasing these models.
OBJECTIVE
This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information.
METHODS
We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each.
RESULTS
We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.
5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created.
We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient.
CONCLUSIONS
Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy.
If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms.
A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws.
Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.
Related Results
Privacy and Security for Digital Health: Assessing Risks and Harms to Users
Privacy and Security for Digital Health: Assessing Risks and Harms to Users
Electronic Health (e-Health), such as mobile health (mHealth) and Health Information Systems (HIS), benefits healthcare consumers and professionals. However, it also poses potentia...
Exploring Word Embeddings for Text Classification: A Comparative Analysis
Exploring Word Embeddings for Text Classification: A Comparative Analysis
For language tasks like text classification and sequence labeling, word embeddings are essential for providing input characteristics in deep models. There have been many word embed...
Augmented Differential Privacy Framework for Data Analytics
Augmented Differential Privacy Framework for Data Analytics
Abstract
Differential privacy has emerged as a popular privacy framework for providing privacy preserving noisy query answers based on statistical properties of databases. ...
Validation in Doctoral Education: Exploring PhD Students’ Perceptions of Belonging to Scaffold Doctoral Identity Work
Validation in Doctoral Education: Exploring PhD Students’ Perceptions of Belonging to Scaffold Doctoral Identity Work
Aim/Purpose: The aim of this article is to make a case of the role of validation in doctoral education. The purpose is to detail findings from three studies which explore PhD stude...
Privacy Risk in Recommender Systems
Privacy Risk in Recommender Systems
Nowadays, recommender systems are mostly used in many online applications to filter information and help users in selecting their relevant requirements. It avoids users to become o...
A Privacy Protection Method for Power User Profiles That Integrates Improved Differential Privacy and Secret Sharing
A Privacy Protection Method for Power User Profiles That Integrates Improved Differential Privacy and Secret Sharing
ABSTRACT
In response to the privacy leakage risks inherent in the big data processing of power user personas, propose a collaborative optimiz...
When Word Embeddings Become Endangered
When Word Embeddings Become Endangered
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as s...
Advancements in Word Embeddings: A Comprehensive Survey and Analysis
Advancements in Word Embeddings: A Comprehensive Survey and Analysis
In recent years, the field of Natural Language Processing (NLP) has seen significant growth in the study of word representation, with word embeddings proving valuable for various N...

