Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)

View through CrossRef
BACKGROUND Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.
Title: Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)
Description:
BACKGROUND Word embeddings are dense numeric vectors used to represent language in neural networks.
Until recently, there had been no publicly released embeddings trained on clinical data.
Our work is the first to study the privacy implications of releasing these models.
OBJECTIVE This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information.
METHODS We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each.
RESULTS We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.
5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created.
We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient.
CONCLUSIONS Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy.
If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms.
A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws.
Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.

Related Results

Privacy and Security for Digital Health: Assessing Risks and Harms to Users
Privacy and Security for Digital Health: Assessing Risks and Harms to Users
Electronic Health (e-Health), such as mobile health (mHealth) and Health Information Systems (HIS), benefits healthcare consumers and professionals. However, it also poses potentia...
Exploring Word Embeddings for Text Classification: A Comparative Analysis
Exploring Word Embeddings for Text Classification: A Comparative Analysis
For language tasks like text classification and sequence labeling, word embeddings are essential for providing input characteristics in deep models. There have been many word embed...
Augmented Differential Privacy Framework for Data Analytics
Augmented Differential Privacy Framework for Data Analytics
Abstract Differential privacy has emerged as a popular privacy framework for providing privacy preserving noisy query answers based on statistical properties of databases. ...
Validation in Doctoral Education: Exploring PhD Students’ Perceptions of Belonging to Scaffold Doctoral Identity Work
Validation in Doctoral Education: Exploring PhD Students’ Perceptions of Belonging to Scaffold Doctoral Identity Work
Aim/Purpose: The aim of this article is to make a case of the role of validation in doctoral education. The purpose is to detail findings from three studies which explore PhD stude...
Privacy Risk in Recommender Systems
Privacy Risk in Recommender Systems
Nowadays, recommender systems are mostly used in many online applications to filter information and help users in selecting their relevant requirements. It avoids users to become o...
When Word Embeddings Become Endangered
When Word Embeddings Become Endangered
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as s...
Advancements in Word Embeddings: A Comprehensive Survey and Analysis
Advancements in Word Embeddings: A Comprehensive Survey and Analysis
In recent years, the field of Natural Language Processing (NLP) has seen significant growth in the study of word representation, with word embeddings proving valuable for various N...

Back to Top