Javascript must be enabled to continue!

Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)

BACKGROUND Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. OBJECTIVE This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. METHODS We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. RESULTS We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. CONCLUSIONS Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.

JMIR Publications Inc.

Mohamed Abdalla Moustafa Abdalla Graeme Hirst Frank Rudzicz

2020

Title: Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)

Description:

BACKGROUND Word embeddings are dense numeric vectors used to represent language in neural networks.

Until recently, there had been no publicly released embeddings trained on clinical data.

Our work is the first to study the privacy implications of releasing these models.

OBJECTIVE This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information.

METHODS We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each.

RESULTS We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.

5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created.

We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient.

CONCLUSIONS Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy.

If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms.

A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws.

Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.

Back

Electronic Health (e-Health), such as mobile health (mHealth) and Health Information Systems (HIS), benefits healthcare consumers and professionals. However, it also poses potentia...

Exploring Word Embeddings for Text Classification: A Comparative Analysis

For language tasks like text classification and sequence labeling, word embeddings are essential for providing input characteristics in deep models. There have been many word embed...

Augmented Differential Privacy Framework for Data Analytics

Abstract Differential privacy has emerged as a popular privacy framework for providing privacy preserving noisy query answers based on statistical properties of databases. ...

Validation in Doctoral Education: Exploring PhD Students’ Perceptions of Belonging to Scaffold Doctoral Identity Work

Aim/Purpose: The aim of this article is to make a case of the role of validation in doctoral education. The purpose is to detail findings from three studies which explore PhD stude...

Privacy Risk in Recommender Systems

Nowadays, recommender systems are mostly used in many online applications to filter information and help users in selecting their relevant requirements. It avoids users to become o...

A Privacy Protection Method for Power User Profiles That Integrates Improved Differential Privacy and Secret Sharing

ABSTRACT In response to the privacy leakage risks inherent in the big data processing of power user personas, propose a collaborative optimiz...

When Word Embeddings Become Endangered

Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as s...

Advancements in Word Embeddings: A Comprehensive Survey and Analysis

In recent years, the field of Natural Language Processing (NLP) has seen significant growth in the study of word representation, with word embeddings proving valuable for various N...

Email:
Password:

Email:

Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)

Related Results