Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage

View through CrossRef
IntroductionApplications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways. Missing values make the linkage process challenging because they can affect the encoding of attribute values. No study has systematically investigated how missing values affect the outcomes of different encoding techniques used in privacy-preserving linkage applications. Objectives and ApproachBinary encodings, such as Bloom filters, are popular for linking sensitive databases. They are now employed in real-world linkage applications. However, existing encoding techniques assume the quasi-identifying attributes used for encoding to be complete. Missing values can lead to incomplete encodings which can result in decreased or increased similarities and therefore to false non-matches or false matches. In this study we empirically evaluate three binary encoding techniques using real voter databases, where pairs of records that correspond to the same voter (with name or address changes) resulted in files of 100,000 and 500,000 records containing from 0% to 50% missing values. ResultsWe encoded between two and four of the attributes first and last name, street, and city into three record-level binary encodings: Cryptographic long-term key (CLK) [Schnell et al. 2009], record-level Bloom filter (RBF) [Durham et al. 2014], and tabulation Min-hashing (TBH) [Smith 2017]. Experiments showed a 10% to 25% drop on average in both precision and recall for all encoding techniques when missing values are increasing. CLK resulted in the highest decrease in precision, while TBH resulted in the highest decrease in recall compared to the other encoding techniques. ConclusionBinary encodings such as Bloom filters are now used in practical applications for linking sensitive databases. Our evaluation shows that such encoding techniques can result in lower linkage quality if there are missing values in quasi-identifying attributes. This highlights the need for novel encoding techniques that can overcome the challenge of missing values.
Title: Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage
Description:
IntroductionApplications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways.
Missing values make the linkage process challenging because they can affect the encoding of attribute values.
No study has systematically investigated how missing values affect the outcomes of different encoding techniques used in privacy-preserving linkage applications.
Objectives and ApproachBinary encodings, such as Bloom filters, are popular for linking sensitive databases.
They are now employed in real-world linkage applications.
However, existing encoding techniques assume the quasi-identifying attributes used for encoding to be complete.
Missing values can lead to incomplete encodings which can result in decreased or increased similarities and therefore to false non-matches or false matches.
In this study we empirically evaluate three binary encoding techniques using real voter databases, where pairs of records that correspond to the same voter (with name or address changes) resulted in files of 100,000 and 500,000 records containing from 0% to 50% missing values.
ResultsWe encoded between two and four of the attributes first and last name, street, and city into three record-level binary encodings: Cryptographic long-term key (CLK) [Schnell et al.
2009], record-level Bloom filter (RBF) [Durham et al.
2014], and tabulation Min-hashing (TBH) [Smith 2017].
Experiments showed a 10% to 25% drop on average in both precision and recall for all encoding techniques when missing values are increasing.
CLK resulted in the highest decrease in precision, while TBH resulted in the highest decrease in recall compared to the other encoding techniques.
ConclusionBinary encodings such as Bloom filters are now used in practical applications for linking sensitive databases.
Our evaluation shows that such encoding techniques can result in lower linkage quality if there are missing values in quasi-identifying attributes.
This highlights the need for novel encoding techniques that can overcome the challenge of missing values.

Related Results

Privacy Attack on Multiple Dynamic Match-key based Privacy-Preserving Record Linkage
Privacy Attack on Multiple Dynamic Match-key based Privacy-Preserving Record Linkage
Introduction Over the last decade, the demand for linking records about people across databases has increased in various domains. Privacy challenges associated with linking sensit...
Linking Sensitive Data – Applications, Techniques, and Challenges
Linking Sensitive Data – Applications, Techniques, and Challenges
IntroductionThe linking of sensitive databases containing personal identifying information across organisations is an increasingly important task in application domains ranging fro...
Taxonomy of Attacks on Privacy-Preserving Record Linkage
Taxonomy of Attacks on Privacy-Preserving Record Linkage
Record linkage is the process of identifying records that corresponds to the same real-world entities across different databases. Due to the absence of unique entity identifiers, r...
Augmented Differential Privacy Framework for Data Analytics
Augmented Differential Privacy Framework for Data Analytics
Abstract Differential privacy has emerged as a popular privacy framework for providing privacy preserving noisy query answers based on statistical properties of databases. ...
Evaluation measure for group-based record linkage
Evaluation measure for group-based record linkage
Introduction The robustness of record linkage evaluation measures is of high importance since linkage techniques are assessed based on these. However, minimal research has been con...
Federated Data Linkage in Practice
Federated Data Linkage in Practice
In recent years, great strides have been made towards the deployment of federated systems for data research, including exploring federated trusted research environments (TREs). The...
Privacy Risk in Recommender Systems
Privacy Risk in Recommender Systems
Nowadays, recommender systems are mostly used in many online applications to filter information and help users in selecting their relevant requirements. It avoids users to become o...
An Evaluation Framework for Privacy-Preserving Record Linkage
An Evaluation Framework for Privacy-Preserving Record Linkage
Privacy-preserving record linkage (PPRL) addresses the problem of identifying matching records from different databases that correspond to the same real-world entities using quasi-...

Back to Top