Javascript must be enabled to continue!
Privacy risk quantification in education data using Markov model
View through CrossRef
AbstractWith Big Data revolution, the education sector is being reshaped. The current data‐driven education system provides many opportunities to utilize the enormous amount of collected data about students' activities and performance for personalized education, adapting teaching methods, and decision making. On the other hand, such benefits come at a cost to privacy. For example, the identification of a student's poor performance across multiple courses. While several works have been conducted on quantifying the re‐identification risks of individuals in released datasets, they assume an adversary's prior knowledge about target individuals. Most of them do not utilize all the available information in the datasets. For example, event‐level information that associates multiple records to the same individual and correlation between attributes. In this work, we propose a method using a Markov Model (MM) to quantify re‐identification risks using all available information in the data under a more realistic threat model that assumes different levels of an adversary's knowledge about the target individual, ranging from any one of the attributes to all given attributes. Moreover, we propose a workflow for efficiently calculating MM risk which is highly scalable to large number of attributes. Experimental results from real education datasets show the efficacy of our model for re‐identification risk.
Practitioner notesWhat is already known about this topic?
There has been a number of works/research conducted on privacy risk quantification in datasets and in the Web.
Majority of them have strong assumption about adversary's prior knowledge of target individual(s).
Most of them do not utilize all the available information in the datasets, eg, event‐level or duplicate records and correlation between attributes.
What this paper adds?
This paper proposes a new re‐identification risk quantification model using Markov models. Our model addresses the shortcomings of existing works, eg, strong assumption about adversary's knowledge, unexplainable model, and utilizing available information in the datasets. Specifically, our proposed model not only focuses on the uniqueness of data points in the datasets (as most of the other existing methods), but also takes into account uniformity and correlation characteristics of these data points.
Re‐identification risk quantification is computationally expensive and is not scalable to large datasets with increasing number of attributes. This paper introduces a workflow for data custodians to use to efficiently evaluate the worst‐case re‐identification risk in their datasets before releasing.
It presents extensive experimental evaluation results of the proposed model for quantifying re‐identification risks on several real education datasets.
Implications for practice and/or policy?
Empirical results on real education datasets validate the significance and efficacy of the proposed model for re‐identification risk quantification compared to existing approaches.
Our model can be used by the data custodians as a tool to evaluate the worst‐case risk of a dataset. It empowers data custodians to make informed decisions on appropriate actions to mitigate these risks (eg, data perturbation) before sharing or releasing their datasets to third parties. A typical use case would be one where the data custodians is an online course/program provider, which collects data about students' engagement with their courses and would like to share it with third parties for them to run learning analytics that would provide value‐added benefits back to the data custodian.
We specifically study the privacy risk quantification for education data; however, our model is applicable to any tabular data release.
Title: Privacy risk quantification in education data using Markov model
Description:
AbstractWith Big Data revolution, the education sector is being reshaped.
The current data‐driven education system provides many opportunities to utilize the enormous amount of collected data about students' activities and performance for personalized education, adapting teaching methods, and decision making.
On the other hand, such benefits come at a cost to privacy.
For example, the identification of a student's poor performance across multiple courses.
While several works have been conducted on quantifying the re‐identification risks of individuals in released datasets, they assume an adversary's prior knowledge about target individuals.
Most of them do not utilize all the available information in the datasets.
For example, event‐level information that associates multiple records to the same individual and correlation between attributes.
In this work, we propose a method using a Markov Model (MM) to quantify re‐identification risks using all available information in the data under a more realistic threat model that assumes different levels of an adversary's knowledge about the target individual, ranging from any one of the attributes to all given attributes.
Moreover, we propose a workflow for efficiently calculating MM risk which is highly scalable to large number of attributes.
Experimental results from real education datasets show the efficacy of our model for re‐identification risk.
Practitioner notesWhat is already known about this topic?
There has been a number of works/research conducted on privacy risk quantification in datasets and in the Web.
Majority of them have strong assumption about adversary's prior knowledge of target individual(s).
Most of them do not utilize all the available information in the datasets, eg, event‐level or duplicate records and correlation between attributes.
What this paper adds?
This paper proposes a new re‐identification risk quantification model using Markov models.
Our model addresses the shortcomings of existing works, eg, strong assumption about adversary's knowledge, unexplainable model, and utilizing available information in the datasets.
Specifically, our proposed model not only focuses on the uniqueness of data points in the datasets (as most of the other existing methods), but also takes into account uniformity and correlation characteristics of these data points.
Re‐identification risk quantification is computationally expensive and is not scalable to large datasets with increasing number of attributes.
This paper introduces a workflow for data custodians to use to efficiently evaluate the worst‐case re‐identification risk in their datasets before releasing.
It presents extensive experimental evaluation results of the proposed model for quantifying re‐identification risks on several real education datasets.
Implications for practice and/or policy?
Empirical results on real education datasets validate the significance and efficacy of the proposed model for re‐identification risk quantification compared to existing approaches.
Our model can be used by the data custodians as a tool to evaluate the worst‐case risk of a dataset.
It empowers data custodians to make informed decisions on appropriate actions to mitigate these risks (eg, data perturbation) before sharing or releasing their datasets to third parties.
A typical use case would be one where the data custodians is an online course/program provider, which collects data about students' engagement with their courses and would like to share it with third parties for them to run learning analytics that would provide value‐added benefits back to the data custodian.
We specifically study the privacy risk quantification for education data; however, our model is applicable to any tabular data release.
Related Results
Augmented Differential Privacy Framework for Data Analytics
Augmented Differential Privacy Framework for Data Analytics
Abstract
Differential privacy has emerged as a popular privacy framework for providing privacy preserving noisy query answers based on statistical properties of databases. ...
Privacy Risk in Recommender Systems
Privacy Risk in Recommender Systems
Nowadays, recommender systems are mostly used in many online applications to filter information and help users in selecting their relevant requirements. It avoids users to become o...
THE SECURITY AND PRIVACY MEASURING SYSTEM FOR THE INTERNET OF THINGS DEVICES
THE SECURITY AND PRIVACY MEASURING SYSTEM FOR THE INTERNET OF THINGS DEVICES
The purpose of the article: elimination of the gap in existing need in the set of clear and objective security and privacy metrics for the IoT devices users and manufacturers and a...
Privacy in online advertising platforms
Privacy in online advertising platforms
Online advertising is consistently considered as the pillar of the "free• content on the Web since it is commonly the funding source of websites. Furthermore, the option of deliver...
Privacy awareness in generative AI: the case of ChatGPT
Privacy awareness in generative AI: the case of ChatGPT
Purpose
Generative AI, like ChatGPT, uses large language models that process human language and learn from patterns identified in large data sets. Despite the great benefits offere...
Application Status and Prospect of Data Privacy Protection Technology
Application Status and Prospect of Data Privacy Protection Technology
This article aims to explore the current application status and future prospects of data privacy protection technology, analyze the challenges faced by current data privacy, explor...
EXPRESS: Customer Data Privacy Stewardship
EXPRESS: Customer Data Privacy Stewardship
To address rising customer concerns about data privacy, some firms adopt strong privacy practices. However, such investments are costly, both financially and in terms of limiting t...
Māori data sovereignty and privacy
Māori data sovereignty and privacy
Privacy is a fundamental human right. One of its most important aspects is information privacy –
providing individuals with control over the way in which their personal data is col...

