Javascript must be enabled to continue!
Towards Streamlined Transparent Data Linkage
View through CrossRef
Linked data is a powerful resource within data analytics and population-level research. However, methods for linkage vary and the choice of approach can impact downstream usage of data by introducing assumptions and biases in resulting links. Selecting stringent linkage methods helps strengthen identified links at risk of missing links; meanwhile, lenient rules or ill-considered comparisons may introduce false positive links. Therefore, the approach is non-trivial, requiring careful selection of preprocessing steps, model development and quality review to ensure suitable outputs, which can require significant human expertise and insight.
Real-world population-scale linkage can benefit from automation and scalability offered within modern data centres, with many tasks eligible for pipelining, such as applying predefined cleaning routines, training defined models, and generating mapping tables. Despite this, there are still pinch points requiring human interaction, such as selecting appropriate linkage fields, blocking rules and comparison methods, and reviewing quality of predictions.
We present an approach to provide scalable automation in linkage pipelines, whilst retaining transparency of the linkage process for downstream users, providing them with a dataset’s life history.
The work output for a given dataset is a versioned catalogue documenting the dataset’s journey, with transparent reporting of data origin, linkage settings, routines, and privacy-preserving quality analysis for inspection. This gives researchers insight into how it may affect their data and provides confidence in data usage. These insights also work in both directions, allowing users to provide feedback and iteratively refine linkage approaches.
Title: Towards Streamlined Transparent Data Linkage
Description:
Linked data is a powerful resource within data analytics and population-level research.
However, methods for linkage vary and the choice of approach can impact downstream usage of data by introducing assumptions and biases in resulting links.
Selecting stringent linkage methods helps strengthen identified links at risk of missing links; meanwhile, lenient rules or ill-considered comparisons may introduce false positive links.
Therefore, the approach is non-trivial, requiring careful selection of preprocessing steps, model development and quality review to ensure suitable outputs, which can require significant human expertise and insight.
Real-world population-scale linkage can benefit from automation and scalability offered within modern data centres, with many tasks eligible for pipelining, such as applying predefined cleaning routines, training defined models, and generating mapping tables.
Despite this, there are still pinch points requiring human interaction, such as selecting appropriate linkage fields, blocking rules and comparison methods, and reviewing quality of predictions.
We present an approach to provide scalable automation in linkage pipelines, whilst retaining transparency of the linkage process for downstream users, providing them with a dataset’s life history.
The work output for a given dataset is a versioned catalogue documenting the dataset’s journey, with transparent reporting of data origin, linkage settings, routines, and privacy-preserving quality analysis for inspection.
This gives researchers insight into how it may affect their data and provides confidence in data usage.
These insights also work in both directions, allowing users to provide feedback and iteratively refine linkage approaches.
Related Results
Description of the international consortium for prostate cancer genetics, and failure to replicate linkage of hereditary prostate cancer to 20q13
Description of the international consortium for prostate cancer genetics, and failure to replicate linkage of hereditary prostate cancer to 20q13
AbstractThe International Consortium for Prostate Cancer Genetics (ICPCG) is an international collaborative effort to pool pedigrees with hereditary prostate cancer (PC) in order t...
Consistently evaluating data linkage classification results
Consistently evaluating data linkage classification results
ObjectivesData linkage is commonly viewed as the problem of classifying record pairs into matches and non-matches. In situations where ground truth data are available, performance ...
Streamlining Grant Applications - What are the probabilities a streamlined grant is fundable and that a fundable grant is streamlined?
Streamlining Grant Applications - What are the probabilities a streamlined grant is fundable and that a fundable grant is streamlined?
Background: Securing qualified peer reviewers for public granting agencies is challenging and to avoid needlessly overworking these volunteers, there is increasing reliance on tria...
Evaluation measure for group-based record linkage
Evaluation measure for group-based record linkage
Introduction The robustness of record linkage evaluation measures is of high importance since linkage techniques are assessed based on these. However, minimal research has been con...
Perspectives on linkage to care for patients diagnosed with HIV: A qualitative study at a rural health center in South Western Uganda
Perspectives on linkage to care for patients diagnosed with HIV: A qualitative study at a rural health center in South Western Uganda
Linkage to care for newly diagnosed human immunodeficiency virus (HIV) patients is important to ensure that patients have good access to care. However, there is little information ...
Content Analysis of the Career Linkage Education in 2022 Revised Elementary School Curriculum
Content Analysis of the Career Linkage Education in 2022 Revised Elementary School Curriculum
This study analyzes the content of career linkage education in the 2022 revised elementary school curriculum and draws implications for its improvement and implementation. To this ...
Linkage of people experiencing homeless using two consent models.
Linkage of people experiencing homeless using two consent models.
ObjectivesAdministrative data linkage is relatively under-utilised as a way of generating evidence to guide homelessness policy and service delivery in the UK. Our objective is to ...
Federated Data Linkage in Practice
Federated Data Linkage in Practice
In recent years, great strides have been made towards the deployment of federated systems for data research, including exploring federated trusted research environments (TREs). The...

