Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Consistently evaluating data linkage classification results

View through CrossRef
ObjectivesData linkage is commonly viewed as the problem of classifying record pairs into matches and non-matches. In situations where ground truth data are available, performance measures such as precision, recall, F-measure, sensitivity, and specificity, are commonly used to evaluate the quality of matches obtained with a trained data linkage classifier. MethodsComparing multiple classifiers using such measures can, however, lead to inconsistent evaluation because for a given measure the same numerical result can be obtained from different classification outcomes. This can cause a suboptimal classifier being selected and potentially result in linked data sets of poor quality. To overcome this problem, we propose the Consistent Record Linkage (CRL) measure, an application focused evaluation method that ensures data linkage classifiers are assessed in a fair and transparent way. The CRL-measure allows the definition of maximum acceptable error rates, and it provides information about the robustness of a classifier based on identified classification thresholds. ResultsUsing both synthetic and real-world data sets, we illustrate how the CRL-measure can provide more detailed information about the performance of data linkage classification results compared to traditional performance measures. Based on user selected maximum acceptable error rates, the CRL-measure identifies the range of classification thresholds where error rates are below these maximums, thereby obtaining high linkage quality. This indicates the robustness of a classifier with regard to a varying classification threshold. Furthermore, the CRL-measure shows a user if a given data linkage classifier is actually able to achieve a certain linkage quality or not. ConclusionThe CRL-measure provides users with consistent information about how multiple data linkage classifiers trained on the same data set perform comparatively. This will allow a better selection of the most suitable classifier for a given data linkage problem and lead to improved quality of linked data sets.
Title: Consistently evaluating data linkage classification results
Description:
ObjectivesData linkage is commonly viewed as the problem of classifying record pairs into matches and non-matches.
In situations where ground truth data are available, performance measures such as precision, recall, F-measure, sensitivity, and specificity, are commonly used to evaluate the quality of matches obtained with a trained data linkage classifier.
MethodsComparing multiple classifiers using such measures can, however, lead to inconsistent evaluation because for a given measure the same numerical result can be obtained from different classification outcomes.
This can cause a suboptimal classifier being selected and potentially result in linked data sets of poor quality.
To overcome this problem, we propose the Consistent Record Linkage (CRL) measure, an application focused evaluation method that ensures data linkage classifiers are assessed in a fair and transparent way.
The CRL-measure allows the definition of maximum acceptable error rates, and it provides information about the robustness of a classifier based on identified classification thresholds.
ResultsUsing both synthetic and real-world data sets, we illustrate how the CRL-measure can provide more detailed information about the performance of data linkage classification results compared to traditional performance measures.
Based on user selected maximum acceptable error rates, the CRL-measure identifies the range of classification thresholds where error rates are below these maximums, thereby obtaining high linkage quality.
This indicates the robustness of a classifier with regard to a varying classification threshold.
Furthermore, the CRL-measure shows a user if a given data linkage classifier is actually able to achieve a certain linkage quality or not.
ConclusionThe CRL-measure provides users with consistent information about how multiple data linkage classifiers trained on the same data set perform comparatively.
This will allow a better selection of the most suitable classifier for a given data linkage problem and lead to improved quality of linked data sets.

Related Results

Evaluation measure for group-based record linkage
Evaluation measure for group-based record linkage
Introduction The robustness of record linkage evaluation measures is of high importance since linkage techniques are assessed based on these. However, minimal research has been con...
Perspectives on linkage to care for patients diagnosed with HIV: A qualitative study at a rural health center in South Western Uganda
Perspectives on linkage to care for patients diagnosed with HIV: A qualitative study at a rural health center in South Western Uganda
Linkage to care for newly diagnosed human immunodeficiency virus (HIV) patients is important to ensure that patients have good access to care. However, there is little information ...
Federated Data Linkage in Practice
Federated Data Linkage in Practice
In recent years, great strides have been made towards the deployment of federated systems for data research, including exploring federated trusted research environments (TREs). The...
DLforum – A multidisciplinary online discussion forum for data linkage researchers and practitioners
DLforum – A multidisciplinary online discussion forum for data linkage researchers and practitioners
Data linkage, the process of identifying records that refer to the same entities across databases, is a crucial component of Population Data Science. Data linkage has a history goi...
Towards Streamlined Transparent Data Linkage
Towards Streamlined Transparent Data Linkage
Linked data is a powerful resource within data analytics and population-level research. However, methods for linkage vary and the choice of approach can impact downstream usage of ...
Linking Sensitive Data – Applications, Techniques, and Challenges
Linking Sensitive Data – Applications, Techniques, and Challenges
IntroductionThe linking of sensitive databases containing personal identifying information across organisations is an increasingly important task in application domains ranging fro...
Abstract 1341: Identification of significant linkage evidence for lethal prostate cancer on chromosome arm 11p15.
Abstract 1341: Identification of significant linkage evidence for lethal prostate cancer on chromosome arm 11p15.
Abstract We performed genome wide linkage analysis in a set of high-risk prostate cancer pedigrees, each with 3 or more sampled cases whose death certificate indicat...
Improving Medical Document Classification via Feature Engineering
Improving Medical Document Classification via Feature Engineering
<p dir="ltr">Document classification (DC) is the task of assigning the predefined labels to unseen documents by utilizing the model trained on the available labeled documents...

Back to Top