Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

A scalable MapReduce-based design of an unsupervised entity resolution system

View through CrossRef
Traditional data curation processes typically depend on human intervention. As data volume and variety grow exponentially, organizations are striving to increase efficiency of their data processes by automating manual processes and making them as unsupervised as possible. An additional challenge is to make these unsupervised processes scalable to meet the demands of increased data volume. This paper describes the parallelization of an unsupervised entity resolution (ER) process. ER is a component of many different data curation processes because it clusters records from multiple data sources that refer to the same real-world entity, such as the same customer, patient, or product. The ability to scale ER processes is particularly important because the computation effort of ER increases quadratically with data volume. The Data Washing Machine (DWM) is an already proposed unsupervised ER system which clusters references from diverse data sources. This work aims at solving the single-threaded nature of the DWM by adopting the parallelization nature of Hadoop MapReduce. However, the proposed parallelization method can be applied to both supervised systems, where matching rules are created by experts, and unsupervised systems, where expert intervention is not required. The DWM uses an entropy measure to self-evaluate the quality of record clustering. The current single-threaded implementations of the DWM in Python and Java are not scalable beyond a few 1,000 records and rely on large, shared memory. The objective of this research is to solve the major two shortcomings of the current design of the DWM which are the creation and usage of shared memory and lack of scalability by leveraging on the power of Hadoop MapReduce. We propose Hadoop Data Washing Machine (HDWM), a MapReduce implementation of the legacy DWM. The scalability of the proposed system is displayed using publicly available ER datasets. Based on results from our experiment, we conclude that HDWM can cluster from 1,000's to millions of equivalent references using multiple computational nodes with independent RAM and CPU cores.
Title: A scalable MapReduce-based design of an unsupervised entity resolution system
Description:
Traditional data curation processes typically depend on human intervention.
As data volume and variety grow exponentially, organizations are striving to increase efficiency of their data processes by automating manual processes and making them as unsupervised as possible.
An additional challenge is to make these unsupervised processes scalable to meet the demands of increased data volume.
This paper describes the parallelization of an unsupervised entity resolution (ER) process.
ER is a component of many different data curation processes because it clusters records from multiple data sources that refer to the same real-world entity, such as the same customer, patient, or product.
The ability to scale ER processes is particularly important because the computation effort of ER increases quadratically with data volume.
The Data Washing Machine (DWM) is an already proposed unsupervised ER system which clusters references from diverse data sources.
This work aims at solving the single-threaded nature of the DWM by adopting the parallelization nature of Hadoop MapReduce.
However, the proposed parallelization method can be applied to both supervised systems, where matching rules are created by experts, and unsupervised systems, where expert intervention is not required.
The DWM uses an entropy measure to self-evaluate the quality of record clustering.
The current single-threaded implementations of the DWM in Python and Java are not scalable beyond a few 1,000 records and rely on large, shared memory.
The objective of this research is to solve the major two shortcomings of the current design of the DWM which are the creation and usage of shared memory and lack of scalability by leveraging on the power of Hadoop MapReduce.
We propose Hadoop Data Washing Machine (HDWM), a MapReduce implementation of the legacy DWM.
The scalability of the proposed system is displayed using publicly available ER datasets.
Based on results from our experiment, we conclude that HDWM can cluster from 1,000's to millions of equivalent references using multiple computational nodes with independent RAM and CPU cores.

Related Results

Design
Design
Conventional definitions of design rarely capture its reach into our everyday lives. The Design Council, for example, estimates that more than 2.5 million people use design-related...
Unsupervised Evaluation of Entity Resolution
Unsupervised Evaluation of Entity Resolution
Entity resolution is the problem of identifying records that refer to the same entity from one or multiple databases. Applications of entity resolution range from health and social...
Dynamics of Mutations in Patients with ET Treated with Imetelstat
Dynamics of Mutations in Patients with ET Treated with Imetelstat
Abstract Background: Imetelstat, a first in class specific telomerase inhibitor, induced hematologic responses in all patients (pts) with essential thrombocythemia (...
Combinatorial Antigen Targeting Strategy for Acute Myeloid Leukemia
Combinatorial Antigen Targeting Strategy for Acute Myeloid Leukemia
Introduction: Efforts to safely and effectively treat acute myeloid leukemia (AML) by targeting a single leukemia associated antigen with chimeric antigen receptor T (CAR T) cells ...
Unsupervised entity linking using graph-based semantic similarity
Unsupervised entity linking using graph-based semantic similarity
Nowadays, the human textual data constitutes a great proportion of the shared information resources such as World Wide Web (WWW). Social networks, news and learning resources as we...
Unsupervised Deep Learning for Enhanced holoentropy Image Stitching
Unsupervised Deep Learning for Enhanced holoentropy Image Stitching
Traditional feature-based image stitching technologies rely heavily on feature detection quality, often failing to stitch images with few features or low resolution. The learning b...
DLUT: Decoupled Learning-Based Unsupervised Tracker
DLUT: Decoupled Learning-Based Unsupervised Tracker
Unsupervised learning has shown immense potential in object tracking, where accurate classification and regression are crucial for unsupervised trackers. However, the classificatio...

Back to Top