Javascript must be enabled to continue!

Clustering Heterogeneous Data Values for Data Quality Analysis

Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.

Association for Computing Machinery (ACM)

Viola Wenz Arno Kesper Gabriele Taentzer

Journal of Data and Information Quality

2023

Title: Clustering Heterogeneous Data Values for Data Quality Analysis

Description:

Data is of high quality if it is fit for its intended purpose.

Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised.

Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules.

In this case, syntactic and semantic heterogeneity often go hand in hand.

Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations.

For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences.

Additionally, search functions such as regular expression matching are used to detect specific patterns.

However, this requires a priori knowledge and technical skills that domain experts often do not have.

Since such datasets often contain thousands of values, the entire process is very time-consuming.

Outliers or subtle differences between values that may be critical to data quality can be easily overlooked.

To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity.

The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge.

The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations.

From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations.

We outline a proof-of-concept implementation of the approach.

Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.

Back

Related Results

The Kernel Rough K-Means Algorithm

Background: Clustering is one of the most important data mining methods. The k-means (c-means ) and its derivative methods are the hotspot in the field of clustering research in re...

Image clustering using exponential discriminant analysis

Local learning based image clustering models are usually employed to deal with images sampled from the non‐linear manifold. Recently, linear discriminant analysis (LDA) based vario...

A COMPARATIVE ANALYSIS OF K-MEANS AND HIERARCHICAL CLUSTERING

Clustering is the process of arranging comparable data elements into groups. One of the most frequent data mining analytical techniques is clustering analysis; the clustering algor...

Clustering Analysis of Data with High Dimensionality

Clustering analysis has been widely applied in diverse fields such as data mining, access structures, knowledge discovery, software engineering, organization of information systems...

Parallel density clustering algorithm based on MapReduce and optimized cuckoo algorithm

In the process of parallel density clustering, the boundary points of clusters with different densities are blurred and there is data noise, which affects the clustering performanc...

Research on a microseismic signal picking algorithm based on GTOA clustering

Abstract. Clustering is one of the challenging problems in machine learning. Adopting clustering methods for the picking of microseismic signals has emerged as a new approach. Howe...

MR-DBIFOA: a parallel Density-based Clustering Algorithm by Using Improve Fruit Fly Optimization

<p>Clustering is an important technique for data analysis and knowledge discovery. In the context of big data, the density-based clustering algorithm faces three challenging ...

CHOOSING SEEDS FOR SEMI-SUPERVISED GRAPH BASED CLUSTERING

Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many application...

Email:
Password:

Email: