Javascript must be enabled to continue!

Correlation and Probability Based Similarity Measure for Detecting Outliers in Categorical Data

Determining the similarity or distance among data objects is an important part in many research fields such as statistics, data mining, machine learning etc. There are many measures available in the literature to define the distance between two numerical data objects. It is difficult to define such a metric to measure the similarity between two categorical data objects since categorical data objects are not ordered. Only a few distance measures are available in the literature to find the similarities among categorical data objects. This paper presents a comparative evaluation of various similarity measures for categorical data and also introduces a novel similarity measure for categorical data based on occurrence frequency and correlation. We evaluated the performance of these similarity measures in the context of outlier detection task in data mining using real world data sets. Experimental results show that the proposed similarity measure outperform the existing similarity measures to detect outliers in categorical datasets. The performances are evaluated in the context of outlier detection task in data mining.

Blue Eyes Intelligence Engineering and Sciences Engineering and Sciences Publication - BEIESP

Roy Thomas* J.E. Judith

International Journal of Innovative Technology and Exploring Engineering

2020

Title: Correlation and Probability Based Similarity Measure for Detecting Outliers in Categorical Data

Description:

Determining the similarity or distance among data objects is an important part in many research fields such as statistics, data mining, machine learning etc.

There are many measures available in the literature to define the distance between two numerical data objects.

It is difficult to define such a metric to measure the similarity between two categorical data objects since categorical data objects are not ordered.

Only a few distance measures are available in the literature to find the similarities among categorical data objects.

This paper presents a comparative evaluation of various similarity measures for categorical data and also introduces a novel similarity measure for categorical data based on occurrence frequency and correlation.

We evaluated the performance of these similarity measures in the context of outlier detection task in data mining using real world data sets.

Experimental results show that the proposed similarity measure outperform the existing similarity measures to detect outliers in categorical datasets.

The performances are evaluated in the context of outlier detection task in data mining.

Back

Related Results

Similarity Search with Data Missing

Similarity search is a fundamental research problem with broad applications in various research fields, including data mining, information retrieval, and machine learning. The core...

An Improved Innovation Robust Outliers Detection Method for Airborne Array Position and Orientation Measurement System

The airborne array position and orientation measurement system (array POS) is a key device for high-resolution multi-dimensional real-time imaging motion compensation of military r...

A Comparative Evaluation of Outlier Detection in Categorical and Mixed Data

Abstract Outlier detection is essential in different domains such as cybersecurity and fraud detection, to name a few. However, identifying the best way to detect o...

Analysis of a Similarity Measure for Non-Overlapped Data

A similarity measure is a measure evaluating the degree of similarity between two fuzzy data sets and has become an essential tool in many applications including data mining, patte...

Bagan Kendali Robust Multivariat untuk Pengamatan Individual

AbstractThe most widely used of control chart in multivariate control processing is control chart T2 Hotelling. There are 2 kinds of control chart T2 Hotelling, namely T2 Hotelling...

Research Note: A Study of Outliers of International Tourism Statistics

As international tourism is an industry that is easily impacted by external shocks, there is always structural mutation of the time series related with it, which causes the existen...

Using covariance weighted euclidean distance to assess the dissimilarity between integral experiments

Integral experiments especially criticality experiments help a lot in designing either new nuclear reactor or criticality assembly. The calculation uncertainty of the integral para...

Outliers in official statistics

AbstractThe purpose of this manuscript is to provide a survey on the important methods addressing outliers while producing official statistics. Outliers are often unavoidable in su...

Email:
Password:

Email: