Javascript must be enabled to continue!
Correlation and Probability Based Similarity Measure for Detecting Outliers in Categorical Data
View through CrossRef
Determining the similarity or distance among data objects is an important part in many research fields such as statistics, data mining, machine learning etc. There are many measures available in the literature to define the distance between two numerical data objects. It is difficult to define such a metric to measure the similarity between two categorical data objects since categorical data objects are not ordered. Only a few distance measures are available in the literature to find the similarities among categorical data objects. This paper presents a comparative evaluation of various similarity measures for categorical data and also introduces a novel similarity measure for categorical data based on occurrence frequency and correlation. We evaluated the performance of these similarity measures in the context of outlier detection task in data mining using real world data sets. Experimental results show that the proposed similarity measure outperform the existing similarity measures to detect outliers in categorical datasets. The performances are evaluated in the context of outlier detection task in data mining.
Blue Eyes Intelligence Engineering and Sciences Engineering and Sciences Publication - BEIESP
Title: Correlation and Probability Based Similarity Measure for Detecting Outliers in Categorical Data
Description:
Determining the similarity or distance among data objects is an important part in many research fields such as statistics, data mining, machine learning etc.
There are many measures available in the literature to define the distance between two numerical data objects.
It is difficult to define such a metric to measure the similarity between two categorical data objects since categorical data objects are not ordered.
Only a few distance measures are available in the literature to find the similarities among categorical data objects.
This paper presents a comparative evaluation of various similarity measures for categorical data and also introduces a novel similarity measure for categorical data based on occurrence frequency and correlation.
We evaluated the performance of these similarity measures in the context of outlier detection task in data mining using real world data sets.
Experimental results show that the proposed similarity measure outperform the existing similarity measures to detect outliers in categorical datasets.
The performances are evaluated in the context of outlier detection task in data mining.
Related Results
Similarity Search with Data Missing
Similarity Search with Data Missing
Similarity search is a fundamental research problem with broad applications in various research fields, including data mining, information retrieval, and machine learning. The core...
An Improved Innovation Robust Outliers Detection Method for Airborne Array Position and Orientation Measurement System
An Improved Innovation Robust Outliers Detection Method for Airborne Array Position and Orientation Measurement System
The airborne array position and orientation measurement system (array POS) is a key device for high-resolution multi-dimensional real-time imaging motion compensation of military r...
A Comparative Evaluation of Outlier Detection in Categorical and Mixed Data
A Comparative Evaluation of Outlier Detection in Categorical and Mixed Data
Abstract
Outlier detection is essential in different domains such as cybersecurity and fraud detection, to name a few. However, identifying the best way to detect o...
Analysis of a Similarity Measure for Non-Overlapped Data
Analysis of a Similarity Measure for Non-Overlapped Data
A similarity measure is a measure evaluating the degree of similarity between two fuzzy data sets and has become an essential tool in many applications including data mining, patte...
Bagan Kendali Robust Multivariat untuk Pengamatan Individual
Bagan Kendali Robust Multivariat untuk Pengamatan Individual
AbstractThe most widely used of control chart in multivariate control processing is control chart T2 Hotelling. There are 2 kinds of control chart T2 Hotelling, namely T2 Hotelling...
Research Note: A Study of Outliers of International Tourism Statistics
Research Note: A Study of Outliers of International Tourism Statistics
As international tourism is an industry that is easily impacted by external shocks, there is always structural mutation of the time series related with it, which causes the existen...
Using covariance weighted euclidean distance to assess the dissimilarity between integral experiments
Using covariance weighted euclidean distance to assess the dissimilarity between integral experiments
Integral experiments especially criticality experiments help a lot in designing either new nuclear reactor or criticality assembly. The calculation uncertainty of the integral para...
Outliers in official statistics
Outliers in official statistics
AbstractThe purpose of this manuscript is to provide a survey on the important methods addressing outliers while producing official statistics. Outliers are often unavoidable in su...

