Javascript must be enabled to continue!
A Comparative Evaluation of Outlier Detection in Categorical and Mixed Data
View through CrossRef
Abstract
Outlier detection is essential in different domains such as cybersecurity and fraud detection, to name a few. However, identifying the best way to detect outliers is often a challenge. Although most algorithms are designed for numerical data, many real-world datasets contain categorical attributes or a mixture of categorical and numerical ones. Given a dataset with one or more categorical attributes, how to detect the outliers? This survey evaluates three potential solutions: (1) applying algorithms that can process categorical data directly, (2) converting categorical attributes into numerical ones before the detection, and (3) removing categorical attributes so that only the numerical ones are considered in the detection. We performed experiments using 47 datasets and 14 detection algorithms, and demonstrated that Solution (1) is usually preferred, especially when employing the detection algorithm CBRW. However, Solution (2) with detection algorithms such as iForest and KNN-outlier achieves better results in certain contexts, being influenced by the data characteristics. Based on these findings, we also introduce a predictive model that achieves 80% accuracy in identifying the best strategy to process new datasets among the three solutions studied. Additionally, we compared approaches to convert categorical attributes into numerical ones, and showed that the Correspondence Analysis data-conversion method often yields the best results. This survey provides comparative insights, methodological guidance, and predictive support for outlier detection in categorical and mixed data.
Springer Science and Business Media LLC
Title: A Comparative Evaluation of Outlier Detection in Categorical and Mixed Data
Description:
Abstract
Outlier detection is essential in different domains such as cybersecurity and fraud detection, to name a few.
However, identifying the best way to detect outliers is often a challenge.
Although most algorithms are designed for numerical data, many real-world datasets contain categorical attributes or a mixture of categorical and numerical ones.
Given a dataset with one or more categorical attributes, how to detect the outliers? This survey evaluates three potential solutions: (1) applying algorithms that can process categorical data directly, (2) converting categorical attributes into numerical ones before the detection, and (3) removing categorical attributes so that only the numerical ones are considered in the detection.
We performed experiments using 47 datasets and 14 detection algorithms, and demonstrated that Solution (1) is usually preferred, especially when employing the detection algorithm CBRW.
However, Solution (2) with detection algorithms such as iForest and KNN-outlier achieves better results in certain contexts, being influenced by the data characteristics.
Based on these findings, we also introduce a predictive model that achieves 80% accuracy in identifying the best strategy to process new datasets among the three solutions studied.
Additionally, we compared approaches to convert categorical attributes into numerical ones, and showed that the Correspondence Analysis data-conversion method often yields the best results.
This survey provides comparative insights, methodological guidance, and predictive support for outlier detection in categorical and mixed data.
Related Results
Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
Investigating Outlier Detection Techniques Based on Kernel Rough
Clustering
Investigating Outlier Detection Techniques Based on Kernel Rough
Clustering
Background:
Data quality is crucial to the success of big data analytics. However, the
presence of outliers affects data quality and data analysis. Employing effective outlier dete...
On Constructing Static Evaluation Function using Temporal Difference Learning
On Constructing Static Evaluation Function using Temporal Difference Learning
An Outlier is a data point which is significantly different from the remaining data points. Outlier is also referred as discordant, deviants and abnormalities. Outliers may have a ...
Non-Recommended Publishing Lists: Strategies for Detecting Deceitful Journals
Non-Recommended Publishing Lists: Strategies for Detecting Deceitful Journals
Abstract
The rapid growth of open access publishing (OAP) has significantly improved the accessibility and dissemination of scientific knowledge. However, this expansion has also c...
A Monte Carlo-Based Outlier Diagnosis Method for Sensitivity Analysis
A Monte Carlo-Based Outlier Diagnosis Method for Sensitivity Analysis
An iterative outlier elimination procedure based on hypothesis testing, commonly known as Iterative Data Snooping (IDS) among geodesists, is often used for the quality control of t...
A Monte Carlo-Based Outlier Diagnosis Method for Sensitivity Analysis
A Monte Carlo-Based Outlier Diagnosis Method for Sensitivity Analysis
An iterative outlier elimination procedure based on hypothesis testing, commonly known as Iterative Data Snooping (IDS) among geodesists, is often used for the quality control of m...
Optimasi Algoritma K-Nearest Neighbors Berdasarkan Perbandingan Analisis Outlier (Berbasis Jarak, Kepadatan, LOF)
Optimasi Algoritma K-Nearest Neighbors Berdasarkan Perbandingan Analisis Outlier (Berbasis Jarak, Kepadatan, LOF)
Pertumbuhan data yang terjadi saat ini berpengaruh terhadap analisis data di berbagai bidang, seperti astronomi, bisnis, kedokteran, pendidikan, dan finansial. Data yang terkumpul ...
Outlier Detection and Correction for the Deviations of Tooth Profiles of Gears
Outlier Detection and Correction for the Deviations of Tooth Profiles of Gears
To decrease the influence of outlier on the measurement of tooth profiles, this paper proposes a method of outlier detection and correction based on the grey system theory. After s...

