Javascript must be enabled to continue!

Clustering Analysis of Data with High Dimensionality

Clustering analysis has been widely applied in diverse fields such as data mining, access structures, knowledge discovery, software engineering, organization of information systems, and machine learning. The main objective of cluster analysis is to create groups of objects based on the degree of their association (Kaufman & Rousseeuw, 1990; Romesburg, 1990). There are two major categories of clustering algorithms with respect to the output structure: partitional and hierarchical (Romesburg, 1990). K-means is a representative of the partitional algorithms. The output of this algorithm is a flat structure of clusters. The K-means is a very attractive algorithm because of its simplicity and efficiency, which make it one of the favorite choices to handle large datasets. On the flip side, it has a dependency on the initial choice of number of clusters. This choice may not be optimal, as it should be made in the very beginning, when there may not exist an informal expectation of what the number of natural clusters would be. Hierarchical clustering algorithms produce a hierarchical structure often presented graphically as a dendrogram. There are two main types of hierarchical algorithms: agglomerative and divisive. The agglomerative method uses a bottom-up approach, i.e., starts with the individual objects, each considered to be in its own cluster, and then merges the clusters until the desired number of clusters is achieved. The divisive method uses the opposite approach, i.e., starts with all objects in one cluster and divides them into separate clusters. The clusters form a tree with each higher level showing higher degree of dissimilarity. The height of the merging point in the tree represents the similarity distance at which the objects merge in one cluster. The agglomerative algorithms are usually able to generate high-quality clusters but suffer a high computational complexity compared with divisive algorithms. In this paper, we focus on investigating the behavior of agglomerative hierarchical algorithms. We further divide these algorithms into two major categories: group based and single-object based clustering methods. Typical examples for the former category include Unweighted Pair-Group using Arithmetic averages (UPGMA), Centroid Linkage, and WARDS, etc. Single LINKage (SLINK) clustering and Complete LINKage clustering (CLINK) fall into the second category. We choose UPGMA and SLINK as the representatives of each category and the comparison of these two representative techniques could also reflect some similarity and difference between these two sets of clustering methods. The study examines three key issues for clustering analysis: (1) the computation of the degree of association between different objects; (2) the designation of an acceptable criterion to evaluate how good and/or successful a clustering method is; and (3) the adaptability of the clustering method used under different statistical distributions of data including random, skewed, concentrated around certain regions, etc. Two different statistical distributions are used to express how data objects are drawn from a 50-dimensional space. This also differentiates our work from some previous ones, where a limited number of dimensions for data features (typically up to three) are considered (Bouguettaya, 1996; Bouguettaya & LeViet, 1998). In addition, three types of distances are used to compare the resultant clustering trees: Euclidean, Canberra Metric, and Bray-Curtis distances. The results of an exhaustive set of experiments that involve data derived from 50- dimensional space are presented. These experiments indicate a surprisingly high level of similarity between the two clustering techniques under most combinations of parameter settings.

IGI Global

Athman Bouguettaya Qi Yu

Encyclopedia of Data Warehousing and Mining, Second Edition

2011

Title: Clustering Analysis of Data with High Dimensionality

Description:

The main objective of cluster analysis is to create groups of objects based on the degree of their association (Kaufman & Rousseeuw, 1990; Romesburg, 1990).

There are two major categories of clustering algorithms with respect to the output structure: partitional and hierarchical (Romesburg, 1990).

K-means is a representative of the partitional algorithms.

The output of this algorithm is a flat structure of clusters.

The K-means is a very attractive algorithm because of its simplicity and efficiency, which make it one of the favorite choices to handle large datasets.

On the flip side, it has a dependency on the initial choice of number of clusters.

This choice may not be optimal, as it should be made in the very beginning, when there may not exist an informal expectation of what the number of natural clusters would be.

Hierarchical clustering algorithms produce a hierarchical structure often presented graphically as a dendrogram.

There are two main types of hierarchical algorithms: agglomerative and divisive.

The agglomerative method uses a bottom-up approach, i.

, starts with the individual objects, each considered to be in its own cluster, and then merges the clusters until the desired number of clusters is achieved.

The divisive method uses the opposite approach, i.

, starts with all objects in one cluster and divides them into separate clusters.

The clusters form a tree with each higher level showing higher degree of dissimilarity.

The height of the merging point in the tree represents the similarity distance at which the objects merge in one cluster.

The agglomerative algorithms are usually able to generate high-quality clusters but suffer a high computational complexity compared with divisive algorithms.

In this paper, we focus on investigating the behavior of agglomerative hierarchical algorithms.

We further divide these algorithms into two major categories: group based and single-object based clustering methods.

Typical examples for the former category include Unweighted Pair-Group using Arithmetic averages (UPGMA), Centroid Linkage, and WARDS, etc.

Single LINKage (SLINK) clustering and Complete LINKage clustering (CLINK) fall into the second category.

We choose UPGMA and SLINK as the representatives of each category and the comparison of these two representative techniques could also reflect some similarity and difference between these two sets of clustering methods.

The study examines three key issues for clustering analysis: (1) the computation of the degree of association between different objects; (2) the designation of an acceptable criterion to evaluate how good and/or successful a clustering method is; and (3) the adaptability of the clustering method used under different statistical distributions of data including random, skewed, concentrated around certain regions, etc.

Two different statistical distributions are used to express how data objects are drawn from a 50-dimensional space.

This also differentiates our work from some previous ones, where a limited number of dimensions for data features (typically up to three) are considered (Bouguettaya, 1996; Bouguettaya & LeViet, 1998).

In addition, three types of distances are used to compare the resultant clustering trees: Euclidean, Canberra Metric, and Bray-Curtis distances.

The results of an exhaustive set of experiments that involve data derived from 50- dimensional space are presented.

These experiments indicate a surprisingly high level of similarity between the two clustering techniques under most combinations of parameter settings.

Back

Related Results

The Kernel Rough K-Means Algorithm

Background: Clustering is one of the most important data mining methods. The k-means (c-means ) and its derivative methods are the hotspot in the field of clustering research in re...

Image clustering using exponential discriminant analysis

Local learning based image clustering models are usually employed to deal with images sampled from the non‐linear manifold. Recently, linear discriminant analysis (LDA) based vario...

A high-dimensionality-trait-driven learning paradigm for high dimensional credit classification

AbstractTo solve the high-dimensionality issue and improve its accuracy in credit risk assessment, a high-dimensionality-trait-driven learning paradigm is proposed for feature extr...

Abstract 2708: Toward improved cancer classification using PCA + tSNE dimensionality reduction on bulk RNA-seq data

Abstract Intro: Minor variations in cancer type can have a major impact on therapeutic effectiveness and on the course of drug research and development. In order to ...

A COMPARATIVE ANALYSIS OF K-MEANS AND HIERARCHICAL CLUSTERING

Clustering is the process of arranging comparable data elements into groups. One of the most frequent data mining analytical techniques is clustering analysis; the clustering algor...

Parallel density clustering algorithm based on MapReduce and optimized cuckoo algorithm

In the process of parallel density clustering, the boundary points of clusters with different densities are blurred and there is data noise, which affects the clustering performanc...

MR-DBIFOA: a parallel Density-based Clustering Algorithm by Using Improve Fruit Fly Optimization

<p>Clustering is an important technique for data analysis and knowledge discovery. In the context of big data, the density-based clustering algorithm faces three challenging ...

Research on a microseismic signal picking algorithm based on GTOA clustering

Abstract. Clustering is one of the challenging problems in machine learning. Adopting clustering methods for the picking of microseismic signals has emerged as a new approach. Howe...

Email:
Password:

Email: