Javascript must be enabled to continue!
A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark
View through CrossRef
Multiobjective clustering algorithm using particle swarm optimization has been applied successfully in some applications. However, existing algorithms are implemented on a single machine and cannot be directly parallelized on a cluster, which makes it difficult for existing algorithms to handle large-scale data. With the development of distributed parallel computing framework, data parallelism was proposed. However, the increase in parallelism will lead to the problem of unbalanced data distribution affecting the clustering effect. In this paper, we propose a parallel multiobjective PSO weighted average clustering algorithm based on apache Spark (Spark-MOPSO-Avg). First, the entire data set is divided into multiple partitions and cached in memory using the distributed parallel and memory-based computing of Apache Spark. The local fitness value of the particle is calculated in parallel according to the data in the partition. After the calculation is completed, only particle information is transmitted, and there is no need to transmit a large number of data objects between each node, reducing the communication of data in the network and thus effectively reducing the algorithm’s running time. Second, a weighted average calculation of the local fitness values is performed to improve the problem of unbalanced data distribution affecting the results. Experimental results show that the Spark-MOPSO-Avg algorithm achieves lower information loss under data parallelism, losing about 1% to 9% accuracy, but can effectively reduce the algorithm time overhead. It shows good execution efficiency and parallel computing capability under the Spark distributed cluster.
Title: A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark
Description:
Multiobjective clustering algorithm using particle swarm optimization has been applied successfully in some applications.
However, existing algorithms are implemented on a single machine and cannot be directly parallelized on a cluster, which makes it difficult for existing algorithms to handle large-scale data.
With the development of distributed parallel computing framework, data parallelism was proposed.
However, the increase in parallelism will lead to the problem of unbalanced data distribution affecting the clustering effect.
In this paper, we propose a parallel multiobjective PSO weighted average clustering algorithm based on apache Spark (Spark-MOPSO-Avg).
First, the entire data set is divided into multiple partitions and cached in memory using the distributed parallel and memory-based computing of Apache Spark.
The local fitness value of the particle is calculated in parallel according to the data in the partition.
After the calculation is completed, only particle information is transmitted, and there is no need to transmit a large number of data objects between each node, reducing the communication of data in the network and thus effectively reducing the algorithm’s running time.
Second, a weighted average calculation of the local fitness values is performed to improve the problem of unbalanced data distribution affecting the results.
Experimental results show that the Spark-MOPSO-Avg algorithm achieves lower information loss under data parallelism, losing about 1% to 9% accuracy, but can effectively reduce the algorithm time overhead.
It shows good execution efficiency and parallel computing capability under the Spark distributed cluster.
Related Results
Estimating PM10 Concentration from Drilling Operations in Open-Pit Mines Using an Assembly of SVR and PSO
Estimating PM10 Concentration from Drilling Operations in Open-Pit Mines Using an Assembly of SVR and PSO
Dust is one of the components causing heavy environmental pollution in open-pit mines, especially PM10. Some pathologies related to the lung, respiratory system, and occupational d...
Pengaruh Penggunaan Busi Standar, Dan Busi Iridium Terhadap Daya Dan Torsi Pada MesinYamaha Force One
Pengaruh Penggunaan Busi Standar, Dan Busi Iridium Terhadap Daya Dan Torsi Pada MesinYamaha Force One
Abstract
A spark plug is a part of an internal combustion engine with an electrode tip in the combustion chamber. Spar...
Optical Measurement of Spark Deflection Inside a Pre-chamber for Spark-Ignition Engines
Optical Measurement of Spark Deflection Inside a Pre-chamber for Spark-Ignition Engines
<div class="section abstract"><div class="htmlview paragraph">The start of combustion in a spark-ignited engine is highly dependent upon the conditions between the two ...
A Synchronous-Asynchronous Particle Swarm Optimisation Algorithm
A Synchronous-Asynchronous Particle Swarm Optimisation Algorithm
In the original particle swarm optimisation (PSO) algorithm, the particles’ velocities and positions are updated after the whole swarm performance is evaluated. This algorithm is a...
Big data clustering techniques based on Spark: a literature review
Big data clustering techniques based on Spark: a literature review
A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single...
Parallel density clustering algorithm based on MapReduce and optimized cuckoo algorithm
Parallel density clustering algorithm based on MapReduce and optimized cuckoo algorithm
In the process of parallel density clustering, the boundary points of clusters with different densities are blurred and there is data noise, which affects the clustering performanc...
Validity of Acute Physiology and Chronic Health Evaluation (APACHE) IV for the Prediction of Prolonged Intensive Care Unit (ICU) Length of Stay in Dr. Sardjito General Hospital in the COVID Era
Validity of Acute Physiology and Chronic Health Evaluation (APACHE) IV for the Prediction of Prolonged Intensive Care Unit (ICU) Length of Stay in Dr. Sardjito General Hospital in the COVID Era
Introduction: APACHE IV was a good predictor of ICU length of stay in the USA and some countries outside the USA but poor in others. It is important to develop a scoring system for...
The Kernel Rough K-Means Algorithm
The Kernel Rough K-Means Algorithm
Background:
Clustering is one of the most important data mining methods. The k-means
(c-means ) and its derivative methods are the hotspot in the field of clustering research in re...

