Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

SpeCA:A Speculative Parallel Crawling Approach on Apache Spark

View through CrossRef
Abstract The World Wide Web today is growing at a phenomenal rate. The crawling approach is of vital importance to improve the efficiency of crawling the web. The existing crawling algorithms on multicore platforms are time consuming and do not support large data well. In order to improve parallelism and efficiency of crawler on distributed network environments, based on the software thread-level speculation technique, this paper raises a Speculative parallel crawler approach (SpeCA) on Apache Spark. By analyzing the process of web crawler, the SpeCA firstly hires a function to divide a crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively crawl in parallel. At last, the speculative results are merged to form the final outcome. Comparing with the conventional parallel approach on multicore platform, SpeCA is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster. Experiments show that the proposed approach could achieve a significant speedup improvement with compare to the traditional approach in average. In addition, with the growing number of working nodes, the execution time decreases gradually, and the speedup scales linearly. The results indicate that the crawling efficiency can be significantly enhanced by adopting this speculative parallel algorithm.
Springer Science and Business Media LLC
Title: SpeCA:A Speculative Parallel Crawling Approach on Apache Spark
Description:
Abstract The World Wide Web today is growing at a phenomenal rate.
The crawling approach is of vital importance to improve the efficiency of crawling the web.
The existing crawling algorithms on multicore platforms are time consuming and do not support large data well.
In order to improve parallelism and efficiency of crawler on distributed network environments, based on the software thread-level speculation technique, this paper raises a Speculative parallel crawler approach (SpeCA) on Apache Spark.
By analyzing the process of web crawler, the SpeCA firstly hires a function to divide a crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively crawl in parallel.
At last, the speculative results are merged to form the final outcome.
Comparing with the conventional parallel approach on multicore platform, SpeCA is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster.
Experiments show that the proposed approach could achieve a significant speedup improvement with compare to the traditional approach in average.
In addition, with the growing number of working nodes, the execution time decreases gradually, and the speedup scales linearly.
The results indicate that the crawling efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

Related Results

Optical Measurement of Spark Deflection Inside a Pre-chamber for Spark-Ignition Engines
Optical Measurement of Spark Deflection Inside a Pre-chamber for Spark-Ignition Engines
<div class="section abstract"><div class="htmlview paragraph">The start of combustion in a spark-ignited engine is highly dependent upon the conditions between the two ...
Software analysis of scientific texts: comparative study of distributed computing frameworks
Software analysis of scientific texts: comparative study of distributed computing frameworks
The relevance of this study is related to the need for efficient analysis of scientific texts in the context of the growing amount of information. This study aims to conduct a stud...
A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark
A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark
Multiobjective clustering algorithm using particle swarm optimization has been applied successfully in some applications. However, existing algorithms are implemented on a single m...
Disq, a library for manipulating bioinformatics sequencing formats in Apache Spark
Disq, a library for manipulating bioinformatics sequencing formats in Apache Spark
ADAM and GATK have independently developed parallel and distributed genomic applications on Apache Spark. To access flat file formats such as BAM, CRAM, SAM, and VC...
A Novel Crawling Robot Based on the Hexagonal Mesh Structure and Enhanced PID Control Strategy
A Novel Crawling Robot Based on the Hexagonal Mesh Structure and Enhanced PID Control Strategy
ABSTRACT The locomotion of crawling robots is similar to that of caterpillars, relying on foot adhesion and body contraction to ensure flexible movement without c...
Distributed Computing Engines for Big Data Analytics
Distributed Computing Engines for Big Data Analytics
Technologies like cloud computing paved way for dealing with massive amounts of data. Prior to cloud, it was not possible unless you invest large amounts for computing resources. N...

Back to Top