Javascript must be enabled to continue!

SpeCA:A Speculative Parallel Crawling Approach on Apache Spark

Abstract The World Wide Web today is growing at a phenomenal rate. The crawling approach is of vital importance to improve the efficiency of crawling the web. The existing crawling algorithms on multicore platforms are time consuming and do not support large data well. In order to improve parallelism and efficiency of crawler on distributed network environments, based on the software thread-level speculation technique, this paper raises a Speculative parallel crawler approach (SpeCA) on Apache Spark. By analyzing the process of web crawler, the SpeCA firstly hires a function to divide a crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively crawl in parallel. At last, the speculative results are merged to form the final outcome. Comparing with the conventional parallel approach on multicore platform, SpeCA is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster. Experiments show that the proposed approach could achieve a significant speedup improvement with compare to the traditional approach in average. In addition, with the growing number of working nodes, the execution time decreases gradually, and the speedup scales linearly. The results indicate that the crawling efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

Springer Science and Business Media LLC

Li Yuxiang Su Yaning

2025

Title: SpeCA:A Speculative Parallel Crawling Approach on Apache Spark

Description:

Abstract The World Wide Web today is growing at a phenomenal rate.

The crawling approach is of vital importance to improve the efficiency of crawling the web.

The existing crawling algorithms on multicore platforms are time consuming and do not support large data well.

In order to improve parallelism and efficiency of crawler on distributed network environments, based on the software thread-level speculation technique, this paper raises a Speculative parallel crawler approach (SpeCA) on Apache Spark.

By analyzing the process of web crawler, the SpeCA firstly hires a function to divide a crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively crawl in parallel.

At last, the speculative results are merged to form the final outcome.

Comparing with the conventional parallel approach on multicore platform, SpeCA is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster.

Experiments show that the proposed approach could achieve a significant speedup improvement with compare to the traditional approach in average.

In addition, with the growing number of working nodes, the execution time decreases gradually, and the speedup scales linearly.

The results indicate that the crawling efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

Back

Abstract A spark plug is a part of an internal combustion engine with an electrode tip in the combustion chamber. Spar...

Optical Measurement of Spark Deflection Inside a Pre-chamber for Spark-Ignition Engines

<div class="section abstract"><div class="htmlview paragraph">The start of combustion in a spark-ignited engine is highly dependent upon the conditions between the two ...

Validity of Acute Physiology and Chronic Health Evaluation (APACHE) IV for the Prediction of Prolonged Intensive Care Unit (ICU) Length of Stay in Dr. Sardjito General Hospital in the COVID Era

Introduction: APACHE IV was a good predictor of ICU length of stay in the USA and some countries outside the USA but poor in others. It is important to develop a scoring system for...

Software analysis of scientific texts: comparative study of distributed computing frameworks

The relevance of this study is related to the need for efficient analysis of scientific texts in the context of the growing amount of information. This study aims to conduct a stud...

A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark

Multiobjective clustering algorithm using particle swarm optimization has been applied successfully in some applications. However, existing algorithms are implemented on a single m...

Disq, a library for manipulating bioinformatics sequencing formats in Apache Spark

ADAM and GATK have independently developed parallel and distributed genomic applications on Apache Spark. To access flat file formats such as BAM, CRAM, SAM, and VC...

A Novel Crawling Robot Based on the Hexagonal Mesh Structure and Enhanced PID Control Strategy

ABSTRACT The locomotion of crawling robots is similar to that of caterpillars, relying on foot adhesion and body contraction to ensure flexible movement without c...

Distributed Computing Engines for Big Data Analytics

Technologies like cloud computing paved way for dealing with massive amounts of data. Prior to cloud, it was not possible unless you invest large amounts for computing resources. N...

Email:
Password:

Email:

SpeCA:A Speculative Parallel Crawling Approach on Apache Spark

Related Results