Javascript must be enabled to continue!

SpeCA:A Speculative Parallel Crawling Approach on Apache Spark

Abstract The World Wide Web today is growing at a phenomenal rate. The crawling approach is of vital importance to improve the efficiency of crawling the web. The existing crawling algorithms on multicore platforms are time consuming and do not support large data well. In order to improve parallelism and efficiency of crawler on distributed network environments, based on the software thread-level speculation technique, this paper raises a Speculative parallel crawler approach (SpeCA) on Apache Spark. By analyzing the process of web crawler, the SpeCA firstly hires a function to divide a crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively crawl in parallel. At last, the speculative results are merged to form the final outcome. Comparing with the conventional parallel approach on multicore platform, SpeCA is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster. Experiments show that the proposed approach could achieve a significant speedup improvement with compare to the traditional approach in average. In addition, with the growing number of working nodes, the execution time decreases gradually, and the speedup scales linearly. The results indicate that the crawling efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

Springer Science and Business Media LLC

Li Yuxiang Su Yaning

2025

Title: SpeCA:A Speculative Parallel Crawling Approach on Apache Spark

Description:

Abstract The World Wide Web today is growing at a phenomenal rate.

The crawling approach is of vital importance to improve the efficiency of crawling the web.

The existing crawling algorithms on multicore platforms are time consuming and do not support large data well.

In order to improve parallelism and efficiency of crawler on distributed network environments, based on the software thread-level speculation technique, this paper raises a Speculative parallel crawler approach (SpeCA) on Apache Spark.

By analyzing the process of web crawler, the SpeCA firstly hires a function to divide a crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively crawl in parallel.

At last, the speculative results are merged to form the final outcome.

Comparing with the conventional parallel approach on multicore platform, SpeCA is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster.

Experiments show that the proposed approach could achieve a significant speedup improvement with compare to the traditional approach in average.

In addition, with the growing number of working nodes, the execution time decreases gradually, and the speedup scales linearly.

The results indicate that the crawling efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

Back

The relevance of this study is related to the need for efficient analysis of scientific texts in the context of the growing amount of information. This study aims to conduct a stud...

Speculative Fiction

The term “speculative fiction” has three historically located meanings: a subgenre of science fiction that deals with human rather than technological problems, a genre distinct fro...

Distributed Computing Engines for Big Data Analytics

Technologies like cloud computing paved way for dealing with massive amounts of data. Prior to cloud, it was not possible unless you invest large amounts for computing resources. N...

Scalability and Efficiency in Distributed Big Data Architectures: A Comparative Study

With the rapid expansion of the size of data, there is a need for the development of scalable and efficient architectures for large scale data processing. This research conducts a ...

Analisis Variasi Busi Terhadap Performa dan Bahan Bakar Motor Bensin 2 Langkah Yamaha F1ZR 110CC

Spark plugs have various types and specifications that can improve motorcycle performance. The purpose of this study was to determine the ratio of torque, power, and fuel consumpti...

Tools and techniques for real-time data processing: A review

Real-time data processing is an essential component in the modern data landscape, where vast amounts of data are generated continuously from various sources such as Internet of Thi...

Populating the Future: Families and Reproduction in Speculative Fiction

Speculative fiction opens doors for imagining beyond what is possible, conventional or acceptable. Speculative fiction has an acute ear for the social, the scientific and for polit...

A System For Storing And Processing Big Data Based On The Apache Spark Platform

The primary objective of this paper is to investigate and implement the Apache Spark big data processing platform on a stock dataset, followed by the application of a machine learn...

Email:
Password:

Email:

SpeCA:A Speculative Parallel Crawling Approach on Apache Spark

Related Results