Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

SpeCA:A Speculative Parallel Crawling Approach on Apache Spark

View through CrossRef
Abstract The World Wide Web today is growing at a phenomenal rate. The crawling approach is of vital importance to improve the efficiency of crawling the web. The existing crawling algorithms on multicore platforms are time consuming and do not support large data well. In order to improve parallelism and efficiency of crawler on distributed network environments, based on the software thread-level speculation technique, this paper raises a Speculative parallel crawler approach (SpeCA) on Apache Spark. By analyzing the process of web crawler, the SpeCA firstly hires a function to divide a crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively crawl in parallel. At last, the speculative results are merged to form the final outcome. Comparing with the conventional parallel approach on multicore platform, SpeCA is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster. Experiments show that the proposed approach could achieve a significant speedup improvement with compare to the traditional approach in average. In addition, with the growing number of working nodes, the execution time decreases gradually, and the speedup scales linearly. The results indicate that the crawling efficiency can be significantly enhanced by adopting this speculative parallel algorithm.
Springer Science and Business Media LLC
Title: SpeCA:A Speculative Parallel Crawling Approach on Apache Spark
Description:
Abstract The World Wide Web today is growing at a phenomenal rate.
The crawling approach is of vital importance to improve the efficiency of crawling the web.
The existing crawling algorithms on multicore platforms are time consuming and do not support large data well.
In order to improve parallelism and efficiency of crawler on distributed network environments, based on the software thread-level speculation technique, this paper raises a Speculative parallel crawler approach (SpeCA) on Apache Spark.
By analyzing the process of web crawler, the SpeCA firstly hires a function to divide a crawling process into several subprocesses which can be implemented independently and then spawns a number of threads to speculatively crawl in parallel.
At last, the speculative results are merged to form the final outcome.
Comparing with the conventional parallel approach on multicore platform, SpeCA is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster.
Experiments show that the proposed approach could achieve a significant speedup improvement with compare to the traditional approach in average.
In addition, with the growing number of working nodes, the execution time decreases gradually, and the speedup scales linearly.
The results indicate that the crawling efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

Related Results

Software analysis of scientific texts: comparative study of distributed computing frameworks
Software analysis of scientific texts: comparative study of distributed computing frameworks
The relevance of this study is related to the need for efficient analysis of scientific texts in the context of the growing amount of information. This study aims to conduct a stud...
Speculative Fiction
Speculative Fiction
The term “speculative fiction” has three historically located meanings: a subgenre of science fiction that deals with human rather than technological problems, a genre distinct fro...
Analisis Variasi Busi Terhadap Performa dan Bahan Bakar Motor Bensin 2 Langkah Yamaha F1ZR 110CC
Analisis Variasi Busi Terhadap Performa dan Bahan Bakar Motor Bensin 2 Langkah Yamaha F1ZR 110CC
Spark plugs have various types and specifications that can improve motorcycle performance. The purpose of this study was to determine the ratio of torque, power, and fuel consumpti...
Tools and techniques for real-time data processing: A review
Tools and techniques for real-time data processing: A review
Real-time data processing is an essential component in the modern data landscape, where vast amounts of data are generated continuously from various sources such as Internet of Thi...
Populating the Future: Families and Reproduction in Speculative Fiction
Populating the Future: Families and Reproduction in Speculative Fiction
Speculative fiction opens doors for imagining beyond what is possible, conventional or acceptable. Speculative fiction has an acute ear for the social, the scientific and for polit...
ANALISIS PENGGUNAAN BUSI RACING TERHADAP UNJUK KERJA MESIN TOYOTA AVANZA 1300 CC
ANALISIS PENGGUNAAN BUSI RACING TERHADAP UNJUK KERJA MESIN TOYOTA AVANZA 1300 CC
To overcome this is certainly a need to increase engine performance either by the use of spark plugs can improve the performance of a gasoline engine. Thus improve fuel efficiency ...
SPARK PLUG PROBLEMS IN AUTOMOTIVE SERVICE
SPARK PLUG PROBLEMS IN AUTOMOTIVE SERVICE
<div class="htmlview paragraph">The selection of a spark plug of the proper heat range for automotive service is becoming increasingly difficult in spite of the many improvem...

Back to Top