Javascript must be enabled to continue!

A System For Storing And Processing Big Data Based On The Apache Spark Platform

The primary objective of this paper is to investigate and implement the Apache Spark big data processing platform on a stock dataset, followed by the application of a machine learning technique for prediction and modeling. Specifically, PySpark, a Python API for Apache Spark, is utilized to facilitate the interaction. The Spark MLlib library is employed for data transformation, whereas the GraphX library is used for data modeling. Multiple executions of the experimental program demonstrated significant performance improvements, with notably shorter runtimes on the Spark cluster than on a single machine. These results highlight the advantages of distributed and parallel processing in large-scale data analysis.

Viet Nam National University Ho Chi Minh City

Anh Tuan Nguyen Hong Phuong Vo Tien Thanh Cao Khai Thien Tran

Science & Technology Development Journal

2025

Title: A System For Storing And Processing Big Data Based On The Apache Spark Platform

Description:

Specifically, PySpark, a Python API for Apache Spark, is utilized to facilitate the interaction.

The Spark MLlib library is employed for data transformation, whereas the GraphX library is used for data modeling.

Multiple executions of the experimental program demonstrated significant performance improvements, with notably shorter runtimes on the Spark cluster than on a single machine.

These results highlight the advantages of distributed and parallel processing in large-scale data analysis.

Back

Technologies like cloud computing paved way for dealing with massive amounts of data. Prior to cloud, it was not possible unless you invest large amounts for computing resources. N...

Software analysis of scientific texts: comparative study of distributed computing frameworks

The relevance of this study is related to the need for efficient analysis of scientific texts in the context of the growing amount of information. This study aims to conduct a stud...

Tools and techniques for real-time data processing: A review

Real-time data processing is an essential component in the modern data landscape, where vast amounts of data are generated continuously from various sources such as Internet of Thi...

Digital Footprint as a Source of Big Data in Education

The purpose of this study is to consider the prospects and problems of using big data in education.Materials and methods. The research methods include analysis, systematization and...

A comparative analysis of big data processing paradigms: Mapreduce vs. apache spark

The paper addresses a highly relevant and contemporary topic in the field of data processing. Big data is a crucial aspect of modern computing, and the choice of processing framewo...

Scalability and Efficiency in Distributed Big Data Architectures: A Comparative Study

With the rapid expansion of the size of data, there is a need for the development of scalable and efficient architectures for large scale data processing. This research conducts a ...

Why Should Big Data-based Price Discrimination be Governed?

Abstract The e-commerce platform provides data service for resident merchants for precise marketing, but which also leads to frequent occurrence of big data-based price dis...

Compressive structural bioinformatics

We are developing compressed 3D molecular data representations and workflows (“Compressive Structural Bioinformatics”) to speed up mining and visualization of 3D structural data by...

Email:
Password:

Email:

A System For Storing And Processing Big Data Based On The Apache Spark Platform

Related Results