Javascript must be enabled to continue!
IRPDP_HT2: A Scalable Data Pre-processing Method in Web Usage Mining using Hadoop-MapReduce
View through CrossRef
Abstract
Data preparation is a vital step in the Web usage mining process since it provides structured data for the subsequent stages. Raw server logs should be turned into user sessions in order to generate structured data for pattern discovery phase. In recent decade, popular websites' server log production has risen to many terabytes to petabytes each day. As a result, server logs have Big Data challenges including storage and processing. Data cleaning, user identification, and session identification are data-intensive and deemed-computation-intensive operations in the data pre-processing sub phases. In recent years, Google's MapReduce parallel programming framework for data-intensive applications has grown in popularity as it provides the facility of scalable, flexible, fault-tolerant, distributed, and data-intensive processing over a cluster of nodes. In this paper, we provide IRPDP_HT2, a parallel data pre-processing algorithm using MapReduce that comprises data cleansing, user identification, and session identification approaches. The relevance and scalability of IRPDP_HT2 are also assessed using real web server logs, and it is discovered that the presented approach is both efficient and scalable.
Title: IRPDP_HT2: A Scalable Data Pre-processing Method in Web Usage Mining using Hadoop-MapReduce
Description:
Abstract
Data preparation is a vital step in the Web usage mining process since it provides structured data for the subsequent stages.
Raw server logs should be turned into user sessions in order to generate structured data for pattern discovery phase.
In recent decade, popular websites' server log production has risen to many terabytes to petabytes each day.
As a result, server logs have Big Data challenges including storage and processing.
Data cleaning, user identification, and session identification are data-intensive and deemed-computation-intensive operations in the data pre-processing sub phases.
In recent years, Google's MapReduce parallel programming framework for data-intensive applications has grown in popularity as it provides the facility of scalable, flexible, fault-tolerant, distributed, and data-intensive processing over a cluster of nodes.
In this paper, we provide IRPDP_HT2, a parallel data pre-processing algorithm using MapReduce that comprises data cleansing, user identification, and session identification approaches.
The relevance and scalability of IRPDP_HT2 are also assessed using real web server logs, and it is discovered that the presented approach is both efficient and scalable.
Related Results
YouTube: big data analytics using Hadoop and map reduce
YouTube: big data analytics using Hadoop and map reduce
We live today in a digital world a tremendous amount of data is generated by each digital service we use. This vast amount of data generated is called Big Data. According to Wikipe...
A scalable MapReduce-based design of an unsupervised entity resolution system
A scalable MapReduce-based design of an unsupervised entity resolution system
Traditional data curation processes typically depend on human intervention. As data volume and variety grow exponentially, organizations are striving to increase efficiency of thei...
Hadoop Ecosystem and Cloud Integration
Hadoop Ecosystem and Cloud Integration
The integration of the Hadoop ecosystem with cloud computing marks a transformative evolution in the way organizations manage and analyze large-scale data. This study examines how ...
MapReduce and Hadoop
MapReduce and Hadoop
This chapter introduces the MapReduce solution for distributed computation. It explains the fundamentals of MapReduce and describes in which scenarios it can be applied (basically,...
Hadoop Tools
Hadoop Tools
As the name indicates, this chapter explains the various additional tools provided by Hadoop. The additional tools provided by Hadoop distribution are Hadoop Streaming, Hadoop Arch...
An Analytical Approach for Optimizing the Performance of Hadoop Map Reduce Over RoCE
An Analytical Approach for Optimizing the Performance of Hadoop Map Reduce Over RoCE
Data intensive systems aim to efficiently process “big” data. Several data processing engines have evolved over past decade. These data processing engines are modeled around the Ma...
Secure Cloud Data with Attribute-based Honey Encryption
Secure Cloud Data with Attribute-based Honey Encryption
Abstract
Encryption is a Technique to convert plain text into Cipher text, which is unreadable without an appropriate decryption key. Hadoop is a platform to process and st...

