Javascript must be enabled to continue!

IRPDP_HT2: A Scalable Data Pre-processing Method in Web Usage Mining using Hadoop-MapReduce

Abstract Data preparation is a vital step in the Web usage mining process since it provides structured data for the subsequent stages. Raw server logs should be turned into user sessions in order to generate structured data for pattern discovery phase. In recent decade, popular websites' server log production has risen to many terabytes to petabytes each day. As a result, server logs have Big Data challenges including storage and processing. Data cleaning, user identification, and session identification are data-intensive and deemed-computation-intensive operations in the data pre-processing sub phases. In recent years, Google's MapReduce parallel programming framework for data-intensive applications has grown in popularity as it provides the facility of scalable, flexible, fault-tolerant, distributed, and data-intensive processing over a cluster of nodes. In this paper, we provide IRPDP_HT2, a parallel data pre-processing algorithm using MapReduce that comprises data cleansing, user identification, and session identification approaches. The relevance and scalability of IRPDP_HT2 are also assessed using real web server logs, and it is discovered that the presented approach is both efficient and scalable.

Research Square Platform LLC

Mitali Srivast Atul Srivast

2022

Title: IRPDP_HT2: A Scalable Data Pre-processing Method in Web Usage Mining using Hadoop-MapReduce

Description:

Abstract Data preparation is a vital step in the Web usage mining process since it provides structured data for the subsequent stages.

Raw server logs should be turned into user sessions in order to generate structured data for pattern discovery phase.

In recent decade, popular websites' server log production has risen to many terabytes to petabytes each day.

As a result, server logs have Big Data challenges including storage and processing.

Data cleaning, user identification, and session identification are data-intensive and deemed-computation-intensive operations in the data pre-processing sub phases.

In recent years, Google's MapReduce parallel programming framework for data-intensive applications has grown in popularity as it provides the facility of scalable, flexible, fault-tolerant, distributed, and data-intensive processing over a cluster of nodes.

In this paper, we provide IRPDP_HT2, a parallel data pre-processing algorithm using MapReduce that comprises data cleansing, user identification, and session identification approaches.

The relevance and scalability of IRPDP_HT2 are also assessed using real web server logs, and it is discovered that the presented approach is both efficient and scalable.

Back

We live today in a digital world a tremendous amount of data is generated by each digital service we use. This vast amount of data generated is called Big Data. According to Wikipe...

A scalable MapReduce-based design of an unsupervised entity resolution system

Traditional data curation processes typically depend on human intervention. As data volume and variety grow exponentially, organizations are striving to increase efficiency of thei...

Hadoop Ecosystem and Cloud Integration

The integration of the Hadoop ecosystem with cloud computing marks a transformative evolution in the way organizations manage and analyze large-scale data. This study examines how ...

MapReduce and Hadoop

This chapter introduces the MapReduce solution for distributed computation. It explains the fundamentals of MapReduce and describes in which scenarios it can be applied (basically,...

Hadoop Tools

As the name indicates, this chapter explains the various additional tools provided by Hadoop. The additional tools provided by Hadoop distribution are Hadoop Streaming, Hadoop Arch...

Editorial

With the phenomenal growth of the Web, there is an everincreasing volume of data and information published in numerous Web pages. The research in Web mining aims to develop new tec...

An Analytical Approach for Optimizing the Performance of Hadoop Map Reduce Over RoCE

Data intensive systems aim to efficiently process “big” data. Several data processing engines have evolved over past decade. These data processing engines are modeled around the Ma...

Secure Cloud Data with Attribute-based Honey Encryption

Abstract Encryption is a Technique to convert plain text into Cipher text, which is unreadable without an appropriate decryption key. Hadoop is a platform to process and st...

Email:
Password:

Email:

IRPDP_HT2: A Scalable Data Pre-processing Method in Web Usage Mining using Hadoop-MapReduce

Related Results