Javascript must be enabled to continue!

Multi-constraint scheduling of MapReduce workloads

In recent years there has been an extraordinary growth of large-scale data processing and related technologies in both, industry and academic communities. This trend is mostly driven by the need to explore the increasingly large amounts of information that global companies and communities are able to gather, and has lead the introduction of new tools and models, most of which are designed around the idea of handling huge amounts of data. A good example of this trend towards improved large-scale data processing is MapReduce, a programming model intended to ease the development of massively parallel applications, and which has been widely adopted to process large datasets thanks to its simplicity. While the MapReduce model was originally used primarily for batch data processing in large static clusters, nowadays it is mostly deployed along with other kinds of workloads in shared environments in which multiple users may be submitting concurrent jobs with completely different priorities and needs: from small, almost interactive, executions, to very long applications that take hours to complete. Scheduling and selecting tasks for execution is extremely relevant in MapReduce environments since it governs a job's opportunity to make progress and determines its performance. However, only basic primitives to prioritize between jobs are available at the moment, constantly causing either under or over-provisioning, as the amount of resources needed to complete a particular job are not obvious a priori. This thesis aims to address both, the lack of management capabilities and the increased complexity of the environments in which MapReduce is executed. To that end, new models and techniques are introduced in order to improve the scheduling of MapReduce in the presence of different constraints found in real-world scenarios, such as completion time goals, data locality, hardware heterogeneity, or availability of resources. The focus is on improving the integration of MapReduce with the computing infrastructures in which it usually runs, allowing alternative techniques for dynamic management and provisioning of resources. More specifically, it is focused in three scenarios that are incremental in its scope. First, it studies the prospects of using high-level performance criteria to manage and drive the performance of MapReduce applications, taking advantage of the fact that MapReduce is executed in controlled environments in which the status of the cluster is known. Second, it examines the feasibility and benefits of making the MapReduce runtime more aware of the underlying hardware and the characteristics of applications. And finally, it also considers the interaction between MapReduce and other kinds of workloads, proposing new techniques to handle these increasingly complex environments. Following these three items described above, this thesis contributes to the management of MapReduce workloads by 1) proposing a performance model for MapReduce workloads and a scheduling algorithm that leverages the proposed model and is able to adapt depending on the various needs of its users in the presence of completion time constraints; 2) proposing a new resource model for MapReduce and a placement algorithm aware of the underlying hardware as well as the characteristics of the applications, capable of improving cluster utilization while still being guided by job performance metrics; and 3) proposing a model for shared environments in which MapReduce is executed along with other kinds of workloads such as transactional applications, and a scheduler aware of these workloads and its expected demand of resources, capable of improving resource utilization across machines while observing completion time goals.

Universitat Politècnica de Catalunya

Jordà Polo Bardés

2023

Title: Multi-constraint scheduling of MapReduce workloads

Description:

In recent years there has been an extraordinary growth of large-scale data processing and related technologies in both, industry and academic communities.

This trend is mostly driven by the need to explore the increasingly large amounts of information that global companies and communities are able to gather, and has lead the introduction of new tools and models, most of which are designed around the idea of handling huge amounts of data.

A good example of this trend towards improved large-scale data processing is MapReduce, a programming model intended to ease the development of massively parallel applications, and which has been widely adopted to process large datasets thanks to its simplicity.

While the MapReduce model was originally used primarily for batch data processing in large static clusters, nowadays it is mostly deployed along with other kinds of workloads in shared environments in which multiple users may be submitting concurrent jobs with completely different priorities and needs: from small, almost interactive, executions, to very long applications that take hours to complete.

Scheduling and selecting tasks for execution is extremely relevant in MapReduce environments since it governs a job's opportunity to make progress and determines its performance.

However, only basic primitives to prioritize between jobs are available at the moment, constantly causing either under or over-provisioning, as the amount of resources needed to complete a particular job are not obvious a priori.

This thesis aims to address both, the lack of management capabilities and the increased complexity of the environments in which MapReduce is executed.

To that end, new models and techniques are introduced in order to improve the scheduling of MapReduce in the presence of different constraints found in real-world scenarios, such as completion time goals, data locality, hardware heterogeneity, or availability of resources.

The focus is on improving the integration of MapReduce with the computing infrastructures in which it usually runs, allowing alternative techniques for dynamic management and provisioning of resources.

More specifically, it is focused in three scenarios that are incremental in its scope.

First, it studies the prospects of using high-level performance criteria to manage and drive the performance of MapReduce applications, taking advantage of the fact that MapReduce is executed in controlled environments in which the status of the cluster is known.

Second, it examines the feasibility and benefits of making the MapReduce runtime more aware of the underlying hardware and the characteristics of applications.

And finally, it also considers the interaction between MapReduce and other kinds of workloads, proposing new techniques to handle these increasingly complex environments.

Following these three items described above, this thesis contributes to the management of MapReduce workloads by 1) proposing a performance model for MapReduce workloads and a scheduling algorithm that leverages the proposed model and is able to adapt depending on the various needs of its users in the presence of completion time constraints; 2) proposing a new resource model for MapReduce and a placement algorithm aware of the underlying hardware as well as the characteristics of the applications, capable of improving cluster utilization while still being guided by job performance metrics; and 3) proposing a model for shared environments in which MapReduce is executed along with other kinds of workloads such as transactional applications, and a scheduler aware of these workloads and its expected demand of resources, capable of improving resource utilization across machines while observing completion time goals.

Back

Optimisation de la gestion des données pour les applications MapReduce sur des infrastructures distribuées à grande échelle Les applications data-intensive sont lar...

Pengurangan Work In Process Inventory di Stasiun Kerja Bottleneck Menggunakan Pendekatan Theory Of Constraint (TOC)

Abstract. CV. Pustaka Setia is a company engaged in publishing and printing books. The obstacle experienced by CV Pustaka Setia is the occurrence of accumulation (Work In Process i...

Improving MapReduce Performance on Clusters

Amélioration des performances de MapReduce sur grappe de calcul Beaucoup de disciplines scientifiques s'appuient désormais sur l'analyse et la fouille de masses gig...

OPTIMIZATION OF WORK LOAD USING MAP REDUCE FRAMEWORK: Review Study

The term Optimize is “to make perfect”. It’s means choosing the best element from some set of available alternatives. Within the past few years, organizations in diverse industries...

Constraining simulation uncertainties in a hydrological model of the Congo River Basin including a combined modelling approach for channel-wetland exchanges

Compared to other large river basins of the world, such as the Amazon, the Congo River Basin appears to be the most ungauged and less studied. This is partly because the basin lack...

DPTM: An Adaptive Scheduler Design Utilizing Timeslot Matching and Release Methods for Concurrent and Multi-task Interleaved Pipelining-oriented CGRA

Coarse-grained reconfigurable architectures (CGRAs) are increasingly employed as domain-specific accelerators due to their efficiency and flexibility. However, the existing CGRA ar...

Visual versus Tabular Scheduling Programs

Effective scheduling in construction is crucial for ensuring timely project completion and maintaining budget control. Scheduling programs play an important role in this process by...

Reinforcement Learning-Based Framework for Optimal Task Scheduling in Cloud Computing

Cloud computing enables the execution of large-scale computing tasks in a pay-per-use manner, allowing users worldwide to submit diverse workloads to cloud infrastructures. In this...

Email:
Password:

Email:

Multi-constraint scheduling of MapReduce workloads

Related Results