Javascript must be enabled to continue!
Adaptive Workflow Scheduling in Heterogeneous GPU Clusters via Deep Reinforcement Learning
View through CrossRef
The proliferation of heterogeneous Graphics Processing Unit (GPU) clusters has introduced unprecedented computational capabilities for workflow execution across diverse scientific and industrial domains. However, the inherent heterogeneity of GPU resources, coupled with dynamic workload characteristics and complex workflow dependencies, presents substantial challenges for efficient scheduling. Traditional heuristic-based scheduling algorithms such as Heterogeneous Earliest Finish Time (HEFT) and First-In-First-Out with Duplication and Earliest Finish Time (FIFO-DEFT) often fail to adapt to rapidly changing cluster states and evolving workload patterns. This paper proposes an adaptive workflow scheduling framework leveraging Deep Reinforcement Learning (DRL) to intelligently allocate workflow tasks to heterogeneous GPU resources. The proposed approach employs a Deep Q-Network (DQN) architecture integrated with prioritized experience replay to learn optimal scheduling policies through continuous interaction with the cluster environment. The framework models workflow scheduling as a Markov Decision Process (MDP) where the agent learns to minimize makespan, maximize resource utilization, and maintain quality-of-service guarantees. Extensive experimental evaluations demonstrate that the DRL-based scheduler achieves significant performance improvements compared to baseline algorithms including HEFT, FIFO-DEFT, and other state-of-the-art schedulers. The proposed method exhibits superior adaptability to varying cluster configurations and workflow characteristics, maintaining robust performance across diverse execution scenarios while reducing average makespan and improving scheduling length ratio metrics.
International Study Counselor
Title: Adaptive Workflow Scheduling in Heterogeneous GPU Clusters via Deep Reinforcement Learning
Description:
The proliferation of heterogeneous Graphics Processing Unit (GPU) clusters has introduced unprecedented computational capabilities for workflow execution across diverse scientific and industrial domains.
However, the inherent heterogeneity of GPU resources, coupled with dynamic workload characteristics and complex workflow dependencies, presents substantial challenges for efficient scheduling.
Traditional heuristic-based scheduling algorithms such as Heterogeneous Earliest Finish Time (HEFT) and First-In-First-Out with Duplication and Earliest Finish Time (FIFO-DEFT) often fail to adapt to rapidly changing cluster states and evolving workload patterns.
This paper proposes an adaptive workflow scheduling framework leveraging Deep Reinforcement Learning (DRL) to intelligently allocate workflow tasks to heterogeneous GPU resources.
The proposed approach employs a Deep Q-Network (DQN) architecture integrated with prioritized experience replay to learn optimal scheduling policies through continuous interaction with the cluster environment.
The framework models workflow scheduling as a Markov Decision Process (MDP) where the agent learns to minimize makespan, maximize resource utilization, and maintain quality-of-service guarantees.
Extensive experimental evaluations demonstrate that the DRL-based scheduler achieves significant performance improvements compared to baseline algorithms including HEFT, FIFO-DEFT, and other state-of-the-art schedulers.
The proposed method exhibits superior adaptability to varying cluster configurations and workflow characteristics, maintaining robust performance across diverse execution scenarios while reducing average makespan and improving scheduling length ratio metrics.
Related Results
Learning Approaches to Dynamic Workflow Scheduling based on Genetic Programming and Deep Reinforcement Learning
Learning Approaches to Dynamic Workflow Scheduling based on Genetic Programming and Deep Reinforcement Learning
<p><strong>Dynamic workflow scheduling (DWS) in cloud computing is a critical yet challenging problem, involving assigning numerous workflow tasks to heterogeneous virt...
Workflow Scheduling Based on Mobile Cloud Computing Machine Learning
Workflow Scheduling Based on Mobile Cloud Computing Machine Learning
In recent years, cloud workflow task scheduling has always been an important research topic in the business world. Cloud workflow task scheduling means that the workflow tasks subm...
Heat transfer in supercritical fluids: computational approaches & studies
Heat transfer in supercritical fluids: computational approaches & studies
(English) This thesis delves into investigating the complexities of heat transfer in supercritical fluids through the application of advanced theoretical and computational methodol...
STRENGTH OF BUTT WELDED BUTT JOINT OF REINFORCEMENT OF CLASS A500C
STRENGTH OF BUTT WELDED BUTT JOINT OF REINFORCEMENT OF CLASS A500C
The paper presents the results of experimental studies of the strength of cross-shaped welded joints of types К1-Кт and К3-Рр [1] of thermomechanically hardened reinforcement of cl...
Reinforcement Learning-Based Framework for Optimal Task Scheduling in Cloud Computing
Reinforcement Learning-Based Framework for Optimal Task Scheduling in Cloud Computing
Cloud computing enables the execution of large-scale computing tasks in a pay-per-use manner, allowing users worldwide to submit diverse workloads to cloud infrastructures. In this...
Parallel metaheuristics on GPU
Parallel metaheuristics on GPU
Métaheuristiques parallèles sur GPU
Les problèmes d'optimisation issus du monde réel sont souvent complexes et NP-difficiles. Leur modélisation est en constante évo...
EDQWS: an enhanced divide and conquer algorithm for workflow scheduling in cloud
EDQWS: an enhanced divide and conquer algorithm for workflow scheduling in cloud
AbstractA workflow is an effective way for modeling complex applications and serves as a means for scientists and researchers to better understand the details of applications. Clou...

