Javascript must be enabled to continue!

Adaptive Workflow Scheduling in Heterogeneous GPU Clusters via Deep Reinforcement Learning

The proliferation of heterogeneous Graphics Processing Unit (GPU) clusters has introduced unprecedented computational capabilities for workflow execution across diverse scientific and industrial domains. However, the inherent heterogeneity of GPU resources, coupled with dynamic workload characteristics and complex workflow dependencies, presents substantial challenges for efficient scheduling. Traditional heuristic-based scheduling algorithms such as Heterogeneous Earliest Finish Time (HEFT) and First-In-First-Out with Duplication and Earliest Finish Time (FIFO-DEFT) often fail to adapt to rapidly changing cluster states and evolving workload patterns. This paper proposes an adaptive workflow scheduling framework leveraging Deep Reinforcement Learning (DRL) to intelligently allocate workflow tasks to heterogeneous GPU resources. The proposed approach employs a Deep Q-Network (DQN) architecture integrated with prioritized experience replay to learn optimal scheduling policies through continuous interaction with the cluster environment. The framework models workflow scheduling as a Markov Decision Process (MDP) where the agent learns to minimize makespan, maximize resource utilization, and maintain quality-of-service guarantees. Extensive experimental evaluations demonstrate that the DRL-based scheduler achieves significant performance improvements compared to baseline algorithms including HEFT, FIFO-DEFT, and other state-of-the-art schedulers. The proposed method exhibits superior adaptability to varying cluster configurations and workflow characteristics, maintaining robust performance across diverse execution scenarios while reducing average makespan and improving scheduling length ratio metrics.

International Study Counselor

Zixuan Li Yuefeng Chen Yuefeng Chen Thomas Gallagher

Multidisciplinary Research in Computing Information Systems

2026

Title: Adaptive Workflow Scheduling in Heterogeneous GPU Clusters via Deep Reinforcement Learning

Description:

However, the inherent heterogeneity of GPU resources, coupled with dynamic workload characteristics and complex workflow dependencies, presents substantial challenges for efficient scheduling.

Traditional heuristic-based scheduling algorithms such as Heterogeneous Earliest Finish Time (HEFT) and First-In-First-Out with Duplication and Earliest Finish Time (FIFO-DEFT) often fail to adapt to rapidly changing cluster states and evolving workload patterns.

This paper proposes an adaptive workflow scheduling framework leveraging Deep Reinforcement Learning (DRL) to intelligently allocate workflow tasks to heterogeneous GPU resources.

The proposed approach employs a Deep Q-Network (DQN) architecture integrated with prioritized experience replay to learn optimal scheduling policies through continuous interaction with the cluster environment.

The framework models workflow scheduling as a Markov Decision Process (MDP) where the agent learns to minimize makespan, maximize resource utilization, and maintain quality-of-service guarantees.

Extensive experimental evaluations demonstrate that the DRL-based scheduler achieves significant performance improvements compared to baseline algorithms including HEFT, FIFO-DEFT, and other state-of-the-art schedulers.

The proposed method exhibits superior adaptability to varying cluster configurations and workflow characteristics, maintaining robust performance across diverse execution scenarios while reducing average makespan and improving scheduling length ratio metrics.

Back

<p><strong>Dynamic workflow scheduling (DWS) in cloud computing is a critical yet challenging problem, involving assigning numerous workflow tasks to heterogeneous virt...

Workflow Scheduling Based on Mobile Cloud Computing Machine Learning

In recent years, cloud workflow task scheduling has always been an important research topic in the business world. Cloud workflow task scheduling means that the workflow tasks subm...

Heat transfer in supercritical fluids: computational approaches & studies

(English) This thesis delves into investigating the complexities of heat transfer in supercritical fluids through the application of advanced theoretical and computational methodol...

STRENGTH OF BUTT WELDED BUTT JOINT OF REINFORCEMENT OF CLASS A500C

The paper presents the results of experimental studies of the strength of cross-shaped welded joints of types К1-Кт and К3-Рр [1] of thermomechanically hardened reinforcement of cl...

Reinforcement Learning-Based Framework for Optimal Task Scheduling in Cloud Computing

Cloud computing enables the execution of large-scale computing tasks in a pay-per-use manner, allowing users worldwide to submit diverse workloads to cloud infrastructures. In this...

R-GPU

Over the last decade, Graphics Processing Unit (GPU) architectures have evolved from a fixed-function graphics pipeline to a programmable, energy-efficient compute accelerator for ...

Parallel metaheuristics on GPU

Métaheuristiques parallèles sur GPU Les problèmes d'optimisation issus du monde réel sont souvent complexes et NP-difficiles. Leur modélisation est en constante évo...

EDQWS: an enhanced divide and conquer algorithm for workflow scheduling in cloud

AbstractA workflow is an effective way for modeling complex applications and serves as a means for scientists and researchers to better understand the details of applications. Clou...

Email:
Password:

Email:

Adaptive Workflow Scheduling in Heterogeneous GPU Clusters via Deep Reinforcement Learning

Related Results