Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

View through CrossRef
The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.
Title: Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures
Description:
The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library.
On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware.
Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time.
However, this is not the case on Non-Uniform Memory Access (NUMA) architectures.
The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread.
In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones.
We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture.
Results show that our method reduces the synchronization overhead by 51.
5% and achieves an improvement of GEMM performance by 1.
9%.

Related Results

Crane Load Moment System For Offshore Crane Operations
Crane Load Moment System For Offshore Crane Operations
Abstract History has shown that dependency upon the crane operator to monitor loads and boom angle or load radius do not allow the margin necessary to perform the...
Numerical Analysis of Roadway Rock-Burst Hazard under Superposed Dynamic and Static Loads
Numerical Analysis of Roadway Rock-Burst Hazard under Superposed Dynamic and Static Loads
Microseismic events commonly occur during the excavation of long wall panels and often cause rock-burst accidents when the roadway is influenced by dynamic loads. In this paper, th...
Dynamic Simulation to Determine Governing Relief Load for De-Methanizer System
Dynamic Simulation to Determine Governing Relief Load for De-Methanizer System
Abstract The existing design of de-methanizer column relief system is re-validated to ensure process safety and integrity. Relief load estimation for de-methanizer b...
The neural basis of intelligence in fine-grained cortical topographies
The neural basis of intelligence in fine-grained cortical topographies
AbstractIntelligent thought is the product of efficient neural information processing, which is embedded in fine-grained, topographically-organized population responses and support...
Mechanical response and damage monitoring in hybrid composites under extreme loading conditions
Mechanical response and damage monitoring in hybrid composites under extreme loading conditions
The rising interest in composite materials within aerospace, defense, and automotive industries has prompted a thorough investigation of their material behavior and development of ...
Clustering based EO with MRF technique for effective load balancing in cloud computing
Clustering based EO with MRF technique for effective load balancing in cloud computing
Purpose Cloud computing (CC) refers to the usage of virtualization technology to share computing resources through the internet. Task scheduling (TS) is used to assign computationa...
Imbalanced image classification algorithm based on fine-grained analysis
Imbalanced image classification algorithm based on fine-grained analysis
Fine-grained attribute analysis and data imbalance have always been research hotspots in the field of computer vision. Due to the complexity and diversity of fine-grained attribute...

Back to Top