Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

View through CrossRef
The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.
Title: Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures
Description:
The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library.
On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware.
Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time.
However, this is not the case on Non-Uniform Memory Access (NUMA) architectures.
The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread.
In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones.
We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture.
Results show that our method reduces the synchronization overhead by 51.
5% and achieves an improvement of GEMM performance by 1.
9%.

Related Results

Crane Load Moment System For Offshore Crane Operations
Crane Load Moment System For Offshore Crane Operations
Abstract History has shown that dependency upon the crane operator to monitor loads and boom angle or load radius do not allow the margin necessary to perform the...
Numerical Analysis of Roadway Rock-Burst Hazard under Superposed Dynamic and Static Loads
Numerical Analysis of Roadway Rock-Burst Hazard under Superposed Dynamic and Static Loads
Microseismic events commonly occur during the excavation of long wall panels and often cause rock-burst accidents when the roadway is influenced by dynamic loads. In this paper, th...
Nanogold and nanosilver hybrid polymer materials
Nanogold and nanosilver hybrid polymer materials
<p>Significant opportunities exist in both the scientific and industrial sectors for the development of new generation hybrid materials. These multifunctional hybrid material...
Dynamic Simulation to Determine Governing Relief Load for De-Methanizer System
Dynamic Simulation to Determine Governing Relief Load for De-Methanizer System
Abstract The existing design of de-methanizer column relief system is re-validated to ensure process safety and integrity. Relief load estimation for de-methanizer b...
The neural basis of intelligence in fine-grained cortical topographies
The neural basis of intelligence in fine-grained cortical topographies
Abstract Intelligent thought is the product of efficient neural information processing, which is embedded in fine-grained, topographically-organized population resp...
Mechanical response and damage monitoring in hybrid composites under extreme loading conditions
Mechanical response and damage monitoring in hybrid composites under extreme loading conditions
The rising interest in composite materials within aerospace, defense, and automotive industries has prompted a thorough investigation of their material behavior and development of ...
Clustering based EO with MRF technique for effective load balancing in cloud computing
Clustering based EO with MRF technique for effective load balancing in cloud computing
Purpose Cloud computing (CC) refers to the usage of virtualization technology to share computing resources through the internet. Task scheduling (TS) is used to assign computationa...

Back to Top