Javascript must be enabled to continue!

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.

MDPI AG

Xing Su Fei Lei

Electronics

2018

Title: Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

Description:

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library.

On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware.

Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time.

However, this is not the case on Non-Uniform Memory Access (NUMA) architectures.

The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread.

In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones.

We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture.

Results show that our method reduces the synchronization overhead by 51.

5% and achieves an improvement of GEMM performance by 1.

9%.

Back

Abstract History has shown that dependency upon the crane operator to monitor loads and boom angle or load radius do not allow the margin necessary to perform the...

Numerical Analysis of Roadway Rock-Burst Hazard under Superposed Dynamic and Static Loads

Microseismic events commonly occur during the excavation of long wall panels and often cause rock-burst accidents when the roadway is influenced by dynamic loads. In this paper, th...

CDK1-dependent N-terminal NuMA phosphorylation promotes dynein-dynactin-NuMA assembly for accurate chromosome segregation

Abstract The microtubule-based motor dynein and its cofactor dynactin fulfil essential functions throughout the cell cycle, including organelle t...

Dynamic Simulation to Determine Governing Relief Load for De-Methanizer System

Abstract The existing design of de-methanizer column relief system is re-validated to ensure process safety and integrity. Relief load estimation for de-methanizer b...

The neural basis of intelligence in fine-grained cortical topographies

AbstractIntelligent thought is the product of efficient neural information processing, which is embedded in fine-grained, topographically-organized population responses and support...

Mechanical response and damage monitoring in hybrid composites under extreme loading conditions

The rising interest in composite materials within aerospace, defense, and automotive industries has prompted a thorough investigation of their material behavior and development of ...

Clustering based EO with MRF technique for effective load balancing in cloud computing

Purpose Cloud computing (CC) refers to the usage of virtualization technology to share computing resources through the internet. Task scheduling (TS) is used to assign computationa...

Imbalanced image classification algorithm based on fine-grained analysis

Fine-grained attribute analysis and data imbalance have always been research hotspots in the field of computer vision. Due to the complexity and diversity of fine-grained attribute...

Email:
Password:

Email:

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

Related Results