Javascript must be enabled to continue!
Quantitative Performance Analysis of BLAS Libraries on GPU Architectures
View through CrossRef
Basic Linear Algebra Subprograms (BLAS) are a set of linear algebra routines commonly used by machine learning applications and scientific computing. BLAS libraries with optimized implementations of BLAS routines offer high performance by exploiting parallel execution units in target computing systems. With massively large number of cores, graphics processing units (GPUs) exhibit high performance for computationally-heavy workloads. Recent BLAS libraries utilize parallel cores of GPU architectures efficiently by employing inherent data parallelism. In this study, we analyze GPU-targeted functions from two BLAS libraries, cuBLAS and MAGMA, and evaluate their performance on a single-GPU NVIDIA architecture by considering architectural features and limitations. We collect architectural performance metrics and explore resource utilization characteristics. Our work aims to help researchers and programmers to understand the performance behavior and GPU resource utilization of the BLAS routines implemented by the libraries.
Deu Muhendislik Fakultesi Fen ve Muhendislik
Title: Quantitative Performance Analysis of BLAS Libraries on GPU Architectures
Description:
Basic Linear Algebra Subprograms (BLAS) are a set of linear algebra routines commonly used by machine learning applications and scientific computing.
BLAS libraries with optimized implementations of BLAS routines offer high performance by exploiting parallel execution units in target computing systems.
With massively large number of cores, graphics processing units (GPUs) exhibit high performance for computationally-heavy workloads.
Recent BLAS libraries utilize parallel cores of GPU architectures efficiently by employing inherent data parallelism.
In this study, we analyze GPU-targeted functions from two BLAS libraries, cuBLAS and MAGMA, and evaluate their performance on a single-GPU NVIDIA architecture by considering architectural features and limitations.
We collect architectural performance metrics and explore resource utilization characteristics.
Our work aims to help researchers and programmers to understand the performance behavior and GPU resource utilization of the BLAS routines implemented by the libraries.
Related Results
Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement
Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement
BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPU is known for its strong arithmetic computing capabi...
Vina-GPU 2.1: towards further optimizing docking speed and precision of AutoDock Vina and its derivatives
Vina-GPU 2.1: towards further optimizing docking speed and precision of AutoDock Vina and its derivatives
AbstractAutoDock Vina and its derivatives have established themselves as a prevailing pipeline for virtual screening in contemporary drug discovery. Our Vina-GPU method leverages t...
SYCL-BLAS: Combining Expression Trees and Kernel Fusion on Heterogeneous Systems
SYCL-BLAS: Combining Expression Trees and Kernel Fusion on Heterogeneous Systems
The support for heterogenous platforms requires multiple specialised devices collaborate to execute an application. The SYCL standard publishes by Khronos, providing a C++ abstract...
Unlocking the Power of Parallel Computing: GPU technologies for Ocean Forecasting
Unlocking the Power of Parallel Computing: GPU technologies for Ocean Forecasting
Abstract. Operational ocean forecasting systems are complex engines that must execute ocean models with high performance to provide timely products and datasets. Significant comput...
Vina-GPU 2.0:further accelerating AutoDock Vina and its derivatives with GPUs
Vina-GPU 2.0:further accelerating AutoDock Vina and its derivatives with GPUs
Modern drug discovery typically faces large virtual screens from huge compound databases where multiple docking tools are involved for meeting various real scenes or improving the ...
GPU-I-TASSER: a GPU accelerated I-TASSER protein structure prediction tool
GPU-I-TASSER: a GPU accelerated I-TASSER protein structure prediction tool
Abstract
Motivation
Accurate and efficient predictions of protein structures play an important role in understanding their funct...
Accelerated hydrologic modeling: ParFlow GPU implementation
Accelerated hydrologic modeling: ParFlow GPU implementation
<p>&#160; ParFlow is known as a numerical model that simulates the hydrologic cycle from the bedrock to the top of the plant canopy. The original codebase pro...
Enabling Real-Time High-Resolution Flood Forecasting for the Entire State of Berlin Through RIM2D’s Multi-GPU Processing
Enabling Real-Time High-Resolution Flood Forecasting for the Entire State of Berlin Through RIM2D’s Multi-GPU Processing
Abstract. Urban areas are increasingly experiencing more frequent and intense pluvial flooding due to the combined effects of climate change and rapid urbanization—a trend expected...

