Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

The GPU on the Matrix-Matrix Multiply: Performance Study and Contributions

View through CrossRef
Modern graphics processing units (GPUs) have been at the leading edge of increasing chip-level parallelism over the last ten years, and the CUDA programming model has recently allowed us to exploit its power across many computational domains. Within them, dense linear algebra algorithms emerge like a natural fit for CUDA and the GPU because they are usually inherently parallel and can naturally be expressed as a blocked computation. In this paper, we extensively analyze the GPU programming and performance of one of the fundamental building blocks in numerical lineal algebra algorithms: The Matrix-Matrix Multiply. Different programming approaches and optimization techniques have already been published in the literature, which we review and analyze to pursue further optimizations and unveil the potential of some hardware resources when programming the GPU under CUDA. Experimental results are shown on a GeForce 8800 GTX and a Tesla C870 GPU with a performance peak of 43 GFLOPS.
Title: The GPU on the Matrix-Matrix Multiply: Performance Study and Contributions
Description:
Modern graphics processing units (GPUs) have been at the leading edge of increasing chip-level parallelism over the last ten years, and the CUDA programming model has recently allowed us to exploit its power across many computational domains.
Within them, dense linear algebra algorithms emerge like a natural fit for CUDA and the GPU because they are usually inherently parallel and can naturally be expressed as a blocked computation.
In this paper, we extensively analyze the GPU programming and performance of one of the fundamental building blocks in numerical lineal algebra algorithms: The Matrix-Matrix Multiply.
Different programming approaches and optimization techniques have already been published in the literature, which we review and analyze to pursue further optimizations and unveil the potential of some hardware resources when programming the GPU under CUDA.
Experimental results are shown on a GeForce 8800 GTX and a Tesla C870 GPU with a performance peak of 43 GFLOPS.

Related Results

Vina-GPU 2.1: towards further optimizing docking speed and precision of AutoDock Vina and its derivatives
Vina-GPU 2.1: towards further optimizing docking speed and precision of AutoDock Vina and its derivatives
AbstractAutoDock Vina and its derivatives have established themselves as a prevailing pipeline for virtual screening in contemporary drug discovery. Our Vina-GPU method leverages t...
Vina-GPU 2.0:further accelerating AutoDock Vina and its derivatives with GPUs
Vina-GPU 2.0:further accelerating AutoDock Vina and its derivatives with GPUs
Modern drug discovery typically faces large virtual screens from huge compound databases where multiple docking tools are involved for meeting various real scenes or improving the ...
GPU-I-TASSER: a GPU accelerated I-TASSER protein structure prediction tool
GPU-I-TASSER: a GPU accelerated I-TASSER protein structure prediction tool
Abstract Motivation Accurate and efficient predictions of protein structures play an important role in understanding their funct...
Enabling Real-Time High-Resolution Flood Forecasting for the Entire State of Berlin Through RIM2D’s Multi-GPU Processing
Enabling Real-Time High-Resolution Flood Forecasting for the Entire State of Berlin Through RIM2D’s Multi-GPU Processing
Abstract. Urban areas are increasingly experiencing more frequent and intense pluvial flooding due to the combined effects of climate change and rapid urbanization—a trend expected...
Unlocking the Power of Parallel Computing: GPU technologies for Ocean Forecasting
Unlocking the Power of Parallel Computing: GPU technologies for Ocean Forecasting
Abstract. Operational ocean forecasting systems are complex engines that must execute ocean models with high performance to provide timely products and datasets. Significant comput...
Accelerated hydrologic modeling: ParFlow GPU implementation
Accelerated hydrologic modeling: ParFlow GPU implementation
<p>  ParFlow is known as a numerical model that simulates the hydrologic cycle from the bedrock to the top of the plant canopy. The original codebase pro...
Parallel garment drape simulation of triangular mesh using GPU programming
Parallel garment drape simulation of triangular mesh using GPU programming
PurposeThe purpose of this paper is to determine the possibility of implementing parallel processing feature of graphic processor unit (GPU) in garment drape simulation.Design/meth...
Matrix Subgridding and Its Effects in Dual Porosity Simulators
Matrix Subgridding and Its Effects in Dual Porosity Simulators
Abstract Naturally fractured reservoirs are found throughout the world and contain significant amounts of oil reserves. The so-called dual porosity model is one o...

Back to Top