Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

A CUDA fast multipole method with highly efficient M2L far field evaluation

View through CrossRef
Solving an N-body problem, electrostatic or gravitational, is a crucial task and the main computational bottleneck in many scientific applications. Its direct solution is an ubiquitous showcase example for the compute power of graphics processing units (GPUs). However, the naïve pairwise summation has [Formula: see text] computational complexity. The fast multipole method (FMM) can reduce runtime and complexity to [Formula: see text] for any specified precision. Here, we present a CUDA-accelerated, C++ FMM implementation for multi particle systems with [Formula: see text] potential that are found, e.g. in biomolecular simulations. The algorithm involves several operators to exchange information in an octree data structure. We focus on the Multipole-to-Local (M2L) operator, as its runtime is limiting for the overall performance. We propose, implement and benchmark three different M2L parallelization approaches. Approach (1) utilizes Unified Memory to minimize programming and porting efforts. It achieves decent speedups for only little implementation work. Approach (2) employs CUDA Dynamic Parallelism to significantly improve performance for high approximation accuracies. The presorted list-based approach (3) fits periodic boundary conditions particularly well. It exploits FMM operator symmetries to minimize both memory access and the number of complex multiplications. The result is a compute-bound implementation, i.e. performance is limited by arithmetic operations rather than by memory accesses. The complete CUDA parallelized FMM is incorporated within the GROMACS molecular dynamics package as an alternative Coulomb solver.
Title: A CUDA fast multipole method with highly efficient M2L far field evaluation
Description:
Solving an N-body problem, electrostatic or gravitational, is a crucial task and the main computational bottleneck in many scientific applications.
Its direct solution is an ubiquitous showcase example for the compute power of graphics processing units (GPUs).
However, the naïve pairwise summation has [Formula: see text] computational complexity.
The fast multipole method (FMM) can reduce runtime and complexity to [Formula: see text] for any specified precision.
Here, we present a CUDA-accelerated, C++ FMM implementation for multi particle systems with [Formula: see text] potential that are found, e.
g.
in biomolecular simulations.
The algorithm involves several operators to exchange information in an octree data structure.
We focus on the Multipole-to-Local (M2L) operator, as its runtime is limiting for the overall performance.
We propose, implement and benchmark three different M2L parallelization approaches.
Approach (1) utilizes Unified Memory to minimize programming and porting efforts.
It achieves decent speedups for only little implementation work.
Approach (2) employs CUDA Dynamic Parallelism to significantly improve performance for high approximation accuracies.
The presorted list-based approach (3) fits periodic boundary conditions particularly well.
It exploits FMM operator symmetries to minimize both memory access and the number of complex multiplications.
The result is a compute-bound implementation, i.
e.
performance is limited by arithmetic operations rather than by memory accesses.
The complete CUDA parallelized FMM is incorporated within the GROMACS molecular dynamics package as an alternative Coulomb solver.

Related Results

Fast Fourier Transforms in Electromagnetics
Fast Fourier Transforms in Electromagnetics
This Chapter review the fast Fourier transform (FFT) technique and its application to computational electromagnetics, especially to the fast solver algorithms including the Conjuga...
Harnessing CUDA Dynamic Parallelism for the Solution of Sparse Linear Systems
Harnessing CUDA Dynamic Parallelism for the Solution of Sparse Linear Systems
We leverage CUDA dynamic parallelism to reduce execution time while significantly reducing energy consumption of the Conjugate Gradient (CG) method for the iterative solution of sp...
An error-controlled fast multipole method
An error-controlled fast multipole method
We present a two-stage error estimation scheme for the fast multipole method (FMM). This scheme can be applied to any particle system. It incorporates homogeneous as well as inhomo...
Corrected Article: “An error-controlled fast multipole method” [J. Chem. Phys. 131, 244102 (2009)]
Corrected Article: “An error-controlled fast multipole method” [J. Chem. Phys. 131, 244102 (2009)]
We present a two-stage error estimation scheme for the fast multipole method (FMM). This scheme can be applied to any particle system. It incorporates homogeneous as well as inhomo...
"Best Tradition": CREATE, JCSEE and the Program Evaluation Standards
"Best Tradition": CREATE, JCSEE and the Program Evaluation Standards
Background: Evaluation “is a task in the best tradition of the most abstract theoretical science as well as the most practical applied science” (Scriven, 1968, p .9). The Program E...
Fast Simulations of Highly-Connected Spiking Cortical Models Using GPUs
Fast Simulations of Highly-Connected Spiking Cortical Models Using GPUs
Over the past decade there has been a growing interest in the development of parallel hardware systems for simulating large-scale networks of spiking neurons. Compared to other hig...
Accelerated hydrologic modeling: ParFlow GPU implementation
Accelerated hydrologic modeling: ParFlow GPU implementation
<p>  ParFlow is known as a numerical model that simulates the hydrologic cycle from the bedrock to the top of the plant canopy. The original codebase pro...
Enhanced Fast Rerouting Mechanisms for Protected Traffic in MPLS Networks
Enhanced Fast Rerouting Mechanisms for Protected Traffic in MPLS Networks
Multiprotocol Label Switching (MPLS) fuses the intelligence of routing with the performance of switching and provides significant benefits to networks with a pure IP architecture a...

Back to Top