Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Automated application-level checkpointing of MPI programs

View through CrossRef
The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.
Title: Automated application-level checkpointing of MPI programs
Description:
The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms.
To run to completion, therefore, these applications must tolerate hardware failures.
In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system.
We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.
We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library.
We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state.
An advantage of our approach is that it is independent of the MPI implementation.
We present experimental results that argue that the overhead of using our system can be small.

Related Results

Reorientasi Jurusan Manajemen Pendidikan Islam (MPI) Antara Tenaga Kependidikan dan Tenaga Pendidik
Reorientasi Jurusan Manajemen Pendidikan Islam (MPI) Antara Tenaga Kependidikan dan Tenaga Pendidik
Abstrak: This article discusses the problems faced by the majority of students MPI FTIK IAIN Purwokerto. On the one hand, there is the desire of MPI students to become educators, a...
Bias correction methods for simulated precipitation in the Brazilian Legal Amazon
Bias correction methods for simulated precipitation in the Brazilian Legal Amazon
This study aimed to evaluate precipitation estimates over the Brazilian Legal Amazon (BLA) using high-resolution historical simulations from the MPI-ESM1-2-HR climate model, before...
Towards an "eddy-resolving" climate prediction system
Towards an "eddy-resolving" climate prediction system
<p>We have developed, implemented and preliminary evaluated the performance of the first “eddy-resolving” decadal prediction prototype sys...
Fetal myocardial index during labor
Fetal myocardial index during labor
Abstract BACKGROUND: The Myocardial Performance Index (MPI) is a Doppler derived myocardial function tool and can be used to evaluate systolic and diastolic function...
Ocean model formulation influences climate sensitivity
Ocean model formulation influences climate sensitivity
<p>The climate sensitivity is known to be mainly determined by the atmosphere model but here we discover that the ocean model can change a given transient climate res...
Non-equidistant checkpointing and quantitative resilience modeling
Non-equidistant checkpointing and quantitative resilience modeling
Software intensive systems rely on checkpointing to prevent loss of computation, by per-forming periodic backups. Non-equidistant checkpointing strategies have been proposed for sp...
The Women Who Don’t Get Counted
The Women Who Don’t Get Counted
Photo by Hédi Benyounes on Unsplash ABSTRACT The current incarceration facilities for the growing number of women are depriving expecting mothers of adequate care cruci...

Back to Top