Javascript must be enabled to continue!
Automated application-level checkpointing of MPI programs
View through CrossRef
The running times of many computational science applications, such as protein-folding using
ab initio
methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.
Association for Computing Machinery (ACM)
Title: Automated application-level checkpointing of MPI programs
Description:
The running times of many computational science applications, such as protein-folding using
ab initio
methods, are much longer than the mean-time-to-failure of high-performance computing platforms.
To run to completion, therefore, these applications must tolerate hardware failures.
In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system.
We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.
We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library.
We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state.
An advantage of our approach is that it is independent of the MPI implementation.
We present experimental results that argue that the overhead of using our system can be small.
Related Results
Reorientasi Jurusan Manajemen Pendidikan Islam (MPI) Antara Tenaga Kependidikan dan Tenaga Pendidik
Reorientasi Jurusan Manajemen Pendidikan Islam (MPI) Antara Tenaga Kependidikan dan Tenaga Pendidik
Abstrak: This article discusses the problems faced by the majority of students MPI FTIK IAIN Purwokerto. On the one hand, there is the desire of MPI students to become educators, a...
Bias correction methods for simulated precipitation in the Brazilian Legal Amazon
Bias correction methods for simulated precipitation in the Brazilian Legal Amazon
This study aimed to evaluate precipitation estimates over the Brazilian Legal Amazon (BLA) using high-resolution historical simulations from the MPI-ESM1-2-HR climate model, before...
Towards an "eddy-resolving" climate prediction system
Towards an "eddy-resolving" climate prediction system
<p>We have developed, implemented and preliminary evaluated the performance of the first &#8220;eddy-resolving&#8221; decadal prediction prototype sys...
Fetal myocardial index during labor
Fetal myocardial index during labor
Abstract
BACKGROUND: The Myocardial Performance Index (MPI) is a Doppler derived myocardial function tool and can be used to evaluate
systolic and diastolic function...
Ocean model formulation influences climate sensitivity
Ocean model formulation influences climate sensitivity
<p>The climate sensitivity is known to be mainly determined by the atmosphere model but here we discover that the ocean model can change a given transient climate res...
Non-equidistant checkpointing and quantitative resilience modeling
Non-equidistant checkpointing and quantitative resilience modeling
Software intensive systems rely on checkpointing to prevent loss of computation, by per-forming periodic backups. Non-equidistant checkpointing strategies have been proposed for sp...
MYOCARDIAL FLOW RESERVE MEASUREMENT USING CADMIUM ZINC-TELLURIDE SINGLE PHOTON EMISSION COMPUTED TOMOGRAPHY MYOCARDIAL PERFUSION IMAGING AND RELATION TO ANGIOGRAPHIC CORONARY ARTERY DISEASE
MYOCARDIAL FLOW RESERVE MEASUREMENT USING CADMIUM ZINC-TELLURIDE SINGLE PHOTON EMISSION COMPUTED TOMOGRAPHY MYOCARDIAL PERFUSION IMAGING AND RELATION TO ANGIOGRAPHIC CORONARY ARTERY DISEASE
The addition of Myocardial flow reserve (MFR) to Single Photon Emission Computed Tomography (SPECT) Myocardial Perfusion Imaging (MPI) by using Cadmium Zinc Telluride (CZT) camera ...
The Women Who Don’t Get Counted
The Women Who Don’t Get Counted
Photo by Hédi Benyounes on Unsplash
ABSTRACT
The current incarceration facilities for the growing number of women are depriving expecting mothers of adequate care cruci...

