Javascript must be enabled to continue!
Application-level checkpointing for shared memory programs
View through CrossRef
Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.
Association for Computing Machinery (ACM)
Title: Application-level checkpointing for shared memory programs
Description:
Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults.
The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state.
At present, it is the responsibility of the programmer to instrument applications for CPR.
Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults.
Our previous work focused on message-passing programs.
In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors.
This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application.
For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.
One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform.
We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha).
Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.
Related Results
Non-equidistant checkpointing and quantitative resilience modeling
Non-equidistant checkpointing and quantitative resilience modeling
Software intensive systems rely on checkpointing to prevent loss of computation, by per-forming periodic backups. Non-equidistant checkpointing strategies have been proposed for sp...
The Women Who Don’t Get Counted
The Women Who Don’t Get Counted
Photo by Hédi Benyounes on Unsplash
ABSTRACT
The current incarceration facilities for the growing number of women are depriving expecting mothers of adequate care cruci...
Shared Histories in Multiethnic Societies: Literature as a Critical Corrective of Cultural Memory Studies
Shared Histories in Multiethnic Societies: Literature as a Critical Corrective of Cultural Memory Studies
AbstractThe staging of history in literature is engaged in dynamic exchange with society’s memory discourses and in this context, literature is generally seen as playing a creative...
Poster 154: Top Orthopaedic Sports Medicine Fellowship Programs as Perceived by Applicants
Poster 154: Top Orthopaedic Sports Medicine Fellowship Programs as Perceived by Applicants
Objectives: Despite the high volume of orthopaedic sports medicine fellowship applicants and growing interest in the field, fellowship applicants’ attitudes and preferences towards...
Poster 155: The Prevalence of “Pipelining” at the Top Orthopaedic Sports Medicine Fellowship Programs
Poster 155: The Prevalence of “Pipelining” at the Top Orthopaedic Sports Medicine Fellowship Programs
Objectives: The term “pipelining” refers to the phenomenon that applicants from certain residency programs frequently match at the same fellowship programs. However, it is unclear ...
Systematic Review of Abstinence-Plus HIV Prevention Programs in High-Income Countries Dr. Sergio Grunbaum Ph.D
Systematic Review of Abstinence-Plus HIV Prevention Programs in High-Income Countries Dr. Sergio Grunbaum Ph.D
Background.
Human immunodeficiency virus (HIV), which causes AIDS, is most often spread through unprotected sex (vaginal, oral, or anal) with an infected partner. Individuals can r...
Automated application-level checkpointing of MPI programs
Automated application-level checkpointing of MPI programs
The running times of many computational science applications, such as protein-folding using
ab initio
methods, are much longer than the mean-time-to-failure...
The Feasibility and Effectiveness of Web-Based Advance Care Planning Programs: Scoping Review
The Feasibility and Effectiveness of Web-Based Advance Care Planning Programs: Scoping Review
BackgroundAdvance care planning (ACP) is a process with the overall aim to enhance care in concordance with patients’ preferences. Key elements of ACP are to enable persons to defi...

