Fault-tolerant Parallel Computing
Fault-tolerant Parallel Computing
William E. Weihl
April 7, 1990
1 Introduction
I have recently started a new project to develop algorithms and design language and sys-
tem support for fault-tolerant parallel programs. This position paper briefly summarizes
the motivation for the project and my plans for the next few years.
2 Motivation
Current multiprocessors, with the exception of large-scale SIMD machines such as the
Connection Machine, are relatively small. As larger machines are built over the next
decade, I believe that fault-tolerance will become an increasingly important issue. Fault-
tolerance is essential whenever the running time of a computation approaches the mean
time between failures (MTBF) of the overall machine. Since the MTBF of machines
will drop as machines grow, and since people will want to run larger computations on
larger machines (as well as running existing computations more quickly), fault-tolerance
is likely to become much more important over the next few years. Support for fault-
tolerance is also important for parallel programs running on networks of uni- and multi-
processors, where it may be necessary to cope with failures of individual machines and
with reconfiguration resulting from changes in the application load.
Fault-tolerance can be provided, at least in part, in hardware. Indeed, individ-
ual hardware components should be built to be reasonably reliable. Hardware fault-
tolerance alone, however, is not sufficient. First, it is too expensive for large-scale
machines. As the size of a machine grows, the reliability required from individual com-
ponents to achieve a' reasonable MTBF for the overall machine also grows, resulting
in an increase in cost for each component. Second, users will invariably want to run
computations that take longer than the MTBF of the machine. Thus, some support for
fault-tolerance must be provided in software. Parallel programs running on large-scale
multiprocessors must be able to detect and recover quickly from failures of machine
components, without taking drastic measures such as restarting an entire computation
from the beginning.
3 Plans
The goal of this work is to develop algorithms for checkpointing and recovery of parallel
programs, and to design system and language primitives for implementing fault-tolerant
parallel programs. Our plan for this research is to study a number of parallel programs,
and develop techniques for making them fault-tolerant. These techniques will serve as
the basis for the design of programming primitives that can be used to write fault-
tolerant parallel programs. We are focusing our efforts on MIMD architectures and
programs.
Based on our work to date, we expect our design to include a set of primitives that
vary in the amount of programming effort required. At one extreme are completely
automatic fault-tolerance mechanisms that handle all the details of checkpointing infor-
mation to backups, detecting failures, and recovering from failures. While such mech-
anisms are easy to use, they also impose substantial overhead that for many programs
may be avoidable. Thus, we also expect to provide lower-level primitives that allow
the programmer to customize the details of checkpointing and recovery to particular
applications.
Fault-tolerance methods can be classified based on the degree of optimism of the
methods. At one extreme are methods such as one developed by Strom and Yem-
ini [SY85], which allow computations to proceed completely asynchronously from the
activities of checkpointing and recovery. When a failure occurs, these highly optimistic
methods may need to roll back a substantial portion of the computation. At the other
extreme are conservative methods that checkpoint every intermediate result (e.g., every
message or every access to a shared memory location). Optimistic methods have the
advantage that they impose less overhead on the computation as long as failures are
rare. However, unconstrained optimism can lead to very long recovery times, as the
number of processors involved in recovering from a failure increases. We expect to focus
on optimistic methods, but to develop techniques for limiting the amount of optimism
so that useful forward progress can be made.
In addition to the work by Strom and Yemini, there are several other papers in
the literature describing automatic mechanisms for recovering from faults in message-
passing systems (e.g., see [KT87,SW89,JZ87]). The algorithms in these papers serve as
a starting point for our design of automatic mechanisms. By themselves, however, these
algorithms are not sufficient, primarily because they do not scale. Some of them require
messages whose size grows with the number of processors, and they may require all
processors to participate in recovering from a failure. As a result, the overhead imposed
in the absence of failures can be substantial, and the time taken to recover from a failure
can grow so large that no forward progress is made. We are working on new algorithms
that correct these problems. We expect to perform simulations and analysis to evaluate
our solutions, and to develop a better understanding of the overhead imposed by the
alternaquire the recovery of a failed process to be coordinated with rollbacks of other
processes that communicate with it. For many programs, it is possible to avoid some or
all of this coordination, and thus to reduce the overall running time substantially. We
are currently studying a number of different parallel algorithms to understand what is
required in each case for fault-tolerance. In some cases, the programming effort required
is relatively modest, and the overhead imposed is -very small.
For example, consider asynchronous iterative relaxation methods, which constitute
one class of techniques for solving simultaneous equations. Unlike synchronous itera-
tive methods, asynchronous methods allow values from any prior iteration to be used
in computing the value of a grid point in a given iteration. This lack of constraints
leads to a very simple strategy for handling failures. Assume that the grid points are
divided among the processors, with each processor responsible for computing the values
of some subset of the grid points. Each processor can checkpoint the values of its grid
points periodically (e.g., to a backup processor, or to disk). When a processor fails,
its computation can be restarted on a backup processor using the latest checkpointed
values for its grid points. Other processors need not be notified of the failure or rolled
back. Thus, checkpointing and recovery can be handled entirely locally at each proces-
sor. In contrast, the automatic methods would roll back all processors that depend on
computation lost at a failed processor; the time to recover would be much longer, and
much more work would be lost.
We have used simulations to study the performance of different checkpointing and
recovery strategies for asynchronous iterative methods. The local strategy outlined
above results in a running time only a few percent more than the running time in the
absence of failures, even with a very low checkpoint rate. Other strategies, such as the
automatic methods discussed above, require a much higher checkpoint rate, and have
much longer running times because they roll back many processors in response to each
failure.
Asynchronous iterative methods are remarkably robust. We intend to study other
numerical programs to understand the extent to which they exhibit similar character-
istics. We also plan to study symbolic programs. While many of these programs will
probably not have the inherent robustness of asynchronous iterative methods, we believe
that many of them can still benefit from tailoring the checkpointing and recovery to the
specific applications. For example, automatic recovery methods must be prepared to
cope with nondeterministic programs, in which the order of message delivery or access
to shared memory affects the outcome of the computation. Many programs, however,
are deterministic: given the same inputs, each process will execute the same sequence
of steps, regardless of the interleaving of steps from different processes. For example,
successive over-relaxation (SOR) algorithms, multigrid algorithms, and many sorting
algorithms have this property. For deterministic programs, any computation performed
by a process is guaranteed to be valid, even if it depends on a computation of another
process that is lost in a failure. Rather than rolling back all dependent computations
after a failure, it suffices for deterministic programs to restart the failed processes;
they will then recompute exactly the same results. To do this, however, mechanisms
are required for saving appropriate states at each process, and for regenerating earlier
messages and values of shared variables when needed by a restarted process.
There are a variety of other paradigms for organizing parallel programs that may
be amenable to specialized fault-tolerance mechanisms. For example, in a functional
program, a result can always be recomputed if the process responsible for it appears
to be dead. No sophisticated mechanisms are required for coordinating checkpoints
and rollbacks of processes. Similarly, in a program organized as a collection of workers
sharing a task bag, where the computation;:associated with each task is.deterministic,
it may be possible to avoid much of the overhead of automatic mechanisms.
4 Conclusion
We are developing techniques for writing fault-tolerant parallel programs. We are cur-
rently studying a variety of algorithms to understand what support is needed to make
each fault-tolerant. Based on this analysis, we will design language and system mech-
anisms to support making parallel programs fault-tolerant. We expect to provide a
range of mechanisms, from completely general automatic mechanisms to lower-level
primitives that can be used by the programmer to incorporate fault-tolerance into his
program. Because automatic mechanisms impose a high cost, it is essential to provide
other mechanisms that allow the programmer to pay only for what is necessary.
References
[ozs7] David B. 3ohnson and Willy Zwaenepoel. Sender-based message logging. In
Proceedings of the 17th International Symposium on Fault-Toleran~ Computing,
pages 14-19. IEEE Computer Society, July 1987.
• [KT87] Richard Koo and Sam Toueg. Checkpointing and rollback-recovery for dis-
tributed systems. IEEE Transactions on Software Engineering, SF_,-13(1):23-
31, January 1987.
[sw89] A. Prasad Sistla and Jennifer L. Welch. Efficient distributed recovery using
message logging. In Proceedings of the 8th ACM Symposium on Principles of
Distributed Computing, pages 223-238. ACM, August 1989.
[SY85] R . E . Strom and S. Yemini. Optimistic recovery in distributed systems. ACM
Transaclio1~s on Computer Systems, 3(3):204-226, August 1985.