0% found this document useful (0 votes)
260 views

Fault Tolerant Computing

Fault-tolerant computing aims to build systems that can continue operating despite faults. It incorporates redundancy through additional hardware, software, data, or design diversity. Common approaches to hardware fault tolerance include fault masking using triple modular redundancy and dynamic recovery using spare components. Software faults can be tolerated using acceptance tests and redundant code blocks. Information redundancy adds check bits for error detection and correction. Fault-tolerant systems are validated through modeling and fault simulations. They find applications in safety-critical systems like medical devices and airplanes to avoid failures.

Uploaded by

Mayowa Sunusi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
260 views

Fault Tolerant Computing

Fault-tolerant computing aims to build systems that can continue operating despite faults. It incorporates redundancy through additional hardware, software, data, or design diversity. Common approaches to hardware fault tolerance include fault masking using triple modular redundancy and dynamic recovery using spare components. Software faults can be tolerated using acceptance tests and redundant code blocks. Information redundancy adds check bits for error detection and correction. Fault-tolerant systems are validated through modeling and fault simulations. They find applications in safety-critical systems like medical devices and airplanes to avoid failures.

Uploaded by

Mayowa Sunusi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Fault-Tolerant Computing

INTRODUCTION
Fault-tolerant computing is the art and science of building computing systems that continue to
operate satisfactorily in the presence of faults. Fault tolerance is the ability of a system to continue
performing its intended function in spite of faults.

A fault-tolerant system may be able to tolerate one or more fault-types including:

i) Transient, intermittent or permanent hardware faults


ii) software and hardware design errors
iii) operator errors, or
iv) Externally induced upsets or physical damage.

An extensive methodology has been developed in this field over the past thirty years, and a number
of fault-tolerant machines have been developed, mostly dealing with random hardware faults, while
a smaller number deal with software, design and operator faults to varying degrees. A large amount
of supporting research has been reported.

Fault tolerance is associated with reliability, with successful operation, and with the absence of
breakdowns. A fault-tolerant system should be able to handle faults in individual hardware or
software components, power failures or other kinds of unexpected disasters and still meet its
specification.

Fault tolerance is needed because it is practically impossible to build a perfect system. The
fundamental problem is that, as the complexity of a system increases, its reliability drastically
deteriorates, unless compensatory measures are taken.

Although designers do their best to have all the hardware defects and software bugs cleaned out of
a system before it goes on the market, history shows that such a goal is not attainable. It is
inevitable that some unexpected environmental factor is not taken into account, or some potential
user mistakes are not foreseen. Thus, even in the unlikely case that a system is designed and
implemented perfectly, faults are likely to be caused by situations out of the control of the
designers.

FAULT TOLERANCE AND REDUNDANCY


There are various approaches to achieve fault-tolerance. Common to all these approaches is a
certain amount of redundancy. Redundancy is the provision of functional capabilities that would be
unnecessary in a fault free environment. This can be a replicated hardware component, an
additional check bit attached to a string of digital data, or a few lines of program code verifying the
correctness of the program’s results.

1
The idea of incorporating redundancy in order to improve reliability of a system was pioneered by
John von Neumann in early 1950s in his work “Probabilistic logic and the synthesis of reliable
organisms from unreliable components”.

Two kinds of redundancies are possible: space redundancy and time redundancy.

Space redundancy: provides additional components, functions, or data items that are unnecessary
for a fault-free operation. Space redundancy is further classified into hardware, software and
information redundancy, depending on the type of redundant resources added to the system.

In time redundancy: the computation or data transmission is repeated and the result is compared to
a stored copy of the previous result.

Hardware Fault-Tolerance
The majority of fault-tolerant designs have been directed toward building computers that
automatically recover from random faults occurring in hardware components. Hardware
redundancy is provided by incorporating extra hardware into the design to either detect or override
the effects of a failed component. The techniques employed to do this generally involve partitioning
a computing system into modules that act as fault containment regions. For example, instead of
having a single processor, we can use two or three processors, each performing the same function.
Each module is backed up with protective redundancy so that, if the module fails, others can assume
its function. Special mechanisms are added to detect errors and implement recovery.

Two general approaches to hardware fault recovery have been used: fault masking and dynamic
recovery.

Fault masking
Fault masking is a structural redundancy technique that completely masks faults within a set of
redundant modules. A number of identical modules execute the same functions, and their outputs
are voted to remove errors created by a faulty module. For example, instead of having a single
processor, we can use two or three processors, each performing the same function. By having two
processors, we can detect the failure of a single processor; by having three, we can use the majority
output to override the wrong output of a single faulty processor.

Triple modular redundancy (TMR) is a commonly used form of fault masking in which the circuitry is
triplicated and voted. The voting circuitry can also triplicate so that individual voter failures can also
be corrected by the voting process. A TMR system fails whenever two modules in a redundant triplet
create errors so that the vote is no longer valid.

Hybrid redundancy is an extension of TMR in which the triplicate modules are backed up with
additional spares, which are used to replace faulty modules - allowing more faults to be tolerated.

Voted systems require more than three times as much hardware as non-redundant systems, but
they have the advantage that computations can continue without interruption when a fault occurs,
allowing existing operating systems to be used.

2
Dynamic recovery
In the case of dynamic recovery, spare components are activated upon the failure of a currently
active component. Special mechanisms are required to detect faults in the modules, switch out a
faulty module, switch in a spare, and instigate those software actions (rollback, initialization, retry,
restart) necessary to restore and continue the computation.

In single computers, special hardware is required along with software to do this, while in
multicomputers, the function is often managed by the other processors.

Dynamic recovery is generally more hardware-efficient than voted systems, and it is therefore the
approach of choice in resource-constrained (e.g. low-power) systems. Its disadvantage is that
computational delays occur during fault recovery.

Software Fault-Tolerance
Efforts to attain software that can tolerate software design faults (programming errors) have made
use of static and dynamic redundancy approaches similar to those used for hardware faults.
Programs are partitioned into blocks and acceptance tests are executed after each block. If an
acceptance test fails, a redundant code block is executed.

Hardware and Software Design Fault Tolerance


To tolerate design faults of both hardware and software, an approach called design diversity
combines hardware and software fault-tolerance by implementing a fault-tolerant computer system
using different hardware and software in redundant channels.

Each channel is designed to provide the same function, and a method is provided to identify if one
channel deviates unacceptably from the others. This is a very expensive technique, but it is used in
very critical aircraft control applications.

Information Redundancy
The best-known form of information redundancy is error detection and correction coding. Here,
extra bits (called check bits) are added to the original data bits so that an error in the data bits can
be detected or even corrected. The resulting error-detecting and error-correcting codes are widely
used today in memory units and various storage devices to protect against benign failures. Note that
these error codes (like any other form of information redundancy) require extra hardware to process
the redundant data (the check bits).

Error-detecting and error-correcting codes are also used to protect data communicated over noisy
channels, which are channels that are subject to many transient failures. These channels can be
either the communication links among widely separated processors (e.g., the Internet) or among
locally connected processors that form a local network. If the code used for data communication is
capable of only detecting the faults that have occurred (but not correcting them), we can retransmit
as necessary, thus employing time redundancy.

3
VALIDATION OF FAULT-TOLERANCE
One of the most difficult tasks in the design of a fault-tolerant machine is to verify that it will meet
its reliability requirements. This requires creating a number of models. The first model is of the
error/fault environment that is expected.

Other models specify the structure and behavior of the design. It is then necessary to determine
how well the fault tolerance mechanisms work by analytic studies and fault simulations.

FOUR ASPECTS TO FAULT TOLERANCE

APPLICATIONS OF FAULT-TOLERANCE
Following the development of semiconductor technology, hardware components became
intrinsically more reliable and the need for tolerance of component defect diminished in general
purpose applications.

Nevertheless, fault tolerance remained necessary in many safety-, mission- and business-critical
applications.

1. Safety-critical applications are those where loss of life or environmental disaster must be
avoided. Examples are nuclear power plant control systems, computer-controlled radiation
therapy machines or heart pace-makers, military radar systems.

2. Mission-critical applications stress mission completion, such as in the case of an airplane or a


spacecraft.

3. Business-critical are those in which keeping a business operating is an issue. Examples are
bank and stock exchange’s automated trading system, web servers, e-commerce.

You might also like