Fault Tolerant Systems 2nd edition by Israel
Koren, Mani Krishna 9780128181065 0128181060
download
https://ebookball.com/product/fault-tolerant-systems-2nd-edition-
by-israel-koren-mani-krishna-9780128181065-0128181060-18756/
Instantly Access and Download Textbook at https://ebookball.com
Get Your Digital Files Instantly: PDF, ePub, MOBI and More
Quick Digital Downloads: PDF, ePub, MOBI and Other Formats
LNCS 2788 Fault Tolerant Communication System to Improve Safety in
Railway Environments 1st Edition by César Mataix, Pedro MartÃ-n,
Francisco Javier RodrÃ-guez, MarÃ-a José Manzano, Javier Pozo ISBN
3540398783 9783540398783
https://ebookball.com/product/lncs-2788-fault-tolerant-
communication-system-to-improve-safety-in-railway-
environments-1st-edition-by-ca-c-sar-mataix-pedro-martan-
francisco-javier-rodraguez-maraa-josa-c-manzano-javier-pozo-
isbn-354/
Hutchison Paediatrics 2nd Edition by Krishna M Goel, Devendra K Gupta
ISBN 9350257718 9789350257715
https://ebookball.com/product/hutchison-paediatrics-2nd-edition-
by-krishna-m-goel-devendra-k-gupta-
isbn-9350257718-9789350257715-418/
State Event Fault Trees A Safety Analysis Model for Software
Controlled Systems 1st Edition by Bernhard Kaiser, Catharina Gramlich
ISBN 9783540301387
https://ebookball.com/product/state-event-fault-trees-a-safety-
analysis-model-for-software-controlled-systems-1st-edition-by-
bernhard-kaiser-catharina-gramlich-isbn-9783540301387-13110/
Surgical Correction of Facial Deformities 1st Edition by Varghese Mani
ISBN B00BYNKXEA 9788184488791
https://ebookball.com/product/surgical-correction-of-facial-
deformities-1st-edition-by-varghese-mani-
isbn-b00bynkxea-9788184488791-7690/
Conflict Tolerant Features 1st edtion by Deepak D’Souza, Madhu
Gopinathan ISBN 3540693888 9783540693888
https://ebookball.com/product/conflict-tolerant-features-1st-
edtion-by-deepak-daeurtmsouza-madhu-gopinathan-
isbn-3540693888-9783540693888-11224/
Introduction to Biomaterials Basic Theory With Engineering
Applications 1st edition by Ong, Mark Appleford, Gopinath Mani
9781107595378 1107595371
https://ebookball.com/product/introduction-to-biomaterials-basic-
theory-with-engineering-applications-1st-edition-by-ong-mark-
appleford-gopinath-mani-9781107595378-1107595371-15120/
Integrated Clinical Orthodontics 1st edition by Vinod Krishna, Ze'ev
Davidovitch ISBN 1444398618 9781444398618
https://ebookball.com/product/integrated-clinical-
orthodontics-1st-edition-by-vinod-krishna-ze-ev-davidovitch-
isbn-1444398618-9781444398618-8164/
Clinical Guide to Accelerated Orthodontics With a Focus on Micro
Osteoperforations 1st Edition by Mani Alikhani ISBN 3319434012
9783319434018
https://ebookball.com/product/clinical-guide-to-accelerated-
orthodontics-with-a-focus-on-micro-osteoperforations-1st-edition-
by-mani-alikhani-isbn-3319434012-9783319434018-5232/
Analyzing Fault Susceptibility of ABS Microcontroller 1st edition by
Dawid Trawczynski, Janusz Sosnowski, Piotr Gawkowski ISBN 3540876977
9783540876977
https://ebookball.com/product/analyzing-fault-susceptibility-of-
abs-microcontroller-1st-edition-by-dawid-trawczynski-janusz-
sosnowski-piotr-gawkowski-isbn-3540876977-9783540876977-11272/
Fault-Tolerant Systems
SECOND EDITION
Israel Koren
C. Mani Krishna
Table of Contents
Cover image
Title page
Copyright
Preface to the Second Edition
Acknowledgments
Chapter 1: Preliminaries
1.1. Fault Classification
1.2. Types of Redundancy
1.3. Basic Measures of Fault Tolerance
1.4. Outline of This Book
1.5. Further Reading
References
Chapter 2: Hardware Fault Tolerance
2.1. The Rate of Hardware Failure
2.2. Failure Rate, Reliability, and Mean Time to Failure
2.3. Hardware Failure Mechanisms
2.4. Common-Mode Failures
2.5. Canonical and Resilient Structures
2.6. Other Reliability Evaluation Techniques
2.7. Fault-Tolerance Processor-Level Techniques
2.8. Timing Fault Tolerance
2.9. Tolerance of Byzantine Failures
2.10. Further Reading
2.11. Exercises
References
Chapter 3: Information Redundancy
3.1. Coding
3.2. Resilient Disk Systems
3.3. Data Replication
3.4. Algorithm-Based Fault Tolerance
3.5. Further Reading
3.6. Exercises
References
Chapter 4: Fault-Tolerant Networks
4.1. Measures of Resilience
4.2. Common Network Topologies and Their Resilience
4.3. Fault-Tolerant Routing
4.4. Networks on a Chip
4.5. Wireless Sensor Networks
4.6. Further Reading
4.7. Exercises
References
Chapter 5: Software Fault Tolerance
5.1. Acceptance Tests
5.2. Single-Version Fault Tolerance
5.3. N-Version Programming
5.4. Recovery Block Approach
5.5. Preconditions, Postconditions, and Assertions
5.6. Exception Handling
5.7. Software Reliability Models
5.8. Fault-Tolerant Remote Procedure Calls
5.9. Further Reading
5.10. Exercises
References
Chapter 6: Checkpointing
6.1. What Is Checkpointing?
6.2. Checkpoint Level
6.3. Optimal Checkpointing: an Analytical Model
6.4. Cache-Aided Rollback Error Recovery (CARER)
6.5. Checkpointing in Distributed Systems
6.6. Checkpointing in Shared-Memory Systems
6.7. Checkpointing in Real-Time Systems
6.8. Checkpointing While Using Cloud Computing Utilities
6.9. Emerging Challenges: Petascale and Exascale Computing
6.10. Other Uses of Checkpointing
6.11. Further Reading
6.12. Exercises
References
Chapter 7: Cyber-Physical Systems
7.1. Structure of a Cyber-Physical System
7.2. The Controlled Plant State Space
7.3. Sensors
7.4. The Cyber Platform
7.5. Actuators
7.6. Further Reading
7.7. Exercises
References
Chapter 8: Case Studies
8.1. Aerospace Systems
8.2. NonStop Systems
8.3. Stratus Systems
8.4. Cassini Command and Data Subsystem
8.5. IBM POWER8
8.6. IBM G5
8.7. IBM Sysplex
8.8. Intel Servers
8.9. Oracle SPARC M8 Server
8.10. Cloud Computing
8.11. Further Reading
References
Chapter 9: Simulation Techniques
9.1. Writing a Simulation Program
9.2. Parameter Estimation
9.3. Variance Reduction Methods
9.4. Splitting
9.5. Random Number Generation
9.6. Fault Injection
9.7. Further Reading
9.8. Exercises
References
Chapter 10: Defect Tolerance in VLSI Circuits
10.1. Manufacturing Defects and Circuit Faults
10.2. Probability of Failure and Critical Area
10.3. Basic Yield Models
10.4. Yield Enhancement Through Redundancy
10.5. Further Reading
10.6. Exercises
References
Chapter 11: Fault Detection in Cryptographic Systems
Abstract
11.1. Overview of Ciphers
11.2. Security Attacks Through Fault Injection
11.3. Countermeasures
11.4. Further Reading
11.5. Exercises
References
Index
Copyright
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
Copyright © 2021 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmi ed in any
form or by any means, electronic or mechanical, including
photocopying, recording, or any information storage and retrieval
system, without permission in writing from the publisher. Details on
how to seek permission, further information about the Publisher's
permissions policies and our arrangements with organizations such
as the Copyright Clearance Center and the Copyright Licensing
Agency, can be found at our website:
www.elsevier.com/permissions.
This book and the individual contributions contained in it are
protected under copyright by the Publisher (other than as may be
noted herein).
Notices
Knowledge and best practice in this field are constantly changing.
As new research and experience broaden our understanding,
changes in research methods, professional practices, or medical
treatment may become necessary.
Practitioners and researchers must always rely on their own
experience and knowledge in evaluating and using any
information, methods, compounds, or experiments described
herein. In using such information or methods they should be
mindful of their own safety and the safety of others, including
parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the
authors, contributors, or editors, assume any liability for any injury
and/or damage to persons or property as a ma er of products
liability, negligence or otherwise, or from any use or operation of
any methods, products, instructions, or ideas contained in the
material herein.
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of
Congress
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN: 978-0-12-818105-8
For information on all Morgan Kaufmann publications visit our
website at h ps://www.elsevier.com/books-and-journals
Publisher: Katey Birtcher
Acquisitions Editor: Steve Merken
Editorial Project Manager: Andrae Akeh
Publishing Services Manager: Shereen Jameel
Production Project Manager: Kamatchi Madhavan
Designer: Patrick Ferguson
Cover illustration: Yaron Koren
Typeset by VTeX
Printed in the United States of America
Last digit is the print number: 9 8 7 6 5 4 3 2 1
Preface to the Second Edition
In this second edition of Fault-tolerant systems, we have retained
the original structure of the book and added material to most
chapters. References have been updated to reflect recent advances in
the field.
Our principal additions are as follows:
• Chapter 2: A discussion of the principal physical causes of
hardware failure.
• Chapter 3: Coverage of low-density parity coding,
hierarchical RAID, as well as a discussion of RAID with
solid-state devices.
• Chapter 4: Fat-tree topologies, networks-on-a-chip, and
wireless sensor networks.
• Chapter 5: We now cover the rejuvenation of hypervisor-
based systems and introduce the Ostrand–Weyuker–Bell
software fault model.
• Chapter 6: Material on checkpointing with cloud computing
utilities and on checkpointing in petascale and exascale
computing has been added.
• Chapter 7: This is a new chapter on the increasingly
prominent field of cyber-physical systems.
• Chapter 8: Among the additions are sections on aerospace
systems, IBM's POWER8 multicore, Intel's Xeon, Oracle and
NEC servers and cloud computing.
• Chapter 9: The spli ing approach to simulation has been
added.
Acknowledgments
Zahava Koren read through the text and made several valuable
suggestions. We would also like to thank the staff at Morgan
Kaufman for their efforts on behalf of this project. We also thank the
funding agencies that have supported our work over the years; in
particular, the material in Chapter 7 is partially based on work
supported by the National Science Foundation under CNS-1717262.
Chapter 1: Preliminaries
The past 50 years have seen computers move from being expensive
computational engines used by government and big corporations to
becoming an everyday commodity, deeply embedded in practically
every aspect of our lives. Not only are computers visible
everywhere, in desktops, laptops, and smartphones; it is also a
commonplace that they are invisible everywhere, as vital components
of cars, home appliances, medical equipment, aircraft, industrial
plants, and power generation and distribution systems. Computer
systems underpin most of the world's financial systems: current
transaction volumes, trading in the stock, bond, and currency
markets would be unthinkable without them. Our increasing
willingness, as a society, to place computers in life-critical and
wealth-critical applications is largely driven by the increasing
possibilities that computers offer. And yet, as we depend more and
more on computers to carry out all of these vital actions, we are—
implicitly or explicitly—gambling our lives and property on
computers doing their job properly.
Computers (hardware plus software) are quite likely the most
complex systems ever created by human beings. The complexity of
computer hardware is still increasing as designers a empt to exploit
the higher transistor density that new generations of technology
make available to them. Computer software is far more complex still,
and with that complexity comes an increased propensity to failure. It
is probably fair to say that there is not a single large piece of
software or hardware today that is free of bugs. Even the space
shu le, with software that was developed and tested using some of
the best and most advanced techniques known to engineering at the
time, is now known to have flown with potentially serious bugs.
Computer scientists and engineers have responded to the
challenge of designing complex systems with a variety of tools and
techniques to reduce the number of faults in the systems they build.
However, that is not enough: we need to build systems that will
acknowledge the existence of faults as a fact of life, and incorporate
techniques to tolerate these faults while still delivering an acceptable
level of service. The resulting field of fault-tolerant computing is the
subject of this book.
1.1 Fault Classification
In everyday language, the terms fault, failure, and error are used
interchangeably. In fault-tolerant computing parlance, however, they
have distinctive meanings. A fault (or failure) can be either a
hardware defect or a software/programming mistake (bug). In
contrast, an error is a manifestation of the fault/failure/bug.
As an example, consider an adder circuit, with one output line
stuck at 1; it always carries the value 1 independently of the values
of the input operands. This is a fault, but not (yet) an error. This fault
causes an error when the adder is used, and the result on that line is
supposed to have been a 0, rather than a 1. A similar distinction
exists between programming mistakes and execution errors.
Consider, for example, a subroutine that is supposed to compute
, but owing to a programming mistake calculates the absolute
value of instead. This mistake will result in an execution error
only if that particular subroutine is used and the correct result is
negative.
Both faults and errors can spread through the system. For
example, if a chip shorts out power to ground, it may cause nearby
chips to fail as well. Errors can spread when the output of one unit is
used as input by other units. To return to our previous examples, the
erroneous results of either the faulty adder or the subroutine
can be fed into further calculations, thus propagating the error.
To limit such contagion, designers incorporate containment zones
into systems. These are barriers that reduce the chance that a fault or
error in one zone will propagate to another. For example, a fault-
containment zone can be created by ensuring that the maximum
possible voltage swings in one zone are insulated from the other
zones, and by providing an independent power supply to each zone.
In other words, the designer tries to electrically isolate one zone
from another. An error-containment zone can be created, as we will
see in some detail later on, by using redundant units/programs and
voting on their output.
Faults can be classified along multiple dimensions. Some
important dimensions are: their duration, when they were
introduced, whether or not there was conscious intent behind their
introduction, and whether they occurred in hardware or in software.
Let us look at each of these dimensions in turn.
Duration: Duration is an important classification dimension for
hardware faults. These can be classified into permanent, transient, or
intermi ent. A permanent fault is just that: it reflects a component
going out of commission permanently. As an example of a
permanent fault, think of a burned-out lightbulb. A transient fault is
one that causes a component to malfunction for some time; it goes
away after that time, and the functionality of the component is fully
restored. As an example, think of a random noise interference
during a telephone conversation. Another example is a memory cell
with contents that are changed spuriously due to some
electromagnetic interference. The cell itself is undamaged: it is just
that its contents are wrong for the time being, and overwriting the
memory cell will make the fault go away. An intermi ent fault never
quite goes away entirely; it oscillates between being quiescent and
active. When the fault is quiescent, the component functions
normally; when the fault is active, the component malfunctions. An
example for an intermi ent fault is a loose electrical connection.
When they were introduced: Faults can be introduced in various
phases of the system's lifetime. Faulty design decisions cause faults
to be introduced in the design phase. They can be born during
system implementation (e.g., software development). They can occur
during system operation due to hardware degradation, faulty
software updates, or harsh environments (e.g., due to high levels of
radiation or excessive temperatures).
p
Intent: Faults may be intentional or unintentional. Most software
bugs are unintentional faults. For example, consider the Fortran
instruction doi=1.35 when the programmer meant to type do i=1,35.
The programmer was trying to set up a loop; what the system saw
was an instruction to assign the value of 1.35 to the variable doi.
They may be intentional: consciously undertaken design decisions
may lead to faults in the system design. Such faults can be further
subclassified into nonmalicious and malicious categories. Nonmalicous
design faults are introduced with the best of intentions; often
because of side-effects (often in the way that various modules of the
system interact) that were not foreseen at design time, or because the
operating environment was imperfectly understood. Malicious faults
are introduced with malicious intent: for instance, a programmer
may deliberately insert a weak point in software that permits
unauthorized access to some data structure. We also include in this
subcategory faults, which may not be deliberately introduced, but
which behave as if they were the product of nefarious intent. For
example, a component may fail in such a way that it “acts as if
malicious” and sends differently valued outputs to different
receivers. Think of an altitude sensor in an airplane that reports a
1000-foot altitude to one unit, while reporting an 8000-foot altitude
to another unit. Malicious faults are also known as Byzantine faults.
1.2 Types of Redundancy
All of fault tolerance is an exercise in exploiting and managing
redundancy. Redundancy is the property of having more of a
resource than is minimally necessary to do the job at hand. As
failures happen, redundancy is exploited to mask or otherwise work
around these failures, thus maintaining the desired level of
functionality.
There are four forms of redundancy that we will study: hardware,
software, information, and time. Hardware faults are usually dealt
with by using hardware, information, or time redundancy, whereas
software faults (bugs) are mostly protected against by software
redundancy.
Hardware redundancy is provided by incorporating extra
hardware into the design to either detect or override the effects of a
failed component. For example, instead of having a single processor,
we can use two or three processors, each performing the same
function. By having two processors and comparing their results, we
can detect the failure of a single processor; by having three, we can
use the majority output to override the wrong output of a single
faulty processor. This is an example of static hardware redundancy, the
main objective of which is the immediate masking of a failure. A
different form of hardware redundancy is dynamic redundancy, where
spare components are activated upon the failure of a currently active
component. A combination of static and dynamic redundancy
techniques is also possible, leading to hybrid hardware redundancy.
Hardware redundancy can thus range from a simple duplication
to complicated structures that switch in spare units when active ones
become faulty. These forms of hardware redundancy incur high
overheads, and their use is therefore normally reserved for critical
systems, where such overheads can be justified. In particular,
substantial amounts of redundancy are required to protect against
malicious faults.
The best-known form of information redundancy is error
detection and correction coding. Here, extra bits (called check bits)
are added to the original data bits so that an error in the data bits can
be detected or even corrected. The resulting error-detecting and
error-correcting codes are widely used today in memory units and
various storage devices to protect against benign failures. Note that
these error codes (like any other form of information redundancy)
may require extra hardware to process the redundant data (the
check bits).
Error-detecting and error-correcting codes are also used to protect
data communicated over noisy channels, which are channels that are
subject to many transient failures. These channels can be the
communication links among either widely separated processors
(e.g., the Internet) or processors that form a local network. If the code
used for data communication is capable of only detecting the faults
that have occurred (but not correcting them), we can retransmit as
necessary, thus employing time redundancy.
In addition to transient data communication failures due to noise,
local and wide-area networks may experience permanent link
failures. These failures may disconnect one or more existing
communication paths, resulting in a longer communication delay
between certain nodes in the network, a lower data bandwidth
between certain node pairs, or even a complete disconnection of
certain nodes from the rest of the network. Redundant
communication links (i.e., hardware redundancy) can alleviate these
problems.
Computing nodes can also exploit time redundancy through
reexecution of the same program on the same hardware. As before,
time redundancy is effective against transient faults. Because the
majority of hardware faults are transient, it is unlikely that the
separate executions (if spaced sufficiently apart) will experience the
same fault. Time redundancy can thus be used to detect transient
faults in situations, where such faults may otherwise go undetected.
Time redundancy can also be used when other means for detecting
errors are in place, and the system is capable of recovering from the
effects of the fault and repeating the computation. Compared with
the other forms of redundancy, time redundancy has much lower
hardware and software overhead, but incurs a high-performance
penalty.
Software redundancy is used mainly against software failures. It is
a reasonable guess that every large piece of software that has ever
been produced has contained faults (bugs). Dealing with such faults
can be expensive: one way is to independently produce two or more
versions of that software (preferably by disjoint teams of
programmers) in the hope that the different versions will not fail on
the same input. The secondary version(s) can be based on simpler
and less accurate algorithms (and, consequently, less likely to have
faults) to be used only upon the failure of the primary software to
produce acceptable results. Just as for hardware redundancy, the
multiple versions of the program can be executed either
concurrently (requiring redundant hardware as well) or sequentially
(requiring extra time, i.e., time redundancy) upon a failure detection.
1.3 Basic Measures of Fault Tolerance
Since fault tolerance is about making machines more dependable, it
is important to have proper measures (yardsticks) by which to gauge
such dependability. In this section, we will examine some of these
yardsticks and their application.
A measure is a mathematical abstraction, that expresses some
relevant facet of the performance of its object. By its very nature, a
measure only captures some subset of the properties of its object.
The trick in defining a suitable measure is to keep this subset large
enough so that behaviors of interest to the user are captured, and yet
not so large that the measure loses focus.
1.3.1 Traditional Measures
We first describe the traditional measures of dependability of a
single computer. These metrics have been around for a long time,
and measure very basic a ributes of the system. Two of these
measures are reliability and availability.
The conventional definition of reliability, denoted by , is the
probability (as a function of the time t) that the system has been up
continuously in the time interval , conditioned on the event that it
was up at time 0. This measure is suitable for applications, in which
even a momentary disruption can prove costly. One example is
computers that control physical processes such as an aircraft, for
which system-wide computer failure could result in catastrophe.
Closely related to reliability are the mean time to failure, denoted by
, and the mean time between failures, . The first is the average
time the system operates until a failure occurs, whereas the second is
the average time between two consecutive failures. The difference
between the two is due to the time needed to repair the system
following the first failure. Denoting the mean time to repair by ,
we obtain
Availability, denoted by , is the average fraction of time over
the interval that the system is up. This measure is appropriate
for applications, in which continuous fault-free operation is not vital,
but where it would be expensive to have the system down for a
significant amount of time. An airline reservation system needs to be
highly available, because downtime can put off customers and lose
sales; however, an occasional (very) short-duration failure can be
tolerated.
The long-term availability, denoted by A, is defined as
A can be interpreted as the probability that the system will be up at
some random point in time, and is meaningful only in systems that
include repair of faulty components. The long-term availability can
be calculated from , , and as follows:
A related measure, point availability, denoted by , is the
probability that the system is up at the particular time instant t.
It is possible for a low-reliability system to have high availability:
consider a system that fails every hour on the average, but comes
back up after only a second. Such a system has an of just 1 hour
and, consequently, a low reliability; however, its availability is high:
.
These definitions assume, of course, that we have a state, in which
the system can be said to be “up”, and another in which it is not. For
simple components, this is a good assumption. For example, a
lightbulb is either good or burned out. A wire is either connected or
has a break in it. However, for even simple systems, such an
assumption can be very limiting. For example, consider a processor
that has one of its several hundreds of millions of gates stuck at logic
value 0. In other words, the output of this logic gate is always 0,
regardless of the input. Suppose the rest of the processor is
functional, and that this failed logic gate only affects the output of
the processor about once in every 25,000 hours of use. For example, a
particular gate in the divide unit when faulty may result in a wrong
quotient if the divisor is within a certain subset of values. Clearly,
the processor is not fault-free, but would one define it as “down”?
The same remarks apply with even greater force to systems that
degrade gracefully. By this, we mean systems with various levels of
functionality. Initially, with all of its components operational, the
system is at its highest level of functionality. As these components
fail, the system degrades from one level of functionality to the next.
Beyond a certain point, the system is unable to produce anything of
use and fails completely. As with the previous example, the system
has multiple “up” states. Is it said to fail when it degrades from full
to partial functionality? Or when it fails to produce any useful
output at all? Or when its functionality falls below a certain
threshold? If the last, what is this threshold, and how is it
determined?
We can therefore see that traditional reliability and availability are
very limited in what they can express. There are obvious extensions
to these measures. For example, we may consider the average
computational capacity of a system with n processors. Let denote
the computational capacity of a system with i operational processors.
This can be a simple linear function of the number of processors,
, or a more complex function of i, depending on the ability of
the application to utilize i processors. The average computational
pp p g p
capacity of the system at time t can then be defined as ,
where is the probability that exactly i processors are operational
at time t. In contrast, the point availability of the system at time t will
be
where m is the minimum number of processors necessary for proper
operation of the system.
1.3.2 Network Measures
In addition to the general system measures previously discussed,
there are also more specialized measures, focusing on the network
that connects the processors together. The simplest of these are
classical node and line connectivity, which are defined as the
minimum number of nodes and lines, respectively, that have to fail
before the network becomes disconnected. This gives a rough
indication of how vulnerable a network is to disconnection. For
example, a network that can be disconnected by the failure of just
one (critically positioned) node is potentially more vulnerable than
another that requires at least four nodes to fail before it becomes
disconnected.
Classical connectivity is a very basic measure of network
reliability. Like reliability, it distinguishes between only two
network states: connected and disconnected. It says nothing about
how the network degrades as nodes fail before, or after, becoming
disconnected. Consider the two networks shown in Fig. 1.1. Both
networks have the same classical node connectivity of 1. However, in
a real sense, network N1 is much more “connected” than N2. The
probability that N2 splinters into small pieces is greater than that for
N1.
FIGURE 1.1 Inadequacy of classical connectivity.
To express this type of “connectivity robustness,” we can use
additional measures. Two such measures are the average node-pair
distance, and the network diameter (the maximum node-pair
distance), both calculated given the probability of node and/or link
failure. Such network measures, together with the traditional
measures listed above, allow us to gauge the dependability of
various networked systems that consist of computing nodes
connected through a network of communication links.
1.4 Outline of This Book
The next chapter is devoted to hardware fault tolerance. This is the
most established topic within fault-tolerant computing, and many of
the basic principles and techniques that have been developed for it
have been extended to other forms of fault tolerance. Prominent
hardware failure mechanisms are discussed, as well as canonical
fault-tolerant redundant structures. The notion of Byzantine, or
malicious, failure is introduced. Techniques to evaluate the reliability
and availability of fault-tolerant systems are introduced here,
including the use of Markov models.
Next, several variations of information redundancy are covered,
starting with the most widely used error-detecting and error-
correcting codes. Then, other forms of information redundancy are
discussed, including storage redundancy (RAID systems), data
replication in distributed systems, and the algorithm-based fault-
tolerance technique that tolerates data errors in array computations
using some error-detecting and error-correcting codes.
Many computing systems nowadays consist of multiple
networked processors that are subject to interconnection link
failures, in addition to the already-discussed single node/processor
failures. We therefore present in this book suitable fault tolerance
techniques for these networks and analysis methods to determine
which network topologies are more robust. With the number of
transistors on a chip increasing in every hardware generation,
networks-on-chip are becoming commonplace. Another recent
development is the proliferation of sensor networks. We discuss
fault-tolerance techniques for both of these.
Software mistakes/bugs are, in practice, unavoidable, and
consequently, some level of software fault tolerance is often
necessary. This can be as simple as acceptance tests to check the
reasonableness of the results before using them, or as complex as
running two or more versions of the software (sequentially or in
parallel). Programs also tend to have their state deteriorate after
running for long periods of time and eventually crash. This situation
can be avoided by periodically restarting the program, a process
called rejuvenation. Hypervisor-based systems, where a hardware
platform supports multiple virtual machines, each running its own
operating system, have gained popularity; we discuss fault-tolerance
issues associated with these. Finally, there is the issue of modeling
software reliability. Unlike hardware faults, software bugs are very
hard to model. Still, a few such models have been developed and
several of them are described.
Hardware fault-tolerance techniques can be quite costly to
implement. In applications, in which a complete and immediate
masking of the effect of hardware faults (especially of a transient
nature) is not necessary, checkpointing is an inexpensive alternative.
For programs that run for a long time and for which reexecution
upon a failure might be too costly, the program state can be saved
(once or periodically) during the execution. Upon a failure
p y g p
occurrence, the system can roll back the program to the most recent
checkpoint and resume its execution from that point. Checkpointing
is especially important in exascale computing, where the computing
base may consist of thousands of processors jointly executing
programs that take hours, days, or even weeks to execute. Various
checkpointing techniques are presented and analyzed in this book,
for both general-purpose computing and real-time systems.
Cyber-physical systems, which consist of physical plants
controlled by computer, have taken off in recent years. Examples
include fly-by-wire aircraft, automobiles, spacecraft, power grids,
chemical reactors, and intelligent highway systems. Such
applications are often life-critical and must therefore be highly
reliable. They consist of sensors to assess the state of the plant and
the operating environment, computers to run the control software,
and actuators to impose their control outputs on the plant. Fault-
tolerance issues associated with each are discussed in this book.
Next is a chapter consisting of a few case studies, which serve to
illustrate the use of many of the fault-tolerance techniques described
previously.
An important part of the design and evaluation process of a fault-
tolerant system is to demonstrate that the system does indeed
function at the advertised level of reliability. Often, the designed
system is too complex to develop analytical expressions of its
reliability. If a prototype of the system has already been constructed,
then fault-injection experiments can be performed and certain
dependability a ributes measured. If, however, as is very common, a
prototype does not yet exist, statistical simulation may be the only
option. Simulation programs for complex systems must be carefully
designed to produce accurate results without requiring excessive
amounts of computation time. We discuss the principles that should
be followed when preparing a simulation program, and show how
simulation results can be analyzed to infer system reliability.
We end the book with two specialized fault-tolerance topics:
defect tolerant VLSI (very large-scale integration) design and fault
tolerance in cryptographic devices. The increasing complexity of
VLSI chip design has resulted in a situation, in which manufacturing
p g g
defects are unavoidable. If nothing is done to remedy this situation,
the expected yield (the fraction of manufactured chips, which are
operational) will be very low. Thus techniques to reduce the
sensitivity of VLSI chips to defects have been developed, some of
which are very similar to the hardware redundancy schemes.
For cryptographic devices, the need for fault tolerance is twofold.
Not only is it crucial that such devices (e.g., smart cards) operate in a
fault-free manner in whatever environment they are used, but more
importantly, they must stay secure. Fault-injection-based a acks on
cryptographic devices have become the simplest and fastest way to
extract the secret key from the device. Thus the incorporation of fault
tolerance can help to keep cryptographic devices secure.
1.5 Further Reading
Several textbooks and reference books on the topic of fault tolerance
are available. See, for example, [5–7,10,14,18–20]. Fault tolerance
from a software perspective is treated in [3,12,15]. Excellent
classifications of the various approaches taken can be found in [2,17].
The major conference in the field is the Conference on Dependable
Systems and Networks (DSN) [4]; this is a successor to the Fault-
Tolerant Computing Symposium (FTCS).
Fault tolerance in specific applications are treated in a number of
works; for instance embedded/cyber-physical systems in [1,8], cloud
computing [9], and high-performance computing [13].
The concept of computing being invisible everywhere appeared in
[22], in the context of pervasive computing, i.e., computing which
pervades everyday living, without being obtrusive.
The definitions of the basic terms and measures appear in most of
the textbooks mentioned above and in several probability and
statistics books. For example, see [21]. Our definitions of fault and
error are slightly different from those used in some of the references.
One definition of an error is that it is that part of the system state
that leads to system failure. Strictly interpreted, this only applies to a
system with state, i.e., with memory. We use the more encompassing
definition of anything that can be construed as a manifestation of a
fault. This wider interpretation allows purely combinational circuits,
which are stateless, to generate errors.
One measure of dependability that we did not describe in the text
is to consider everything from the perspective of the application.
This approach was taken to define the measure known as
performability. The application is used to define “accomplishment
levels” . Each of these represents a level of quality of
service delivered by the application. Now, the performance of the
computer affects this quality (if it did not, by definition, it would
have nothing to do with the application!). The approach taken by
performability is to link the performance of the computer to the
accomplishment level that this enables. Performability is then a
vector, , where is the probability that the
computer functions well enough to permit the application to reach
up to accomplishment level . For more on performability, see
[11,16].
References
[1] I. Alvarez, A. Ballesteros, M. Barranco, D. Gessner, S.
Djerasevic, J. Proenza, Fault tolerance in highly reliable
ethernet-based industrial systems, Proceedings of the IEEE
June 2019;107(6):977–1010.
[2] A. Avizienis, J.C. Laprie, B. Randell, C. Landwehr, Basic
concepts and taxonomy of dependable and secure
computing, IEEE Transactions on Dependable and Secure
Computing October 2004;1(1):11–33.
[3] B. Baudry, M. Monperrus, The multiple facets of software
diversity: recent developments in the year 2000 and beyond,
ACM Computing Surveys September 2015;48(1), 16.
[4] Dependable Systems and Networks (DSN) Conference
h p://www.dsn.org.
[5] E. Dubrova, Fault-Tolerant Design. Springer; 2013.
[6] W.R. Dunn, Practical Design of Safety-Critical Computer
Systems. Reliability Press; 2002.
[7] C.E. Ebeling, An Introduction to Reliability and Maintainability
Engineering. McGraw-Hill; 1997.
[8] C. Edwards, T. Lombaerts, H. Smaili, Fault-Tolerant Flight
Control. Springer; 2009.
[9] B. Fuhrt, A. Escalante, Handbook of Cloud Computing.
Springer; 2010.
[10] J-C. Geffroy, G. Motet, Design of Dependable Computing
Systems. Kluwer Academic Publishers; 2002.
[11] R. Ghosh, K.S. Trivedi, V.K. Naik, D.S. Kim, End-to-end
performability analysis for infrastructure-as-a-service cloud:
an interacting stochastic models approach, Pacific Rim
International Symposium on Dependable Computing. 2010:125–
132.
[12] R.S. Hammer, Pa erns for Fault-Tolerant Software. John Wiley;
2013.
[13] T. Herault, Y. Robert, Fault-Tolerance Techniques for High-
Performance Computing. Springer; 2015.
[14] P. Jalote, Fault Tolerance in Distributed Systems. PTR Prentice
Hall; 1994.
[15] J. Knight, Fundamentals of Dependable Computing for Software
Engineers. Chapman and Hall; 2012.
[16] J.F. Meyer, On evaluating the performability of degradable
computing systems, IEEE Transactions on Computers August
1980;29:720–731.
[17] G. Psychou, D. Rodopoulos, M.M. Sabry, D. Atienza, T.G.
Noll, F. Ca hoor, Classification of resilience techniques
against functional errors at higher abstraction layers of
digital systems, ACM Computing Surveys 2017;50(4), 50.
[18] L.L. Pullum, Software Fault Tolerance Techniques and
Implementation. Artech House; 2001.
[19] D.P. Siewiorek, R.S. Swarz, Reliable Computer Systems: Design
and Evaluation. A. K. Peters; 1998.
[20] M.L. Shooman, Reliability of Computer Systems and Networks:
Fault Tolerance, Analysis, and Design. Wiley-Interscience; 2001.
[21] K.S. Trivedi, Probability and Statistics With Reliability,
Queuing, and Computer Science Applications. John Wiley; 2002.
g p pp y
[22] M. Weiser, The computer for the twenty-first century,
Scientific American September 1991;265(3):94–105.
Chapter 2: Hardware Fault
Tolerance
Hardware fault tolerance is the most mature area in fault-tolerant
computing. Many hardware fault-tolerance techniques have been
developed and used in practice in critical applications, ranging from
telephone exchanges to space missions. In the past, the main obstacle
to a wide use of hardware fault tolerance was the cost of the extra
hardware required. With the continued reduction in the cost of
hardware, this is no longer a significant drawback, and the use of
hardware fault-tolerance techniques is expected to increase.
However, other constraints, notably on power consumption, may
continue to restrict the use of massive redundancy in many
applications.
The hardware can broadly be divided into a hierarchy of three
levels. (One can obviously have a more detailed division into
sublevels, but three levels will do for our purposes.) At the top is the
system level. This is the “face” that the entity presents to the
operating environment. An example is the computer hardware that
controls a modern aircraft. At the second level, the system is
composed of multiple modules or components. Examples include
individual processor cores, memory modules, and the I/O
subsystem. Obviously, each of these modules will themselves be
composed of submodules. At the bo om of the hierarchy are
individual nanometer-sized devices.
This chapter first discusses hardware failure at a high level. Then,
we dive to the device level to explain some major hardware failure
mechanisms. After this, we return to the system level to consider
more complex systems consisting of multiple components, describe
various resilient structures (which have been proposed and
p p
implemented), and evaluate their reliability and/or availability. Next,
we describe hardware fault-tolerance techniques that have been
developed specifically for general-purpose processors. Finally, we
discuss malicious faults, and investigate the amount of redundancy
needed for protecting against these.
2.1 The Rate of Hardware Failure
The failure rate of a hardware component depends on its current
age, any voltage or physical shocks that it has suffered, the ambient
temperature, and the technology. The dependence on age is usually
captured by what is known as the bathtub curve (see Fig. 2.1). When
components are very young, their tendency to fail is high. This is
due to the chance that some components with manufacturing defects
slipped through manufacturing quality control and were released.
As time goes on, these components are weeded out, and the
component spends the bulk of its life showing a fairly constant
failure rate (a precise definition of failure rate is presented a li le
later, but for now an intuitive understanding is sufficient). As it
becomes very old, aging effects start to take over, and the failure rate
rises again.
FIGURE 2.1 Bathtub curve.
The impact of the other factors can be expressed through the
following empirical failure rate formula for the region where the
failure rate is roughly constant:
(2.1)
where the notations represent the following:
Further details can be found in MIL-HDBK-217E, which is a
handbook produced by the U.S. Department of Defense.
Devices operating in space, which is replete with charged particles
and can subject devices to severe temperature swings, can thus be
expected to fail much more often than their counterparts in air-
conditioned offices; so too can computers in automobiles (which
suffer high temperatures and vibration) and industrial applications.
2.2 Failure Rate, Reliability, and Mean Time
to Failure
In this section, we consider a single component of a more complex
system, and show how reliability and mean time to failure ( )
can be derived from the basic notion of failure rate. Consider a
component that is operational at time and remains operational
until it is hit by a failure. Suppose for the moment that all failures are
permanent and irreparable. Let T denote the lifetime of the
component (the time until it fails), and let and denote the
probability density function of T and the cumulative distribution
function of T, respectively. These functions are defined for only
(because the lifetime cannot be negative) and are related through
(2.2)
represents (but is not equal to) the momentary probability of
failure at time t. To be exact, for a very small Δt,
. Being a density function, must satisfy
is the probability that the component will fail at or before time t,
, the reliability of a component (the probability that it will survive
at least until time t), is given by
(2.3)
An important quantity is the probability that a good component of
current age t will fail in the next short duration of length dt. This is a
conditional probability, since we know that the component survived
at least until time t. This conditional probability is represented by the
failure rate (also called the hazard rate) of a component at time t,
denoted by the time-dependent function, , which can be
calculated as follows:
(2.4)
Since , we obtain
(2.5)
If a component suffers no meaningful level of aging and has a failure
rate that is constant over time: . In this case,
and the solution of this differential equation (with ) is
(2.6)
Therefore a constant failure rate implies that the lifetime T of the
component has an exponential distribution, with a parameter that is
equal to the constant failure rate λ,
For an irreparable component, the is equal to its expected
lifetime, (where denotes the expectation or mean of a random
variable),
(2.7)
Substituting yields
(2.8)
(the term is equal to zero when and when ,
since ).
For the case of a constant failure rate for which ,
(2.9)
Although a constant failure rate is used in most calculations of
reliability (mainly owing to the simplified derivations), there are
cases for which this simplifying assumption is inappropriate,
especially during the “infant mortality” and “wear-out” phases of a
component's life (Fig. 2.1). In such cases, the Weibull distribution is
often used. This distribution has two parameters, λ and β, and has
the following density function of the lifetime T of a component:
(2.10)
The corresponding failure rate is
(2.11)
This failure rate is an increasing function of time for , is constant
for , and is a decreasing function of time for . This makes it
very flexible, and especially appropriate for the wear-out and infant
mortality phases. The component reliability for a Weibull
distribution is
(2.12)
and the of the component is
(2.13)
where is the Gamma function. The Gamma
function is a generalization of the factorial function to real numbers,
and satisfies
• for
•
• for an integer n,
Note that the Weibull distribution includes as a special case ( ),
the exponential distribution with a constant failure rate λ.
2.3 Hardware Failure Mechanisms
The previous sections introduced us to some factors which affect
hardware reliability. In this section, we describe some of the physical
failure mechanisms that are at the base of component failure. These
failure mechanisms will illustrate the dependence of the failure rate
(recall Eq. (2.1)) on the temperature, supply voltage, and the circuit
age.
Note that the failure of a device does not automatically result in
the failure of the circuit in which the device serves: that depends on
the circuit design and usage, which together determine the criticality
of the device to component functionality. Different devices within
the same circuit see different levels of stress; for instance, their duty
cycle (fraction of time for which the device is on) can vary, as can
their temperature. It is therefore very hard to map the lifetime of
individual devices to the lifetime of the overall circuit.
One cannot therefore create an unbroken chain of reasoning that
starts with the individual failure model for each of the billions of
transistors in a system and, step by detailed step, go up the
hierarchy to accurately model the overall system reliability. Such a
model would take too long to evaluate. We have to use approximate
models based on experimental tests and on simulation, at the chip or
system level.
Rather, device failure models are probably best used to
understand how changes in operating parameters (e.g., current,
applied voltage, and operating temperature) can change the failure
rate. We can then use this knowledge to try to tune the operating
parameters to stave off circuit failures to the extent possible. We can
also use these models to carry out accelerated failure tests. For
example, if we know that failure rates rise exponentially with
temperature, we can carry out high-temperature tests of devices, and
then use the time-to-failure data to estimate their time-to-failure
under normal operating temperatures. If we know that devices can
partially recover from stress, we can schedule rest periods to allow
them to do so. Knowing that devices slow down as they age, we can
clock them more slowly to provide slack time to make up for such
slowdown (up to a point). At the design stage, knowing that the
aging process will be different for different parts of the circuit (since
not all parts are equally stressed), we can size devices differently to
provide greater resilience against aging for the more highly stressed
portions. And so on.
The failure mechanisms we consider here are those which afflict
conventional CMOS devices; such devices are, by far, the most
commonly encountered in computing today. The reader should,
however, always keep in mind that as device feature sizes continue
to drop (e.g., 5 nm devices are being developed at the time of
writing), and as new technologies are introduced, new failure
mechanisms can become prominent, and failure models can change.
2.3.1 Electromigration
When current flows down a wire, there is momentum transfer
between the flowing electrons and the metal atoms that constitute
the wire. There is another effect, in the other direction, due to the
applied electric field, which is usually insufficient to fully counteract
the momentum transfer. As a result, atoms in the metal interconnect
can migrate away from their original position, resulting in a thinning
of the interconnect or outright separation.
The traditional expression for the median time to failure due to
electromigration, denoted by , was derived in the 1960s and
is known as Black's formula:
(2.14)
where is a constant of proportionality; J the current density; m
an exponent, whose value is typically between 1 and 2; is the
activation energy associated with the material of which the
interconnect is made; k is Bol mann's constant; T the absolute
temperature, and . The activation energy is 0.6 eV for
aluminium and 0.9 eV for gold. Adding a small quantity of copper
(less than 4%) to aluminium is sufficient to significantly increase its
activation energy.
Black's formula applies to wires above a certain thickness and
length. A metal interconnect can generally be regarded as a
polycrystalline film consisting of grains. As the wire width drops
below the grain size, the grain boundaries become increasingly
perpendicular to the longitudinal direction of the interconnect, and
the wire becomes highly resistant to electromigration.
Electromigration tends also to be counteracted by mechanical
stress in short wires: below a certain length, called the Blech length,
electromigration effects are not significant. The Blech length
decreases in proportion to the current density: the product of the
Blech length and the applied current density is roughly a constant,
depending on the material used.
2.3.2 Stress Migration
Metal interconnects are deposited on a silicon substrate; the two
materials expand at different rates when heated. This causes
mechanical stress to the interconnects, which triggers migration
among the metal atoms. The mean time to failure caused by stress
migration of metal interconnect is often modeled by the following
expression:
(2.15)
where is a constant of proportionality, σ is the mechanical
strain, is the activation energy for stress migration (often taken
as between 0.6 to 1.0 eV), k the Bol mann constant, and T the
absolute temperature. The exponent, m, is usually taken as between
2 and 4 for soft metals like Al and Cu; it rises to between 6 and 9 for
strong, hardened materials.
2.3.3 Negative Bias Temperature Instability
Negative bias temperature instability (NBTI) is an increasingly
important failure mechanism in transistors. NBTI can be caused by
negative gate voltages and elevated temperatures; it affects pMOS
transistors. Its counterpart in nMOS transistors is positive bias
temperature instability; we will focus here on NBTI.
When an electric field is applied across the gate oxide of a
transistor, charge tends to get trapped below the transistor gate. The
amount of trapped charge grows with the time for which the field is
applied. The transistor threshold voltage then changes (the threshold
voltage is the level beyond which a channel forms between source
and drain to allow the transistor to turn on and current to flow);
consequently, the current flow across an ON transistor drops, and
the gate delay increases. Beyond a certain point, the device can no
longer function to specifications. The problem gets worse with
increased miniaturization.
Once the gate voltage is removed, this charge can dissipate over
time. Such dissipation can be very slow, however: it can take
thousands of seconds. Furthermore, complete recovery of the device
to its original state tends not to happen; there is always some
degradation that carries over from one stress cycle to another.
(Recovery is what makes modeling the time to failure of this process
so challenging.) Beyond a certain point, timing failures occur, and
the circuit composed of such devices fails.
Experimental work has shown that the threshold voltage change
increases with temperature, the supply voltage, and the duty cycle.
This change grows as a function of the time, t, for which the gate is
stressed: the threshold voltage change is proportional to , where
m is usually modeled as between 1/4 and 1/6.
To reduce the rate of NBTI-induced failure, mitigation by means
of using supply voltage scaling and duty cycle reduction can be
a empted. Rest periods, where parts of the chip are turned off by
power gating, can be regularly scheduled to provide devices the
opportunity to partially reverse the voltage threshold shift caused by
NBTI. Body biasing, where a voltage is applied to the body of the
chip (i.e., the substrate), can be used to compensate for changes in
threshold voltage. The supply voltage can be increased to speed up
the device to make up for NBTI-induced slowdown; however, this
has the unfortunate effect of exacerbating NBTI, and also
dramatically increasing the power consumption. In addition, the
increase in circuit delay from long-term NBTI-caused degradation
can be monitored, and the clock frequency reduced when necessary
to prevent timing faults.
2.3.4 Hot Carrier Injection
Charge carriers (electrons for n-channel and holes for p-channel
devices) get accelerated by the high fields in the channel of a device.
A certain fraction of these carriers gain sufficient energy to be
injected into the gate oxide and get trapped there. This can change
the current-voltage characteristics of the transistor. After some time,
transistors subjected to such defects can become too slow.
The mean time to failure expression depends on whether it is an n-
channel or p-channel device; the reason is that electrons and holes
have different mobility characteristics. Commonly suggested
expressions for the MTTF for n- and p-channel devices are as
follows:
(2.16)
(2.17)
where , are constants of proportionality; , are the
substrate and gate currents, respectively; w is the transistor width;
is the activation energy associated with the HCI process; m an
exponent in the range ; k the Bol mann constant, and T the
absolute temperature. Note that the gate and substrate currents will
vary with time, based on the device usage. Using their peak values
will provide a lower bound to the MTTF. Also, this expression is
useful in providing an indication of how changes in operating
conditions will accelerate HCI-based failure.
2.3.5 Time-Dependent Dielectric Breakdown
When voltage is applied across the gate oxide, electrical defects—
called traps—are induced within it. These traps increase leakage
current through the gate oxide. As time goes on, these traps can
move around and cause variation in leakage current and gate delay.
This is the wearout phase, involving soft breakdown.
After some time, the traps may overlap, and a conducting path
through the oxide results. Once this happens, the leakage current
through the gate jumps. This heats up the device; such heating can
contribute to the formation of additional traps. These traps may
widen the conduction path through the oxide, and the leakage
current goes up further. This leads to an increase in temperature,
and a highly damaging positive feedback loop is created. The
dielectric film suffers hard breakdown causing the transistor to fail.
The breakdown of the gate dielectric over time is called time-
dependent dielectric breakdown (TDDB).
There is no consensus on a model for calculating MTTF for this
failure mechanism; there are multiple competing expressions for it.
A commonly used expression for the mean time to time-dependent
dielectric breakdown is the following:
(2.18)
where is a constant of proportionality, V is the voltage applied
across the gate oxide, T is the absolute temperature in Kelvin (K),
and k is Bol mann's constant. Typical values for the other
parameters are: , , , ,
.
2.3.6 Putting It All Together
Given the many failure mechanisms that exist, how may we put
them all together? One approach is to treat each failure mechanism
as independent of the others. Assuming a constant failure rate of a
given failure mechanism FM, then its rate, denoted by , will
equal the inverse of the mean time to failure (see Eq. (2.9)), i.e.,
. As will be described in greater detail later in this
chapter, the sum of failure rates model for independent failure
mechanisms is followed. Accordingly, the aggregate rate of
independent failure mechanisms is the sum of the rates of the
individual failure mechanisms, i.e., . The overall
mean time to failure of the device is then estimated as
. We can then approximate the reliability of the
device over some time t as .
There are three points that must be made with respect to this
approach. First, the failure mechanisms are not necessarily
independent; treating them as such and adding up their individual
rates may result in estimates of dubious accuracy. For this reason,
some have suggested calculating the MTTF of each failure
mechanism separately, and taking the minimum of them. Also,
when one failure mechanism is dominant, we can simply ignore the
others.
Second, these expressions implicitly assume a constant
environment. For instance, the temperature dependence in the
various formulas above assume a constant temperature. We can use
them in a varying environment by means of the following
approximation: divide the time axis into small segments, treat the
operating conditions as constant over each individual segment,
apply the reliability formula over that segment, and then stitch the
results together.
Example: Suppose we have an expression for the reliability as a
function of temperature. That is, suppose we have the failure
rate of a device defined as , where T is the absolute
temperature; this rate has been calculated from a physical
model, which assumes a constant temperature. We are given
the temperature profile as a function of time: , and are asked
to estimate the device reliability over the interval .
Divide the time axis into short intervals of duration Δ each;
interval , . Assume that the temperature
over is constant at . The probability that the device does not
fail anytime over , given that it was functional at the
beginning of that interval, is given by .
Hence, the probability that it does not fail up to the end of is
. Note that this is an approximation; it assumes that
the failure process in each segment is stochastically
independent.
Thirdly, it is very difficult to obtain precise estimates for the
variables in the reliability formulas. How, for instance, is one to
obtain good estimates for the temperature or duty cycle? We end up
having to obtain very approximate estimates based, for example, on
circuit simulation.
2.4 Common-Mode Failures
Redundancy to detect or to mask failure works only so long as the
redundant units do not produce incorrect outputs that are all
identical. Unfortunately, common-mode failures, where such events
occur, are sometimes encountered and greatly reduce the reliability
of systems. We will cover common-mode software faults in Chapter
5; here, we focus on hardware.
Common-mode faults occur either at the design/implementation,
or at the operational, stage. If the same mistake is made during
design or the same vulnerability introduced, multiple circuits can
fail identically on the same inputs. At the operational stage, because
multiple circuits operate in the same environment, the same
environmental disturbance (such as a massive dose of radiation or
electromagnetic interference) might introduce identical erroneous
outputs. If circuits are powered from a common supply, then surges
or other abnormalities in the supply might trigger a similar response
from all of them.
Design diversity is used to combat common-mode faults at the
design stage. For example, there are different ways of implementing
adder circuits (e.g., carry lookahead, carry-select, etc.). One can force
one circuit to be implemented using NAND gates, while the
corresponding circuit in its redundant partner is implemented using
NOR gates. We may place different constraints on gate fan-in, or
other circuit parameters in the various circuits, and expect some
diversity to emerge from the designs then produced by computer-
aided design software tools.
Reducing common-mode faults generated while in operation
requires isolation. For example, different circuits may be powered by
independent power supplies, or they may be in different zones
heavily shielded from radiation or electromagnetic interference.
A metric has been introduced to quantify diversity between two
circuits. Let be the probability that faults , in circuits 1 and 2,
respectively, do not cause identically wrong outputs. If is the
probability of the occurrence of such faults, the diversity between
these circuits is quantified as . Unfortunately, this
metric is not easy to calculate in practice. For more on this, see the
Further Reading section.
2.5 Canonical and Resilient Structures
Having briefly considered some significant physical hardware
failure mechanisms, we return now to a higher level. We consider
some canonical structures, out of which more complex structures can
be constructed. We start with the basic series and parallel structures,
continue with nonseries/parallel ones, and then describe some of the
many resilient structures that incorporate redundant components
(next referred to as modules).
2.5.1 Series and Parallel Systems
The most basic structures are the series and parallel systems
depicted in Fig. 2.2. A series system is defined as a set of N modules
connected together, so that the failure of any one module causes the
entire system to fail. Note that the diagram in Fig. 2.2A is a reliability
diagram and not always an electrical one; the output of the first
module is not necessarily physically connected to the input of the
second module. The four modules in this diagram can, for example,
represent the instruction decode unit, execution unit, data cache, and
instruction cache in a microprocessor. All four units must be fault-
free for the microprocessor to function, although the way they are
connected does not resemble a series system.
FIGURE 2.2 Series and parallel systems. (A) Series system.
(B) Parallel system.
Assuming that the modules in Fig. 2.2A fail independently of each
other, the reliability of the entire series system is the product of the
reliabilities of its N modules. Denoting by the reliability of
module i and by the reliability of the whole series system,
(2.19)
if module i has a constant failure rate, denoted by , then, according
to Eq. (2.6), , and consequently,
(2.20)
where . From Eq. (2.20) we see that the series system has a
constant failure rate equal to (the sum of the individual failure
rates), and its is therefore .
A parallel system is defined as a set of N modules connected
together so that it requires the failure of all the modules for the
system to fail. This leads to the following expression for the
reliability of a parallel system, denoted by :
(2.21)
If module i has a constant failure rate , then
(2.22)
As an example, the reliability of a parallel system consisting of two
modules with constant failure rates and is given by
Other documents randomly have
different content
CHRIST STILLING THE TEMPEST.
The beating rain in torrents fell,
The thunder muttered loud,
And fearful men with deep grief dwell
Before their Saviour bowed.
The billows lashed the rock-bound shore,
The howling winds roared by,
While feeble cries rose on the gale,
“Christ, save us, or we die.”
Upon a bed of sweet repose
Our blessed Saviour lay,
While round Him played the lightning’s flash
From out a frowning sky.
And feeble cries of grief and woe
Were heard around His bed,—
“Oh! Jesus, wake—we perish now,
Our courage all has fled.”
The lightnings flashed, the thunder roared,
The foaming waves rolled by,
And Jesus calmly rose and said,
“Fear ye not; it is I.”
Loud roared the winds in wailing notes,
The night was cold and chill,
And to the raging storm He said,
“Hush, ye winds; peace, be still.”
The winds were stilled, the sea was calm,
The clouds soon passed away,
And sunny skies, with golden gleams,
Beamed on the face of day.
“What man is this,” the seamen cry,
“That e’en the sea ’ll obey?
He only whispered, ‘Peace, be still,’
And darkness passed away.”
Western Recorder.
THE ORPHAN.
“An orphan in the cold wide world,
Dear Lord, I come to Thee:
Thou, Father of the fatherless,
My Friend and Father be!”
“Cold is the world without a father’s arm to shield, and a mother’s
heart to love. The sun shines but dimly on the head of the orphan,
for sorrow claims such as its own, and no earthly power can release
from its embrace. When a father dies, and she who ‘loves with a
deep, strong, fervent love,’ is laid in the grave, then is the brightness
of earthly existence extinguished.”
Children, how accurately do the above lines describe the lonely
and forsaken condition of the orphan!
Have you never felt your little hearts throb with sorrow when you
saw the children of the Orphan Asylum walk quietly down the aisle
of the church and seat themselves in regular order in the front
pews? Did not their plain dress speak to you in language which you
were obliged to hear? Did not the prayer arise from your breasts,
that God would be a Father to the fatherless, that He would watch
over, guide and protect, throughout the journey of life, that helpless
little band of fatherless and motherless children?
How lonely must their condition be. No father to counsel, no
mother to love, no home beneath whose shelter they may rest, but
dependent upon the cold charities of a colder world.
He who would treat unkindly, or wound the feelings of an orphan,
is worse than the brute of the field.
My young orphan friends, there is but one source to which I can
direct you; there is but one friend who will never desert you; there is
but one house whose door will never be closed against you.
That source is God; that friend is Christ; that house is one not
made with hands, eternal in the heavens. God will counsel you;
upon the bosom of Christ you may “lean for repose;” and the angels
of heaven will ever welcome you to their blest abode.
The kind father and the loving mother, from whom you have been
separated by death, you shall meet again, if you are Christians.
And to you, dear little readers, who know not the length and
breadth and depth of a Saviour’s love, let me say one word: There is
no orphanage like that of the soul which leans not upon Christ as its
Saviour and Redeemer.
LAMENT OF AN ORPHAN.
“Homeless, friendless, for many years
I’ve wandered far and wide,
With none to wipe away my tears,
And none to be my guide.
“No gentle word to soothe my grief,
Words so harshly spoken;
No tender hand to give relief,
And now my heart is broken.
“I sigh to think in former days,
When by my mother’s side
I watched the sun’s last golden rays
As they sank at eventide.
“Oft I’ve played beside the brook,
My brother’s hand in hand,
As each did seek his favor’d nook,
Then we’re a merry band.
“I have no friends—my mother’s gone,
She is far, far away;
I sit beside her lowly stone,
And sing my plaintive lay.
“I pray that God will take me home
To that bright world above;
There we shall meet to part no more,
In that heaven of love.
“Death has marked me for its own,
And I no more shall rove;
God has called the orphan child
To praise with Him above.
“Can you hear my prayer, Mother,
In yonder region bright?
I’m coming to you now, Mother,
Earth’s but a dismal night.”
THE RECORDING ANGEL.
“Among the deepest shades of night
Can there be one who sees my way?
Yes, God is as a shining light
That turns the darkness into day.”
We are told, that during the trial of Bishop Cranmer, in England, he
heard, as he was making his defence before the judges, the
scratching of a pen behind a screen. The thought at once arose in
his mind that they were taking down every word he uttered. “I
should be very careful,” thought he to himself, “what I say; for the
whole of this will be handed down to posterity, and exert an untold
influence for good or for evil.”
Do you know, my young friends, that there is a Recording Angel in
heaven that takes down not only every wicked word you utter, but
the very thoughts of your minds and desires of your hearts?
Remember, that though your actions are not all seen by men, nor
your thoughts known to your companions, yet every action, thought
and word is carefully recorded in the Book of God’s Remembrance.
How chaste, then, should be your conversation, how guarded your
conduct, how pure your every wish!
At the day of judgment, how full will the pages of that book be of
your unkind treatment of some poor, forsaken little wanderer; of
your revengeful feelings towards your schoolmate for his little acts of
childish thoughtlessness!
But is there not some way to blot out these dark sins from the
Book of God’s Remembrance? Yes, there is. Christ has died, that you
might live. He assures you that though your sins are “as scarlet, they
shall be as white as snow; though they be red like crimson, they
shall be as wool.”
THE EVER-PRESENT GOD.
“In all my vast concerns with Thee,
In vain my soul would try
To shun Thy presence, Lord, or flee
The notice of Thine eye.
“Thy all-surrounding sight surveys
My rising and my rest,
My public walks, my private ways,
And secrets of my breast.
“My thoughts lie open to the Lord
Before they’re formed within;
And ere my lips pronounce the word,
He knows the sense I mean.”
THOMAS WARD; OR, THE BOY WHO
WAS ASHAMED TO PRAY.
“Come, my soul, thy suit prepare,
Jesus loves to answer prayer;
He Himself has bid thee pray,
Therefore will not say thee nay.”
Early one morning, in the month of September, 184–, Mr. Ward’s
family were assembled around the family altar for prayer, to implore
the blessing and protection of our Heavenly Father in behalf of their
only boy, who was about leaving his home for a distant school.
Thomas, a boy of about twelve summers, was deeply affected by
the solemn services, and as he arose from his knees his eyes were
filled with tears, thinking, perhaps, that he might never be permitted
to enjoy that delightful privilege again. His father prayed particularly
that God would take care of his boy during his absence from his
parents; that He would preserve him from all dangers; that He
would be near him in all his temptations; and, if they should not
meet again on earth, that they might all—father, mother and son—
meet where the “wicked cease from troubling, and the weary are at
rest.” He endeavored to impress upon his mind the necessity of
prayer, and that he should never neglect it, under any
circumstances. Don’t be ashamed to pray, my son, said his father.
The ringing of the car-bell announced that in a short time he must
be off. The most trying point had now come,—he must bid his
parents farewell. Clasping his arms around his mother’s neck, he
said: “Oh, my Mother, my Mother, shall I ever see you again?” and
with a kiss to each, bade his affectionate parents adieu, and, valise
in hand, walked hastily to the dépôt.
Having procured his ticket, he seated himself in the cars, and in a
few moments left the home of his childhood for the P—— H——
school, at B——. His heart was sad, as he thought of the many
happy hours he had spent “at home” with his kind parents, and a
tear stole silently down his cheek. These sad and melancholy
thoughts, however, were soon banished from his mind by the
magnificent scenery of the country through which he was passing.
He thought “the country,” as it was called in town, was the
loveliest place he had ever seen. Thomas’ mind became so much
engaged with the picturesque scenery—mountains, lakes and valleys
—that he reached his place of destination ere he supposed he had
travelled half-way.
He met the principal at the dépôt, awaiting his arrival, and in a
few moments they were on their way to the school. Nothing of
interest occurred during the remainder of the day, with the exception
of the boys’ laughing at Thomas, calling him “town boy,” etc.;
“initiating” him, as they termed it. When the time for retiring to rest
drew near, and one after another of the boys fell asleep, Thomas
was surprised that not one of them offered a petition to God, asking
Him to take care of them during the silent watches of the night. He
knelt beside his bed, and attempted to offer a short prayer; but his
companions were laughing and singing, and he arose from his
knees, wishing that he was at home, where he could, in his quiet
little chamber, offer up his evening devotions. Some of the boys
were actually so rude as to call him “Parson Ward,” and ask him if he
intended holding forth next Sabbath?
The next night Thomas felt so ashamed, that he determined not
to pray, and laid his head on a prayerless pillow,—a thing he had not
done since he was able to say, “Gentle Jesus, meek and mild.” The
last words of his father, “Don’t be ashamed to pray” came to his
mind; but thinking about them as little as possible, he soon fell
asleep.
In a short time Thomas became the ringleader of the gang in all
that was bad, and soon learned to curse and swear worse than any
of his companions.
On a beautiful Sabbath morning, instead of going to church, he
wandered off, and finding nothing to engage his thoughts,
determined to take a bath. He had scarcely been in the water five
minutes, when he was seized with cramp, and sunk to rise no more.
The last words that lingered on the lips of the drowning boy were,
“Oh, my mother!”
The awful death of Thomas speaks for itself. May it serve as a
warning to those who violate God’s holy commandment, and are
ashamed to pray. May it also teach us how quickly one sin leads to
another. His first sin was neglecting to pray; his second, profanity;
his third, Sabbath-breaking, which terminated in his death.
NOT ASHAMED OF CHRIST.
“Jesus, and shall it ever be,
A mortal man ashamed of Thee?
Ashamed of Thee, whom angels praise,
Whose glories shine through endless days!
“Ashamed of Jesus!—Sooner far
Let evening blush to own a star;
He sheds the beams of light divine
O’er this benighted soul of mine.
“Ashamed of Jesus!—Just as soon
Let midnight be ashamed of noon;
’Tis midnight with my soul, till He,
Bright Morning Star, bid darkness flee.
“Ashamed of Jesus! that dear friend
On whom my hopes of Heaven depend!
No, when I blush be this my shame,
That I no more revere His name.
“Ashamed of Jesus!—Yes, I may,
When I’ve no sins to wash away,
No tear to wipe, no good to crave,
No fears to quell, no soul to save.
“Till then—nor is my boasting vain—
Till then I boast a Saviour slain;
And oh, may this my glory be,
That Christ is not ashamed of me.”
THE ROSE.
“There is no rose without a thorn.”
There are few lovelier things than the rose to be met with along the
pathway of life.
There is something about it so meek and modest, that I love to
look at it; and what is sweeter than the mellow fragrance of a
beautiful rose? It always reminds me of that beautiful country
where, we are told, never-fading flowers continue to bloom forever.
The Church of Christ is compared, in the Bible, to the Rose of
Sharon; and it seems to me that the inspired penman could not have
found, throughout the length and breadth of the world, anything
better suited to convey the idea of gentle lowliness and meek
humility, than the rose.
Its fragrance can be enjoyed by all. It is not sweeter to the king
than to the peasant. So with religion. It is a fountain from which all
can drink.
There is another thing about the rose which should teach us a
lesson. As there is no rose without a thorn, so there is no enjoyment
without some pain connected with it. There are many children who
are always discontented; they are never pleased with any thing, but
are always looking out for what is disagreeable, and not for what is
pleasant. What is this, but forgetting the delightful fragrance of the
rose, and piercing our fingers with the few thorns which are about it.
Our blessings are much more numerous than our cares and troubles.
Why not, then, clip off the thorns, and keep merely the fully opened
rose?
As the leaves of the rose wither and die, so must we.
Let us always remember this, and also live in such a way, by
shedding a sweet fragrance about our pathway, that all who know us
will love us, and forget the few thorns of evil which may be found in
our characters.
“How fair is the rose! what a beautiful flower,
The glory of April and May;
And the leaves are beginning to fade in an hour,
And they wither and die in a day.
“Yet the rose has one powerful virtue to boast,
Above all the flowers of the field:
When its leaves are all dead and fine colors lost,
Still how sweet a perfume it will yield!
“So frail is the youth and the beauty of man,
Though they bloom and look gay like a rose:
But all our fond care to preserve them is vain,—
Time kills them as fast as he goes.
“Then I’ll not be proud of my youth or my beauty,
Since both of them wither and fade,
But gain a good name by well doing my duty;
This will scent like a rose when I’m dead.”
CHILDREN AND THE FLOWERS.
“‘Flowers, sweet and lowly flowers,
Gems of earth so bright and gay,
Is there nothing you can teach us,
Nothing you to us can say?
“‘List, and ye shall hear our voices
Speaking to you from the sod;
List, for we would lead you gently
Upwards from the earth to God.
“‘Children, as ye gaze upon us,
Think of Him who, when below,
Told you well to mark the flowers,
How without a care they grow.
“‘Children, know that like the flowers
You must quickly fade away:
Life is short; improve the hours—
You may only have to-day.
“‘We were once but seeds, dear children—
We were placed in earth, and died;
You must die; but trust in Jesus—
Fear not, but in Him abide.
“‘We proclaim the resurrection,
How the dead in Christ shall rise;
Incorruptible, immortal,
They shall reign above the skies.
“‘Farewell, children, and remember,
When our forms shall meet your view,
That the Lord, who clothes each flower,
Will much more provide for you.’”
THE LANTERN.
Gently, Lord, O gently lead us
Through this lonely vale of tears—
Through the changes here decreed us,
Till our last great change appears.
When temptation’s darts assail us,
When in devious paths we stray,
Let Thy goodness never fail us—
Lead us in Thy perfect way.
Sp. Songs.
The sun had disappeared behind the western hills, and darkness was
fast covering the face of nature, when a little girl, who had been to a
distant city, commenced retracing her steps homeward. A kind friend
handed her a lantern, and told her if she followed the road on which
the lantern shone, it would certainly direct her home. She started
with a light heart and joyous spirits, much delighted with her
journey beside the still waters, and through the green pastures.
By and by she came to a certain place where two roads branched
off. She did not know which one to take; but soon found that her
lantern shone very plainly on the one beset with thorns and briers.
She concluded to disregard the advice of her friend, and took the
opposite road, as it seemed so much more pleasant than the one on
which her lantern shone. At first her pathway was bordered with
roses of the sweetest fragrance, and with everything calculated to
make a young person happy. Finally she reached a point in her
journey where she knew not what to do. She had no lamp to direct
her; no kind friend to whom she might look for directions; all around
her was dark and dismal. Wherever she trod, her steps seemed
beset with troubles of every kind.
At last a friendly voice whispered in her ear, and said: “Stop, my
dear child—stop and think. You know not whither you are going. You
are in the road to death. Stop, before you further go.”
She determined to turn her course, and retraced her steps with a
heavy heart, determined thereafter always to follow the road on
which her lantern shone. She soon reached the place where she had
left her lantern, and found its rays still brightly shining on the same
road.
She continued her journey onward, and found, though it was
rough at first, the farther she proceeded, the better was she
pleased. When she reached her home, she found her friends
anxiously awaiting her arrival. They all greeted her with a kiss, and
welcomed her back again.
Children, the little girl about whom I have been telling you is the
young Christian, commencing her journey from the city of
Destruction to the New Jerusalem. The journey is her Christian life;
the two roads are the long and narrow road to Heaven, and the
broad road to Hell; the kind friend is some fellow Christian, and the
lantern is God’s Holy Word. The thorns in the one road are the trials
of a Christian; while the roses in the other are the allurements
placed there by the Wicked One, to ensnare the careless and
inconsiderate. Her home is Heaven.
Young Christian, learn a lesson from the conduct of this little girl:
Never pursue the course which seems most pleasant, but the one
laid down in the Bible.
“Thy Word is a lamp unto my feet, and a light unto my path.”
“‘Whither goest thou, pilgrim stranger
Wand’ring through this lonely vale?
Know’st thou not ’tis full of danger,
And will not thy courage fail?’
“‘Pilgrim thou hast justly call’d me,
Passing through a waste so wide;
But no harm will e’er befall me
While I’m blessed with such a guide.’
“‘Such a guide!—no guide attends thee,
Hence for thee my fears arise:
If some guardian power befriends thee,
’Tis unseen by mortal eyes.’
“‘Yes, unseen, but still believe me,
I have near me such a friend;
He’ll in every strait relieve me,
He will guide me to the end.’”
HEAVEN IS MY HOME.
“I’m but a stranger here;
Heaven is my home:
Earth is a desert drear;
Heaven is my home:
Danger and sorrow stand
Round me on every hand
Heaven is my fatherland,
Heaven is my home.
“What though the tempests rage?
Heaven is my home:
Short is my pilgrimage;
Heaven is my home:
And time’s wild wintry blast
Soon will be overpast;
I shall reach home at last.
Heaven is my home.
“Therefore I murmur not;
Heaven is my home:
Whate’er my earthly lot,
Heaven is my home:
And I shall surely stand
There at my Lord’s right hand:
Heaven is my fatherland,
Heaven is my home.”
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookball.com