Fault-Tolerant Architectures
Fault-Tolerant Architectures
1
Topics covered
• Fault-tolerant architectures
• Programming for reliability
2
Reliability
3
Reliability achievement
• Fault avoidance
– Development technique are used that either minimise the possibility
of mistakes or trap mistakes before they result in the introduction of
system faults.
• Fault detection and removal
– Verification and validation techniques are used that increase the
probability of detecting and correcting errors before the system
goes into service are used.
• Fault tolerance
– Run-time techniques are used to ensure that system errors do not
lead to system failures.
4
ATM reliability specification
• Key concerns
– To ensure that their ATMs carry out customer services as requested and
that they properly record customer transactions in the account
database.
– To ensure that these ATM systems are available for use when required.
5
Fault-tolerant architectures
6
Fault tolerance
• In critical situations, software systems must be fault tolerant.
• Fault tolerance is required where there are high availability
requirements or where system failure costs are very high.
• Fault tolerance means that the system can continue in
operation in spite of software failure.
• Even if the system has been proved to conform to its
specification, it must also be fault tolerant as there may be
specification errors or the validation may be incorrect.
7
Fault-tolerant system architectures
• Fault-tolerant systems architectures are used in situations
where fault tolerance is essential. These architectures are
generally all based on redundancy and diversity.
• Examples of situations where dependable architectures are
used:
– Flight control systems, where system failure could threaten the safety of
passengers
– Reactor systems where failure of a control system could lead to a
chemical or nuclear emergency
– Telecommunication systems, where there is a need for 24/7 availability.
8
Protection systems
• A specialized system that is associated with some other control
system, which can take emergency action if a failure occurs.
– System to stop a train if it passes a red light
– System to shut down a reactor if temperature/pressure are too high
• Protection systems independently monitor the controlled
system and the environment.
• If a problem is detected, it issues commands to take emergency
action to shut down the system and avoid a catastrophe.
9
Protection system architecture
10
Protection system functionality
• Protection systems are redundant because they include
monitoring and control capabilities that replicate those in the
control software.
• Protection systems should be diverse and use different
technology from the control software.
• Aim is to ensure that there is a low probability of failure on
demand for the protection system.
11
Self-monitoring architectures
• Multi-channel architectures where the system monitors its own
operations and takes action if inconsistencies are detected.
• The same computation is carried out on each channel and the
results are compared. If the results are identical and are
produced at the same time, then it is assumed that the system
is operating correctly.
• If the results are different, then a failure is assumed and a
failure exception is raised.
12
Self-monitoring architecture
13
Self-monitoring systems
• Hardware in each channel has to be diverse so that
common mode hardware failure will not lead to each
channel producing the same results.
• Software in each channel must also be diverse, otherwise
the same software error would affect each channel.
• If high-availability is required, you may use several self-
checking systems in parallel.
– This is the approach used in the Airbus family of aircraft for their
flight control systems.
14
Airbus flight control system
architecture
15
Airbus architecture discussion
• The Airbus FCS has 5 separate computers, any one of which can
run the control software.
• Extensive use has been made of diversity
– Primary systems use a different processor from the secondary systems.
– Primary and secondary systems use chipsets from different manufacturers.
– Software in secondary systems is less complex than in primary system –
provides only critical functionality.
– Software in each channel is developed in different programming languages
by different teams.
– Different programming languages used in primary and secondary systems.
16
N-version programming
• Multiple versions of a software system carry out
computations at the same time. There should be an
odd number of computers involved, typically 3.
• The results are compared using a voting system and
the majority result is taken to be the correct result.
• Approach derived from the notion of triple-modular
redundancy, as used in hardware systems.
17
Hardware fault tolerance
• Depends on triple-modular redundancy (TMR).
• There are three replicated identical components that receive
the same input and whose outputs are compared.
• If one output is different, it is ignored and component failure
is assumed.
• Based on most faults resulting from component failures
rather than design faults and a low probability of
simultaneous component failure.
18
Triple modular redundancy
19
N-version programming
20
Software diversity
• Approaches to software fault tolerance depend on software
diversity where it is assumed that different implementations of
the same software specification will fail in different ways.
• It is assumed that implementations are (a) independent and (b)
do not include common errors.
• Strategies to achieve diversity
– Different programming languages
– Different design methods and tools
– Explicit specification of different algorithms
21
Problems with design diversity
• Teams are not culturally diverse so they tend to tackle problems in
the same way.
• Characteristic errors
– Different teams make the same mistakes. Some parts of an implementation
are more difficult than others so all teams tend to make mistakes in the
same place;
– Specification errors;
– If there is an error in the specification then this is reflected in all
implementations;
– This can be addressed to some extent by using multiple specification
representations.
22
Improvements in practice
• In principle, if diversity and independence can be
achieved, multi-version programming leads to very
significant improvements in reliability.
• In practice, observed improvements are much less
significant but the approach seems leads to reliability
improvements of between 5 and 9 times.
• The key question is whether or not such improvements
are worth the considerable extra development costs for
multi-version programming.
23