Lecture 3
Lecture 3
Fault-Tolerant Design
1
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1
What is this chapter about?
Gives Overview of Fault-Tolerant Design
Focus on
Basic Concepts in Fault-Tolerant Design
Metrics Used to Specify and Evaluate Dependability
Review of Coding Theory
Fault-Tolerant Design Schemes
– Hardware Redundancy
– Information Redundancy
– Time Redundancy
Examples of Fault-Tolerant Applications in Industry
2
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 2
Fault-Tolerant Design
Introduction
Fundamentals of Fault Tolerance
Fundamentals of Coding Theory
Fault Tolerant Schemes
Industry Practices
Concluding Remarks
3
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 3
Introduction
Fault Tolerance
Ability of system to continue error-free operation in
presence of unexpected fault
Important in mission-critical applications
E.g., medical, aviation, banking, etc.
Errors very costly
Becoming important in mainstream applications
Technology scaling causing circuit behavior to
become less predictable and more prone to failures
Needing fault tolerance to keep failure rate within
acceptable levels
4
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 4
Faults
Permanent Faults
Due to manufacturing defects, early life failures,
wearout failures
Wearout failures due to various mechanisms
– e.g., electromigration, hot carrier degradation, dielectric
breakdown, etc.
Temporary Faults
Only present for short period of time
Caused by external disturbance or marginal design
parameters
5
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 5
Temporary Faults
Transient Errors (Non-recurring errors)
Cause by external disturbance
– e.g., radiation, noise, power disturbance, etc.
Intermittent Errors (Recurring errors)
Cause by marginal design parameters
Timing problems
– e.g., races, hazards, skew
Signal integrity problems
– e.g., crosstalk, ground bounce, etc.
6
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 6
Redundancy
Fault Tolerance requires some form of
redundancy
Time Redundancy
Hardware Redundancy
Information Redundancy
7
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 7
Time Redundancy
Perform Same Operation Twice
See if get same result both times
If not, then fault occurred
Can detect temporary faults
Cannot detect permanent faults
– Would affect both computations
Advantage
Little to no hardware overhead
Disadvantage
Impacts system or circuit performance
8
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 8
Hardware Redundancy
Replicate hardware and compare outputs
From two or more modules
Detects both permanent and temporary faults
Advantage
Little or no performance impact
Disadvantage
Area and power for redundant hardware
9
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 9
Information Redundancy
Encode outputs with error detecting or
correcting code
Code selected to minimize redundancy for
class of faults
Advantage
Less hardware to generate redundant
information than replicating module
Drawback
Added complexity in design
10
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 10
Failure Rate
(t) = Component failure rate
Measured in FITS (failures per 109 hours)
11
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 11
System Failure Rate
System constructed from components
No Fault Tolerance
Any component fails, whole system fails
k
sys c ,i
i 1
12
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 12
Reliability
If component working at time 0
R(t) = Probability still working at time t
t
R(t ) e
13
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 13
Reliability for Series System
Series System
All components need to work for system to
work
A B C
Rsys RA RB RC
14
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 14
System Reliability with Redundancy
System reliability with component B in
Parallel
Can tolerate one component B failing
B
A C
B
2
Rsys RA 1 (1 RB ) RC RA (2 RB R ) RC 2
B
15
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 15
Mean-Time-to-Failure (MTTF)
Average time before system fails
Equal to area under reliability curve
MTTF R (t )dt
0
0
16
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 16
Maintainability
If system failed at time 0
M(t) = Probability repaired and operational
at time t
System repair time divided into
Passive repair time
– Time for service engineer to travel to site
Active repair time
– Time to locate failing component,
repair/replace, and verify system operational
– Can be improved through designing system so
easy to locate failed component and verify
17
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 17
Repair Rate and MTTR
= rate at which system repaired
Analogous to failure rate
Maintainability often modeled as
t
M (t ) 1 e
18
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 18
Availability
S Normal system operation
1
0
t0 t1 t2 t3 t4 t
failures
System Availability
Fraction of time system is operational
MTTF
system availability
MTTF MTTR
19
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 19
Availability
Telephone Systems
Required to have system availability of
0.9999 (“four nines”)
High-Reliability Systems
May require 7 or more nines
Fault-Tolerant Design
Needed to achieve such high availability
from less reliable components
20
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 20
Coding Theory
Coding
Using more bits than necessary to represent
data
Provides way to detect errors
– Errors occur when bits get flipped
Error Detecting Codes
Many types
Detect different classes of errors
Use different amounts of redundancy
Ease of encoding and decoding data varies
21
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 21
Block Code
Message = Data Being Encoded
Block code
Encodes m messages with n-bit codeword
log 2 m
redundancy 1
n
If no redundancy
m messages encoded with log2(m) bits
minimum possible
22
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 22
Block Code
To detect errors, some redundancy
needed
Space of distinct 2n blocks partitioned into
codewords and non-codewords
Can detect errors that cause codeword
to become non-codeword
Cannot detect errors that cause
codeword to become another codeword
23
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 23
Separable Block Code
Separable
n-bit blocks partitioned into
– k information bits directly representing message
– (n-k) check bits
Denoted (n,k) Block Code
Advantage
k-bit message directly extracted without
decoding
Rate of Separable Block Code = k/n
24
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 24
Example of Separable Block Code
(4,3) Parity Code
Check bit is XOR of 3 message bits
message 101 codeword 1010
Single Bit Parity
log 2 m log 2 (2k ) k nk 1
redundancy 1 1 1
n n n n n
k n 1
rate
n n
25
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 25
Example of Non-Separable Block Code
One-Hot Code
Each Codeword has single 1
Example of 8-bit one-hot
– 10000000, 01000000, 00100000, 00010000
00001000, 00000100, 00000010, 00000001
Redundancy = 1 - log2(8)/8 = 5/8
26
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 26
Linear Block Codes
Special class
Modulo-2 sum of any 2 codewords also
codeword
Null space of (n-k)xn Boolean matrix
– Called Parity Check Matrix, H
For any n-bit codeword c
cHT = 0
All 0 codeword exists in any linear code
27
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 27
Linear Block Codes
Generator Matrix, G
kxn Matrix
Codeword c for message m
c = mG
GHT =0
28
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 28
Systematic Block Code
First k-bits correspond to message
Last n-k bits correspond to check bits
For Systematic Code
G = [Ikxk : Pkx(n-k)]
H = [I(n-k)x(n-k) : PT(n-k)xk]
Example
1 0 0 1
G 0 1 0 1 H 1 1 1 1
0 0 1 1
29
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 29
Distance of Code
Distance between two codewords
Number of bits in which they differ
Distance of Code
Minimum distance between any two
codewords in code
If n=k (no redundancy), distance = 1
Single-bit parity, distance = 2
Code with distance d
Detect d-1 errors
Correct up to (d-1)/2 errors
30
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 30
Error Correcting Codes
Code with distance 3
Called single error correcting (SEC) code
Code with distance 4
Called single error correcting and double
error detecting (SEC-DED) code
Procedure for constructing SEC code
Described in [Hamming 1950]
Any H-matrix with all columns distinct and
no all-0 column is SEC
31
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 31
Hamming Code
For any value of n
SEC code constructed by
– setting each column in H equal to binary
representation of column number (starting from 1)
Number of rows in H equal to log2(n+1)
Example of SEC Hamming Code for n=7
0 0 0 1 1 1 1
H 0 1 1 0 0 1 1
1 0 1 0 1 0 1
32
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 32
Error Correction in Hamming Code
Syndrome, s
s = HvT for received vector v
If v is codeword
– Syndrome = 0
If v non-codeword and single-bit error
– Syndrome will match one of columns of H
– Will contain binary value of bit position in error
33
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 33
Example of Error Correction
For (7,3) Hamming Code
Suppose codeword 0110011 has one-bit
error changing it to 1110011
0 0 1
0 1 0
0 1 1
s vH [1110011 ] 1
T
0 0 [001]
1 0 1
1 1 0
1 1 1
34
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 34
SEC-DED Code
Make SEC Hamming Code SEC-DED
By adding parity check over all bits
Extra parity bit
– 1 for single-bit error
– 0 for double-bit error
Makes possible to detect double bit error
– Avoid assuming single-bit error and
miscorrecting it
35
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 35
Example of Error Correction
For (7,4) SEC-DED Hamming Code
Suppose codeword 0110011 has two-bit
error changing it to 1010011
– Doesn’t match any column in H
0 0 1 1
0 1 0 1
0 1 1 1
s vH [1010011] 1
T
0 0 1 [0010]
1 0 1 1
1 1 0 1
1 1 1 1
36
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 36
Hsiao Code
Weight of column
Number of 1’s in column
Constructing n-bit SEC-DED Hsiao Code
First use all possible weight-1 columns
– Then all possible weight-3 columns
– Then weight-5 columns, etc.
Until n columns formed
Number check bits is log2(n+1)
Minimizes number of 1’s in H-matrix
– Less hardware and delay for computing syndrome
– Disadvantage: Correction logic more complex
37
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 37
Example of Hsiao Code
(7,3) Hsiao Code
Uses weight-1 and weight-3 columns
0 0 0 1 0 1 1
0 0 1 0 1 0
1
H
0 1 0 0 1 1 0
1 0 0 0 1 1 1
38
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 38
Unidirectional Errors
Errors in block of data which only cause
01 or 10, but not both
Any number of bits in error in one direction
Example
Correct codeword 111000
Unidirectional errors could cause
– 001000, 000000, 101000 (only 10 errors)
Non-unidirectional errors
– 101001, 011001, 011011 (both10 and 01)
39
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 39
Unidirectional Error Detecting Codes
All unidirectional error detecting (AUED)
Codes
Detect all unidirectional errors in codeword
Single-bit parity is not AUED
– Cannot detect even number of errors
No linear code is AUED
– All linear codes must contain all-0 vector, so
cannot detect all 10 errors
40
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 40
Two-Rail Code
Two-Rail Code
One check bit for each information bit
– Equal to complement of information bit
Two-Rail Code is AEUD
50% Redundancy
Example of (6,3) Two-Rail Code
Message 101 has Codeword 101010
Set of all codewords
– 000111, 001110, 010101, 011100, 100110,
101010, 110001, 111000
41
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 41
Berger Codes
Lowest redundancy of separable AUED
codes
For k information bits, log2(k+1) check bits
Check bits equal to binary representation
of number of 0’s in information bits
Example
Information bits 1000101
– log2(7+1)=3 check bits
– Check bits equal to 100 (4 zero’s)
42
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 42
Berger Codes
Codewords for (5,3) Berger Code
00011, 00110, 01010, 01101, 10010,
10101, 11001, 11100
If unidirectional errors
Contain 10 errors
– increase 0’s in information bits
– can only decrease binary number in check bits
Contain 01 errors
– decrease 0’s in information bits
– can only increase binary number in check bits
43
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 43
Berger Codes
If 8 information bits
Berger code requires log28+1=4 check bits
log 2 m log 2 (2k ) 8 1
redundancy 1 1 1 25%
n n 12 4
46
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 46
Example
6-out-of-12 constant weight code
C612 924 codewords
47
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 47
Constant Weight Codes
Advantage
Less redundancy than Berger codes
Disadvantage
Non-separable
Need decoding logic
– to convert codeword back to binary message
48
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 48
Burst Error
Burst Error
Common, multi-bit errors tend to be clustered
– Noise source affects contiguous set of bus lines
Length of burst error
– number of bits between first and last error
Wrap around from last to first bit of codeword
Example: Original codeword 00000000
00111100 is burst error length 4
00110100 is burst error length 4
– Any number of errors between first and last error
49
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 49
Cyclic Codes
Special class of linear code
Any codeword shifted cyclically is another
codeword
Used to detect burst errors
Less redundancy required to detect burst
error than general multi-bit errors
– Some distance 2 codes can detect all burst
errors of length 4
– detecting all possible 4-bit errors requires
distance 5 code
50
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 50
Cyclic Redundancy Check (CRC) Code
Most widely used cyclic code
Uses binary alphabet based on GF(2)
CRC code is (n,k) block code
Formed using generator polynomial, g(x)
– called code generator
– degree n-k polynomial (same degree as
number of check bits)
nk
g ( x ) g nk x ... g 2 x g1 x g 0
2
c ( x ) m( x ) g ( x )
51
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 51
Message m(x) g(x) c(x) Codeword
0000 0 x2 + 1 0 000000
0001 1 x2 + 1 x2 + 1 000101
0010 x x2 + 1 x3 + x 001010
0011 x+1 x2 + 1 x3 + x2 + x + 1 001111
0100 x2 x2 + 1 x4 + x2 010100
0101 x2 + 1 x2 + 1 x4 + 1 010001
0110 x2 + x x2 + 1 x4 + x3 + x2 + x 011110
0111 x2 + x + 1 x2 + 1 x4 + x3 + x + 1 011011
1000 x3 x2 + 1 x5 + x3 101000
1001 x3 + 1 x2 + 1 x5 + x3 + x2 + 1 101101
1010 x3 + x x2 + 1 x5 + x 100010
1011 x3 + x + 1 x2 + 1 x5 + x2 + x + 1 100111
1100 x3 + x2 x2 + 1 x5 + x4 + x3 + x2 111100
1101 x3 + x2 + 1 x2 + 1 x5 + x4 + x3 + 1 111001
1110 x3 + x2 + x x2 + 1 x5 + x4 + x2 + x 110110
1111 x3 + x2 + x + 1 x2 + 1 x5 + x4 + x + 1 110011
52
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 52
CRC Code
Linear block code
Has G-matrix and H-matrix
G-matrix shifted version of generator
polynomial
g nk ... g1 g0 0 0 0
0 g nk ... g1 g0 0 0
G
. . . . . . .
0 0 ... g nk ... g1 g 0
53
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 53
CRC Code Example
(6,4) CRC code generated by g(x)=x2+1
1 0 1 0 0 0
0 1 0 1 0 0
G
0 0 1 0 1 0
0 0 0 1 0 1
54
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 54
Systematic CRC Codes
To obtain systematic CRC code
codewords formed using Galois division
– nice because LFSR can be used for performing
division
nk
c ( x ) m( x ) x r ( x)
nk
m( x ) x
r ( x) remainder of
g ( x)
55
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 55
Galois Division Example
Encode m(x)=x2+x with g(x)=x2+1
Requires dividing m(x)xn-k =x4+x3 by g(x)
111
101 11000
101
110
101
110
101
11 remainder
Remainder r(x)=x+1
– c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1
56
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 56
Message m(x) g(x) r(x) c(x) Codeword
0000 0 x2 + 1 0 0 000000
0001 1 x2 + 1 1 x2 + 1 000101
0010 x x2 + 1 x x3 + x 001010
0011 x+1 x2 + 1 x+1 x3 + x2 + x + 1 001111
0100 x2 x2 + 1 1 x4 + 1 010001
0101 x2 + 1 x2 + 1 0 x4 + x2 010100
0110 x2 + x x2 + 1 x+1 x4 + x3 + x + 1 011011
0111 x2 + x + 1 x2 + 1 x x4 + x3 + x + 1 011110
1000 x3 x2 + 1 x x4 + x3 + x + 1 100010
1001 x3 + 1 x2 + 1 x+1 x4 + x3 + x + 1 100111
1010 x3 + x x2 + 1 0 x4 + x3 + x + 1 101000
1011 x3 + x + 1 x2 + 1 1 x4 + x3 + x + 1 101101
1100 x3 + x2 x2 + 1 x+1 x4 + x3 + x + 1 110011
1101 x3 + x2 + 1 x2 + 1 x x4 + x3 + x + 1 110110
1110 x3 + x2 + x x2 + 1 1 x4 + x3 + x + 1 111001
1111 x3 + x2 + x + 1 x2 + 1 0 x4 + x3 + x2 + x 111100
57
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 57
Generating Check Bits for CRC Code
Use LFSR
With characteristic polynomial equal to g(x)
Append n-k 0’s to end of message
Example: m(x)=x2+x+1 and g(x)=x3+x+1
Message
0 0 0 111000
Appended 0’s
0 1 0
0 0 0 111010
codeword to check
59
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 59
Selecting Generator Polynomial
Key issue for CRC Codes
If first and last bit of polynomial are 1
– Will detect burst errors of length n-k or less
If generator polynomial is mutliple of (x+1)
– Will detect any odd number of errors
If g(x) = (x+1)p(x) where p(x) primitive of
degree n-k-1 and n < 2n-k-1
– Will detect single, double, triple, and odd errors
60
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 60
Commonly Used CRC Generators
CRC code Generator Polynomial
61
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 61
Fault Tolerance Schemes
Adding Fault Tolerance to Design
Improves dependability of system
Requires redundancy
– Hardware
– Time
– Information
62
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 62
Hardware Redundancy
Involves replicating hardware units
At any level of design
– gate-level, module-level, chip-level, board-level
Three Basic Forms
Static (also called Passive)
– Masks faults rather than detects them
Dynamic (also called Active)
– Detects faults and reconfigures to spare hardware
Hybrid
– Combines active and passive approaches
63
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 63
Static Redundancy
Masks faults so no erroneous outputs
Provides uninterrupted operation
Important for real-time systems
– No time to reconfigure or retry operation
Simple self-contained
– No need to update or rollback system state
64
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 64
Triple Module Redundancy (TMR)
Well-known static redundancy scheme
Three copies of module
Use majority voter to determine final output
Error in one module out-voted by other two
Module
1
Module Majority
2 Voter
Module
3
65
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 65
TMR Reliability and MTTF
TMR works if any 2 modules work
Rm = reliability of each module
Rv = reliability of voter
RTMR Rv [ Rm3 C23 Rm2 (1 Rm )] Rv (3Rm2 2 Rm3 )
3 2
2m v 3m v
66
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 66
Comparison with Simplex
Neglecting fault rate of voter
3 5 1 5
2
MTTFTMR MTTFsimplex
2m 3m 6 m 6
67
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 67
Comparison with Simplex
Crossover point
RTMR Rsimplex
3e 2 mt 2e 3mt e mt
ln 2
Solve t 0.7 MTTFsimplex
m
68
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 68
N-Modular Redundancy (NMR)
NMR
N modules along with majority voter
– TMR special case
Number of failed modules masked = (N-1)/2
As N increases, MTTF decreases
– But, reliability for short missions increases
If goal only to tolerate temporary faults
TMR sufficient
69
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 69
Interwoven Logic
Replace each gate
with 4 gates using inconnection pattern
that automatically corrects errors
Traditionally not as attractive as TMR
Requires lots of area overhead
Renewed interest by researchers
investigating emerging nanoelectronic
technologies
70
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 70
Interwoven Logic with 4 NOR Gates
+
2a
X
+
2b
+
+ 2c +
1a 4a
+
X + + 2d +
+ 2 + 1b 4b
1
+ 4 + +
1c + 4c
Y 3 3a
+ +
1d + 4d
3b
+
3c
Y +
3d
71
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 71
Example of Error on Third Y Input
0
0 + 0
2a
X 0 + 0
0 2b
+ 0
+ 1 2c + 1
1a 4a
+ 0
X + + 1 2d + 1
2 1b 4b
+ +
1 4
+ 0 + 1
+ 1c + 0 4c
Y 3
+ 0
3a
+ 1
1d + 0 4d
3b
0 + 0
0 3c
Y 1 + 0
0 3d
72
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 72
Dynamic Redundancy
Involves
Detecting fault
Locating faulty hardware unit
Reconfiguring system to use spare fault-free
hardware unit
73
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 73
Unpowered (Cold) Spares
Advantage
Extends lifetime of spares
Equations
Assume spare not failing until powered
Perfect reconfiguration capability
Rw / cold _ spare (1 t )e t
2
MTTFw / cold _ spare
74
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 74
Unpowered (Cold) Spares
One cold spare doubles MTTF
Assuming faults always detected and
reconfiguration circuitry never fails
Drawback of cold spare
Extra time to power and initialize
Cannot be used to help in detecting faults
Fault detection requires either
– periodic offline testing
– online testing using time or information
redundancy
75
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 75
Powered (Hot) Spares
Can use spares for online fault detection
One approach is duplicate-and-compare
If outputs mismatch then fault occurred
– Run diagnostic procedure to determine which
module is faulty and replace with spare
Any number of spares can be used
Module
Output
A
Spare
Module
Compare Agree/Disagree
Module
B
76
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 76
Pair-and-a-Spare
Avoids halting system to run diagnostic
procedure when fault occurs
Module
Output
A
Compare Agree/Disagree
Module
B
Switch
Module
Output
C
Compare Agree/Disagree
Module
D
77
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 77
TMR/Simplex
When one module in TMR fails
Disconnect one of remaining modules
Improves MTTF while retaining advantages
of TMR when 3 good modules
TMR/Simplex
Reliability always better than either TMR or
Simplex alone
78
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 78
Comparison of Reliability vs Time
1
0.9
0.8
RELIABILITY
0.7 SIMPLEX
TMR
0.6
TMR/SIMPLEX
0.5
0.4
0.3
0 0.2 0.4 0.6 0.8 1
NORMALIZED MISSION TIME (T/MTTF)
79
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 79
Hybrid Redundancy
Combines both static and dynamic
redundancy
Masks faults like static
Detects and reconfigures like dynamic
80
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 80
TMR with Spares
If TMR module fails
Replace with spare
– can be either hot or cold spare
While system has three working modules
– TMR will provide fault masking for
uninterrupted operation
81
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 81
Self-Purging Redundancy
Uses threshold voter instead of majority
voter
Threshold voter outputs 1 if number of
input that are 1 greater than threshold
– Otherwise outputs 0
Requires hot spares
82
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 82
Self-Purging Redundancy
Module Elem.
Switch
1
Elem.
Module Switch
2
Threshold
Elem.
Module Switch Voter
3 2 Elementary Switch
Elem. Initialization
Module Switch
S R
4 Flip
Flop
Elem.
Module Switch & Voter
5 Module
83
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 83
Self-Purging Redundancy
Compared with 5MR
Self-purging with 5 modules
– Tolerate up to 3 failing modules (5MR cannot)
– Cannot tolerate two modules simultaneously
failing (5MR can)
Compared with TMR with 2 spares
Self-purging with 5 modules
– simpler reconfiguration circuitry
– requires hot spares (3MR w/spares can use
either hot or cold spares)
84
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 84
Time Redundancy
Advantage
Less hardware
Drawback
Cannot detect permanent faults
If error detected
System needs to rollback to known good
state before resuming operation
85
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 85
Repeated Execution
Repeat operation twice
Simplest time redundancy approach
Detects temporary faults occurring during
one execution (but not both)
– Causes mismatch in results
Can reuse same hardware for both
executions
– Only one copy of functional hardware needed
86
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 86
Repeated Execution
Requires mechanism for storing and
comparing results of both executions
In processor, can store in memory or on
disk and use software to compare
Main cost
Additional time for redundant execution
and comparison
87
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 87
Multi-threaded Redundant Execution
Can use in processor-based system that
can run multiple threads
Two copies of thread executed concurrently
Results compared when both complete
Take advantage of processor’s built-in
capability to exploit processing resources
– Reduce execution time
– Can significantly reduce performance penalty
88
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 88
Multiple Sampling of Ouputs
Done at circuit-level
Sample once at end of normal clock cycle
Same again after delay of t
Two samples compared to detect mismatch
– Indicates error occurred
Detect fault whose duration is less than t
Performance overhead depends on
– Size of t relative to normal clock period
89
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 89
Multiple Sampling of Outputs
Simple approach using two latches
Main
Latch
Shadow
Latch
Clk+t
90
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 90
Multiple Sampling of Outputs
Approach using stability checker at output
Stability Stability
Checking Checking
Normal Period Normal Period
Clock Period t Clock Period t
Checking &
Period
+
Signal & Error
+
&
91
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 91
Diverse Recomputation
Use same hardware, but perform
computation differently second time
Can detect permanent faults that affects
only one computation
For arithmetic or logical operations
Shift operands when performing second
computation [Patel 1982]
Detects permanent fault affecting only one
bit-slice
92
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 92
Information Redundancy
Based on Error Detecting and
Correcting Codes
Advantage
Detects both permanent and temporary
faults
Implemented with less hardware overhead
than using multiple copies of module
Disadvantage
More complex design
93
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 93
Error Detection
Error detecting codes used to detect
errors
If error detected
– Rollback to previous known error-free state
– Retry operation
94
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 94
Rollback
Requires adding storage to save
previous state
Amount of rollback depends on latency of
error detection mechanism
Zero-latency error detection
– rollback implemented by preventing system
state from updating
If errors detected after n cycles
– need rollback restoring system to state at least
n clock cycles earlier
95
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 95
Checkpoint
Execution divided into set of operations
Before each operation executed
– checkpoint created where system state saved
If any error detected during operation
– rollback to last checkpoint and retry operation
If multiple retries fail
– operation halts and system flags that
permanent fault has occurred
96
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 96
Error Detection
Encode outputs of circuit with error
detecting code
Non-codeword output indicates error
m Functional k
Inputs Outputs
Logic
k
m Error
Check Bit Checker
Indication
Generator c
97
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 97
Self-Checking Checker
Has two outputs
Normal error-free case (1,0) or (0,1)
If equal to each other, then error (0,0) or (1,1)
Cannot have single error indicator output
– Stuck-at 0 fault on output could never be detected
98
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 98
Totally Self-Checking Checker
Requires three properties
Code Disjoint
– all codeword inputs mapped to codeword outputs
Fault Secure
– for all codeword inputs, checker in presence of
fault will either procedure correct codeword output
or non-codeword output (not incorrect codeword)
Self-Testing
– For each fault, at least one codeword input gives
error indication
99
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 99
Duplicate-and-Compare
Equality checker indicates error
Undetected error can occur only if
common-mode fault affecting both copies
Only faults after stems detected
Over 100% overhead (including checker)
Stems
Primary Functional
Inputs Logic
Equality Error
Checker Indication
Functional
Logic
100
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 100
Single-Bit Parity Code
Totally self-checking checker formed by
removing final gate from XOR tree
EI0
Functional
Logic
EI1
Parity
Prediction
101
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 101
Single-Bit Parity Code
Cannot detect even bit errors
Can ensure no even bit errors by
generating each output with independent
cone of logic
– Only single bit errors can occur due to single
point fault
– Typically requires a lot of overhead
102
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 102
Parity-Check Codes
Each check bit is parity for some set of
output bits
Example: 6 outputs and 3 check bits
Z1 Z2 Z3 Z4 Z5 Z 6 c1 c2 c3
Parity Group 1 1 0 0 1 1 0 1 0 0
Parity Group 2 0 1 1 0 0 0 0 1 0
Parity Group 3 0 0 0 0 0 1 0 0 1
103
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 103
Parity-Check Codes
For c check bits and k functional
outputs
2ck possible parity check codes
Can choose code based on structure of
circuit to minimize undetected error
combinations
Fanouts in circuit determine possible error
combinations due to single-point fault
104
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 104
Checker for Parity-Check Codes
Constructed from single-bit parity
checkers and two-rail checkers
Z1
Z4 Parity
Z5 Checker
c1
Two-Rail
Checker
Z2
E0
Parity Two-Rail
Z3
Checker Checker E1
c2
Z6 Parity
c3 Checker
105
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 105
Two-Rail Checkers
Totally self-checking two-rail checker
A0
&
B0
+ C0
&
&
A1
+ C1
&
B1
106
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 106
Berger Codes
Inverter-free circuit
Inverters only at primary inputs
Can be synthesized using only algebraic
factoring [Jha 1993]
Only unidirectional errors possible for
single point faults
– Can use unidirectional code
– Berger code gives 100% coverage
107
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 107
Constant Weight Codes
Non-separable with lower redundancy
Drawback: need decoding logic to convert
codeword back to its original binary value
Can use for encoding states of FSM
– No need for decoding logic
108
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 108
Error Correction
Information redundancy can also be
used to mask errors
Not as attractive as TMR because logic for
predicting check bits very complex
However, very good for memories
– Check bits stored with data
– Error do not propagate in memories as in logic
circuits, so SEC-DED usually sufficient
109
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 109
Error Correction
Memories very dense and prone to errors
Especially due to single-event upsets (SEUs)
from radiation
SEC-DED check bits stored in memory
32-bit word, SEC-DED requires 7 check bits
– Increases size of memory by 7/32=21.9%
64-bit word, SEC-DED requires 8 check bits
– Increases size of memory by 8/64=12.5%
110
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 110
Memory ECC Architecture
Write Data Word
Generate Write
Data Word Check Bits
In Check
Bits Memory
Calculated
Check Bits
Read
Generate Check Bits
111
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 111
Hamming Code for ECC RAM
Input Data Z Z Bit Error Z Output
Correction Circuit Data
Hamming RAM
Check Bit c Parity Bit
Core c Parity
Generator Generator
Check
N words
Parity Bit Z+c+1 Hamming
Generator bits/word Hamming c Check
Check Bit
Generate Detect/Correct
Generator
Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 c1 c2 c3 c4
Parity Group 1 1 1 0 1 1 0 1 0 1 0 0 0
Parity Group 2 1 0 1 1 0 1 1 0 0 1 0 0
Parity Group 3 0 1 1 1 0 0 0 1 0 0 1 0
Parity Group 4 0 0 0 0 1 1 1 1 0 0 0 1
113
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 113
Memory Scrubbing
Every location in memory read on
periodic basis
Reduces chance of multiple errors
accumulating in a memory word
Can be implemented by having memory
controller cycle through memory during idle
periods
114
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 114
Multiple-Bit Upsets (MBU)
Can occur due to single SEU
Typically occur in adjacent memory cells
Memory interleaving used
To prevent MBUs from resulting in multiple
bit errors in same word
Memory
Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4
Bit1 Bit1 Bit1 Bit1 Bit2 Bit2 Bit2 Bit2 Bit3 Bit3 Bit3 Bit3
115
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 115
Type Issues Goal Examples Techniques
Mainstream Reasonable Level of Meet Failure Rate Consumer Electronics Often None;
Low-Cost Failures Acceptable Expectations Personal Computers Memory ECC; Bus
Systems at Low Cost Parity; Changing as
Technology Scales
116
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 116
Concluding Remarks
Many different fault-tolerant schemes
Choosing scheme depends on
Types of faults to be tolerated
– Temporary or permanent
– Single or multiple point failures
– etc.
Design constraints
– Area, performance, power, etc.
117
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 117
Concluding Remarks
As technology scales
Circuits increasingly prone to failure
Achieving sufficient fault tolerance will be
major design issue
118
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 118