0% found this document useful (0 votes)
29 views

Lecture 3

The chapter discusses fault-tolerant design including basic concepts, metrics for dependability, coding theory, fault-tolerant schemes using hardware redundancy, information redundancy and time redundancy, and examples of fault-tolerant applications in industry.

Uploaded by

Ahmed Saeed
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Lecture 3

The chapter discusses fault-tolerant design including basic concepts, metrics for dependability, coding theory, fault-tolerant schemes using hardware redundancy, information redundancy and time redundancy, and examples of fault-tolerant applications in industry.

Uploaded by

Ahmed Saeed
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 118

Chapter 3

Fault-Tolerant Design

1
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1
What is this chapter about?
 Gives Overview of Fault-Tolerant Design
 Focus on
 Basic Concepts in Fault-Tolerant Design
 Metrics Used to Specify and Evaluate Dependability
 Review of Coding Theory
 Fault-Tolerant Design Schemes
– Hardware Redundancy
– Information Redundancy
– Time Redundancy
 Examples of Fault-Tolerant Applications in Industry
2
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 2
Fault-Tolerant Design

 Introduction
 Fundamentals of Fault Tolerance
 Fundamentals of Coding Theory
 Fault Tolerant Schemes
 Industry Practices
 Concluding Remarks

3
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 3
Introduction
 Fault Tolerance
 Ability of system to continue error-free operation in
presence of unexpected fault
 Important in mission-critical applications
 E.g., medical, aviation, banking, etc.
 Errors very costly
 Becoming important in mainstream applications
 Technology scaling causing circuit behavior to
become less predictable and more prone to failures
 Needing fault tolerance to keep failure rate within
acceptable levels
4
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 4
Faults
 Permanent Faults
 Due to manufacturing defects, early life failures,
wearout failures
 Wearout failures due to various mechanisms
– e.g., electromigration, hot carrier degradation, dielectric
breakdown, etc.
 Temporary Faults
 Only present for short period of time
 Caused by external disturbance or marginal design
parameters

5
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 5
Temporary Faults
 Transient Errors (Non-recurring errors)
 Cause by external disturbance
– e.g., radiation, noise, power disturbance, etc.
 Intermittent Errors (Recurring errors)
 Cause by marginal design parameters
 Timing problems
– e.g., races, hazards, skew
 Signal integrity problems
– e.g., crosstalk, ground bounce, etc.
6
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 6
Redundancy
 Fault Tolerance requires some form of
redundancy
 Time Redundancy
 Hardware Redundancy
 Information Redundancy

7
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 7
Time Redundancy
 Perform Same Operation Twice
 See if get same result both times
 If not, then fault occurred
 Can detect temporary faults
 Cannot detect permanent faults
– Would affect both computations
 Advantage
 Little to no hardware overhead
 Disadvantage
 Impacts system or circuit performance
8
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 8
Hardware Redundancy
 Replicate hardware and compare outputs
 From two or more modules
 Detects both permanent and temporary faults
 Advantage
 Little or no performance impact
 Disadvantage
 Area and power for redundant hardware

9
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 9
Information Redundancy
 Encode outputs with error detecting or
correcting code
 Code selected to minimize redundancy for
class of faults
 Advantage
 Less hardware to generate redundant
information than replicating module
 Drawback
 Added complexity in design
10
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 10
Failure Rate
 (t) = Component failure rate
 Measured in FITS (failures per 109 hours)

11
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 11
System Failure Rate
 System constructed from components
 No Fault Tolerance
 Any component fails, whole system fails

k
sys   c ,i
i 1

12
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 12
Reliability
 If component working at time 0
 R(t) = Probability still working at time t

 Exponential Failure Law


 If failure rate assumed constant
– Good approximation if past infant mortality period

 t
R(t )  e
13
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 13
Reliability for Series System
 Series System
 All components need to work for system to
work

A B C

Rsys  RA RB RC

14
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 14
System Reliability with Redundancy
 System reliability with component B in
Parallel
 Can tolerate one component B failing

B
A C
B

 2

Rsys  RA 1  (1  RB ) RC  RA (2 RB  R ) RC 2
B

15
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 15
Mean-Time-to-Failure (MTTF)
 Average time before system fails
 Equal to area under reliability curve

MTTF   R (t )dt
0

 For Exponential Failure Law



1
MTTF   e dt 
 t

0

16
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 16
Maintainability
 If system failed at time 0
 M(t) = Probability repaired and operational
at time t
 System repair time divided into
 Passive repair time
– Time for service engineer to travel to site
 Active repair time
– Time to locate failing component,
repair/replace, and verify system operational
– Can be improved through designing system so
easy to locate failed component and verify
17
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 17
Repair Rate and MTTR
 = rate at which system repaired
 Analogous to failure rate 
 Maintainability often modeled as
 t
M (t )  1  e

 Mean-Time-to-Repair (MTTR) = 1/

18
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 18
Availability
S Normal system operation
1

0
t0 t1 t2 t3 t4 t
failures

 System Availability
 Fraction of time system is operational

MTTF
system availability 
MTTF  MTTR
19
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 19
Availability
 Telephone Systems
 Required to have system availability of
0.9999 (“four nines”)
 High-Reliability Systems
 May require 7 or more nines
 Fault-Tolerant Design
 Needed to achieve such high availability
from less reliable components

20
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 20
Coding Theory
 Coding
 Using more bits than necessary to represent
data
 Provides way to detect errors
– Errors occur when bits get flipped
 Error Detecting Codes
 Many types
 Detect different classes of errors
 Use different amounts of redundancy
 Ease of encoding and decoding data varies
21
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 21
Block Code
 Message = Data Being Encoded
 Block code
 Encodes m messages with n-bit codeword
log 2 m 
redundancy  1 
n
 If no redundancy
 m messages encoded with log2(m) bits
 minimum possible
22
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 22
Block Code
 To detect errors, some redundancy
needed
 Space of distinct 2n blocks partitioned into
codewords and non-codewords
 Can detect errors that cause codeword
to become non-codeword
 Cannot detect errors that cause
codeword to become another codeword

23
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 23
Separable Block Code
 Separable
 n-bit blocks partitioned into
– k information bits directly representing message
– (n-k) check bits
 Denoted (n,k) Block Code
 Advantage
 k-bit message directly extracted without
decoding
 Rate of Separable Block Code = k/n
24
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 24
Example of Separable Block Code
 (4,3) Parity Code
 Check bit is XOR of 3 message bits
 message 101  codeword 1010
 Single Bit Parity
log 2 m  log 2 (2k ) k nk 1
redundancy  1  1 1  
n n n n n

k n 1
rate  
n n
25
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 25
Example of Non-Separable Block Code
 One-Hot Code
 Each Codeword has single 1
 Example of 8-bit one-hot
– 10000000, 01000000, 00100000, 00010000
00001000, 00000100, 00000010, 00000001
 Redundancy = 1 - log2(8)/8 = 5/8

log 2 m  log 2 (n)


redundancy  1  1
n n

26
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 26
Linear Block Codes
 Special class
 Modulo-2 sum of any 2 codewords also
codeword
 Null space of (n-k)xn Boolean matrix
– Called Parity Check Matrix, H
 For any n-bit codeword c
 cHT = 0
 All 0 codeword exists in any linear code

27
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 27
Linear Block Codes
 Generator Matrix, G
 kxn Matrix
 Codeword c for message m
 c = mG

 GHT =0

28
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 28
Systematic Block Code
 First k-bits correspond to message
 Last n-k bits correspond to check bits
 For Systematic Code
 G = [Ikxk : Pkx(n-k)]
 H = [I(n-k)x(n-k) : PT(n-k)xk]
 Example
1 0 0 1
G  0 1 0 1 H  1 1 1 1
0 0 1 1
29
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 29
Distance of Code
 Distance between two codewords
 Number of bits in which they differ
 Distance of Code
 Minimum distance between any two
codewords in code
 If n=k (no redundancy), distance = 1
 Single-bit parity, distance = 2
 Code with distance d
 Detect d-1 errors
 Correct up to (d-1)/2 errors
30
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 30
Error Correcting Codes
 Code with distance 3
 Called single error correcting (SEC) code
 Code with distance 4
 Called single error correcting and double
error detecting (SEC-DED) code
 Procedure for constructing SEC code
 Described in [Hamming 1950]
 Any H-matrix with all columns distinct and
no all-0 column is SEC
31
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 31
Hamming Code
 For any value of n
 SEC code constructed by
– setting each column in H equal to binary
representation of column number (starting from 1)
 Number of rows in H equal to log2(n+1)
 Example of SEC Hamming Code for n=7
0 0 0 1 1 1 1

H  0 1 1 0 0 1 1 
1 0 1 0 1 0 1
32
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 32
Error Correction in Hamming Code
 Syndrome, s
 s = HvT for received vector v
 If v is codeword
– Syndrome = 0
 If v non-codeword and single-bit error
– Syndrome will match one of columns of H
– Will contain binary value of bit position in error

33
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 33
Example of Error Correction
 For (7,3) Hamming Code
 Suppose codeword 0110011 has one-bit
error changing it to 1110011
0 0 1
0 1 0

0 1 1
 
s  vH  [1110011 ] 1
T
0 0  [001]
1 0 1
 
1 1 0
1 1 1 

34
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 34
SEC-DED Code
 Make SEC Hamming Code SEC-DED
 By adding parity check over all bits
 Extra parity bit
– 1 for single-bit error
– 0 for double-bit error
 Makes possible to detect double bit error
– Avoid assuming single-bit error and
miscorrecting it

35
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 35
Example of Error Correction
 For (7,4) SEC-DED Hamming Code
 Suppose codeword 0110011 has two-bit
error changing it to 1010011
– Doesn’t match any column in H
0 0 1 1
0 1 0 1

0 1 1 1
 
s  vH  [1010011] 1
T
0 0 1  [0010]
1 0 1 1
 
1 1 0 1
1 1 1 1

36
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 36
Hsiao Code
 Weight of column
 Number of 1’s in column
 Constructing n-bit SEC-DED Hsiao Code
 First use all possible weight-1 columns
– Then all possible weight-3 columns
– Then weight-5 columns, etc.
 Until n columns formed
 Number check bits is log2(n+1)
 Minimizes number of 1’s in H-matrix
– Less hardware and delay for computing syndrome
– Disadvantage: Correction logic more complex
37
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 37
Example of Hsiao Code
 (7,3) Hsiao Code
 Uses weight-1 and weight-3 columns

0 0 0 1 0 1 1
0 0 1 0 1 0 
1
H 
0 1 0 0 1 1 0
 
1 0 0 0 1 1 1

38
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 38
Unidirectional Errors
 Errors in block of data which only cause
01 or 10, but not both
 Any number of bits in error in one direction
 Example
 Correct codeword 111000
 Unidirectional errors could cause
– 001000, 000000, 101000 (only 10 errors)
 Non-unidirectional errors
– 101001, 011001, 011011 (both10 and 01)

39
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 39
Unidirectional Error Detecting Codes
 All unidirectional error detecting (AUED)
Codes
 Detect all unidirectional errors in codeword
 Single-bit parity is not AUED
– Cannot detect even number of errors
 No linear code is AUED
– All linear codes must contain all-0 vector, so
cannot detect all 10 errors

40
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 40
Two-Rail Code
 Two-Rail Code
 One check bit for each information bit
– Equal to complement of information bit
 Two-Rail Code is AEUD
 50% Redundancy
 Example of (6,3) Two-Rail Code
 Message 101 has Codeword 101010
 Set of all codewords
– 000111, 001110, 010101, 011100, 100110,
101010, 110001, 111000
41
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 41
Berger Codes
 Lowest redundancy of separable AUED
codes
 For k information bits, log2(k+1) check bits
 Check bits equal to binary representation
of number of 0’s in information bits
 Example
 Information bits 1000101
– log2(7+1)=3 check bits
– Check bits equal to 100 (4 zero’s)
42
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 42
Berger Codes
 Codewords for (5,3) Berger Code
 00011, 00110, 01010, 01101, 10010,
10101, 11001, 11100
 If unidirectional errors
 Contain 10 errors
– increase 0’s in information bits
– can only decrease binary number in check bits
 Contain 01 errors
– decrease 0’s in information bits
– can only increase binary number in check bits
43
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 43
Berger Codes
 If 8 information bits
 Berger code requires log28+1=4 check bits
log 2 m  log 2 (2k ) 8 1
redundancy  1  1  1    25%
n n 12 4

 (16,8) Two-Rail Code


 Requires 50% redundancy
 Redundancy advantage of Berger Code
 Increases as k increased
44
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 44
Constant Weight Codes
 Constant Weight Codes
 Non-separable, but lower redundancy than
Berger
 Each codeword has same number of 1’s
 Example 2-out-of-3 constant weight code
 110, 011, 101
 AEUD code
 Unidirectional errors always change number
of 1’s
45
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 45
Constant Weight Codes
 Number codewords in m-out-of-n code
n
C m

 Codewords maximized when m close to


n/2 as possible
 n/2-out-of-n when n even
 (n/2-0.5 or n/2+0.5)-out-of-n when n odd
 Minimizes redundancy of code

46
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 46
Example
 6-out-of-12 constant weight code
C612  924 codewords

log 2 m  log 2 (924)


redundancy  1  1  17.9%
n 12

 12-bit Berger Code


 Only 28 = 256 codewords
log 2 m  log 2 (28 )
redundancy  1  1  33.3%
n 12

47
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 47
Constant Weight Codes
 Advantage
 Less redundancy than Berger codes

 Disadvantage
 Non-separable
 Need decoding logic
– to convert codeword back to binary message

48
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 48
Burst Error
 Burst Error
 Common, multi-bit errors tend to be clustered
– Noise source affects contiguous set of bus lines
 Length of burst error
– number of bits between first and last error
 Wrap around from last to first bit of codeword
 Example: Original codeword 00000000
 00111100 is burst error length 4
 00110100 is burst error length 4
– Any number of errors between first and last error

49
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 49
Cyclic Codes
 Special class of linear code
 Any codeword shifted cyclically is another
codeword
 Used to detect burst errors
 Less redundancy required to detect burst
error than general multi-bit errors
– Some distance 2 codes can detect all burst
errors of length 4
– detecting all possible 4-bit errors requires
distance 5 code
50
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 50
Cyclic Redundancy Check (CRC) Code
 Most widely used cyclic code
 Uses binary alphabet based on GF(2)
 CRC code is (n,k) block code
 Formed using generator polynomial, g(x)
– called code generator
– degree n-k polynomial (same degree as
number of check bits)
nk
g ( x )  g nk x  ...  g 2 x  g1 x  g 0
2

c ( x )  m( x ) g ( x )
51
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 51
Message m(x) g(x) c(x) Codeword
0000 0 x2 + 1 0 000000
0001 1 x2 + 1 x2 + 1 000101
0010 x x2 + 1 x3 + x 001010
0011 x+1 x2 + 1 x3 + x2 + x + 1 001111
0100 x2 x2 + 1 x4 + x2 010100
0101 x2 + 1 x2 + 1 x4 + 1 010001
0110 x2 + x x2 + 1 x4 + x3 + x2 + x 011110
0111 x2 + x + 1 x2 + 1 x4 + x3 + x + 1 011011
1000 x3 x2 + 1 x5 + x3 101000
1001 x3 + 1 x2 + 1 x5 + x3 + x2 + 1 101101
1010 x3 + x x2 + 1 x5 + x 100010
1011 x3 + x + 1 x2 + 1 x5 + x2 + x + 1 100111
1100 x3 + x2 x2 + 1 x5 + x4 + x3 + x2 111100
1101 x3 + x2 + 1 x2 + 1 x5 + x4 + x3 + 1 111001
1110 x3 + x2 + x x2 + 1 x5 + x4 + x2 + x 110110
1111 x3 + x2 + x + 1 x2 + 1 x5 + x4 + x + 1 110011
52
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 52
CRC Code
 Linear block code
 Has G-matrix and H-matrix
 G-matrix shifted version of generator
polynomial

 g nk ... g1 g0 0 0 0
 0 g nk ... g1 g0 0 0 
G
 . . . . . . .
 
 0 0 ... g nk ... g1 g 0 

53
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 53
CRC Code Example
 (6,4) CRC code generated by g(x)=x2+1

1 0 1 0 0 0
0 1 0 1 0 0
G
0 0 1 0 1 0
 
0 0 0 1 0 1

54
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 54
Systematic CRC Codes
 To obtain systematic CRC code
 codewords formed using Galois division
– nice because LFSR can be used for performing
division

nk
c ( x )  m( x ) x  r ( x)
nk
m( x ) x
r ( x)  remainder of
g ( x)

55
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 55
Galois Division Example
 Encode m(x)=x2+x with g(x)=x2+1
 Requires dividing m(x)xn-k =x4+x3 by g(x)
111
101 11000
101
110
101
110
101
11 remainder

 Remainder r(x)=x+1
– c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1
56
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 56
Message m(x) g(x) r(x) c(x) Codeword
0000 0 x2 + 1 0 0 000000
0001 1 x2 + 1 1 x2 + 1 000101
0010 x x2 + 1 x x3 + x 001010
0011 x+1 x2 + 1 x+1 x3 + x2 + x + 1 001111
0100 x2 x2 + 1 1 x4 + 1 010001
0101 x2 + 1 x2 + 1 0 x4 + x2 010100
0110 x2 + x x2 + 1 x+1 x4 + x3 + x + 1 011011
0111 x2 + x + 1 x2 + 1 x x4 + x3 + x + 1 011110
1000 x3 x2 + 1 x x4 + x3 + x + 1 100010
1001 x3 + 1 x2 + 1 x+1 x4 + x3 + x + 1 100111
1010 x3 + x x2 + 1 0 x4 + x3 + x + 1 101000
1011 x3 + x + 1 x2 + 1 1 x4 + x3 + x + 1 101101
1100 x3 + x2 x2 + 1 x+1 x4 + x3 + x + 1 110011
1101 x3 + x2 + 1 x2 + 1 x x4 + x3 + x + 1 110110
1110 x3 + x2 + x x2 + 1 1 x4 + x3 + x + 1 111001
1111 x3 + x2 + x + 1 x2 + 1 0 x4 + x3 + x2 + x 111100
57
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 57
Generating Check Bits for CRC Code
 Use LFSR
 With characteristic polynomial equal to g(x)
 Append n-k 0’s to end of message
 Example: m(x)=x2+x+1 and g(x)=x3+x+1
Message

0 0 0 111000
Appended 0’s

0 1 0

Final state after shifting equals remainder


58
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 58
Checking CRC Codeword
 Checking Received Codeword for Errors
 Shift codeword into LFSR
– with same characteristic polynomial as used to
generate it
 If final state of LFSR non-zero, then error

0 0 0 111010
codeword to check

59
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 59
Selecting Generator Polynomial
 Key issue for CRC Codes
 If first and last bit of polynomial are 1
– Will detect burst errors of length n-k or less
 If generator polynomial is mutliple of (x+1)
– Will detect any odd number of errors
 If g(x) = (x+1)p(x) where p(x) primitive of
degree n-k-1 and n < 2n-k-1
– Will detect single, double, triple, and odd errors

60
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 60
Commonly Used CRC Generators
CRC code Generator Polynomial

CRC-5 (USB token packets) x5+x2+1

CRC-12 (Telecom systems) x12+x11+x3+x2+x+1

CRC-16-CCITT (X25, Bluetooth) x16+x12+x5+1

CRC-32 (Ethernet) x32+x26+x23+x22+x16+x12+x11+x10+x8


+x7+x5+x4+x+1

CRC-64 (ISO) x64+x4+x3+x+1

61
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 61
Fault Tolerance Schemes
 Adding Fault Tolerance to Design
 Improves dependability of system
 Requires redundancy
– Hardware
– Time
– Information

62
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 62
Hardware Redundancy
 Involves replicating hardware units
 At any level of design
– gate-level, module-level, chip-level, board-level
 Three Basic Forms
 Static (also called Passive)
– Masks faults rather than detects them
 Dynamic (also called Active)
– Detects faults and reconfigures to spare hardware
 Hybrid
– Combines active and passive approaches
63
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 63
Static Redundancy
 Masks faults so no erroneous outputs
 Provides uninterrupted operation
 Important for real-time systems
– No time to reconfigure or retry operation
 Simple self-contained
– No need to update or rollback system state

64
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 64
Triple Module Redundancy (TMR)
 Well-known static redundancy scheme
 Three copies of module
 Use majority voter to determine final output
 Error in one module out-voted by other two
Module
1

Module Majority
2 Voter

Module
3

65
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 65
TMR Reliability and MTTF
 TMR works if any 2 modules work
 Rm = reliability of each module
 Rv = reliability of voter
RTMR  Rv [ Rm3  C23 Rm2 (1  Rm )]  Rv (3Rm2  2 Rm3 )

 MTTF for TMR


  
MTTFTMR   RTMR dt   Rv (3Rm2  2 Rm3 )dt   e vt (3e 2 mt  2e 3mt )dt
0 0 0

3 2
 
2m  v 3m  v
66
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 66
Comparison with Simplex
 Neglecting fault rate of voter
3  5  1  5
2
MTTFTMR        MTTFsimplex
2m 3m  6  m  6

 TMR has lower MTTF, but


 Can tolerate temporary faults
 Higher reliability for short mission times

67
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 67
Comparison with Simplex
 Crossover point
RTMR  Rsimplex
3e 2 mt  2e 3mt  e mt
ln 2
Solve  t   0.7 MTTFsimplex
m

 RTMR > Rsimplex when


 Mission time shorter than 70% of MTTF

68
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 68
N-Modular Redundancy (NMR)
 NMR
 N modules along with majority voter
– TMR special case
 Number of failed modules masked = (N-1)/2
 As N increases, MTTF decreases
– But, reliability for short missions increases
 If goal only to tolerate temporary faults
 TMR sufficient

69
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 69
Interwoven Logic
 Replace each gate
 with 4 gates using inconnection pattern
that automatically corrects errors
 Traditionally not as attractive as TMR
 Requires lots of area overhead
 Renewed interest by researchers
investigating emerging nanoelectronic
technologies

70
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 70
Interwoven Logic with 4 NOR Gates
+
2a
X
+
2b
+
+ 2c +
1a 4a
+
X + + 2d +
+ 2 + 1b 4b
1
+ 4 + +
1c + 4c
Y 3 3a
+ +
1d + 4d
3b
+
3c
Y +
3d

71
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 71
Example of Error on Third Y Input
0
0 + 0
2a
X 0 + 0
0 2b
+ 0
+ 1 2c + 1
1a 4a
+ 0
X + + 1 2d + 1
2 1b 4b
+ +
1 4
+ 0 + 1
+ 1c + 0 4c
Y 3
+ 0
3a
+ 1
1d + 0 4d
3b
0 + 0
0 3c
Y 1 + 0
0 3d

72
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 72
Dynamic Redundancy
 Involves
 Detecting fault
 Locating faulty hardware unit
 Reconfiguring system to use spare fault-free
hardware unit

73
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 73
Unpowered (Cold) Spares
 Advantage
 Extends lifetime of spares
 Equations
 Assume spare not failing until powered
 Perfect reconfiguration capability

Rw / cold _ spare  (1  t )e  t
2
MTTFw / cold _ spare 

74
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 74
Unpowered (Cold) Spares
 One cold spare doubles MTTF
 Assuming faults always detected and
reconfiguration circuitry never fails
 Drawback of cold spare
 Extra time to power and initialize
 Cannot be used to help in detecting faults
 Fault detection requires either
– periodic offline testing
– online testing using time or information
redundancy
75
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 75
Powered (Hot) Spares
 Can use spares for online fault detection
 One approach is duplicate-and-compare
 If outputs mismatch then fault occurred
– Run diagnostic procedure to determine which
module is faulty and replace with spare
 Any number of spares can be used
Module
Output
A
Spare
Module
Compare Agree/Disagree
Module
B

76
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 76
Pair-and-a-Spare
 Avoids halting system to run diagnostic
procedure when fault occurs
Module
Output
A

Compare Agree/Disagree
Module
B
Switch

Module
Output
C

Compare Agree/Disagree
Module
D

77
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 77
TMR/Simplex
 When one module in TMR fails
 Disconnect one of remaining modules
 Improves MTTF while retaining advantages
of TMR when 3 good modules
 TMR/Simplex
 Reliability always better than either TMR or
Simplex alone

78
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 78
Comparison of Reliability vs Time
1

0.9

0.8
RELIABILITY

0.7 SIMPLEX
TMR
0.6
TMR/SIMPLEX

0.5

0.4

0.3
0 0.2 0.4 0.6 0.8 1
NORMALIZED MISSION TIME (T/MTTF)

79
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 79
Hybrid Redundancy
 Combines both static and dynamic
redundancy
 Masks faults like static
 Detects and reconfigures like dynamic

80
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 80
TMR with Spares
 If TMR module fails
 Replace with spare
– can be either hot or cold spare
 While system has three working modules
– TMR will provide fault masking for
uninterrupted operation

81
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 81
Self-Purging Redundancy
 Uses threshold voter instead of majority
voter
 Threshold voter outputs 1 if number of
input that are 1 greater than threshold
– Otherwise outputs 0
 Requires hot spares

82
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 82
Self-Purging Redundancy
Module Elem.
Switch
1

Elem.
Module Switch
2

Threshold
Elem.
Module Switch Voter
3 2 Elementary Switch

Elem. Initialization 
Module Switch
S R
4 Flip
Flop

Elem.
Module Switch & Voter
5 Module

83
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 83
Self-Purging Redundancy
 Compared with 5MR
 Self-purging with 5 modules
– Tolerate up to 3 failing modules (5MR cannot)
– Cannot tolerate two modules simultaneously
failing (5MR can)
 Compared with TMR with 2 spares
 Self-purging with 5 modules
– simpler reconfiguration circuitry
– requires hot spares (3MR w/spares can use
either hot or cold spares)
84
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 84
Time Redundancy
 Advantage
 Less hardware
 Drawback
 Cannot detect permanent faults
 If error detected
 System needs to rollback to known good
state before resuming operation

85
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 85
Repeated Execution
 Repeat operation twice
 Simplest time redundancy approach
 Detects temporary faults occurring during
one execution (but not both)
– Causes mismatch in results
 Can reuse same hardware for both
executions
– Only one copy of functional hardware needed

86
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 86
Repeated Execution
 Requires mechanism for storing and
comparing results of both executions
 In processor, can store in memory or on
disk and use software to compare
 Main cost
 Additional time for redundant execution
and comparison

87
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 87
Multi-threaded Redundant Execution
 Can use in processor-based system that
can run multiple threads
 Two copies of thread executed concurrently
 Results compared when both complete
 Take advantage of processor’s built-in
capability to exploit processing resources
– Reduce execution time
– Can significantly reduce performance penalty

88
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 88
Multiple Sampling of Ouputs
 Done at circuit-level
 Sample once at end of normal clock cycle
 Same again after delay of t
 Two samples compared to detect mismatch
– Indicates error occurred
 Detect fault whose duration is less than t
 Performance overhead depends on
– Size of t relative to normal clock period

89
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 89
Multiple Sampling of Outputs
 Simple approach using two latches

Main
Latch

Signal Clk  Error

Shadow
Latch

Clk+t

90
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 90
Multiple Sampling of Outputs
 Approach using stability checker at output
Stability Stability
Checking Checking
Normal Period Normal Period
Clock Period t Clock Period t

Checking &
Period
+
Signal & Error
+
&

91
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 91
Diverse Recomputation
 Use same hardware, but perform
computation differently second time
 Can detect permanent faults that affects
only one computation
 For arithmetic or logical operations
 Shift operands when performing second
computation [Patel 1982]
 Detects permanent fault affecting only one
bit-slice
92
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 92
Information Redundancy
 Based on Error Detecting and
Correcting Codes
 Advantage
 Detects both permanent and temporary
faults
 Implemented with less hardware overhead
than using multiple copies of module
 Disadvantage
 More complex design
93
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 93
Error Detection
 Error detecting codes used to detect
errors
 If error detected
– Rollback to previous known error-free state
– Retry operation

94
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 94
Rollback
 Requires adding storage to save
previous state
 Amount of rollback depends on latency of
error detection mechanism
 Zero-latency error detection
– rollback implemented by preventing system
state from updating
 If errors detected after n cycles
– need rollback restoring system to state at least
n clock cycles earlier
95
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 95
Checkpoint
 Execution divided into set of operations
 Before each operation executed
– checkpoint created where system state saved
 If any error detected during operation
– rollback to last checkpoint and retry operation
 If multiple retries fail
– operation halts and system flags that
permanent fault has occurred

96
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 96
Error Detection
 Encode outputs of circuit with error
detecting code
 Non-codeword output indicates error

m Functional k
Inputs Outputs
Logic
k

m Error
Check Bit Checker
Indication
Generator c

97
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 97
Self-Checking Checker
 Has two outputs
 Normal error-free case (1,0) or (0,1)
 If equal to each other, then error (0,0) or (1,1)
 Cannot have single error indicator output
– Stuck-at 0 fault on output could never be detected

98
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 98
Totally Self-Checking Checker
 Requires three properties
 Code Disjoint
– all codeword inputs mapped to codeword outputs
 Fault Secure
– for all codeword inputs, checker in presence of
fault will either procedure correct codeword output
or non-codeword output (not incorrect codeword)
 Self-Testing
– For each fault, at least one codeword input gives
error indication
99
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 99
Duplicate-and-Compare
 Equality checker indicates error
 Undetected error can occur only if
common-mode fault affecting both copies
 Only faults after stems detected
 Over 100% overhead (including checker)
Stems

Primary Functional
Inputs Logic
Equality Error
Checker Indication

Functional
Logic

100
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 100
Single-Bit Parity Code
 Totally self-checking checker formed by
removing final gate from XOR tree


 EI0
Functional

Logic


 EI1
Parity
Prediction

101
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 101
Single-Bit Parity Code
 Cannot detect even bit errors
 Can ensure no even bit errors by
generating each output with independent
cone of logic
– Only single bit errors can occur due to single
point fault
– Typically requires a lot of overhead

102
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 102
Parity-Check Codes
 Each check bit is parity for some set of
output bits
 Example: 6 outputs and 3 check bits

Z1 Z2 Z3 Z4 Z5 Z 6 c1 c2 c3

Parity Group 1 1 0 0 1 1 0 1 0 0
Parity Group 2 0 1 1 0 0 0 0 1 0
Parity Group 3 0 0 0 0 0 1 0 0 1

103
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 103
Parity-Check Codes
 For c check bits and k functional
outputs
 2ck possible parity check codes
 Can choose code based on structure of
circuit to minimize undetected error
combinations
 Fanouts in circuit determine possible error
combinations due to single-point fault

104
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 104
Checker for Parity-Check Codes
 Constructed from single-bit parity
checkers and two-rail checkers
Z1
Z4 Parity
Z5 Checker
c1
Two-Rail
Checker
Z2
E0
Parity Two-Rail
Z3
Checker Checker E1
c2

Z6 Parity
c3 Checker

105
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 105
Two-Rail Checkers
 Totally self-checking two-rail checker
A0
&
B0
+ C0
&

&
A1
+ C1
&
B1

106
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 106
Berger Codes
 Inverter-free circuit
 Inverters only at primary inputs
 Can be synthesized using only algebraic
factoring [Jha 1993]
 Only unidirectional errors possible for
single point faults
– Can use unidirectional code
– Berger code gives 100% coverage

107
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 107
Constant Weight Codes
 Non-separable with lower redundancy
 Drawback: need decoding logic to convert
codeword back to its original binary value
 Can use for encoding states of FSM
– No need for decoding logic

108
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 108
Error Correction
 Information redundancy can also be
used to mask errors
 Not as attractive as TMR because logic for
predicting check bits very complex
 However, very good for memories
– Check bits stored with data
– Error do not propagate in memories as in logic
circuits, so SEC-DED usually sufficient

109
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 109
Error Correction
 Memories very dense and prone to errors
 Especially due to single-event upsets (SEUs)
from radiation
 SEC-DED check bits stored in memory
 32-bit word, SEC-DED requires 7 check bits
– Increases size of memory by 7/32=21.9%
 64-bit word, SEC-DED requires 8 check bits
– Increases size of memory by 8/64=12.5%

110
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 110
Memory ECC Architecture
Write Data Word

Read Data Word

Generate Write
Data Word Check Bits
In Check
Bits Memory
Calculated
Check Bits

Read
Generate Check Bits

Data Word Correct Syndrome


Out
Data

111
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 111
Hamming Code for ECC RAM
Input Data Z Z Bit Error Z Output
Correction Circuit Data
Hamming RAM
Check Bit c Parity Bit
Core c Parity
Generator Generator
Check
N words
Parity Bit Z+c+1 Hamming
Generator bits/word Hamming c Check
Check Bit
Generate Detect/Correct
Generator

Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 c1 c2 c3 c4
Parity Group 1 1 1 0 1 1 0 1 0 1 0 0 0
Parity Group 2 1 0 1 1 0 1 1 0 0 1 0 0
Parity Group 3 0 1 1 1 0 0 0 1 0 0 1 0
Parity Group 4 0 0 0 0 1 1 1 1 0 0 0 1

Error Type Condition


No bit error Hamming check bits match, no parity error
Single-bit correctable error Hamming check bits mismatch, parity error
Double-bit error detection Hamming check bits mismatch, no parity error
112
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 112
Memory ECC
 SEC-DED generally very effective
 Memory bit-flips tend to be independent
and uniformly distributed
 If bit-flip occurs, gets corrected next time
memory location accessed
 Main risk is if memory word not access for
long time
– Multiple bit-flips could accumulate

113
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 113
Memory Scrubbing
 Every location in memory read on
periodic basis
 Reduces chance of multiple errors
accumulating in a memory word
 Can be implemented by having memory
controller cycle through memory during idle
periods

114
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 114
Multiple-Bit Upsets (MBU)
 Can occur due to single SEU
 Typically occur in adjacent memory cells
 Memory interleaving used
 To prevent MBUs from resulting in multiple
bit errors in same word

Memory

Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4
Bit1 Bit1 Bit1 Bit1 Bit2 Bit2 Bit2 Bit2 Bit3 Bit3 Bit3 Bit3

115
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 115
Type Issues Goal Examples Techniques

Long-Life Difficult or Maximize Satellites Dynamic


Systems Expensive to Repair MTTF Spacecraft Redundancy
Implanted Biomedical

Reliable Error or Delay Fault Masking Aircraft TMR


Real-Time Catastrophic Capability Nuclear Power Plant
Systems Air Bag Electronics
Radar

High Downtime High Reservation System No Single Point of


Availability Very Costly Availability Stock Exchange Failure;
Systems Telephone Systems Self-Checking Pairs;
Fault Isolation

High Data Corruption High Banking Checkpointing,


Integrity Very Costly Data Integrity Transaction Time Redundancy;
Systems Processing ECC; Redundant
Database Disks

Mainstream Reasonable Level of Meet Failure Rate Consumer Electronics Often None;
Low-Cost Failures Acceptable Expectations Personal Computers Memory ECC; Bus
Systems at Low Cost Parity; Changing as
Technology Scales

116
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 116
Concluding Remarks
 Many different fault-tolerant schemes
 Choosing scheme depends on
 Types of faults to be tolerated
– Temporary or permanent
– Single or multiple point failures
– etc.
 Design constraints
– Area, performance, power, etc.

117
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 117
Concluding Remarks
 As technology scales
 Circuits increasingly prone to failure
 Achieving sufficient fault tolerance will be
major design issue

118
System-on-Chip
EE141 Test Architectures Ch. 3 - Fault-Tolerant Design - P. 118

You might also like