CERN-THESIS-2018-030 High-Energy Physics Fault Tolerance Metrics and Testing Methodologies For SRAM Based FPGAs
CERN-THESIS-2018-030 High-Energy Physics Fault Tolerance Metrics and Testing Methodologies For SRAM Based FPGAs
Master Thesis
FPGAs
A case of study based on the Xilinx Triple Modular
Redundancy (TMR) Subsystem
09/04/2018
Advisor
prof. Marco Parvis
Co-Advisor Candidate
prof. Michelangelo Agnello Emanuele Canessa
Contents
1 Introduction 1
ii
4 Metrics for Fault Tolerance 27
4.1 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Dependability Attributes . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Dependability Threats . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Dependability Means . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Fault Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Metrics for Single Event Effects on FPGAs . . . . . . . . . . . . . . . 36
4.3.1 Cross Section . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Measurement of SEE Sensitivity . . . . . . . . . . . . . . . . . 37
4.3.3 SEU Sensitivity on FPGAs . . . . . . . . . . . . . . . . . . . . 38
iii
6.3.3 Mean Time to Failure Evaluation . . . . . . . . . . . . . . . . 57
6.4 Testing Procedure and Architecture . . . . . . . . . . . . . . . . . . . 58
6.4.1 Single Module Testing . . . . . . . . . . . . . . . . . . . . . . 58
6.4.2 Tabletop Testing . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4.3 Ground Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 61
List of Figures 97
iv
List of Tables 101
Acronyms 103
Bibliography 107
v
Chapter 1
Introduction
Field-Programmable Gate Arrays have become more and more actractive to the
developers of mission-critical and safety-critical systems. Thanks to their reconfig-
urability properties, as well as their I/O capabilities these devices are often employed
as core logic in many different applications, like:
• ASIC Prototyping;
• Audio;
• Automotive;
• Broadcast;
• Consumer Electronics;
• Data Center;
• High-Energy Physics;
• Industrial;
• Medical;
• Scientific Instruments;
• Security systems;
1
1 – Introduction
• Wired Communications;
• Wireless Communications.
On top of that, the use of soft microcontrollers can ease the complexity related
to the some of the control logic of these devices, allowing to easily develop new
features without having to redesign most of the control logic involved.
However, for application safety-critical and mission-critical like Aerospace and
High-Energy Physics these devices require a further analisys on radiation effects.
The main matter of this thesis, that has been developed in collaboration with the
Conseil Européen pour la Recherche Nucléaire (CERN) A Large Ion Collider Experi-
ment (ALICE), for the planned Inner Tracking System (ITS) Upgrade, are discussed
the fault tolerance metrics and the testing methodologies that can be applicable to
soft microprocessor cores running on FPGAs.
In Chapter 2 are discussed the effects of radiation on FPGAs, as well as the main
units of measure involved. Particular attention is then dedicated to the so-called
Single Event Effects.
In Chapter 3 are discussed the main techniques employed to protect digital de-
signs load onto FPGAs.
In Chapter 4 are discussed the main metrics that are available to classify the
effects of faults in these devices, with particular emphasis to the ones employed for
Single Event Effects.
In Chapter 5 are discussed the available techniques for radiation hardness design
validation. In particular, are presented the working schemes for tabletop testing and
ground testing.
In Chapter 6 are introduced the metrics and the testing methodologies that have
been used to characterize the Xilinx TMR Subsystem against radiation effects.
Finally, in Chapter 7 are presented the results of the characterization process
and the conclusions, as well as the possible future work associated to this matter.
2
Chapter 2
Radiation Effects on
Field-Programmable Gate Arrays
Field-Programmable Gate Array (FPGA)s are becoming more and more attractive
in many fields of applications due to their reconfiguration capabilities. FPGAs,
however, are highly sensible to ionizing radiation. This weakness makes them very
prone to radiation-induced memory upsets.
• Antifuse-based;
• Flash-based;
• Static RAM (SRAM)-based.
Antifuse-based FPGAs
3
2 – Radiation Effects on Field-Programmable Gate Arrays
Flash-based FPGAs
Figure 2.1: Floating Gate NMOS Transistor: The accumulation of charges in the
Floating Gate prevents the transistor from working as expected, eventually, the
value stored is changed when the charge exceed the threshold.
Finally, SRAM-based FPGAs are characterized by having all the configuration bits
stored in a Static RAM. Although this choice leads to potentially an infinite number
of reconfiguration cycles, the memory itself is volatile and it is the most susceptible
to radiation effects among the others. The strength of this family of devices resides
in the technology adopted, that is the best available on the market. It is also
important to note that an external memory has to be present in order to reprogram
the Configuration RAM in case of power loss.
4
2.2 – Radiation Effects
Figure 2.2: SRAM Cell: the effect of a particle striking through one of the M1..M4
transistors could flip the value stored the memory cell by the positive feedback of
the structure.
Table 2.1: Comparison of FPGA technologies in terms of the main considered pa-
rameters in radiation environments.
5
2 – Radiation Effects on Field-Programmable Gate Arrays
• electrons;
• protons;
• α-particles;
• β-particles;
• heavy ions.
α-particles
An α-particle is an Helium nucleus, made of two neutrons and two protons and is
a very highly ionizing particle. For this reason the α-particle lose their energy in a
short path inside the material and can be easily shielded with a few centimeters of
air or a thin thickness shielding material like a sheet of paper.
β-particles
The β-particles are electrons or positrons emitted from radioactive atoms. Their
energy spectrum can vary from a few keV up to 10 MeV and it is dependent by
the emitting atoms. As the other ionizing radiations, a simple and low-cost Geiger
counter can detect beta particles, although without the information about their
energy. Beta particles can be easily stopped in the material: for instance, a 1 MeV
beta particle can be stopped by a thin (∼ 1 mm) Aluminum foil. On the other
6
2.2 – Radiation Effects
hand, β-particles crossing materials with high atomic numbers (Z) can produce
Bremsstrahlung radiation (photons) that can easily penetrate the material. Besides,
the positrons can annihilate and produce two photons of 0.511 MeV.
Neutrons
Neutrons, per se, are not able to cause direct ionization having zero electrical charge.
Their interaction with the material, instead, can cause recoil in the nuclei present;
finally, the nuclei’s can cause subsequent ionization in other atoms. Having zero
electrical charge, these particles have a greater penetration capability with respect
to the particles discussed above.
One of the most common units of measure that are used for radiation is the so
called absorbed dose or Total Ionizing Dose (TID). This quantity is often measured
in Gray (Gy) or, less frequently, in radians (1 Gy = 100 rad). The TID has a direct
correlation with the energy that has been absorbed by the material: in fact, an
J
absorbed dose of 1 Gy corresponds to an absorbed energy of 1 kg .
The absorbed dose has also a biological significance, but it is also necessary
to take into account the type of radiation considered. This operation requires the
definition of a weighting factor, wr , for each radiation type. Using the previously
defined weighting factors, each pair of radiation type and energy is multiplied by
the correspondent weighting factor, therefore obtaining a "weighted" absorbed dose,
called equivalent dose and measured in Sievert (Sv).
7
2 – Radiation Effects on Field-Programmable Gate Arrays
dE
LET = (2.1)
dx
Where dE is represents the quantity of energy that has been transferred and dx
represents the distance traveled in the material. Although it can be expressed in
Newton (N), most often the unit of measure used to express this quantity is MeV
cm
.
Different particles have different Linear Energy Transfer. For instance, α-particles
are often referred as High-LET, while others –like β-particles– are defined as Low-
LET.
Finally, Effective Linear Energy Transfer (LETeff ) is often used when the LET
has been already characterized using a perpendicular beam to the material. This
quantity is expressed as follows:
LET
LETeff = (2.2)
cos(θ)
Fluence
Fluence is the Flux integrated over a period of time. The particle fluence defines the
number of particles passing through a spherical surface during a specified period of
time ∆T .
Z
Φ= φ dt (2.3)
∆T
Single Event Effect (SEE) is a generic term that describe the type of effects that
can be caused by a single particle striking a silicon device. Necessary condition for
a Single Event Effect to come true is that the penetrating particle has a sufficient
LET to cause a ionization.
8
2.3 – Single Event Effects
(a) Heavy Ion striking through a transistor (b) Proton inducing nuclear reactions in a
and creating a ionization path transistor
Figure 2.3: Single Event Effects on transistors: the effect of striking particles can
activate the transistors.
The first two families of errors are often referred to as Soft Errors; the term
derives from the fact that this type of errors can be cleared by power cycling the
circuit.
The last five families, instead, are examples of Hard Errors: these errors lead to
a permanent misbehavior of the circuit; to recover from an hard errors it is often
necessary to replace the whole device.
9
2 – Radiation Effects on Field-Programmable Gate Arrays
Single Event Upsets are a special form of Single Event Effects: they model the effect
of a striking particle that hits a memory element in a sequential circuit and flip its
value. Among the other types of SEEs, SEUs are the less destructive events that
can be caused by striking particles.
These errors manifest themselves with a high probability in devices that contain
large memory elements: this is a common denominator in FPGAs.
Single Event Functional Interrupt (SEFI) is a particular type of SEU that takes place
when one of the basic functionality of the circuit is interrupted due to the upsed.
Common examples of SEFIs are particles that hits the clock tree configuration bits
in FPGAs.
With Multiple-Cell Upset (MCU) identifies a special type of SEU that change the
state of two or more logic cells. These cells are usually physically adjacent, so that
a single particle can hit partially all of them.
A particular case of MCU is represented by a Multiple-Bit Upset (MBU): in this
case the cells whose value have been flipped by the particle are inscribed by being
part of the same word. These effects are very destructive in terms of functional
behavior of the circuit: in fact, error correction codes are usually not able to correct
more than one bit flip per word. For this reason, many hardware manufacturers
produce their own memory where cells of the same word are interleaved by cells of
other words, so that the possibility of having an MBU is greatly reduced.
Another special kind of SEE is represented by the SET. This type of soft errors
models a change in the timing of a signal. The circuit behavior induced by a SET
can be easily modeled as a glitch in a signal propagating through the circuit. [4]
If the voltage transient caused by the particle striking through a node in the
combinational logic is captured by a storage element, it can lead to a state change.
In this case the SET resulted in a SEU in a memory element.
10
2.3 – Single Event Effects
In this category are included all the SEEs that can cause permanent damage to the
integrated circuits on which they arise.
Single Event Induced Burnouts affects usually the power transistors present in a
circuit. It corresponds to a trigger of their parasitic bipolar structure, that is followed
by a positive feed-back. The feedback increase rapidly the current flowing therefore
producing a burnout in the transistor affected.
Figure 2.4: Single Event Induced Burnout on a MOS Transistor: the parasitic
bipolar structure is excited, the followed by a positive feedback that increase the
temperatures.
A Single Event Gate Rupture, also called Single Event Dielectric Rupture (SEDR),
represent the destructive rupture of the dielectric present in a transistor (usually is
the gate oxyde). The rupture of the dielectric cause the formation of a conducting
path, in the case of SEGR a permanent leakage gate current is added.
11
2 – Radiation Effects on Field-Programmable Gate Arrays
Figure 2.5: Single Event Gate Rupture on a MOS transistor: the dielectric present
in the gate of the transistor is pierced.
This hard errors are always destructive, and the only way to protect a component
against these effects is to force electrical conditions such that their generation is not
possible.
Figure 2.6: Single Event Latch-Up on a PNPN thyristor structure: the particle
excites the implicit thyristor structure that starts conducting due to the positive
feedback.
12
2.4 – Single Event Effects on SRAM-based FPGAs
Single Event Snap-Back is very similar to the effect produced by a SEL, the only
difference is that it occurs withing a single MOS structure. Similary, an high current
is generated between drain and source region, amplified by the intrinsic bipolar
transistor placed in between. The high current, as in the other, generates a localized
heating that could lead to permanent damage if the device is not power down.
Finally, the outcome caused by a Single Event Hard Error is very similar to a SEU: a
memory cell’s bit flips its value. The difference is that the change is semi-permanent
or permanent, for this reason a SEHE is often called stuck bit error or hard fault.
As discussed in Section 2.1, the different technologies used for FPGAs are charac-
terized by different tolerance to radiation. The lack of functionality of antifuse- and
flash-based devices forces the usage of SRAM-based FPGAs to implement complex
designs with strict requirements. These devices, though, presents a strong suscep-
tance to Single Event Effects.
The most common type of SEE present in SRAM-based FPGAs, as discussed in
Section 2.3.1, are indeed the Single Event Upsets. These special type of soft-errors
can result in a number of error modes in different parts of the FPGA.
In a SRAM-based FPGA, the most sensible component to SEUs is surely the Con-
figuration RAM (CRAM). This memory holds information about:
• Routing configuration.
13
2 – Radiation Effects on Field-Programmable Gate Arrays
It is important to note that non all SEUs lead to errors: for instance, there may
be some configuration bits that are either not used or even disabled. Xilinx defined
a set of special bits, called essential bits, as a subset containing only the bits that are
essential for the specific design that is loaded onto the FPGA. Flipping a Xilinx’s
essential bit value leads to misbehavior(s) in the design.
Finally, errors affecting this memory are often called static errors because they
will not disappear until actively corrected by either scrubbing or complete reconfig-
uration.
LUTs are used to configure the logic function of combinational logic inside the
FPGA. Every bit in these memory elements defines the output of the combinational
block given a particular input. In case of an upset, the logic function implemented
changes, modifying the behavior of the circuit described.
Figure 2.7: Single Event Upset in a LUT: the logic function implemented is changed
In a FPGA, the user data memory is composed of all the memory elements that are
used inside the design. These memory elements usually include:
• D Flip-Flop (DFF);
• Block RAM (BRAM);
• Distributed RAM (DRAM).
The contents of these components can change at any time during the operating
time of the design; for this reason, errors affecting these configuration bits are not
even considered permanent. To correct errors present in the user data space it is
necessary to employ techniques for design mitigation, discussed in Chapter 3.
14
2.4 – Single Event Effects on SRAM-based FPGAs
FPGAs have many different I/O configuration capability. Usually, all the pins avail-
able are configurable as input, output, and bidirectional buffers. An error affecting
the configuration bits responsible for this feature can potentially lead to permanent
damage of the device: for instance, a pin that was previously configured as an input
could be reconfigured as output, leading to short circuits.
• Multiplexers;
• Buffers.
Multiplexers are one of the most widely used components in FPGAs. Similarly
to PIPs, their select signal is driven by one or more configuration bits. The effect
of an error on one of these elements could lead to an undefined behavior of the
design: for example, an upset could change a MUX configuration in such a way that
it selects now the input from an unused, unconnected component.
15
2 – Radiation Effects on Field-Programmable Gate Arrays
Finally, buffers are similar to PIPs. Buffers are often used to propagate clocks or
for I/O purposes. An error on one of these elements could cause a variety of different
effects, ranging from the interruption of the clock distribution to more severe errors,
like a wire driven by two buffers at the same time (short circuit).
16
Chapter 3
• Hardened Technologies;
• Mitigation Techniques.
The following sections briefly describe some of the system-level mitigation tech-
niques that are applicable to a digital design in order to increase its reliability; those
techniques are not mutually exclusive, meaning that they can be mixed in order to
obtain better results.
17
3 – Available Techniques for Fault Tolerance
The replication leads not only to increased area, but as well as power occupation
and routing difficulty overheads.
It is important to notice that the weak element of this configuration is the com-
parator/voter placed at the end of the two hardware blocks: the presence of a fault
in this component completely defeat the purpose of duplicating the design. With
that said, in the vast majority of digital design, the cross section (i.e. the probability
that a single event can manifest) of this component is several orders of magnitude
less than the probability of the presence of a fault in the original hardware block.
Depending on the level of replication, these techniques may provide:
• Error Detection;
• Error Detection and Error Correction.
18
3.1 – Spatial Redundancy
Figure 3.1: Duplicate with Comparison scheme: the systems are duplicated and
their outputs compared for errors
Figure 3.2: N-Modular Redundancy scheme: the systems are replicated N times,
the voter decides the correct output using a majority voter scheme
The error detection and correction capabilities, combined to the smallest area
overhead among the majority-voting architectures, have made the TMR the most
common technique used for design mitigation.
19
3 – Available Techniques for Fault Tolerance
For the sake of simplicity, the following considerations are focused on this par-
ticular kind of architecture, but they can easily expanded to any value of N .
To introduce the first problem of this approach, that is in general the main issue of
all the error correction techniques, let’s first introduce the simpler version.
Its working principle is the simplest: the input data is triplicated and feeds all
the blocks, than their outputs are voted by the majority voter.
Figure 3.3: Block TMR scheme: a block is triplicated, including its memory ele-
ments, and then a voter decides the correct output using a majority voter scheme
The main problem of this scheme shows itself when the various replicas of the
hardware block contain registers, a common scenario in all synchronous designs.
The presence of a memory element in the replicated block imply an internal state;
in case of an error, the internal state may drift from the correct state, leading to
permanent errors at the output.
Having one block that always provide wrong results defeats completely the Error
Correction capabilities of the system: in case another error occurs in one of the non-
faulty blocks, the voter will not be able anymore to mask its presence.
A direct consequence can be highlighted by modeling the non-protected system
and the one that encompasses Block TMR. The reliability of the non-protected
system can be expressed as follow:
R(t) = e−λt
Where R(t) is the Reliability function of time t, λ is the failure rate of the system.
20
3.1 – Spatial Redundancy
The plot shown in Figure 3.4 highlights the problem: as the time passes, the
probability to have a fault that alters the state of one memory element increase,
thus reducing the reliability of the system. Block TMR is valid if the system is
periodically reconfigured and reset, otherwise after a given amount of time τ (λ), the
non-protected system will offer a greater reliability compared to the triplicated one.
Single system
Triplicated system
Reliability
Time
Figure 3.4: Reliablity comparison between a single system and a triplicated system
with BTMR, the triplicated system is more likely to fail after a period of time
The solution to the problem presented above presents overhead in terms of area and
routing difficulty.
Every flip-flop is triplicated, as well as the combinational logic; voters are added
after every tuple of memory elements to vote and restore their state. With this
method, the fault is masked internally, and the state of the hardware block is always
restored the next clock cycle by the feedback network.
Distributed TMR always offers a greater reliability compared with the non-
mitigated single block system.
21
3 – Available Techniques for Fault Tolerance
Figure 3.5: Distributed TMR scheme: the block is triplicated at each step, the
inputs to the next step are always voted and the correct state is always restored
from the voted output
The process that stands behind all the information redundancy techniques uses:
• an encoding function F (D) that takes as input the original data D and returns
the encoded value K;
22
3.2 – Information Redundancy
• a decoding function F −1 (K) that takes as input the encoded value K and
returns the original data D.
The data is stored only in its encoded version K. The function F (D) is tuned
to maximize the possibility to identify an error in the unique data.
• Even Parity: the parity bit is asserted when the number of ones present in the
data word, excluding the parity bit itself, is odd.
• Odd Parity: the parity bit is asserted when the number of ones present in the
data word, excluding the parity bit itself, is even.
Cyclic Redundancy Check (CRC) is the generalized version of Parity Code tech-
nique, where instead of only one redundant bit there are more. Like its simpler
version, CRC is only able to provide error detection capabilities.
23
3 – Available Techniques for Fault Tolerance
The technique define a simple hash function that is designed to maximize the
error detection capabilities. Unlike Parity Code, using more than one redundant bit
allows to detect different families of errors.
Cyclic codes are in general easy to implement, with a relative low hardware
overhead, making them a preferable solution in many applications. However, the lack
of error correction capability do not make CRC suitable for time-critical applications,
where there is not the possibility of data recovery by retry.
Hamming Code, also called Error Correction Code (ECC), was introduced by Richard
Hamming in 1950. This redundant technique allows both error detection and error
correction in the non-redundant bits.
As of today, Hamming Code refers to a specific (7,4) code that uses 3 redundant
parity bits to encode 4 data bits in a word of 7 bits. In this particular configuration,
often called Hamming(7,4)-code, the additional bits are capable of Single Error
Correction (SEC).
While discussing the Hamming Code, it is important to introduce a new concept:
the Hamming Distance. The Hamming Distance between two strings s1 and s2 ,
of equal length, is defined as the number of positions at which the corresponding
symbols are different.
In other words, the Hamming(7,4)-code is able to detect and correct errors up
to an hamming distance of one.
Although Hamming Code is able to detect and correct single bit errors, the original
implementation is not able to detect if more than one error is present in the original
word. Note that if an error correction is tried on a word that presents two errors, the
result of this operation is still an incorrect word. To overcome this problems, during
the last years there were presented various extensions to the original Hamming Code
that enable the Double Error Detection (DED) capability. The most common one
adds a parity bit to the original (7,4) code to enable this feature.
This family of extensions to the original Hamming Code is called Single Error
Correction/Double Error Detection (SECDED) and it is often employed in memory
designs.
24
3.3 – Temporal Redundancy
Figure 3.7: Time Redundancy in Software: the same operation is repeated multiple
times and then the result is compared and voted
• The entire hardware is replicated N times and the clock is delayed properly,
such that the N independent hardware blocks are queried at different time
instants.
Temporal Redundancy techniques are often not employed due to the high com-
putational time and area overhead.
25
3 – Available Techniques for Fault Tolerance
Figure 3.8: Time Redundancy in Hardware: delays are added to repeat the same
operation at different time instants
26
Chapter 4
A fundamental task when working on fault tolerance is the definition of the so-called
metrics: a standard of measurement that can provide informations on how well the
system is performing.
Before defining what are the metrics for fault tolerance it is necessary to describe
what is the mission of a product, that is, in short, its purpose. The mission can be
characterized by:
In the following sections are presented some of the main metrics that are available
to classify systems and provide standardized benchmarks.
4.1 Dependability
Dependability is one of the key parameters used to assess the quality of a product.
Dependability is the property that characterize a dependable system, and it is defined
as:
This property is used in many different fields, and it can be defined using three
different class of parameters:
27
4 – Metrics for Fault Tolerance
• Reliability;
• Maintainability;
28
4.1 – Dependability
• Availability;
• Safety.
Z ∞
R(t) = Pworking (τ > t) = f (x) dx (4.1)
t
Where Pworking (t) represents the probability of being in a working state at time
t, τ is a random variable and f (t) represents the failure probability density function.
Another unit of measure to quantify the reliability of a system is defined by the
Mean Time to Failure (MTTF). This quantity represents the average time before a
failure occurs in the system.
Z ∞
MTTF = E[τ ] = t · f (t) dt (4.2)
0
Where E[τ ] is the expected value of the random variable τ , defined in Equation
4.1.
Finally, failure rate (also called hazard rate) is defined as the number of failures
over a period of time.
#FAILURES
λ= (4.3)
∆t
29
4 – Metrics for Fault Tolerance
Figure 4.2: The product life cycle of a repairable system: it transitions from working
to failure state and vice versa using failure and repair transitions
The following attributes can be defined when dealing with a subset of all the possible
systems, called repairable systems. A repairable system is characterized by the ability
of being repairable, and its life cycle can be modeled as a diagram with two states:
working state and failure state.
The transitions between these two states are regulated by the alternation of two
processes: repair-to-failure and failure-to-repair. The former is regulated by the
random variable τ (defined in Equations 4.1 and 4.2), that represents the time-to-
failure of the system; the latter is instead related to another random variable, θ,
that represents the time-to-repair.
Similarly to non-repairable systems, can be characterized using a quantity similar
to MTTF, the Mean Time Between Failures (MTBF): it represents the average
amount of time between a failure and the consequent one.
This capability could have consequences depending on the truthfulness of the
assumption “system good as new after repair”. If the assumption is not considered
true, another parameter has to be accounted: Mean Time to First Failure (MTTFF).
From now on, the aforementioned assumption will be considered true, therefore the
following condition holds.
30
4.1 – Dependability
Where E[θ] is the expected value of the random variable θ, m(t) is the repairabil-
ity probability density function (m(t) = dMdt(t) ).
Another useful parameter used is represented by the repair rate, defined as the
number of repairs over a period of time.
#REPAIRS
µ= (4.7)
∆t
Availability
Where p0 and p represent respectively the current and the next state of the
Markov chain and Q is the transition matrix.
" #
−λ λ
Q= (4.10)
µ −µ
31
4 – Metrics for Fault Tolerance
Figure 4.3: Markov chain representation of a repairable system: the transition prob-
abilities are defined by the failure rate (λ) and the repair rate (µ)
Substituting the value of Q in the Equation 4.9 leads to the following system of
equations.
dpw (t) = −λ · pw (t) + µ · pf (t)
dt
(4.11)
dpf (t) = λ · pw (t) − µ · pf (t)
dt
Where pw (t) = A(t) and pf (t) = 1 − A(t) = U (t), also called Unavailability. The
initial conditions for the above system of equations assume that the Markov chain
starts from the working state, so pw (0) = 1 and pf (0) = 0.
By solving the system of Equation 4.11 allow to express the Availability, A(t),
and the Unavailability, U (t), as function of the failure and repair rate.
µ λ
A(t) = pw (t) = + · e−(λ+µ)t = A∞ + Atrans (4.12)
λ+µ λ+µ
λ λ
U (t) = pf (t) = − · e−(λ+µ)t = U∞ − Atrans (4.13)
λ+µ λ+µ
These two equations are characterized by a constant term, often called steady-
state term, and a transient one, that is multiplied by an exponential. A common
condition for a repairable system is that the time required to repair it is negligible
compared to the time required to experience a failure.
For this reason, the Equations 4.12 and 4.13 can be simplified with the following.
µ MTTF
A(t) = A∞ = = (4.15)
λ+µ MTTF + MTTR
32
4.1 – Dependability
λ MTTR
U (t) = U∞ = = (4.16)
λ+µ MTTF + MTTR
Finally, availability is often also expressed in terms of ratio between the uptime
and the total time elapsed.
UPTIME
A(t) = (4.17)
UPTIME + DOWNTIME
Threats are phenomenas that can affect the mission of a system by interfering with
its components. First of all, it is important to define the possible outcome of a threat
present in a digital circuit. In fact, it can manifest itself differently depending on
its type, on the structure of the circuit and on the mission accomplished by the
application. The following list provides the four possible outcomes that can be
related to the presence of a threat.
• Fault;
• Error;
• Misbehavior;
• External Effect.
Figure 4.4: Fault life cycle: the fault is activated into an error, the error is prop-
agated into a misbehavior, depending on the type of the misbehavior the external
effects can be different
33
4 – Metrics for Fault Tolerance
Faults
Errors
Errors represent an internal discrepancy between the expected behavior and the ac-
tual one. Its presence is dictated by the activation of the fault present in the circuit.
An example of an error could be an internal component whose state has drifted away
from the correct one. Errors can be observed using specialized mechanisms, like an
hardware debugger.
Similarly to what has been said for faults, errors may or may not be propagated
into an actual misbehavior.
Misbehaviors
External Effects
The external effects are cause by the presence of a failure in a system. Depending
on its the severity, the impact on the service delivered can be different: for instance,
having an error at the output that is not distinguishable from a good one can
severely impact on the mission, while a detectable misbehavior can be identified
and corrected, therefore reducing its effect.
34
4.2 – Fault Classification
Latency of a fault is defined as the amount of time between its occurrence and
its manifestation as a misbehavior on the system. This quantity can be influenced
by many different factors:
• The time of occurrence: a fault that occurs during an active time of a compo-
nent has higher chances of being propagated.
• The observation level: depending on how the component is observed, the fault
propagation may be delayed.
• Fault Prevention, that defines techniques adopted to prevent faults from oc-
curring;
• Fault Removal, that defines techniques used to remove a fault from the system;
• Fault Tolerance, that defines techniques utilized to deal and mask the presence
of a fault (discussed in Chapter 3);
35
4 – Metrics for Fault Tolerance
• Type: that identify the class the fault belongs to, for example a fault that is
changing the value of a memory location is called a memory fault, while a fault
that modify the logic function of a block is called logic fault.
• Locality: that is the location in which the fault is placed. Faults in critical
components can impact severely the mission of the system.
• Latency: the interval of time from its occurrence and its manifestation as a
misbehavior, as discussed in Section 4.1.2.
• Frequency: that represents the average time of occurrence of the same fault.
• Severity: that is the magnitude of the fault’s effect on the system’s mission.
This parameter is strongly dependent on the fault type and the fault locality.
• Critical fault: that represents a fault that prevents the system’s mission to be
carried out until the repair is completed. The frequency of these category of
faults have to be very low or non-existent.
• Major fault: this type of fault is very similar to a critical fault with the
difference that a temporary workaround can be applied in order to avoid strong
consequences on the mission. Major faults can manifest themselves with a
slightly higher frequency compared to the critical ones.
• Minor fault: this category includes all the faults that have few secondary
effects on the system’s behavior, that usually don’t affect the mission carried.
As discussed in Chapter 2 (Section 2.4), Single Event Effects are a common denom-
inator in FPGAs that have to work in radiation environments. There is therefore
the need to be able to classify the sensitivity of the device against SEEs.
These properties are strongly dependent on technology parameters used to pro-
duce the integrated circuit on which the Configuration RAM is implemented. In
the following sections some of the main quantities to consider are explained and
discussed.
36
4.3 – Metrics for Single Event Effects on FPGAs
#EVENTS
σion (LET) = (4.18)
Φion
#EVENTS
σn,p (E) = (4.19)
Φn,p
37
4 – Metrics for Fault Tolerance
Navg
σ= (4.20)
Φ · cos(θ)
Where Navg is the average number of events per device, Φ is the fluence and θ is
the incidence angle (0° if the beam is perpendicular to the device).
Influencing Factors
The Equation 4.20 takes into account only few affecting factors, the cross section,
actually, is influenced on many more parameters:
• Particle Energy;
• Temperature;
• Operational Mode;
• Clock Frequency;
• Current-limiting conditions;
• Reset conditions;
38
4.3 – Metrics for Single Event Effects on FPGAs
From this calculation, it is possible to evaluate the number of Single Event Upsets
as function of the fluence.
U = Φ · σdevice (4.22)
The upset rate is useful to calculate the various requirements in terms of correc-
tion rate.
39
40
Chapter 5
Once the design part is completed and all the mitigation techniques have been
implemented in the design, it is important to validate it to ensure a correct behavior
in radioactive environments. There are multiple techniques to simulate the effect of
a particle beam that hits a FPGA.
Before discussing what is the mechanism behind the process of fault injection it
is necessary to explain the internal structure of the Configuration RAM present in
FPGAs.
The CRAM is organized as an array of frames, similarly to a wide Static RAM.
Each frame is subdivided into words, that are usually 32bit each. Each bit present
into these words represent a specific configuration bit used to configure the various
parts present in the FPGA.
41
5 – Radiation Hardness Design Validation
When performing fault injection, it is usual to randomize the frames, words, and
bits in order to simulate better the effect of a particles hitting the FPGA without a
defined pattern, that is close to what happens if the device is put under a particle
beam.
With this technique, however, it is possible to force an ECC error by selecting
properly two bits of the same word to simulate a worst case scenario where the bits
are not correctable automatically.
42
5.1 – Fault Injection
With that said, taking into account the fact that these memory elements repre-
sent a small percentage of the total configuration RAM size, this technique is able
to predict the behavior under beam with a sufficient enough confidence level.
The SEM IP, among all its features, have the possibility of classify the faults that
are present on the CRAM of the device. This is a proprietary technology of Xilinx,
called Xilinx Essential Bits Technology, that uses an algorithm to identify which
are the essential bits for a design. Essential bits are, in short, a subset of all the
configuration bits available: they are essential in the sense that a changing a value
of these bits changes the function implemented by the design. [5]
Xilinx also defines the so-called prioritized essential bits, a subset of the essential
bits that are weighted by metrics defined by the user. An example of this could be
the configuration bits of a device with an high utilization rate. On top of that, there
are the critical bits: those are bits whose change is likely to kill the entire design,
like the configuration bits for the clock distribution. [5]
The Soft Error Mitigation supports error classification using an external Read-
Only Memory (ROM), interfaced using the built-in SPI interface, that contains a
list of only the essential bits for the design. In this way, in case of an uncorrectable
error, it is possible to identify if this error involves essential bits or not, and act
consequently.
43
5 – Radiation Hardness Design Validation
Figure 5.1: SEM IP block description: all the input and outputs ports are listed
• Mitigation only;
• Detect only;
• Emulation;
44
5.1 – Fault Injection
Figure 5.2: Xilinx Essential Bits: configuration bits can be classified based on their
priority levels
• Monitoring only.
Working Principle
Assuming to have generated the SEM IP with the most feature-rich mode, Mitigation
and Testing, it is possible to:
• detect errors;
• correct errors;
45
5 – Radiation Hardness Design Validation
• inject errors.
All of these operation are possible thanks to a dedicated interface to the Configu-
ration RAM, the Internal Configuration Access Port (ICAP). This interface enables
a direct, fast communication from FPGA to the configuration memory. For this rea-
son the SEM IP can detect errors present with a latency between 22ms and 58ms,
depending on the size of this memory. [13]
To ensure error detection and correction capabilities, the addressable memory
location are continuously read as fast as possible. In case one frame presents a CRC
error, correction has to be performed: in the case of a single bit error with ECC
enabled, the correction is automatic, and the IP only takes care of rewriting the
correct value in the corrupted memory location. If, instead, a multiple bit error is
present, the Hamming codes implemented are not able to correct its value: in this
case it is necessary to classify the bits affected using the error classification capability,
if enabled. If one or more bits are classified as essential, or if the classification is
disabled, the IP will show an uncorrectable error message that states that recovery
is impossible: in this case it is necessary to reprogram either the configuration frame
or, directly, the entire device, in order to restore the original configuration.
Fault injection, on the other hand, can be performed using two different type of
addressing:
The difference between the two is that the former have the property of being
linear, from 0 to a maximum value that depends on the size of the FPGA CRAM.
The latter instead is closer to the actual cell placement: in fact, internally, LFAs
are translated into PFAs. Removing this level of abstraction create intrinsic "holes"
in the address space, trying to inject an error to these locations is simply discarded,
and no action is taken.
46
5.1 – Fault Injection
Using an external device that manages the fault injection on the configuration bits
presents various advantages with respect to an internal peripheral.
First of all, being the internal peripheral able to faults on virtually the whole
device, while performing sessions of random fault injection there is the possibility
that the error injected can break the peripheral itself, forcing to reconfigure the
FPGA to regain the control. This problem does not subsist in the case of external
platforms like the JCM, the advantage, then, is that it is possible to inject an
arbitrary number of faults without having to continuously check the working status
of the fault injector, therefore easing the process.
Secondly, the external platform does not consume resources on the FPGA: im-
plementing an internal scrubber requires the utilization of internal resources like
LUTs, D Flip-Flops, Block RAMs and Distributed RAMs. This point is also im-
portant since additional utilization of resources could make a difference in terms of
routing difficulty, that in the specific case of fault injection can alter the results.
47
5 – Radiation Hardness Design Validation
This approach have also drawbacks: similarly to what has been discussed for the
blind scrubbing, the interface speed can be a problem, even though usual rates for
fault injection are slow enough not to notice any difference using the JTAG interface.
48
5.2 – Ground Testing
complete their radioactive decay process. During this period only a small group of
specialized people can access to that room, for this reason it is usually necessary to
wait until the radioactive decay can be considered complete. An usual period for
this process is around two weeks (14 days) after which it is possible to retrieve the
prototype(s).
49
50
Chapter 6
The main matter of this thesis is represented by the characterization of the Xilinx
Triple Modular Redundancy (TMR) Subsystem, an Intellectual Property developed
by Xilinx that is made to increase the dependability of their soft microprocessor
core, the Microblaze. In the following sections are presented the main structure of
the TMR Subsystem, the testing procedure and the testing architecture employed
to characterize and compare the performances against radiation effects.
• 40 MHz Clock;
• No Instruction and Data Cache;
• No Branch Target Cache;
• No Memory Management Unit (MMU);
51
6 – Characterization of the Xilinx Triple Modular Redundancy Subsystem
• No Barrel Shifter;
This core serves as the base to build the Triple Modular Redundancy Subsystem,
that is, in short, a set of IPs developed by Xilinx that are designed in order to be
able to manage automatically and mask the presence of the faults that affects the
Microblaze soft core. [16]
Similary to what said for the Microblaze embedded core, there are many configu-
ration options that can be used to generate the Subsystem, that has been configured
to triplicate the soft core and its peripherals, without a Watchdog counter and the
Soft Error Mitigation Interface. The reason for the lack of this interface is dictated
by the fact that the SEM IP shows an comparable susceptibility to soft errors than
the core itself, therefore making its presence not useful for testing purposes.
An important role in the TMR Subsystem IP is played by the TMR Manager com-
ponent. This is the core component of the subsystem: it handles the presence
of faults by continuously analyzing the comparator statuses. In case one of them
presents a mismatch, a special interrupt-like signal, called Break is asserted and the
Microblazes present in the design are forced to start the recovery process.
During this process the cores are forced to perform the following list of operations:
[16]
1. the software is interrupted by the break signal, that cause the call of the
software break handler function.
2. the break handler stores all the internal registers to the data RAM. during
this process, the data is automatically corrected by the voters present at the
output of each processor.
3. the break handler resets all the microblaze cores present by executing a special
instruction that resets also the status of the TMR manager.
4. after reset, the values of the registers placed in RAM are read and restored.
52
6.1 – Xilinx Microblaze and TMR Subsystem
Figure 6.1: TMR Subsystem block diagram: the Microblaze core is triplicated as
well as its peripherals and its memory, the outputs are voted
5. a special return resumes the execution exactly at the place where the break
occurred.
During this process, the processor subsystem is unavailable. The shortness of the
recovery process ensures high levels of availability for the core. For real time appli-
cations there is also the possibility of masking the break signal during time-critical
parts of the code executed, de facto delaying the restore of the subsystem, that
works with a working scheme similar to the Duplicate with Comparison (Lockstep).
53
6 – Characterization of the Xilinx Triple Modular Redundancy Subsystem
Figure 6.2: TMR Manager state transition in case of an error: starting from Voting
mode, an error move the state to Lockstep mode, where the only two out of three
processors are working with a Duplicate with Comparison scheme
• Fast: the benchmark should be able to highlight the presence of errors in the
microprocessors on which it runs onto as soon as possible, therefore reducing
54
6.2 – Benchmarks for Radiation Testing
the fault latency. This requires the continuous production of results to compare
against.
Excluding the last requirement, the first three are easily covered if an algorithm
to detect stuck-at faults is employed for this type of test. Although this type of
algorithm is able to maximize is proven to give the best results in terms of speed,
exhaustiveness and size, it is strongly affected by the architecture of the micropro-
cessor tested and definitively not portable to other’s.
After careful consideration the solution for the radiation benchmark coincided with
the Advanced Encryption Standard (AES) algorithm. The reasons for this choice
are presented in the following paragraphs.
First of all, it is a fast algorithm that works on small chunks (128 bit) of data
at a time and produces continuously results to be compared against. It works by
reading a block of data to be encoded, and it outputs it right after the encoding;
the reduced size of the block allows it to produce results at high rate.
Secondly, for the purpose of testing an integer-only microprocessor it is exhaus-
tive: this is the case for the configured Microblaze core. The operations performed
are able to test most of the core on which it runs on, allowing to easily highlight the
presence of errors in these components. In a cryptographic algorithm every bit in a
stored in a register is used and it matters for the successfulness of the encoding.
Finally, there are many open source implementations of this algorithm available
online, some of them already optimized to run on microprocessors. AES can be
found implemented in many different programming languages, including the omni-
present C language used for microprocessor programming, that are easily portable
and architecturally independent.
55
6 – Characterization of the Xilinx Triple Modular Redundancy Subsystem
• Active and Working – that represents the nominal conditions: the processor
is both in running state and it is producing the correct results at the outputs.
• Active and Not Working – that represents the conditions where the processor
is in running state but it is not producing the correct results.
• Not Active and Working – that represents the conditions where the processor
is not in running state (i.e. it is not producing any output) but it was working
correctly up to that point.
• Not Active and Not Working – that represents the conditions where the pro-
cessor was already not producing the correct results, and it stopped producing
any. This state is not really meaningful for the analysis performed but it is
indeed one of the possible outcomes.
56
6.3 – Microprocessor Testing Metrics
not correct. This status is definitively the most hazardous among all of them: the
fact that the core is produce what seem to be correct results, but in reality they are
not, can have serious consequences on the mission carried. In fact, the only method
to verify the correctness of the data that is produced by a core is to have already
the expected values stored, or to produce them at runtime with an error-free circuit.
These two solutions completely defeat the purpose of use of a microprocessor from
the beginning. As said, the consequences of a bad output can be catastrophic in
systems that require an high reliability of their components; for instance, a core
could be utilized to control the opening of an airplane door: a wrong output value
can potentially open it during a flight, causing a catastrophe.
On the other hand, Not Active and Working status can be detected more easily
and usually there is also the possibility of performing corrective actions before the
system’s mission is affected; for these reasons, the severity associated to this status is
lower than the previous one. The detection of this kind of status can be implemented
both at hardware level and at software level. The first one consists in the monitoring
of a so-called heartbeat signal coming from the processor: this is a non-constant
periodic signal that indicates the normal operation of the core. When this signal
stops, corrective actions have to be performed in order to restore the working state.
A second one, instead, make use of a watchdog counter: this counter is usually
designated to generate a reset when it reaches zero, task of the software running on
the core is to periodically reset it to the original value.
Pc−1
i=0 Ei
Uavg = (6.1)
c
57
6 – Characterization of the Xilinx Triple Modular Redundancy Subsystem
U =1
tupset
avg = (6.2)
φavg · σdevice
Pc−1
i=0
tfailure
avg = tupset
avg · Uavg = = MTTF (6.3)
c · φavg · σdevice
After the definition of the metrics used to evaluate the performances against radia-
tion effects for the embedded cores, it is necessary to define a testing procedure to
follow.
The testing procedure have slight differences depending on its type, tabletop
testing or ground testing. The former, as described in Section 5.1 is in general faster
than the latter due to the possibility to inject faults at an almost arbitrary rate.
The latter, instead, as described in Section 5.2, is able to produce more accurate
results. The differences on the procedures reflect the ones on the architectures, for
this reason two different architectures have been developed.
58
6.4 – Testing Procedure and Architecture
components are implemented using a triplicated BRAM with ECC enabled: this
ensure to be able to work at the operational speed of each core.
As said before, the source data is a set of the input patterns required by the
core; in the specific case of the AES algorithm, the source data is composed by:
• a Loop Counter;
• an Error Counter.
The former is incremented every time that a comparison between the result from
the processor and its respective “golden” is performed. By continuous reading this
counter it is possible to identify if a core is active or not based on the history of the
values. When a core is active this counter counts up until it saturates, while if the
core is stalled it remains constant over time. Based on the speed of the interface and
the frequency of the output of the algorithm it is impossible to read consecutively
two identical values.
The latter is instead incremented every time that a comparison is performed,
but there was a mismatch between the values at the output. This counter holds the
value 0 until an error occurs: in this case two are the possible outcomes:
• the error counter starts counting up, that is representative of the Active and
Not Working status;
• the error counter increase and stop counting afterwards, that is representative
a non-permanent error.
With these two counters it is then possible to evaluate the four possible opera-
tional statuses of a core in real time, by reading continuously the values that they
holds.
Finally, there is also the need for an interface used to communicate these values
to the external world, where they are analyzed and stored to complete the charac-
terization. For this purpose, a SPI interface was used.
59
6 – Characterization of the Xilinx Triple Modular Redundancy Subsystem
Figure 6.3: Single Module architecture: the source data is read by the DUT, and
compared with the golden results. Counters are placed to verify the operational and
working state of the DUT.
The first round of irradiation is performed during tabletop testing. During this
phase a design containing multiple copies (20) of each core tested have been loaded
onto the FPGA, then a fault injection campaign have been started. Having multiple
copies of each core can improve two aspects of this test:
• The speed of the simulation – the increased number of cores running in parallel
reduces the number of tests that have to be performed;
• The systematic effects given by the placement of the processor core on the
FPGA.
For fault injection purposes, and to ease the complexity and improve the speed
of the tests, an external JCM device was connected to the JTAG port of the FPGA.
On the remote PC, acting as an SPI master, instead, a script was in charge of the
following actions:
60
6.4 – Testing Procedure and Architecture
4. Keeping track of the current status of every core present in the design: when
a core changes its operating status the following operations are performed:
(a) Stop the fault injection;
(b) Save the state of all counters present in the design;
(c) Save the current number of faults injected;
(d) Resume the fault injection;
5. Stopping the procedure and restarting the process when one of the following
conditions was met:
• All the microprocessors present were either in Not Available state or in
Not Working state;
• The testing circuitry failed, meaning that a fault have been registered on
one of the components responsible for the data transferred to the PC.
Figure 6.4: Tabletop testing block diagram: the single modules are interfaced using
SPI from a remote PC, that controls the fault injection procedure using JCM
61
6 – Characterization of the Xilinx Triple Modular Redundancy Subsystem
Figure 6.5: Ground testing block diagram: the single modules are interfaced using a
custom chip called SCA, that allows SPI communication over optical fiber. A remote
PC serves for data storage, and it uses a Common Readout Unit to communicate
over optical fiber.
The remote PC had a different task, while before it was actively used in the pro-
cess of controlling the fault injection process and the reconfiguration of the FPGA,
now it only serves the purpose of data storage. The values of the counters are,
in fact, read with many more other values coming from different components and
peripherals involved in the project.
After the data taking process, the available data is analyzed by cross-checking
with the values of the fluence irradiating the FPGA and the timestamps of counter
values. Starting from this point it is possible, using the equations discussed in
Chapter 4, to evaluate the equivalent number of upsets registered on the device.
62
Chapter 7
In this thesis have been presented techniques and architectures that can be em-
ployed for radiation testing of microprocessors, in collaboration with the CERN
ALICE ITS group. In the following sections are presented the results relative to the
characterization of the Xilinx Microblaze and TMR Subsystem.
Furthermore, the same testing procedure and architecture have been employed
for the characterization of an open source microprocessor soft core: the Murax
VexRiscv SoC. On top of this implementation have been applied custom mitigation
techniques using Synopsys Synplify. The results were finally compared in terms of
reliability, being the two alternatives similar in features, resource occupation and
performances.
In Table 7.1 are presented:
63
7 – Results, Conclusions and Future Work
7.1 Results
In this section are presented the results of both tabletop and ground testing, first
in terms of average number of upsets required to change the operational status, and
then in terms of Mean Time to Failure, considering the nominal flux for the ALICE
Experiment:
Considering the Xilinx Ultrascale XCKU040 FPGA as reference for the cross
section and making use of Equation 6.2, the following average upset period can be
calculated:
U =1
tupset
avg = = 3690 s (7.2)
φavg · σdevice
For tabletop testing purposes, six different designs were produced, one for each core
tested as in Table 7.1. Each one of them contained 20 replicas of the same core, as
explained in Chapter 6, on which fault injection have been performed using JCM.
During these tests, a total of 75881726 faults have been injected on the six
different designs produced, that yielded a total of 102550 cores in either Operational
and Not Working (F) or Not Operational and Working (S) state. The following
table presents the results in terms of average number of upsets required, and the
correspondent MTTF with the particle flux of Equation 7.1.
64
7.2 – Conclusions and Future Work
65
7 – Results, Conclusions and Future Work
For what concerns instead the testing procedure and, consequently, the testing
architecture, there are possible future improvements that include:
• The use of other algorithms for radiation benchmarking; one of the possible
solution may be the development of a dedicated benchmark for this purposes.
• The use of a more sophisticated testing procedure, that has not been imple-
mented due to limitations in terms of data bandwidth.
66
Appendix A
CERN’s A Large Ion Collider Experiment (ALICE) is one of the largest experiments
in the world devoted to research in the physics of matter at an infinitely small scale.
Located at the Large Hadron Collider (LHC), this experiment is committed to the
study of heavy ion collisions, with a center of mass energy of approximately 5.5 TeV
per nucleon. The main objective of the experiment is represented by the study of
dark matters at high densities and temperatures.
To achieve this objective, the ALICE detector is composed of two main compo-
nents:
As said in the introduction, the ITS represents the central part of the ALICE de-
tector. It is embedded in a large magnet and it covers ±45° over the full azimuth.
Its basic functions are the following:
67
A – A Large Ion Collider Experiment
• improvement of the momentum and angle measurements for the Time Projec-
tion Chamber (TPC).
The ALICE Experiment is planning to upgrade the ITS during the second LHC
shutdown, in the years 2019-2020. The new ITS will be composed of 7 concentric
layers of pixel detectors, each layer will be 1.5 m long, with the outer radius of 40
cm. This arrangement of sensors will create a 12.6 Gpx camera. [19]
Figure A.1: Sensors layout of the upgraded ALICE-ITS: 2 outer layers, 2 middle
layers and 3 inner layers
All the seven layers will be equipped with the ALPIDE chip, that embeds the
sensitive part and the read-out electronics within the same piece of silicon. This
chips are organized in staves, each stave is composed by a different number of aligned
ALPIDE chips:
• Middle and Outer Barrel stave: 14 ALPIDE chips, organized in two half-staves.
68
A.1 – Upgrade of the Inner Tracking System
Figure A.2: The Inner Barrel stave readout architecture: 9 ALPIDE chips are or-
ganized in a straight line, operating at 1.2 Gbit/s
Figure A.3: The Outer Barrel stave readout architecture: 14 ALPIDE chips are
organized in two half-staves, each one composed by 7 sensors operating at 0.4 Gbit/s
To carry out the incoming data from the sensors present in the ITS, a set of dedicated
hardware (Readout Electronics) is placed nearby them. [20]
The readout electronics plays a fundamental role in this scheme. Being close
the beam collision, it is strongly affected by radiation effects. The core logic of the
readout electronics is composed by the Readout Unit (RU), that is custom board
that features a Xilinx XCKU060 FPGA. This board takes care of interfacing multiple
ALPIDE chips to the Common Readout Unit (CRU) while syncing with the triggers
coming from the ALICE Trigger System. In particular, one single RU is connected
to:
69
A – A Large Ion Collider Experiment
Figure A.4: High level architecture of the upgraded ITS: the sensors are directly
interfaced by the readout electronics, that is connected to Common Readout Units
and synchronized with the ALICE Trigger System
• the Common Readout Unit, that receives the data coming from the sensors;
• the ALICE Tigger System, that sends triggers synchronized with the LHC
clock.
Figure A.5: Readout Unit architecture, it interface various sensors operating a dif-
ferent speeds with the Common Readout Unit, while being synchronized with the
triggers coming from the ALICE Trigger System
70
Appendix B
71
B – Advanced Encryption Standard
Finally, there are many modes of operation for the cipher. The simplest one is
called Electronic Codebook (ECB): in this case each block is encrypted separately,
by applying the same set of operations to each block encoded. An alternative to
this solution is called Cipher Block Chaining (CBC): in this case the product of the
previous encoding is combined to the plaintext of the next one. [23]
Figure B.1: AES ECB Encryption: the same set of operations are applied to different
blocks of the input data, treating them separately.
Figure B.2: AES CBC Encryption: the output of the previous encoding is combined
with the block of data encoded at the next step.
72
B.3 – Implemented Algorithm
B.3.1 aes.h
# ifndef _AES_H_
# define _AES_H_
/* Name d e f i n i t i o n s :
* - Nb : The number of columns c o m p r i s i n g a state in AES .
* - Bs : Block size in bytes AES is 128 b block only .
* - Nk : Number of 32 - bit words in a key .
* - Nr : Number of rounds of for e n c r y p t i o n .
* - Ks : Ik size in bytes .
* - Ke : E x p a n d e d Ik size in bytes .
* - Rk : Round Ik .
* - Ik : Input Ik .
* - Iv : Input Vector .
* - St : State .
*/
# define AES128 1
# define ECB_ENC 1
# define Nb 4
# define Bs ( Nb * Nb )
# if AES128
# define Nk 4
# define Nr 10
# define Ks 16
# define Ke 176
# elif AES192
# define Nk 6
# define Nr 12
# define Ks 24
# define Ke 208
# elif AES256
# define Nk 8
# define Nr 14
# define Ks 32
# define Ke 240
# else
# endif /* AES */
# if ECB | ECB_ENC
void ecb_encrypt ( uint8_t * dst , const uint8_t * src , const uint8_t * key );
# endif /* ECB | ECB_ENC */
# if ECB | ECB_DEC
void ecb_decrypt ( uint8_t * dst , const uint8_t * src , const uint8_t * key );
# endif /* ECB | ECB_DEC */
73
B – Advanced Encryption Standard
# if CBC | CBC_ENC
void cbc_encrypt ( uint8_t * dst , const uint8_t * src , const uint8_t * key ,
const uint8_t * iv );
# endif /* CBC | CBC_ENC */
# if CBC | CBC_DEC
void cbc_decrypt ( uint8_t * dst , const uint8_t * src , const uint8_t * key ,
const uint8_t * iv );
# endif /* CBC | CBC_DEC */
# endif // _AES_H_
74
B.3 – Implemented Algorithm
B.3.2 aes_constants.h
# ifndef AES_CONST_H
# define AES_CONST_H 1
75
B – Advanced Encryption Standard
# endif /* ifndef A E S _ C O N S T _ H */
76
B.3 – Implemented Algorithm
B.3.3 aes.c
// All other round keys are found from the p r e v i o u s round keys .
// i == Nk
for ( ; i < Nb * ( Nr + 1); ++ i ) {
word [0] = Rk [( i -1) * 4 + 0];
word [1] = Rk [( i -1) * 4 + 1];
word [2] = Rk [( i -1) * 4 + 2];
word [3] = Rk [( i -1) * 4 + 3];
if ( i % Nk == 0) {
RotWord ( word );
SubWord ( word );
word [0] = word [0] ^ Rcon [ i / Nk ];
}
# if defined ( AES256 ) && ( AES256 == 1)
if ( i % Nk == 4) {
SubWord ( word );
}
# endif
Rk [ i * 4 + 0] = Rk [( i - Nk ) * 4 + 0] ^ word [0];
Rk [ i * 4 + 1] = Rk [( i - Nk ) * 4 + 1] ^ word [1];
Rk [ i * 4 + 2] = Rk [( i - Nk ) * 4 + 2] ^ word [2];
Rk [ i * 4 + 3] = Rk [( i - Nk ) * 4 + 3] ^ word [3];
}
}
77
B – Advanced Encryption Standard
for ( i = 0; i < 4; ++ i ) {
for ( j = 0; j < 4; ++ j ) {
(* St )[ i ][ j ] ^= Rk [ round * Nb * 4 + i * Nb + j ];
}
}
}
for ( i = 0; i < 4; ++ i ) {
for ( j = 0; j < 4; ++ j ) {
(* St )[ j ][ i ] = sbox [(* St )[ j ][ i ]];
}
}
}
# if XTIME_AS_FUNC
static uint8_t xtime ( uint8_t x ) {
return ( x << 1) ^ ((( x >> 7) & 0 x01 ) * 0 x1B );
}
# else
# define xtime ( x ) ((( x ) << 1) ^ (((( x ) >> 7) & 0 x01 ) * 0 x1B ))
# endif /* X T I M E _ A S _ F U N C */
78
B.3 – Implemented Algorithm
for ( i = 0; i < 4; ++ i ) {
St0 = (* St )[ i ][0];
Tmp = (* St )[ i ][0] ^ (* St )[ i ][1] ^ (* St )[ i ][2] ^ (* St )[ i ][3];
# if MPY_AS_FUNC
# else
# define Mpy (x , y) ( \
((( y ) >> 0 & 0 x01 ) * ( x )) ^ \
((( y ) >> 1 & 0 x01 ) * xtime ( x )) ^ \
((( y ) >> 2 & 0 x01 ) * xtime ( xtime ( x ))) ^ \
((( y ) >> 3 & 0 x01 ) * xtime ( xtime ( xtime ( x )))) ^ \
((( y ) >> 4 & 0 x01 ) * xtime ( xtime ( xtime ( xtime ( x ))))) \
)
# endif /* M P Y _ A S _ F U N C */
79
B – Advanced Encryption Standard
}
# endif
temp = (* St )[1][2];
(* St )[1][2] = (* St )[3][2];
(* St )[3][2] = temp ;
AddRk ( St , Rk , round );
round ++;
SubBytes ( St );
ShiftRows ( St );
AddRk ( St , Rk , round );
}
# endif
AddRk ( St , Rk , round );
round - -;
80
B.3 – Implemented Algorithm
InvShiftRows ( St );
InvSubBytes ( St );
AddRk ( St , Rk , round );
InvMixColumns ( St );
}
InvShiftRows ( St );
InvSubBytes ( St );
AddRk ( St , Rk , round );
}
# endif
# if ECB | ECB_ENC
void ecb_encrypt ( uint8_t * dst , const uint8_t * src , const uint8_t * key ) {
state_t * St ;
uint8_t Rk [ Ke ];
if ( dst != src ) {
cpy ( dst , src );
}
St = ( state_t *) dst ;
IkExpand ( Rk , key );
Encrypt ( St , Rk );
}
# endif /* ECB | ECB_ENC */
# if ECB | ECB_DEC
void ecb_decrypt ( uint8_t * dst , const uint8_t * src , const uint8_t * key ) {
state_t * St ;
uint8_t Rk [ Ke ];
if ( dst != src ) {
cpy ( dst , src );
}
St = ( state_t *) dst ;
IkExpand ( Rk , key );
Decrypt ( St , Rk );
}
# endif /* ECB | ECB_DEC */
for ( i = 0; i < Bs ; ++ i ) {
block [ i ] ^= Iv [ i ];
}
}
81
B – Advanced Encryption Standard
# if CBC | CBC_ENC
void cbc_encrypt ( uint8_t * dst , const uint8_t * src , const uint8_t * key ,
const uint8_t * iv ) {
state_t * St ;
uint8_t Rk [ Ke ];
if ( dst != src ) {
cpy ( dst , src );
}
St = ( state_t *) dst ;
IkExpand ( Rk , key );
XorWithIv ( dst , iv );
Encrypt ( St , Rk );
}
# endif /* CBC | CBC_ENC */
# if CBC | CBC_DEC
void cbc_decrypt ( uint8_t * dst , const uint8_t * src , const uint8_t * key ,
const uint8_t * iv ) {
state_t * St ;
uint8_t Rk [ Ke ];
if ( dst != src ) {
cpy ( dst , src );
}
St = ( state_t *) dst ;
IkExpand ( Rk , key );
Decrypt ( St , Rk );
XorWithIv ( dst , iv );
}
# endif /* CBC | CBC_DEC */
82
Appendix C
<base_name>_tmr(0|1)_ecc(0|1)
C.1.1 generate_mb.tcl
proc generate_mb { prj_name bd_name } {
c r e a t e _ b d _ d es i g n $bd_name
## CONSTANTS # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
set XIP " xilinx.com:ip "
set MB_E NABLE_UA RT 1
set MB_E NABLE_GP IO 1
set MB_E NABLE_TIM R 1
## COMPONENT: MicroBlaze # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
set mb [ create_bd _cell -type ip -vlnv $ XI P :m ic ro b la ze microblaze_0 ]
set_property -dict [ list \
83
C – Xilinx Microblaze and TMR Subsystem
C O N F I G . C _ A D D R _ S I Z E 32 \
CONFIG.C_AREA_OPTIMIZED 1 \
CONFIG.C_INTERCONNECT 2 \
C O N F I G . C _ B A S E _ V E C T O R S 0 x00000000 \
CONFIG.C_FAULT_TOLERANT 1 \
CONFIG.C_LOCKSTEP_SLAVE 0 \
CONFIG.C_AVOID_PRIMITIVES 3 \
CONFIG.C_PVR 0 \
C O N F I G . C _ P V R _ U S E R 1 0 x00 \
C O N F I G . C _ P V R _ U S E R 2 0 x00000000 \
CONF IG.C_D_A XI 0 \
CONF IG.C_D_L MB 1 \
CONF IG.C_I_A XI 0 \
CONF IG.C_I_L MB 1 \
CONFIG.C_USE_BARREL 0 \
C O N F I G . C _ U S E_ D I V 0 \
CONFIG.C_USE_HW_MUL 0 \
C O N F I G . C _ U S E_ F P U 0 \
CONFIG.C_USE_MSR_INSTR 1 \
CONFIG.C_USE_PCMP_INSTR 1 \
CONFIG.C_USE_REORDER_INSTR 1 \
CONFIG.C_UNALIGNED_EXCEPTIONS 0 \
CONFIG.C_ILL_OPCODE_EXCEPTION 0 \
CONFIG.C_M_AXI_I_BUS_EXCEPTION 0 \
CONFIG.C_M_AXI_D_BUS_EXCEPTION 0 \
CONFIG.C_DIV_ZERO_EXCEPTION 0 \
CONFIG.C_FPU_EXCEPTION 0 \
CONFIG.C_OPCODE_0x0_ILLEGAL 0 \
CONFIG.C_FSL_EXCEPTION 0 \
CONFIG.C_ECC_USE_CE_EXCEPTION 0 \
CONFIG.C_USE_STACK_PROTECTION 0 \
CONFIG.C_IMPRECISE_EXCEPTIONS 0 \
CONFIG.C_DEBUG_ENABLED 1 \
CONFIG.C_NUMBER_OF_PC_BRK 1 \
CONFIG.C_NUMBER_OF_RD_ADDR_BRK 0 \
CONFIG.C_NUMBER_OF_WR_ADDR_BRK 0 \
CONFIG.C_DEBUG_EVENT_COUNTERS 5 \
CONFIG.C_DEBUG_LATENCY_COUNTERS 1 \
C O N F I G . C _ D E B U G _ C O U N T E R _ W I D T H 32 \
C O N F I G . C _ D E B U G _ T R A C E _ S I Z E 8192 \
CONFIG.C_DEBUG_PROFILE_SIZE 0 \
CONFIG.C_DEBUG_EXTERNAL_TRACE 0 \
CONFIG.C_DEBUG_INTERFACE 0 \
CONFIG.C_ASYNC_INTERRUPT 0 \
CONFIG.C_FSL_LINKS 0 \
CONFIG.C_USE_EXTENDED_FSL_INSTR 0 \
CONFIG.C_ICACHE_BASEADDR 0 x0000000000000000 \
CONFIG.C_ICACHE_HIGHADDR 0 x000000003FFFFFFF \
CONFIG.C_USE_ICACHE 0 \
CONFIG.C_ALLOW_ICACHE_WR 1 \
CONFIG.C_ICACHE_LINE_LEN 4 \
CONFIG.C_ICACHE_FORCE_TAG_LUTRAM 0 \
CONFIG.C_ICACHE_STREAMS 0 \
CONFIG.C_ICACHE_VICTIMS 0 \
CONFIG.C_ICACHE_DATA_WIDTH 0 \
C O N F I G . C _ A D D R _ T A G _ B I T S 17 \
C O N F I G . C _ C A C H E _ B Y T E _ S I Z E 8192 \
CONFIG.C_DCACHE_BASEADDR 0 x0000000000000000 \
CONFIG.C_DCACHE_HIGHADDR 0 x000000003FFFFFFF \
CONFIG.C_USE_DCACHE 0 \
CONFIG.C_ALLOW_DCACHE_WR 1 \
CONFIG.C_DCACHE_LINE_LEN 4 \
CONFIG.C_DCACHE_FORCE_TAG_LUTRAM 0 \
84
C.1 – Configuration Scripts
CONFIG.C_DCACHE_USE_WRITEBACK 0 \
CONFIG.C_DCACHE_VICTIMS 0 \
CONFIG.C_DCACHE_DATA_WIDTH 0 \
C O N F I G . C _ D C A C H E _ A D D R _ T A G 17 \
C O N F I G . C _ D C A C H E _ B Y T E _ S I Z E 8192 \
C O N F I G . C _ U S E_ M M U 0 \
CONFIG.C_MMU_DTLB_SIZE 4 \
CONFIG.C_MMU_ITLB_SIZE 2 \
CONFIG.C_MMU_TLB_ACCESS 3 \
C O N F I G . C _ M M U _ Z O N E S 16 \
CONFIG.C_MMU_PRIVILEGED_INSTR 0 \
CONFIG.C_USE_INTERRUPT 1 \
CONFIG.C_USE_EXT_BRK 0 \
CONFIG.C_USE_EXT_NM_BRK 0 \
CONFIG.C_USE_NON_SECURE 0 \
CONFIG.C_USE_BRANCH_TARGET_CACHE 0 \
CONFIG.C_BRANCH_TARGET_CACHE_SIZE 0 \
] $mb
## AUTOMATION: MicroBlaze # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
apply_bd_automation \
-rule x i l i n x . c o m : b d _ r u l e : m i c r o b l a z e \
-config { \
preset " M i cr oc on t ro ll er " \
local_mem " 8 KB " \
ecc " None " \
cache " None " \
debug_module " None " \
axi_periph " Enabled " \
85
C – Xilinx Microblaze and TMR Subsystem
## AUTOMATION: AXI # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
foreach axiPeriph $AXI_PERIPH {
apply_bd_automation \
-rule x i l i n x . c o m : b d _ r u l e : a x i 4 \
-config { \
Master " / microblaze_0 ( Periph ) " \
intc_ip " / m i c r o b l a z e _ 0 _ a x i _ p e r i p h " \
Clk_xbar " Auto " \
Clk_master " Auto " \
Clk_slave " Auto " \
} [ g e t _ b d _ in t f _ p i n s $axiPeriph / S_AXI ]
}
## INTERRUPTS # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
if {[ llength $IRQ_PERIPH ] > 0} {
set irqconcat [ get_bd_cells m i c r o b l a z e _ 0 _ x l c o n c a t ]
set_property -dict [ list \
C O N F I G . N U M _ PO R T S [ llength $IRQ_PERIPH ] \
] $irqconcat
set i 0
foreach irqPeriph $IRQ_PERIPH {
conn ect_bd_n et \
[ get_bd_pins $irqPeriph / interrupt ] \
[ get_bd_pins $irqconcat / In$i ]
incr i
}
}
## EXTERNAL CONNECTIONS # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
set_property NAME ext_clk [ get_bd_ports Clk ]
m a k e _ b d _ p i n s _ e x t e r n a l [ get_bd_pins rst_* / ext_reset_in ]
set_property NAME ext_rst_n [ get_bd_ports ext_reset_in* ]
if { $M B_ E NA BL E_ U AR T } {
m a k e _ b d _ p i n s _ e x t e r n a l [ get_bd_pins $uart / rx ]
set_property NAME mb_uart_rxd [ get_bd_ports rx* ]
m a k e _ b d _ p i n s _ e x t e r n a l [ get_bd_pins $uart / tx ]
set_property NAME mb_uart_txd [ get_bd_ports tx* ]
}
if { $M B_ E NA BL E_ G PI O } {
# m a k e _ b d _ i n t f _ p i n s _ e x t e r n a l [ g e t _ bd _ i n t f _ p i n s $gpio / GPIO ]
# set_property NAME mb_gpio [ g e t _ b d _ i n t f _ p o r t s GPIO ]
m a k e _ b d _ p i n s _ e x t e r n a l [ get_bd_pins $gpio / GPIO_io_i ]
set_property NAME mb_gpio_i [ get_bd_ports GPIO_io_i* ]
m a k e _ b d _ p i n s _ e x t e r n a l [ get_bd_pins $gpio / GPIO_io_t ]
set_property NAME mb_gpio_t [ get_bd_ports GPIO_io_t* ]
m a k e _ b d _ p i n s _ e x t e r n a l [ get_bd_pins $gpio / GPIO_io_o ]
set_property NAME mb_gpio_o [ get_bd_ports GPIO_io_o* ]
}
## GROUP CELLS # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
group _bd_cell s $bd_name [ get_bd_cells ]
## SAVE DESIGN # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
regenerate_bd_layout
validate_bd_design
86
C.1 – Configuration Scripts
save _bd_desi gn
write_bd_tcl generated_$ { bd_name } .tcl -force
# Options:
# bram: " Local " | " Common With ECC " ( LMB Memory Configuration )
# wd: " None " | " Internal " ( Software Watchdog )
# sem_if: " None " | " Included " | " External " ( SEM Interface )
# sem_wd: " 0 " | " 1 " ( SEM Heartbeat Watchdog )
# brk: " 0 " | " 1 " ( Re co n fi gu ra t io n Delay )
# mask: " 0 " | " 1 " ( Comparator Test )
# inject: " 0 " | " 1 " ( Fault Injection )
apply_bd_automation \
-rule x i l i n x . c o m : b d _ r u l e : t m r \
-config { \
bram " Local " \
wd " None " \
sem_if " None " \
sem_wd " 0 " \
brk " 1 " \
mask " 0 " \
inject " 0 " \
} $tmr
## EXTERNAL RESET # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
move_bd_cells [ get_bd_cells /] [ get_bd_cells $ { bd_name }/ rst_Clk_100M ]
set_property NAME re s et _g en e ra to r [ get_bd_cells rst_Clk_100M ]
set rst [ get_bd_cells / r e se t_ ge n er at or ]
87
C – Xilinx Microblaze and TMR Subsystem
## GENERATE IP FILES # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
g en er at e _t ar ge t all $bd_file -quiet
export_simulation \
-of_objects $bd_file \
-directory $ { prj_name }/ $ { prj_name } .ip_us er_files / sim_scripts \
- i p _ u s e r _ f i l e s _ d i r $ { prj_name }/ $ { prj_name } . ip_user_ files \
- i p s t a t i c _ s o u r c e _ d i r $ { prj_name }/ $ { prj_name } . ip_user_ files / ipstatic \
-lib_map_path [ list \
{ modelsim=$ { prj_name }/ $ { prj_name } .cache / co mpile_si mlib / modelsim } \
{ questa=$ { prj_name }/ $ { prj_name } .cache / compi le_simli b / questa } \
{ ies=$ { prj_name }/ $ { prj_name } .cache / co mpile_si mlib / ies } \
{ vcs=$ { prj_name }/ $ { prj_name } .cache / co mpile_si mlib / vcs } \
{ riviera=$ { prj_name }/ $ { prj_name } .cache / comp ile_siml ib / riviera }\
] \
- u s e _ i p _ c o m p i l e d _ l i b s -force -quiet
88
C.2 – Firmware
C.2 Firmware
In this section are presented the main C and assembly files that, used with the
AES C files presented in Appendix B, were compiled in an Executable and Linkable
Format (ELF) file and therefore loaded on the microprocessors.
Depending on the configuration used, single or triplicated Microblaze, the define
TMR_ENABLE was set properly.
C.2.1 test_micro.c
# include < stdint .h >
# include " aes . h "
# define TMR_ENABLED 1
# define ECB 0
# define ECB_ENC 1
# define ECB_DEC 0
# define READ_REQUEST 0
# define BLOC KS_ENCOD ED 255
# if TMR_ENABLED
extern void _ x t m r _ m a n a g e r _ i n i t i a l i z e ();
# endif
int main () {
uint8_t key [ Ks ]; // C o n t a i n e r for Key
uint8_t buf [ Bs ]; // C o n t a i n e r for Data
# if TMR_ENABLED
_ x t m r _ m a n a g e r _ i n i t i a l i z e ();
# endif
while (1) {
// Get key from GPIOs .
get_data ( key , Ks );
set_data ( key , Ks );
int i ;
for ( i = 0; i < BLOC KS_ENCOD ED ; ++ i ) {
# if ECB | ECB_ENC
89
C – Xilinx Microblaze and TMR Subsystem
// Test ECB E n c o d i n g
get_data ( buf , Bs );
ecb_encrypt ( buf , buf , key );
set_data ( buf , Bs );
# endif
# if ECB | ECB_DEC
// Test ECB D e c o d i n g
get_data ( buf , Bs );
ecb_decrypt ( buf , buf , key );
set_data ( buf , Bs );
# endif
}
}
return 0;
}
90
C.2 – Firmware
C.2.2 mb_recovery.S
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* TMR Manager r e c o v e r y r o u t i n e s :
* - Break Handler
* - Reset Handler
* - Initialize
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
# define BASE_VECTORS 0 x00000000
/*
* _ x t m r _ m a n a g e r _ i n i t i a l i z e - I n i t i a l i z e break and reset vector .
*
* Save o r i g i n a l cold reset vector to global v a r i a b l e s .
* Set up reset vector to branch to _ x t m r _ m a n a g e r _ r e s e t .
* Set up break vector to branch to _ x t m r _ m a n a g e r _ b r e a k .
*
*/
. global _ x t m r _ m a n a g e r _ i n i t i a l i z e
. section . text
. align 2
. ent _ x t m r _ m a n a g e r _ i n i t i a l i z e
. type _xtmr_manager_initialize , @function
_xtmr_manager_initialize :
/* Push to Stack */
addik r1 , r1 , -16
swi r6 , r1 , 0
swi r7 , r1 , 4
swi r8 , r1 , 8
swi r9 , r1 , 12
/* Clear R e g i s t e r s */
ori r9 , r0 , XTMR_CR_MAGIC
swi r9 , r0 , XTMR_CR
swi r0 , r0 , XTMR_FFR
swi r0 , r0 , XTMR_BDIR
swi r0 , r0 , XTMR_SEMIMR
91
C – Xilinx Microblaze and TMR Subsystem
swi r0 , r0 , XTMR_RFSR
swi r0 , r0 , XTMR_CSCR
/* I n i t i a l i z e break vector */
ori r6 , r0 , _ x t m r _ m a n a g e r _ b r e a k
bsrli r6 , r6 , 16
ori r6 , r6 , 0 xb0000000
ori r7 , r0 , _ x t m r _ m a n a g e r _ b r e a k
andi r7 , r7 , 0 xffff
ori r7 , r7 , 0 xb8080000
swi r6 , r8 , 0 x14
swi r7 , r8 , 0 x18
/*
* _ x t m r _ m a n a g e r _ b r e a k - Handler for r e c o v e r y break from the TMR Manager .
*
* Save stack pointer in global r e g i s t e r .
* Save all r e g i s t e r s that r e p r e s e n t the p r o c e s s o r i n t e r n a l state .
* Flush or i n v a l i d a t e all i n t e r n a l cached data : D - cache , I - cache , BTC and UTLB .
* Call break handler in C code .
* Suspend p r o c e s s o r to signal TMR Manager that it should perform a reset .
*
* Handler notes :
* - There is no need to save e x c e p t i o n r e g i s t e r s ( EAR , ESR , BIP , EDR ) , since
* when the MSR EIP bit is set , break is blocked .
*/
. global _ x t m r _ m a n a g e r _ b r e a k
. section . text
. align 2
. ent _ x t m r _ m a n a g e r _ b r e a k
. type _xtmr_manager_break , @function
_xtmr_manager_break :
/* Save context to stack */
92
C.2 – Firmware
SAVE_REG ( r1 )
SAVE_REG ( r2 )
SAVE_REG ( r3 )
SAVE_REG ( r4 )
SAVE_REG ( r5 )
SAVE_REG ( r6 )
SAVE_REG ( r7 )
SAVE_REG ( r8 )
SAVE_REG ( r9 )
SAVE_REG ( r10 )
SAVE_REG ( r11 )
SAVE_REG ( r12 )
SAVE_REG ( r13 )
SAVE_REG ( r14 )
SAVE_REG ( r15 )
SAVE_REG ( r16 )
SAVE_REG ( r17 )
SAVE_REG ( r18 )
SAVE_REG ( r19 )
SAVE_REG ( r20 )
SAVE_REG ( r21 )
SAVE_REG ( r22 )
SAVE_REG ( r23 )
SAVE_REG ( r24 )
SAVE_REG ( r25 )
SAVE_REG ( r26 )
SAVE_REG ( r27 )
SAVE_REG ( r28 )
SAVE_REG ( r29 )
SAVE_REG ( r30 )
SAVE_REG ( r31 )
mfs r1 , rmsr
swi r1 , r0 , X T M R _ M a n a g e r _ r m s r
/*
* _ x t m r _ m a n a g e r _ r e s e t - Handler for r e c o v e r y reset issued by TMR Manager .
*
* Restore stack pointer from global r e g i s t e r .
* Restore MSR to turn on caches .
* Call reset handler in C code .
* If C code returns 0 , r e p r e s n t i n g cold reset , jump to saved cold reset vector .
* Restore all r e g i s t e r s that r e p r e s e n t the p r o c e s s o r i n t e r n a l state .
* Return from break to resume e x e c u t i o n .
*
*/
. global _ x t m r _ m a n a g e r _ r e s e t
. section . text
. align 2
. ent _ x t m r _ m a n a g e r _ r e s e t
. type _xtmr_manager_reset , @function
_xtmr_manager_reset :
/* Turn on caches if they are used */
lwi r1 , r0 , X T M R _ M a n a g e r _ r m s r
93
C – Xilinx Microblaze and TMR Subsystem
mts rmsr , r1
bri 4
/* Clear R e g i s t e r s */
ori r9 , r0 , XTMR_CR_MAGIC
swi r9 , r0 , XTMR_CR
swi r0 , r0 , XTMR_FFR
swi r0 , r0 , XTMR_BDIR
swi r0 , r0 , XTMR_SEMIMR
swi r0 , r0 , XTMR_RFSR
swi r0 , r0 , XTMR_CSCR
94
C.2 – Firmware
. global X T MR _M an a ge r_ r7
. global X T MR _M an a ge r_ r8
. global X T MR _M a na ge r_ r 9
. global X T M R _ M a n ag e r _ r 1 0
. global X T M R _ M a n ag e r _ r 1 1
. global X T M R _ M a n ag e r _ r 1 2
. global X T M R _ M a n ag e r _ r 1 3
. global X T M R _ M a n ag e r _ r 1 4
. global X T M R _ M a n ag e r _ r 1 5
. global X T M R _ M a n a g e r _ r1 6
. global X T M R _ M a n ag e r _ r 1 7
. global X T M R _ M a n ag e r _ r 1 8
. global X T M R _ M a n a g e r _ r1 9
. global X T M R _ M a n ag e r _ r 2 0
. global X T M R _ M a n ag e r _ r 2 1
. global X T M R _ M a n ag e r _ r 2 2
. global X T M R _ M a n ag e r _ r 2 3
. global X T M R _ M a n ag e r _ r 2 4
. global X T M R _ M a n ag e r _ r 2 5
. global X T M R _ M a n a g e r _ r2 6
. global X T M R _ M a n ag e r _ r 2 7
. global X T M R _ M a n ag e r _ r 2 8
. global X T M R _ M a n a g e r _ r2 9
. global X T M R _ M a n ag e r _ r 3 0
. global X T M R _ M a n ag e r _ r 3 1
XTMR_Manager_ColdResetVector :
. long 0
. long 0
XTMR_Manager_InstancePtr :
. long 0
XTMR_Manager_rmsr :
. long 0
X TM R_ Ma n ag er _r 1 :
. long 0
X TM R_ Ma n ag er _r 2 :
. long 0
X TM R_ Ma n ag er _r 3 :
. long 0
X TM R_ Ma na g er _r 4 :
. long 0
X TM R_ Ma na g er _r 5 :
. long 0
X TM R_ Ma n ag er _ r6 :
. long 0
X TM R_ Ma na g er _r 7 :
. long 0
X TM R_ Ma na g er _r 8 :
. long 0
X TM R_ Ma n ag er _ r9 :
. long 0
X T M R _ M a n a g e r_ r 1 0 :
. long 0
X T M R _ M a n a g e r_ r 1 1 :
. long 0
X T M R _ M a n a g e r_ r 1 2 :
. long 0
X T M R _ M a n a g e r_ r 1 3 :
. long 0
X T M R _ M a n a g e r_ r 1 4 :
. long 0
X T M R _ M a n a g e r_ r 1 5 :
. long 0
95
C – Xilinx Microblaze and TMR Subsystem
XTMR_Manager_r16 :
. long 0
X T M R _ M a n a g e r_ r 1 7 :
. long 0
X T M R _ M a n a g e r_ r 1 8 :
. long 0
XTMR_Manager_r19 :
. long 0
X T M R _ M a n a g e r_ r 2 0 :
. long 0
X T M R _ M a n a g e r_ r 2 1 :
. long 0
X T M R _ M a n a g e r_ r 2 2 :
. long 0
X T M R _ M a n a g e r_ r 2 3 :
. long 0
X T M R _ M a n a g e r_ r 2 4 :
. long 0
X T M R _ M a n a g e r_ r 2 5 :
. long 0
XTMR_Manager_r26 :
. long 0
X T M R _ M a n a g e r_ r 2 7 :
. long 0
X T M R _ M a n a g e r_ r 2 8 :
. long 0
XTMR_Manager_r29 :
. long 0
X T M R _ M a n a g e r_ r 3 0 :
. long 0
X T M R _ M a n a g e r_ r 3 1 :
. long 0
96
List of Figures
3.1 Duplicate with Comparison scheme: the systems are duplicated and
their outputs compared for errors . . . . . . . . . . . . . . . . . . . . 19
3.2 N-Modular Redundancy scheme: the systems are replicated N times,
the voter decides the correct output using a majority voter scheme . . 19
97
List of Figures
3.3 Block TMR scheme: a block is triplicated, including its memory ele-
ments, and then a voter decides the correct output using a majority
voter scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Reliablity comparison between a single system and a triplicated sys-
tem with BTMR, the triplicated system is more likely to fail after a
period of time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Distributed TMR scheme: the block is triplicated at each step, the
inputs to the next step are always voted and the correct state is always
restored from the voted output . . . . . . . . . . . . . . . . . . . . . 22
3.6 Information Redundancy technique: a redundant part is added to the
original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Time Redundancy in Software: the same operation is repeated mul-
tiple times and then the result is compared and voted . . . . . . . . . 25
3.8 Time Redundancy in Hardware: delays are added to repeat the same
operation at different time instants . . . . . . . . . . . . . . . . . . . 26
5.1 SEM IP block description: all the input and outputs ports are listed . 44
5.2 Xilinx Essential Bits: configuration bits can be classified based on
their priority levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
98
List of Figures
6.3 Single Module architecture: the source data is read by the DUT, and
compared with the golden results. Counters are placed to verify the
operational and working state of the DUT. . . . . . . . . . . . . . . . 60
6.4 Tabletop testing block diagram: the single modules are interfaced us-
ing SPI from a remote PC, that controls the fault injection procedure
using JCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Ground testing block diagram: the single modules are interfaced using
a custom chip called SCA, that allows SPI communication over optical
fiber. A remote PC serves for data storage, and it uses a Common
Readout Unit to communicate over optical fiber. . . . . . . . . . . . . 62
B.1 AES ECB Encryption: the same set of operations are applied to
different blocks of the input data, treating them separately. . . . . . . 72
B.2 AES CBC Encryption: the output of the previous encoding is com-
bined with the block of data encoded at the next step. . . . . . . . . 72
99
100
List of Tables
101
List of Tables
102
Acronyms
DFF D Flip-Flop.
103
Acronyms
I/O Input/Output.
IP Intellectual Property.
PC Personal Computer.
104
Acronyms
105
106
Bibliography
[1] F. Brosser and E. Milh, «Seu mitigation techniques for advanced reprogrammable
fpga in space», 128, Master’s thesis, 2014.
[2] European Cooperation for Space Standardization, Ecss-q-hb-60-02a – tech-
niques for radiation effects mitigation in asics and fpgas handbook, 2016. [On-
line]. Available: http://ecss.nl/get_attachment.php?file=2016/09/
ECSS-Q-HB-60-02A1September2016.pdf.
[3] ——, Ecss-e-hb-10-12a – calculation of radiation and its effects and margin
policy handbook, 2013. [Online]. Available: http://ecss.nl/get_attachment.
php?file=handbooks/ecss-e-hb/ECSS-E-HB-10-12A17December2010.pdf.
[4] V. Ferlet-Cavrois, L. W. Massengill, and P. Gouker, «Single event transients in
digital cmos -a review», IEEE Transactions on Nuclear Science, vol. 60, no. 3,
pp. 1767–1790, Jun. 2013, issn: 0018-9499. doi: 10.1109/TNS.2013.2255624.
[5] Xilinx Inc., Soft error mitigation using prioritized essential bits, XAPP538
(v1.0) April 4, 2012. [Online]. Available: https://www.xilinx.com/support/
documentation/application_notes/xapp538- soft- error- mitigation-
essential-bits.pdf.
[6] D. G. Mavis and P. H. Eaton, «Soft error rate mitigation techniques for modern
microcircuits», in 2002 IEEE International Reliability Physics Symposium.
Proceedings. 40th Annual (Cat. No.02CH37320), 2002, pp. 216–225. doi: 10.
1109/RELPHY.2002.996639.
[7] M. Pignol, «Dmt and dt2: Two fault-tolerant architectures developed by cnes
for cots-based spacecraft supercomputers», in 12th IEEE International On-
Line Testing Symposium (IOLTS’06), 2006, pp. 1–10. doi: 10.1109/IOLTS.
2006.24.
[8] B. Kannan and L. E. Parker, «Metrics for quantifying system performance in
intelligent, fault-tolerant multi-robot teams», in 2007 IEEE/RSJ International
Conference on Intelligent Robots and Systems, Oct. 2007, pp. 951–958. doi:
10.1109/IROS.2007.4399530.
107
BIBLIOGRAPHY
[9] IFIP Working Group 10.4, Ifip working group 10.4 on dependable computing
and fault tolerance, 1980. [Online]. Available: http://www.dependability.
org/wg10.4/.
[10] SEBoK Wiki, Reliability, availability, and maintainability, 2018. [Online]. Avail-
able: http : / / www . sebokwiki . org / wiki / Reliability , _Availability ,
_and_Maintainability.
[11] S. Y. Wang, S. E. Meyer, D. R. Saxena, and M. Y. Lai, «Applications of
software fault tolerance testing methodology», in Global Telecommunications
Conference, 1996. GLOBECOM ’96. ’Communications: The Key to Global
Prosperity, vol. 1, Nov. 1996, 670–674 vol.1. doi: 10 . 1109 / GLOCOM . 1996 .
594446.
[12] Xilinx Inc., Soft error mitigation controller v4.1, PG036 September 30, 2015.
[Online]. Available: www.xilinx.com/support/documentation/ip_documentation/
sem/v4_1/pg036_sem.pdf.
[13] ——, Ultrascale architecture soft error mitigation controller, PG187 Decem-
ber 20, 2017. [Online]. Available: https : / / www . xilinx . com / support /
documentation/ip_documentation/sem_ultra/v3_1/pg187-ultrascale-
sem.pdf.
[14] A. Gruwell, P. Zabriskie, and M. Wirthlin, «High-speed fpga configuration and
testing through jtag», in 2016 IEEE AUTOTESTCON, Sep. 2016, pp. 1–8.
doi: 10.1109/AUTEST.2016.7589601.
[15] Xilinx Inc., Microblaze processor reference guide, UG984 (v2017.4) Decem-
ber 20, 2017. [Online]. Available: https : / / www . xilinx . com / support /
documentation/sw_manuals/xilinx2017_4/ug984- vivado- microblaze-
ref.pdf.
[16] ——, Microblaze triple modular redundancy (tmr) subsystem v1.0, PG268 Oc-
tober 4, 2017. [Online]. Available: https : / / www . xilinx . com / support /
documentation/ip_documentation/tmr/v1_0/pg268-tmr.pdf.
[17] H. Quinn, W. H. Robinson, P. Rech, M. Aguirre, A. Barnard, M. Desogus,
L. Entrena, M. Garcia-Valderas, S. M. Guertin, D. Kaeli, F. L. Kastensmidt,
B. T. Kiddie, A. Sanchez-Clemente, M. S. Reorda, L. Sterpone, and M. Wirth-
lin, «Using benchmarks for radiation testing of microprocessors and fpgas»,
IEEE Transactions on Nuclear Science, vol. 62, no. 6, pp. 2547–2554, Dec.
2015, issn: 0018-9499. doi: 10.1109/TNS.2015.2498313.
[18] A. Caratelli, S. Bonacini, K. Kloukinas, A. Marchioro, P. Moreira, R. De
Oliveira, and C. Paillard, «The gbt-sca, a radiation tolerant asic for detector
control and monitoring applications in hep experiments», vol. 10, pp. C03034–
C03034, Mar. 2015.
108
BIBLIOGRAPHY
[19] S. Kushpil and A. Collaboration, «Upgrade of the alice inner tracking system»,
Journal of Physics: Conference Series, vol. 675, no. 1, p. 012 038, 2016. [On-
line]. Available: http://stacks.iop.org/1742-6596/675/i=1/a=012038.
[20] A. Szczepankiewicz, «Readout of the upgraded alice-its», Nuclear Instruments
and Methods in Physics Research Section A: Accelerators, Spectrometers, De-
tectors and Associated Equipment, vol. 824, pp. 465–469, 2016, Frontier Detec-
tors for Frontier Physics: Proceedings of the 13th Pisa Meeting on Advanced
Detectors, issn: 0168-9002. doi: https://doi.org/10.1016/j.nima.2015.
10 . 056. [Online]. Available: http : / / www . sciencedirect . com / science /
article/pii/S0168900215012681.
[21] Advanced Encryption Standard, Advanced encryption standard — Wikipedia,
the free encyclopedia, 2018. [Online]. Available: https://en.wikipedia.org/
wiki/Advanced_Encryption_Standard.
[22] Substitution–permutation network, Substitution–permutation network — Wikipedia,
the free encyclopedia, 2018. [Online]. Available: https://en.wikipedia.org/
wiki/Substitution%E2%80%93permutation_network.
[23] Block cipher mode of operation, Block cipher mode of operation — Wikipedia,
the free encyclopedia, 2018. [Online]. Available: https://en.wikipedia.org/
wiki/Block_cipher_mode_of_operation.
[24] D. P. Siewiorek and P. Narasimhan, «Fault-tolerant architectures for space
and avionics applications», Mar. 2018.
[25] Xilinx Inc., Single-event upset mitigation selection guide, XAPP987 (v1.0)
March 18, 2008. [Online]. Available: https://www.xilinx.com/support/
documentation/application_notes/xapp987.pdf.
[26] M. Mager, «Alpide, the monolithic active pixel sensor for the alice its up-
grade», Nuclear Instruments and Methods in Physics Research Section A:
Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 824,
pp. 434–438, 2016, Frontier Detectors for Frontier Physics: Proceedings of
the 13th Pisa Meeting on Advanced Detectors, issn: 0168-9002. doi: https:
/ / doi . org / 10 . 1016 / j . nima . 2015 . 09 . 057. [Online]. Available: http :
//www.sciencedirect.com/science/article/pii/S0168900215011122.
109