0% found this document useful (0 votes)
34 views15 pages

Anany 2019

This document discusses a technique called design disjunction that enables fast online diagnosis and recovery of faults in reconfigurable logic fabrics like FPGAs. Design disjunction uses partial reconfiguration to achieve self-recovery from faults in O(log T) steps, where T is the number of logic resources. It forms groups of resources at design time such that up to f faults can be resolved with certainty, where f is the number of faults the method can handle. Experimental results show the technique can accurately isolate faults at the individual logic slice level and achieve millisecond recovery times with increased fault coverage compared to conventional redundancy approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views15 pages

Anany 2019

This document discusses a technique called design disjunction that enables fast online diagnosis and recovery of faults in reconfigurable logic fabrics like FPGAs. Design disjunction uses partial reconfiguration to achieve self-recovery from faults in O(log T) steps, where T is the number of logic resources. It forms groups of resources at design time such that up to f faults can be resolved with certainty, where f is the number of faults the method can handle. Experimental results show the technique can accurately isolate faults at the individual logic slice level and achieve millisecond recovery times with increased fault coverage compared to conventional redundancy approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO.

10, OCTOBER 2016 3055

Fast Online Diagnosis and Recovery


of Reconfigurable Logic Fabrics
Using Design Disjunction
Ahmad Alzahrani, Member, IEEE, and Ronald F. DeMara, Senior Member, IEEE

Abstract—Design disjunction is developed to offer a broad coverage, high resolution, and low overhead approach to online diagnosis
and recovery of reconfigurable fabrics. Design disjunction leverages the condensed diagnosability of T logic resources to achieve self-
recovery using partial reconfiguration in O(log T ) steps. Reconfiguration is guided by the constructive property of f-disjunctness which
forms O(log T ) resource groups at design-time. Resolution of f simultaneous resource faults is shown to be guaranteed when the
resource groups are mutually f-disjunct. This extends run-time fault resilience to a large resource space with certainty for up to f faults
using a decision-free resolution process that also provides a high likelihood of identifying the fault’s location to a fine granularity.
Finally, design disjunction is parameterized to accommodate the low coverage issue of functional testing for which inarticulate tests can
otherwise impair fault isolation. Experimental results for MCNC and ISCAS benchmarks on a Xilinx 7-series field programmable gate
array (FPGA) demonstrate f-diagnosability at the individual slice level with a minimum average isolation accuracy of 96:4 percent
(94:4 percent) for f ¼ 1 (f ¼ 2). Results have also demonstrated millisecond order recovery with a minimum increase of 83:6 percent
in fault coverage compared to N-modular redundancy (NMR) schemes. Recovery is achieved while incurring an average critical path
delay impact of only 1:49 percent and energy cost roughly comparable to conventional two-MR approaches.

Index Terms—Reconfigurable logic devices, field programmable gate arrays, autonomous fault handling, fault-tolerant systems, run-time
fault diagnosis and recovery, online test, design space exploration

1 INTRODUCTION

C ONTINUED scaling of transistor feature size has exacer-


bated reliability concerns such as process variation,
aging degradations, latent faults, and temporary failures in
faulty elements. Availability depends on rapid fault
recovery to incur minimal downtime via autonomous fault
resolution. As opposed to FPGAs, application-specific inte-
integrated circuits (ICs). Consequently, the need for IC grated circuits (ASICs) use fixed redundancy configurations
fault tolerance has received increasing interest over the last which preclude fine-grained resource remapping. Whereas
decade. Moreover, the pervasive use of embedded computing FPGAs can enable dynamic fine-grained resiliency, a novel
systems realized by field-programmable gate arrays (FPGAs) online technique is developed using rapid self-organization
has elevated the importance of FPGA availability require- to attain HA objectives.
ments corresponding to the proportion of time that their oper- Reconfigurable hardware’s capacity to self-organize can
ation can be sustained. A common requirement is to provide fulfill anticipated roles in designing future dependable hard-
high availability (HA) operation defined by 99:999 percent ware systems [2]. At present, the most widely adopted rec-
(“five nines”) that correlates to five minutes of downtime per onfigurable architectures are SRAM-based FPGAs whose
year, or greater availability such as 99:999999 percent (“eight capacity can exceed a million logic cells which can be lever-
nines”) that correlates to 316 milliseconds of downtime per aged to enable resilience. SRAM-based FPGAs are ubiquitous
year [1]. High availability operation is crucial whenever in application-specific embedded systems, high performance
unavailability could result in potential harm or inconve- computing centers as well as safety-impacting, mission-criti-
nience, violation of a service-level agreement, or a loss of reve- cal, and commerce-enabling systems. The FPGA devices
nue, mission, and/or safety. within these systems can significantly impact the overall sys-
Traditionally, availability requirements can be achieved tem reliability [3]. Fortunately, run-time partial reconfigura-
through spatial resource redundancy to mask or replace tion capabilities of contemporary FPGAs can be utilized to
maintain degraded-mode operation while enabling rapid
 A. Alzahrani is with the Department of Computer Engineering, Umm Al- recovery from a variety of faults.
Qura University, Makkah 21955, Saudi Arabia. Over the last two decades, a significant body of research
E-mail: [email protected]. has focused on realizing FPGA-based systems that are
 R.F. DeMara is with the Department of Electrical and Computer
Engineering, University of Central Florida, Orlando, FL 32816.
robust to permanent and transient failures. Permanent fail-
E-mail: [email protected]. ures constitute any irreversible damage to the physical
Manuscript received 2 Mar. 2015; revised 20 Nov. 2015; accepted 24 Nov. resources, whereas transient failures are short-duration
2015. Date of publication 31 Dec. 2015; date of current version 14 Sept. 2016. events induced by external sources such as charged par-
Recommended for acceptance by P. Eles. ticles [4]. Particle-induced transient faults, or soft errors,
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below.
cause single event upsets (SEUs) which can alter SRAM con-
Digital Object Identifier no. 10.1109/TC.2015.2513762 figuration bits and lead to a functional failure. Conventional
0018-9340 ß 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
3056 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

with FPGA dynamic partial reconfiguration, enable fault


resilience against extensive fault scenarios by reusing a sub-
set of the DCs to ensure continual execution with minimal
recovery time.
Overall, the contributions of this work include:
 the first approach to utilize design disjunction for
condensed diagnostic analysis of reconfigurable
hardware,
 an explicit fine-grained approach to determine the
optimal number of DCs at design-time using the prop-
erty of f-disjunctness for recovery from multiple logic
and interconnect failures during the system lifetime,
 an extension of NGT to overcome the low coverage
of online functional testing, and
 improvement in crucial metrics including availabil-
ity, provability of recovery, fault coverage, fault iso-
Fig. 1. Objectives of proposed design disjunction approach. lation accuracy, and area efficiency.
The remainder of this paper begins with a review of the
resilience techniques for soft-errors and single permanent
related work in Section 2. Section 3 provides an introduction
faults are based on fault-masking via majority voting [5],
to group testing and the property of f-disjunctness along
[6]. Voting methods such as N-modular redundancy (NMR)
with illustrations. Section 4 discusses design for resource
incur N-fold power and area overheads to tolerate
disjunction using the developed mosaic convergence algo-
temporary and permanent faults in up to bðN  1Þ=2c
rithm. Section 5 explains fault isolation and recovery
modules. Techniques such as re-execution and reconfigura-
schemes for reconfigurable fabrics using design disjunction.
tion scrubbing [7], [8] can provide low overhead recovery Evaluation results for several case studies are provided and
for temporary failures. discussed in Section 6. A comparison between the proposed
Alternatively, dynamic remapping of a single design imp- work and modular redundancy schemes is presented in
lementation at the module or logic-tile level can be employed Section 7. Finally, Section 8 presents a brief conclusion.
to deallocate the use of damaged resources [9]. However, the
existing techniques for remapping of FPGA resources at run-
time can significantly increase the time complexity of recov- 2 RELATED WORK
ery, and thus the downtime. The recovery overhead includes For contemporary reconfigurable devices, low-level hard-
run-time remapping entailing on-board execution of FPGA ware support for testing can incur a significant area over-
design processes, such as place and route (PAR), which are head due to uncertainty in the logic and interconnect usage
time consuming. A single implementation of an FPGA-based of the target applications. In some cases, the goals of testing
design can require minutes to hours using a high-end multi- have been limited to verifying the collective health of recon-
core processor [10]. Although execution time for remapping figurable fabrics, whereas in the case of diagnostic testing,
can be substantially decreased using incremental PAR if loca- locations of faulty elements are also identified. Reconfigur-
tions of faulty elements are known, it is still a difficult compu- ability has been leveraged in various ways to enable online
tational workload for embedded processing cores [11]. Thus, testing strategies which examine correctness throughout the
conventional dynamic remapping techniques typically requ- system lifetime.
ire faulty systems to be taken offline for an undesirable inter- Table 1 summarizes features of related approaches along
val of time. with the proposed scheme. Previous online diagnostic test
In this paper, a new deterministic design space exp- schemes for reconfigurable fabrics [14], [20] provide fine
loration (DSE) [12], [13] method is used to realize FPGA resolution, although they require that the system be halted
fault tolerance that achieves the availability and reliability or become unprotected for extended periods before individ-
objectives shown in Fig. 1. The design space, and thus ual faulty elements can be identified. The ability to rapidly
the fault-resolution space, need only be explored at design- obtain information about faulty resources is a critical factor
time by creating a small library of alternative design in realizing efficient self-repair. It facilitates fault evasion
configurations (DCs) with f-disjunct resource usage. DCs are whereby faulty resources are avoided, or partially damaged
created using the mosaic convergence algorithm developed resources are reassigned to other useful functionalities.
such that at least one DC in the library evades any occur- Online fault localization techniques often consider the struc-
rence up to d resource faults, where d is lower-bounded by tural heterogeneity of contemporary reconfigurable hard-
f. The f-disjunction of resources among alternative DCs ware. Testing and fault isolation schemes for structures
enables run-time fault localization by a non-adaptive group such as programmable logic, interconnect, and RAM have
testing (NGT) technique. This realizes a novel low overhead been developed through the years, based on the nature of
fault localization/fault isolation capability along with rapid each structure. For example, RAM-based testing has been
fault recovery from temporary and permanent faults in extensively studied and the well-known MARCH algo-
reconfigurable fabrics while incurring minimal area, power, rithms [21] have been proven effective for diagnosis of
and perturbation to normal system throughput. We show RAM cells by applying a sequence of tests to each element
that the combinatorial properties of f-disjunctness, along in succession. Previous online fault isolation and recovery
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3057

TABLE 1
Comparison of Design Disjunction with Related Approaches

Run-time Resource Provable Intrinsic


Error-tolerant Recovery PAR at
Approach Fault Coverage: Multiple-fault Fault Isolation Latency Wear Run-time Advantage
Isolation Resolution Coverage Leveling
STARs [14] Yes Logic: LUT Yes No Exhaustive BIST No Required Resource
Overhead Recycling
R3TOS [15] Yes Logic: LUT Yes Yes Exhaustive BIST No Required Robust Control
Overhead Mechanism
Module Diversity [16] No Logic: CLB No No ms! ms Yes Unnecessary Effective Aging
Mitigation
Hahanov et al. [17] No Logic: CLB Yes No Routing Overhead No Required Provable
Coverage
AGT [18] No Logic: slice No No PAR Overhead No Required Intrinsic
Adaptation
Consensus-Based Yes Logic: slice No No PAR Overhead No Required Outlier
Evaluation [19] Identification
Design Disjunction Yes Logic: slice & Yes Yes ms! ms Yes Unnecessary Condensed
(approach herein) Interconnect: PIPs Diagnosis

approaches for FPGA logic using dynamic reconfiguration carry out the test. The work in [20] also demonstrates an effec-
have relied on built-in self-test (BIST) [14], [22]. However, tive application-dependent diagnosis for FPGA intercon-
dedicated BIST structures including test pattern generators nects. Distinct test configurations are applied to modulate
(TPGs) and output response analyzers (ORAs) are typically application LUT functionalities and study output patterns to
not available for FPGA platforms [22]. Modern FPGA archi- discern which nets are faulty. These application-dependent
tectures are also not entirely scan-ready. Thus, scan chains, approaches assume the resources undergoing diagnosis pro-
TGPs, and ORAs are frequently implemented directly in the cedures are unavailable during diagnosis. Thus, methods
fabric using look-up tables (LUTs) and shift registers. As a which eliminate these limitations on availability are sought.
consequence, BIST-inspired methods can increase FPGA Alternative approaches that eliminate BIST area and
resource requirements by up to 50 percent [23]. power overheads, referred to as operational testing tech-
The BIST-based roving STARs test scheme in [14] parti- niques, conduct functional tests via input data that are
tions the reconfigurable fabric into tiles, and continuous simultaneously used for normal throughput [19]. These
online testing is carried out by roving a BISTer from one tile techniques attain availability by relying on run-time inputs,
to another while the resources not used by the BISTer struc- computational redundancy, and output comparison to
ture are dynamically reconfigured to maintain availability. assess the subset of resources currently used by an applica-
Although failures are resolved at a fine resolution, data tion. Permanent and temporary fault monitoring for opera-
throughput must be suspended to copy state values prior to tional testing can be realized using concurrent error
each tile movement. Resource recycling is also facilitated; detection (CED) techniques based on duplication with
however, fault isolation and recovery depend on the latency comparison (DWC) or parity-based methods [24]. DWC that
of BISTers to rove the device before encountering faulty ele- compares the Hamming distance between the outputs of
ments. Another recent BIST-based fault-tolerant FPGA two spatially redundant modules is compatible with recent
approach is illustrated by the reliable reconfigurable real- multi-objective DSE approaches [25] which utilize a cost
time operating system (R3TOS) [15] wherein a hardware function that considers area requirements and resource
microkernel (HWuK) provides a task scheduler, an alloca- utilization against overhead of reconfiguration time. In [18],
tor to manage FPGA resources for tile placement, and a another operational testing method based on adaptive
configuration manager which converts commands issued group testing (AGT) for diagnosis of reconfigurable fabrics
by the scheduler and allocator into FPGA reconfiguration is described under a single-fault assumption. However,
operations. To minimize single-point of failure exposures, since the creation of test designs are adaptive based on
HWuK components are realized by an 8-bit PicoBlaze outcomes of successive tests, the AGT method is unsuitable
processor occupying six block RAMs (BRAMs) and 500 for high availability applications. Similar to iterative logic
configurable logic blocks (CLBs) protected with selective array (ILA) and array-based testing methods [26], most
triple modular redundancy (TMR) and error-correcting functional testing techniques are mainly used for testing a
code (ECC) bits whose resources also undergo periodic group of resources and provide no fault localization at a
testing. The impact of BIST latency is masked by the use fine resolution. In this work, benefits of operational testing
of hardware replication and voting. are explored with design disjunction to locate faulty resour-
To reduce the high complexity and cost of BIST, applica- ces while avoiding BIST overheads.
tion-dependent BIST testing [20] focuses on the subset of Other previous design-time approaches for run-time
resources used to maintain design functionality. Thus, fault recovery have used genetic algorithms (GAs) [27] to
exhaustive test vectors generated by a TPG and response evolve a pool of best-fit designs that exhibit resilience to var-
analysis carried out by an ORA can be relaxed without con- ious failures. The evolved designs are used at run-time to
tinually engaging a dedicated reconfiguration controller to maintain system functionality. Although GAs can succeed
3058 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

in finding resilient designs, the number of evolved designs specified. The groups are constructed iteratively based on
requiring functional evaluation is large, and also being a each successive test outcome during the testing procedure.
probabilistic process does not explicitly guarantee conver- As testing progresses, the iterative sampling of groups
gence. The work in [17] presents an algebraic method for narrows down the suspect set of faulty resources until
devising an optimal remapping strategy for logic blocks at defectives are identified. The binary search (BS) method
row and column levels to reduce recovery latency and mini- described in [33] presents one of the simplest AGT algo-
mize number of spare rows and columns required to tolerate rithms. At the initial stage of BS, the set of scan cells to be
a large combination of fault locations. Remapping by inter- tested, X, are considered suspect. The set X is partitioned
change of device columns and rows is still performed at run- into two groups, each of which is collectively tested using
time, which relies on an independent fault diagnosis process two scan chains. The BS technique is applied recursively to
to locate faulty cells before identifying which resources to any erroneous group until faulty cells are singled out. A
interchange. The consensus-based evaluation(CBE) method modified implementation of this algorithm was first pro-
described in [19] generates, at design-time, a diverse pool of posed for functional testing of FPGAs in [18] under a single-
FPGA designs with alternative device resources. These fault assumption. Each test group is a set of resources which
designs are evaluated against each other using a duplex implement a functionally equivalent design. Initially, all
arrangement. Statistical clustering is used to identify opera- resources in the reconfigurable container are deemed sus-
tionally correct designs without the assumption of a golden pect. The test starts by dividing suspect resources among
element. The module diversity approach described in [16] different functionally equivalent FPGA designs. The suspect
provides yet another method for generating diverse designs set is narrowed down to those that implement a fault-
at design-time for mitigating aging effects at run-time. The affected design. The modified suspect set is iteratively
diverse designs can be deployed according to a scheduling divided and utilized by a new generation of test designs.
policy that results in a steady stress distribution across The algorithm terminates when only a single cell remains in
resources to achieve an extended lifetime. The set of diverse the suspect set, thus identifying the defective resource. The
designs also guarantees fault recovery under a single-fault operational complexity of this algorithm depends on the
assumption for all possible single CLB faults. maximum number of test designs allowed in every test
Unfortunately, none of the existing approaches demon- generation. Thus, an overriding concern with AGT is the
strate provable coverage for multiple faults nor do they downtime needed to generate new test designs by repeat-
allow the use of diverse designs for diagnostic tests to locate edly invoking the design flow which is infeasible on
faulty resources. In this work, we describe an explicit deployed real-time embedded systems.
method for generating the optimal number of DCs that
guarantee recovery from multiple faults at fine granularity 3.2 Non-Adaptive Group Testing
while providing rapid fault isolation. Broader surveys of In the case of non-adaptive group testing, the sampling pro-
recent techniques for fault tolerance, autonomous recovery, cedure for all groups is known apriori to the execution of
and self-healing of FPGA-based systems are presented tests. An intuitive way to model and describe the problem
in [28], [29], and [30], respectively. of fault isolation in FPGAs using this class of group testing
techniques is through matrix algebra. The following nota-
3 GROUP TESTING FOR DIAGNOSIS OF tions are used throughout the paper:
RECONFIGURABLE ARCHITECTURES  Design matrix D gT is a binary matrix indicating the
If a test is used to identify f defectives among T elements, subset of resources used by each of g DCs. Rows in
where f is unknown, then a straightforward, albeit subopti- this matrix correspond to DCs whereas columns cor-
mal, procedure is to evaluate each element individually. respond to resources. An entry ki;j of D matrix is one
Assuming all tests are reliable, then the testing time com- if resource j is utilized by DCi , and zero otherwise.
plexity becomes OðT Þ. This cost can be considerably  Health vector hT 1 is a binary vector of length T rep-
reduced by dividing the T elements into g subsets, or resenting the health of the T resources, i.e., an entry
groups. The collective results after testing each group can hj is one if resource j is defective and zero if resource
be interpreted to identify the f defectives. The challenge is j is healthy.
to sample the minimum number of groups sufficient to find  Outcome vector o g1 is a binary vector of length g
the defectives. This is the basic idea behind group testing containing the error detection outcomes of all g DCs,
first introduced by Dorfman [31] for screening a large num- i.e., an entry oi is one if an erroneous outcome is
ber of blood samples by pooling them together to reduce detected while DCi is deployed and zero if DCi sus-
testing cost. Group testing has been adapted to diverse tains correct operation.
applications such as testing for manufacturing defects,  Set cðvvÞ is the subset of elements in binary vector v
DNA library screening, coding theory, software testing, and whose entries are one.
BIST-based diagnosis in digital systems [32]. Based upon  vðvvÞ is the weight of binary vector v, i.e., number of
how test groups are sampled, most group testing techniques elements whose entries are one.
can be classified into adaptive or non-adaptive categories.  Gnr is the set of all r-combinations of n elements.
The Outcome Vector, o g1 , can be given as follows:
3.1 Adaptive Group Testing
When using adaptive group testing, complete knowledge of o g1 ¼ D gT  h T 1 : (1)
how groups are sampled before testing begins is not
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3059

then the resource k must be faulty. Thus, the health vector


can be obtained as follows:

1 if cðcck Þ  cðooÞ
h ¼ fhk j hk ¼ ; 1  k  T g: (3)
0 otherwise

Fig. 2b illustrates how the same two-disjunct matrix is


used to single out the two defective resources, four and
nine, using the described decoding method. In this example,
the sparse health vector is given as:

h ¼ ð 0 0 0 1 0 0 0 0 1 0 ÞT : (4)

Although the binary decoder is efficient, there are two


main challenges to properly exploit this technique for fault
isolation of reconfigurable hardware. The first challenge is
the well-known limitation of low coverage from functional
testing which can introduce a sampling noise to the binary
decoding method leading to misdiagnosis. Hence, a suspi-
ciousness ranking metric that classifies resources according
Fig. 2. (a) Example of two-disjunct design matrix. (b) Conventional to their existence rate in failed DCs is developed instead of
diagnosis decoder. binary decoding methods. Additionally, f-disjunctness for
d > 1 along with the proposed ranking metric are shown to
The objective is to recover the health vector h given be effective for surmounting the low coverage issue of func-
that both the design matrix and the outcome vector are tional testing as explained in Section 5.2. Since all DCs imple-
known. The health vector can be efficiently recovered if ment the same application functionality while utilizing a
the design matrix obeys the f-disjunctness property and disjunct set of T resources, each DC requires the same
no more than f resources are defective [34]. The f-dis- resource count. The second challenge is to construct a con-
junctness property constrains how alternative groups are strained f-disjunct design matrix for any given T and with
overlapped such that f-diagnosability still holds. It pro- rows of equal weight dictated by the application size, R.
vides an efficient strategy to distribute each possible sub- Available techniques used to construct f-disjunct matrices
set of resources of size up to f among a unique subset of stipulate a set of conditions on matrix size and the row
DCs. Therefore, defective resources can be identified by weights which preclude the flexibility needed to meet design
finding the common resources among faulty DCs. The and resource count constraints of operational testing of recon-
matrix DgT is considered f-disjunct if and only if for any figurable fabrics. In this work, a new combinatorial search
possible combination of columns, S, of size f, every col- algorithm is described to achieve f-disjunctness for any given
umn not in S has at least d row elements whose entries design parameters T , R, and d. In Sections 4 and 5, solutions
are one and all entries of the columns S are zero [35]. to these two challenges are discussed with results demon-
This can be expressed as: strating feasibility and advantages of the proposed approach.
!
X
g [ 4 DESIGN FOR DISJUNCTION ON RECONFIGURABLE
8S 2 GTf ; Di;j ¼ 1 ^ Di;k ¼ 0 5d; (2)
i¼1 k2S
ARCHITECTURES
Design disjunction realizes a set of f-disjunct DCs, each of
where 14j4T and j 62 S. which implements the same application functionality, and
The parameter d represents the number of rows that then employs them to locate and evade defective resources
satisfy the left side of inequality in Eq. (2). We refer to during system lifetime while maintaining optimal availabil-
this parameter as the disjunction factor. The minimum ity. These DCs are produced prior to the test procedure;
value of d necessary to ensure f-disjunctness is 1 in which therefore, only partial reconfiguration overhead of existing
all possible combinations of up to f faulty resources can DCs is incurred during fault diagnosis and recovery. Fault
be identified provided that all tests are reliable, i.e. each tolerance is achieved by run-time reconfiguration to load
faulty DC will generate a detectable erroneous outcome. one of the bitfiles from the subset of DCs which does not
Fig. 2a shows a two-disjunct matrix and a one subset of utilize defective resources. The constructive property of
columns, S, of size 2 that meets the condition given by f-disjunctness is shown to be effective for extracting highly
Eq. (2) for d ¼ 1. fault-resilient DCs against logic and interconnect failures.
The decoding procedure to infer the sparse health vector In this work, FPGA-based fault scenarios are considered
assuming reliable testing is illustrated through a binary for evaluation of design disjunction since FPGAs are the
comparison between each column vector, c, of the D matrix prominent form of contemporary reconfigurable hardware.
and the outcome vector o . If the subset of elements of ck hav- Modern FPGAs have multiple levels of logic cell granularity.
ing value equal to one is fully contained within the subset of For instance, basic logic elements such as LUTs and flip-flops
elements of the outcome vector o having value equal to one, of Xilinx FPGAs are organized into logic slices which are
3060 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

considered the most primitive programmable logic blocks.


As such, design disjunction is examined at the slice level. Algorithm 1. Mosaic Convergence Algorithm for Con-
Thus, the columns of the design matrix D correspond to sli- structing (T ,R,f)-Disjunct Design Matrix
ces while rows represent DCs. We also focus on logic fault Procedure construct (T ,R,f)-disjunct matrix
localization. However, the proposed work can be combined Input: T : Total Number of Resources
with other application-dependent interconnect testing such R: Required Resources to Implement Application
as [20] for fault isolation at the level of interconnect points. f: Number of Defects
d: Disjunction Factor
Assuming an application is synthesized to a minimum of
Output: Design Matrix, D gT .
R slices, then the weight, i.e. the number of non-zero ele- !
1 f :¼ ðTf Þ ¼ f!ðTTfÞ!
ments, of every row of the design matrix must equal R. The
problem of constructing f-disjunct matrices has been 2 " :¼ f  ðT  fÞ // binary check count
increasingly studied within coding theory literature [34]. 3 DR :¼ 0
4 Generate a random row vector v ; s.t.: lengthðvvÞ ¼ T
For the interest of this work, we empirically evaluate the
and vðvvÞ ¼ R
lower bound on DC count required to reach f-disjunction 5 g :¼ 1 // point to the first row of D
using the developed mosaic convergence algorithm. Let the 6 D g :¼ v // insert v as the first row of the design
notation (T ,R,f)-disjunct matrix denote an f-disjunct design matrix
matrix whose rows have exactly R non-zero entries out of T . 7 g :¼ g þ 1
Algorithm 1 shows the pseudocode for the proposed mosaic 8 C :¼ G Tf // set of all f-combinations out of T
convergence approach for constructing such a matrix. Start- 9 fT :¼ ½dfT //initialize binary coverage
ing with an initial row that has R non-zero entries (lines 4- matrix entries to d
7), each added row represents the best-found row vector 10 DR funcðvvÞ // call function DR func to update DR
that maximizes the accumulative disjunction ratio (lines 36- after inserting the row vector v
49). The disjunction ratio is defined as follows: 11 while (DR 6¼ 1) do
12 v :¼ ½11T // start with a row vector
Definition 4.1. Disjunction ratio (DR) is the proportion of GTf v s:t: lengthðvvÞ ¼ vðvvÞ ¼ T
elements that satisfy the condition stated in Eq. (2). 13 S max :¼ C z max
14 for each k 2 S max do
The binary coverage matrix  (line 9) tracks whether each 15 vk :¼ 0
combination S 2 GTf has satisfied the condition in Eq. (2). 16 while (vðvvÞ 6¼ R) do
Every added row is initially a T -dimensional row vector v 17 max :¼ 0
of weight equals T (line 12). The combinatorial search for 18 for i :¼ 1 to T for do
19 if (vi 6¼ 0) then
optimal v, requires two nested sequential loops (lines 17-31) 20 t :¼ v
which examine each non-zero element in v and pick the ele- 21 y :¼ z max
ment which, if flipped to zero, yields the largest increment 22 ti :¼ 0
to the disjunction ratio DR. This latter step is repeated until 23 count :¼ 0
the weight of the vector v is reduced to R. Once an optimal 24 for each S 2 C s:t: i 2 S do
row vector is found, the coverage matrix  is updated to 25 for j :¼ 1 to T do
26 if (tj ¼ 1 ^ yj 6¼ 0) then
include the incremental coverage of each row (lines 36-
27 yj :¼ yj  1
49). The row-by-row construction of design matrix D termi- 28 count :¼ count þ 1
nates once the DR value reaches its maximum value of 29 if (count > max) then
1 (line 11). 30 top entry index :¼ i
The complexity of the binary search for each new row is 31 max :¼ count
largely determined by T and the cardinality of set <  GTf 32 vtop entry index :¼ 0
that have not yet satisfied the condition expressed in 33 D g :¼ v
Eq. (2). The cardinality of < decreases exponentially as 34 g :¼ g þ 1
number of rows in the D matrix increase. For search of the 35 DR funcðvvÞ
first few rows, the search space for optimal v is still large, // update DR after inserting a new row
which rapidly decreases as more rows are added to the D 36 Function DR_func(a)
matrix. To decrease the execution time of the algorithm, one 37 count :¼ 0
38 max :¼ 0
option is to limit the combinatorial search to a randomly
39 for z :¼ 1 to f do
selected subset of <. This will increase the speed of the con-
40 S :¼ C z
struction algorithm at the expense of obtaining a suboptimal
41 8 k 2 S; a k ¼ 0) then
if (8
v in each row iteration. The effect of this suboptimality 42 for j :¼ 1 to T s:t: j 62 S do
appears in the final solution as an increase in g, or number 43 z;j 6¼ 0 ^ a j ¼ 1) then
if (
of required DCs to achieve f-disjunctness. In this work, we 44 z;j :¼ z;j  1
utilized exhaustive combinatorial search to capture the 45 count :¼ count þ 1
lower bound on number of DCs needed to achieve the dis- 46 if (vðz Þ > max) then
cussed FT objectives, although search can be relaxed in 47 z max :¼ z
practice. The constructed design matrix is then used to 48 max :¼ vð z Þ
define the set of placement constraints supplied to the 49 DR :¼ DR þ count "d
design tools to implement disjunct DCs.
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3061

Fig. 3. Required number of DCs versus resource count for typical values
of f (d ¼ 1).

The mosaic convergence algorithm was implemented on


an Intel quad-core processor based PC design station. The
number of DCs g required to reach f-disjunctness with
respect to T and f is obtained for d ¼ 1. Fig. 3 shows col-
lected g values for f ¼ 1; 2; and 3. The logarithmic trend Fig. 4. Fault diagnosis using the FSR metric.
lines indicate that g grows linearly as resource count
increases exponentially. The advantageous logarithmic Similarly, the cumulative sum of FSR, denoted as CFSR,
dependence of g on resource count T obtained by the mosaic for all resources used by each DC yields a failure ranking
convergence procedure is consistent with results from other metric for DCs. The CFSR is used to determine the best
probabilistic methods for constructing unconstrained dis- operational DC if fault isolation at the design configuration
junct matrices [36], [37]. Fig. 3 also shows the non-linear level is sought.
increase in g for increasing f. The small number of disjunct We first focus on the case of ideal test coverage in
DCs signifies the advantage of design disjunction to lower which all fault-affected DCs manifest at least one errone-
testing cost and recovery overhead. ous functional output. Fig. 4 illustrates an example of a
single fault isolation case on a reconfigurable partition of
5 DESIGN DISJUNCTION FOR FAULT TOLERANCE size 20  15 ¼ 300 slices for an application mapped to 195
slices. Using the mosaic convergence procedure in Algo-
5.1 Fault Diagnosis Using Design Disjunction
rithm 1, 16 DCs (indexed 1-16) are found sufficient to
The binary decoder described in Section 3.2 provides only achieve one-disjunctness for d ¼ 1 in this example. The
binary diagnostic data which can lead to incorrect fault resource grouping defined by a (300; 195; 1)-disjunct
diagnosis in the presence of inarticulate tests. Instead, a design matrix is shown by the dark blue cells for each
ranking scheme that assesses resources according to their
DC. Based on fault detection outcomes after evaluating
existence rate in failed DCs can reveal a more accurate esti-
all the 16 DCs, the FSR value for each slice is computed.
mate of the failure state of the resources. For each resource,
The highest observed FSR reveals the location of faulty
the proportion of failed DCs that utilize the resource is com-
slice as depicted by the FSR heat map.
puted and compared with other resources. This ratio is
To examine the quality of fault isolation using the pro-
referred to as fault sensing ratio (FSR) and can be expressed posed ranking method, the terms isolation accuracy and fault
as follows: coverage are defined as follows:
Sg 
 
k¼1 Dk;i j Dk;i ¼ 1 ^ ok ¼ 1 Definition 5.1. Isolation accuracy is the number of non-faulty
FSRi ¼ ; 1  i  T ; (5)
vðcci Þ resources that have lower FSR values than all defectives,
divided by the total number of resources.
where ci is the ith column vector of the design matrix D.
A resource with a large FSR has a high likelihood of For instance, given a pool of 1;000 resources having two
being faulty. To illustrate how FSR is obtained, the health defects, an isolation accuracy of 95 percent indicates that
vector h given by the example described in Section 3.2 can b998  95%c ¼ 948 of non-faulty resources score lower FSR
be rewritten using FSR for each cell, as follows, in which values than the two defects.
faulty resources get the highest FSR values.
Definition 5.2. Fault coverage is the proportion of all combina-
tions of faulty resources of size up to f that attain a specified
h ¼ ð 0:3 0:6 0:3 1 0:
6 0:
6 0:
6 0:
6 1 0 ÞT
isolation accuracy.
3062 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

Fig. 5. Isolation accuracy versus g (T ¼ 1;000, f ¼ 2, d ¼ 1).


Fig. 6. DC count for increasing d (f ¼ 1).
Fig. 5 shows the required number of DCs, g, to reach vari-
ous isolation accuracies and their fault coverage values. The to attenuate the chance of false test outcomes during indi-
results also demonstrate how Algorithm 1 progresses vidual tests.
towards the termination criteria, i.e. DR ¼ 100 percent, as g In this work, we study how such an extension affects
increases. The resource count T chosen for this analysis fault diagnosis using the proposed ranking scheme. The
equals 1;000 and disjunction parameters are f ¼ 2 and d ¼ 1. described combinatorial construction method given by the
  1;000
In this case, 55 DCs are sufficient to identify all 1;000
2 þ 1 mosaic convergence procedure in Algorithm 1 is also used
possible fault locations with 100 percent isolation accuracy. to realize design disjunction for d > 1. Fig. 6 shows the
The value of g can be considerably reduced while maintain- number of DCs for one-disjunctness and selected d values. It
ing a high isolation accuracy. A reduction of 36:4 percent is evident that design disjunction for d > 1 is achieved at
(61:8 percent) in g results in a slight decrease in isolation modest linear increase in DC count g. For instance, the case
accuracy of 1 percent (5 percent). This tradeoff between isola- of 7;000 resources indicates that d can be increased by an
tion accuracy and number of required tests can be conducted order of magnitude from d ¼ 1 to d ¼ 10 while only roughly
based on system reliability goals, e.g., the extent sufficient to tripling the number of DCs required. In Section 6, we evalu-
achieve fast self-repair. It is important to note again that ate the effect of increasing d on fault diagnosis for various
these simulation results are collected under the conditions of case studies in which we compare the isolation accuracy
reliable tests. It is expected that g is increased to tolerate inar- under the low coverage of operational testing.
ticulate tests while maintaining equivalent isolation accuracy
as demonstrated in Section 5.2. 5.3 Fault Recovery Using Design Disjunction
The combinatorial characteristics of f-disjunct design matri-
5.2 Inarticulate Operational Testing ces add another advantage for design disjunction. The defi-
In the preceding analysis, we have assumed that a test out- nition expressed in Eq. (2) implies that any f-disjunct set of
come generated by a fault detection scheme embedded DCs should guarantee that for any possible accumulation of
within each DC is reflective of the actual health state of f faulty resources there exists at least one DC whose
used resources. However, this assumption for functional resource set does not include a defective. This implication
testing of digital designs cannot be guaranteed for various should not be considered as the upper bound on the num-
reasons. These include low test coverage due to node’s con- ber of recoverable defectives. Since hardware utilization
trollability and observability constraints, common mode ratio R=T can increase or decrease the sparsity of design
failures, or stuck-at 0 fault conditions in the fault detection matrix, it is possible to guarantee fault evasion for larger
logic. Error-resilient NGT was previously investigated than f defectives. The normal probability pdc nf ðdÞ that up
through probabilistic and theoretical analysis with direct to d defective resources are not used by a DC is given as:

d  
numerical simulations [37], [38]. In Section 3.2, a discussion
Y R
was provided for the classical requirement to obtain f-dis- pdc nf ðdÞ ¼ 1 ; d 5 1: (6)
junction which states that d must be greater than or equal 1. k¼1
T k1
As d increases beyond 1, the effect of inarticulate tests on
the decoding procedure can be masked. In the context of Thus, recovery coverage (RC), defined by the probability of
operational testing of reconfigurable hardware, increasing recovery for g DCs, can be computed for any accumulated
the disjunction factor d results in an increased number of fault count d as:
alternative DCs. Since resources are sensitized in a diverse  g
way as the device is reconfigured to different DCs, diversity RCðdÞ ¼ 1  1  pdc nf ðdÞ ; d 5 1: (7)
among DCs enables a better collective diagnostic coverage
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3063

switch tile of the device. This presents a significant chal-


lenge for run-time interconnect testing and diagnosis. Spe-
cialized functional testing for interconnects based on output
pattern analysis as in [20] and [39] has been shown to be
effective for diagnosis at the net level of a target design.
However, a net in a design can utilize a considerable num-
ber of PIPs spanning multiple switch tiles that can prolong
the self-repair process. Since allocation of interconnect
resources is precipitated by mapping and placement of logic
resources [40], a design disjunction in the logic fabric has
been demonstrated to also confer significant incidental dis-
junction in interconnect resources. This property effectively
extends fault recovery to routing fabrics as demonstrated in
Section 6.2.

6 EVALUATION
6.1 Evaluation Setup
The proposed work is initially evaluated on a set of MCNC
and ISCAS benchmarks through hardware simulations to
Fig. 7. Recovery coverage of disjunct DCs (T ¼ 100, R ¼ 30, d ¼ 1).
show its applicability to a variety of applications. A modu-
larized AES128 encryption core is selected as a realistic tar-
In order to examine the recovery behavior of the pro- get application for the hardware prototype. The actual
posed method, three sets of f-disjunct designs for f ¼ 1; 2; hardware demonstration is performed on the commercial
and 3 were tested against all possible set of fault locations Xilinx KC705 FPGA evaluation board. The KC705 board fea-
G Td for varying accumulated fault count d. Fig. 7 compares tures: 28 nm-based Kintex-7 FPGA, 1 GB DDR3 memory,
simulation results against our model given by Eq. (7). 128MB linear flash memory, and a joint test action
Recovery coverage on the left vertical axis also indicates the group (JTAG) interface. For hardware simulation, a soft-
proportion of G Td combinations of defective(s) that were suc- ware-based CED scheme is utilized to detect failures during
cessfully evaded by at least one DC. All three disjunct sets simulation. Parity-based and DWC error detection methods
exhibit high fault resilience for fault count d larger than f. are adopted in the hardware prototype. For all case studies,
A target recovery rate can be met by choosing the appro- Xilinx 7-series FPGAs using Xilinx design toolsets are used
priate hardware utilization as indicated in Eq. (6). For practi- to generate disjunct DCs.
cal considerations, the optimal number of DCs for recovery The design flow for the evaluation framework is depicted
during the system lifetime can be generated at design-time in Fig. 8. The flow starts from a conventional design in a
and stored in an off-chip flash memory.The data in the exter- hardware description language using Xilinx’s ISE synthesis
nal flash memory can be protected using hardware redun- tool. The synthesized netlists for target application are
dancy or error correction schemes in addition to functional imported to Xilinx’s PlanAhead to generate the physical
verification by CED which is resident on the FPGA. implementation of all disjunct DCs. To enable partial recon-
figuration support in the PlanAhead tool, a reconfigurable
partition (RP) must be floorplanned such that it contains T
5.4 Incidental Disjunction for Interconnect resources necessary to realize the disjunct DCs. The RP is
Fault Tolerance interfaced with the static region (SR) outside the RP through
Contemporary reconfigurable devices utilize hundreds of proxy LUTs. All disjunct DCs must use the same proxy logic
thousands of routing points. For instance, Xilinx 7-series for the target application’s input and output ports which is
FPGAs fabricated in a 28 nm process allow over 3;500 pro- possible by locking all port sets with the LOC constraint.
grammable interconnect points (PIPs) to be defined in each Each DC is defined as a distinct reconfigurable module

Fig. 8. Framework of demonstration system.


3064 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

control partial reconfiguration via the ICAP at the system


level. The advanced extensible interface (AXI) bus system is
used to interface the processor with the ICAP, memory
interfaces, RPs, and other IPs used in the prototype.
Design disjunction is evaluated on the hardware plat-
form using high-resolution image data which reside in the
external DDR3 during the recovery process. A hardware
timer is attached to the developed system bus to accurately
capture system throughput and processing time of fault
diagnosis flow. Xilinx’s IPs which form the processing
system (PS) including the MicroBlaze core, memory and
communication interfaces, and ICAP reconfiguration logic,
reside in the SR of the device. Partial reconfiguration is inte-
grated in this prototype by defining a distinct RP for each
AES module. Disjunct RMs are then defined and added for
each RP. The design flow of the hardware prototype is
extended from the implementation steps of experimental
Fig. 9. Block diagram of hardware demonstration system.
simulation. The static bitfile for the SR and partial bitfiles
for each RP are obtained from the NCD netlists using
(RM) inside the RP. Resource allocation for each RM is dic- the Xilinx’s BitGen tool. The software module running
tated by the design matrix constructed for the target applica- on the embedded processor developed for the prototype
tion according to the design parameters discussed in using the Xilinx’s SDK is combined with the static bitfile
Section 4. Resource allocation for each DC is added to the using Xilinx’s Data2MEM tool before programming the
design flow by defining the placement AREA_GROUP and FPGA board through its JTAG interface. Partial bitfiles for
CONFIG_PROHIBIT constraints in the user constraints all RPs are stored in the off-FPGA flash memory chip before
file (UCF) for each RM. The PlanAhead tool then generates the evaluation begins. When partial reconfiguration is
Xilinx’s native circuit description (NCD) netlist for each RM. required, the embedded MicroBlaze processor moves each
The stuck-at fault (SAF) model is adopted for fault injec- partial bitstream in the flash memory to the DDR3 memory
tion in this evaluation. Fault injection is incorporated into before being written by the ICAP.
the flow using Xilinx’s FPGA Editor which can inject SAF The evaluation process including resource allocation for
into NCD netlists at any randomly chosen location. design disjunction, fault injection, and simulation, is carried
Resource information for generating appropriate fault injec- out by a Python-based software module that automates
tion commands for the FPGA Editor tool are extracted from design and simulation tasks by invoking all required Xilinx
Xilinx design language (XDL) netlists. For hardware simula- tools through external system commands. The Python mod-
tion of each benchmark, a post PAR simulation model is ule also parses post PAR design files to extract delays and
generated from each NCD netlist before Xilinx’s ISim simu- build a slice-level netlist using a net connectivity graph
lator is invoked to verify functionality of each DC. To drive with associated functionality and routing resource informa-
each simulation case, a subset of random inputs generated tion. This netlist is used to examine the recovery rate in rela-
from a uniform distribution are used to mimic run-time tion to logic resources and PIPs.
operational inputs. It is worth noting that operational test-
ing using concurrent error detection schemes employs a
functional fault model (FFM) which encompasses SAF and 6.2 Design Parameters and Results
a wide range of failure modes that can alter application For each MCNC and ISCAS benchmark, two f-disjunct sets
functionality. of DCs are generated for f ¼ 1 and f ¼ 2. Table 2 lists the
The considered AES encryption core for the hardware isolation accuracy results averaged over 1;000 experimental
prototype is comprised of non-linear substitution boxes, a runs on all benchmarks for f ¼ 1 and f ¼ 2. Results include
key expansion and addition units, and other logic blocks for the 95 percent confidence interval (CI) and the area require-
shifting and mixing columns of the state matrix where input ments indicated by parameters R and T . In this evaluation,
words are arranged. The AES core is decomposed into eight T values are selected such that the area overhead T =R 2
modules each of which has its own embedded error detec- and T =R 3 for f ¼ 1 and f ¼ 2, respectively, to demon-
tion domain. Fig. 9 shows a block diagram for the hardware strate adaptation to various design parameters. The execu-
demonstration system on the KC705 FPGA board. Error tion time of the mosaic convergence algorithm, denoted by
detection schemes for the AES modules are derived mostly t mc , to generate the (T ,R,f)-disjunct design matrix for each
from [41]. An embedded MicroBlaze processor orchestrates benchmark is also included. For this evaluation, design dis-
execution flow of fault recovery and diagnosis, and consti- junction for each benchmark is realized using d ¼ 1 to
tutes a golden element in this prototype. Partial reconfigura- observe the effect of inarticulate operational testing on fault
tion (PR) using the internal configuration access port (ICAP) isolation. As discussed in Section 4, the execution time of
is utilized for partial reconfiguration to minimize reconfigu- the mosaic convergence algorithm depends largely on T
ration overhead. Xilinx provides the AXI_HWICAP IP core and size of GTf . The average execution time of the algorithm
and a set of basic library functions supplied with the for the application set examined in this evaluation is
Xilinx’s software development kit (SDK) that are used to 89:8 ms (61:1 s) for f ¼ 1 (f ¼ 2). Table 2 also shows that the
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3065

TABLE 2
Isolation Accuracy Results (d ¼ 1)

f¼1 f¼2
Isolation Accuracy (%) Isolation Accuracy (%)
Benchmark R T g t mc m 95% CI T g t mc m 95% CI
Circuit (ms) lower upper (s) lower upper
alu4 73 144 15 41 96.86 96.07 97.65 198 41 12.74 95.78 93.89 97.67
c880 16 30 10 7 95.80 93.85 97.75 45 25 0.057 95.56 93.54 97.57
misex3 103 198 15 98 91.73 89.28 94.18 286 44 51.7 88.16 84.34 91.99
exp5 22 40 11 9 97.17 96.28 98.07 66 29 0.161 93.42 90.19 96.64
vda 43 84 14 13 98.32 97.15 99.50 119 35 1.97 97.13 95.12 99.15
c6288 139 256 15 211 99.14 98.53 99.75 390 48 174.7 97.01 94.69 99.33
seq 132 252 15 205 91.71 89.69 93.74 385 47 170.3 89.90 86.49 93.32
apex4 70 136 14 31 98.56 97.75 99.37 204 41 14.7 97.40 95.87 98.94
des 146 275 16 262 97.31 96.26 98.35 391 48 179.8 92.67 89.55 95.79
c3540 58 112 14 21 97.66 96.31 99.01 162 38 5.97 96.67 95.16 98.19
average – – – 89.8 96.43 95.11 97.74 – – 61.12 94.37 91.88 96.86

average isolation accuracy over all benchmarks for similar to those listed in Table 2. It is evident that design
f ¼ 1 (f ¼ 2) is 96:4 percent (94:4 percent). Although the disjunction allows the ratio of shared PIPs among DCs to be
obtained isolation accuracy results are still promising, it is much lower than that of logic resources. This is attributed to
evident that design disjunction for d > 1 is needed to over- the PAR mechanism in the FPGA tool and its reaction to the
come the impact of low test coverage. Test coverage also diverse logic realizations. Also, it translates into an increase
depends on the quality of input test patterns, a higher isola- in the likelihood of finding at least one DC that avoids all
tion accuracy can be achieved if specialized high-coverage faulty resources as confirmed here for logic slices and PIPs.
test patterns generated by conventional ATPG tools at To observe the impact of design disjunction on applica-
design-time are used at run-time. tion performance, the timing slacks along critical paths of
Design disjunction for d > 1 is also evaluated to demon- all DCs are compared to the total slack of baseline design
strate feasibility to reach optimal fault isolation under inar- for each benchmark. The baseline design is the conventional
ticulate testing. Table 3 shows how design disjunction for a physical implementation of an application inside its dedi-
moderate increase in disjunction factor d results in a greater cated RP without resource constraints. For typical imple-
than 99 percent isolation accuracy for all selected bench- mentation, PAR algorithms search for the best placement
marks. The three selected benchmarks include the misex3 and routing to meet timing constraints. Total slack s is given
benchmark which gives the worst combined isolation accu- by post PAR timing reports as follows:
racy for f ¼ 1 and f ¼ 2 using d ¼ 1. Nevertheless, isolation
accuracy exceeding > 99 percent given by the upper 95 per- s ¼ ttarget  ttotal ¼ ttarget  ½tcp  tcps þ tcu ; (8)
cent CI is reached using d ¼ 5. A diminishing return in
improving isolation accuracy is also observed as d increases. where ttarget is target clock period, ttotal is total delay, tcp is
Thus, the range 14d411 can be chosen for an optimal critical path delay, tcps is clock path skew, and tcu is clock
tradeoff between isolation accuracy and g. A linear depen- uncertainty. ttarget is set such that the total slack of baseline
dency of g on d is also observed that is consistent with the design is 2 ns. Figure 11 shows s and tcp data for each bench-
analysis provided in Section 5. mark. The average increase in tcp compared to the baseline
Fig. 10 reports fault recovery results for the exhaustive design is 1:49 percent and the average decrease in the ratio
fault coverage evaluation on logic and PIPs for f ¼ 1 and of the total slack to the total delay is only 1:78 percent. It is
d ¼ 1. The design parameters for these benchmarks are also observed that the top-performing DC can be slightly

TABLE 3
Isolation Accuracy versus d for Selected Benchmarks (f ¼ 1)

misex3 c3540 alu4


Isolation Accuracy (%) Isolation Accuracy (%) Isolation Accuracy (%)
d g t mc m 95% CI g t mc m 95% CI g t mc m 95% CI
(ms) lower upper (ms) lower upper (ms) lower upper
1 15 98 91.7 89.3 94.2 14 21 97.7 96.3 99.0 15 41 96.9 96.1 97.7
3 25 146 96.4 94.7 98.0 23 43 99.7 99.5 99.9 26 75 99.7 98.4 99.5
5 36 201 97.7 96.0 99.4 33 59 99.8 99.7 100.0 34 101 99.7 99.5 99.9
7 46 281 98.8 97.6 100.0 42 79 99.9 99.8 100.0 44 142 99.8 99.7 100.0
9 55 339 98.9 98.0 99.7 51 123 100.0 99.9 100.0 53 179 100.0 100.0 100.0
11 65 426 99.3 98.5 100.0 – – – – – – – – – –
3066 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

Fig. 11. Effect of design disjunction on system performance.


Fig. 10. Fault recovery coverage (f ¼ 1, d ¼ 1).
the top 50 resources in ascending order of FSR for each of
faster than the baseline design due to the stochastic nature the 15 test cases. The defective resources indicated by the
of placement and routing algorithms which does not guar- red dots rank the highest in FSR with a considerable differ-
antee convergence to the optimal solution, or due to random ence to their next lower ranking resources. The normalized
variation of the timing of logic resources [42]. CFSR values for DCs for the 15 test cases depicted in
Table 4 lists design parameters, execution time to realize Fig. 12b show that faulty DCs accumulate higher CFSR val-
the design matrix, error detection method, and size of ues. Thus, the DC ranking the lowest CFSR for each test
partial bitstream for each distinct AES module shown in case is selected as the optimal fault-resilient candidate DC
Fig. 9. A failure in any module triggers the embedded pro- for recovery.
cessor to execute diagnosis and recovery service routines. Fig. 12c shows the encryption time of the AES core dur-
Initially, transient and permanent failures are undistin- ing fault-handling routine for a selected test case. The test
guished. Thus, articulating inputs are re-issued to ascertain procedure is triggered after injecting a SAF at a randomly
if reconfiguration scrubbing can resolve possible SEUs. If chosen LUT input in one of the 32-bit s-boxes. At the begin-
discrepancies persist, then DCs of the respective RP are ning, DC14 is deployed during fault occurrence. The fault
configured to the FPGA through the ICAP. Reconfiguration recovery procedure reconfigures the device with the partial
occurs while using application throughput to stimulate test bitfile of DC14 to rule out SEUs. Since discrepancies persist,
sequences and maintain availability. The evaluation win- diagnosis flow continues by testing the remaining 23 DCs.
dow for this prototype is set to 1;000 blocks which can be Execution time is given per 100 plaintext blocks. The
adapted to maintain a desired throughput rate. If the fault encryption core throughput is mainly impacted by the par-
detection signal is asserted at any time within the evaluation tial reconfiguration overhead tpr ¼ 4:58 ms and the latency
window, the fault isolation flow will continue by loading of post-testing decoding phase td ¼ 6:14 ms. The entire diag-
a subsequent DC. The feedback from the fault detection nosis flow completes in a millisecond-order time. Fault
logic is captured by the processor where diagnostic data recovery is achieved after the second test using DC2 which
are decoded to identify faulty resources and the optimal can be kept in service to maintain availability during time-
resilient DC based on the ranking scheme described in critical events. The fault diagnosis flow can continue as
section 5.1. shown until all DCs are evaluated so that the locations of
Figs. 12a and 12b show the outlier behavior for FSR and damaged resources and DC for recovery are determined.
CFSR ranking metrics, respectively, for 15 test cases. For Since design disjunction is realized using d ¼ 3 for the hard-
the sake of comparison, FSR and CFSR values for each test ware prototype, the inarticulate tests of DC12 and DC19
case are normalized from 1 to 10. Each test case is conducted have no impact on the trends given by FSR and CFSR. The
by first selecting an AES module at random and then inject- obtained optimal resilient DC in this test case is DC6 which
ing a SAF at a randomly chosen LUT input. Fig. 12a depicts is deployed to guarantee sustained recovery.

TABLE 4
Design Parameters for AES Modules

Module R T d g t mc (ms) Bitstream Size Detection Scheme


32-Bit s-boxes 60 119 3 24 41 Parity-based [41]
Mix Columns & Add Round Key 55 111 3 24 39 57.9 KB
128-bit Rotate/Rcon Logics for Key Expansion 52 102 3 23 32 DWC
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3067

Fig. 12. Execution of isolation phase on an AES module.

7 COMPARISON OF DESIGN DISJUNCTION AND As depicted in Fig. 13a, due to the provision of fine-
MODULAR REDUNDANCY grained resource allocation and relocation by design dis-
junction, a higher FR compared to NMR schemes can be
Modular redundancy using an NMR method is the most com-
obtained for the same area overhead. For instance, with a
mon form of hardware redundancy to tolerate failures. NMR
similar area overhead to TMR, design disjunction achieves
methods can be realized using commercially-available and
83:6 percent (143:3 percent) increase in FR over TMR for
academic design tools such as Xilinx TMR (XTMR) and BYU-
d ¼ 1 (d ¼ 7). Similarly, design disjunction can provide a
LANL TMR (BL-TMR), respectively. NMR employs N repli-
comparable FR to that of TMR using a considerably lower
cas and majority voting which masks failed modules by
area overhead. Fig. 13b reflects the area efficiency of the pro-
selecting a majority output. The area and power overheads of
posed work compared to modular redundancy. Area effi-
this scheme are approximately ðN  1Þ-fold including over-
ciency is quantified by the ratio of FR to the total resource
heads incurred by voting logic. A single failure in a module
count T . Similar to modular redundancy methods, a dimin-
can render that module unusable which compromises failure
ishing return on FR occurs as more hardware resources are
recoverability besides pre-determining resource use. Failure
considered. The resultant area advantage from using design
recoverability, denoted by FR, is defined as the cumulative
disjunction is more prominent for larger area overhead. For
sum of recovery coverage for all possible combinations of
the lowest design setting, i.e., f ¼ 1 and d ¼ 1, design dis-
fault locations. This definition can be expressed for a given
junction still enables a higher FR per area than any NMR
fault count d as:
setup included in this analysis. It is also worth noting that
X
T
FR ¼ RCðdÞ: (9)
d¼1

Let Am be the minimum resource count required to


implement a single module and mf be the number of failed
modules, then recovery coverage for NMR scheme denoted
by RCNMR is computed as follows:

jfx 2 GTd s:t: mf 4 bN1


2 cgj
RCNMR ðdÞ ¼ : (10)
jGTd j

For NMR systems where N ¼ 3 and N ¼ 5, RCNMR can


2Am
be given as 3  jGAd j=jGd j and ½10  jGd
m T
j  15  jGA
d j=jGd j,
m T

respectively. Fig. 13a compares the FR of the proposed


work with that of NMR. The area overhead of design dis-
junction in this comparison includes the overhead of CED
based on DWC. Both redundancy methods achieve a linear
increase in failure recoverability as more redundant resour-
ces are added; however, design disjunction offers a higher
linear increase. Designing for a higher disjunction factor d
increases g which proportionately results in a higher RC as
given by Eq. (7) and thus improves FR. Fig. 13. Area efficiency of design disjunction.
3068 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

the area advantage of design disjunction can be further [5] C. Bolchini, A. Miele, and C. Sandionigi, “A novel design method-
ology for implementing reliability-aware systems on SRAM-based
enhanced by using parity-based error detection instead FPGAs,” IEEE Trans. Comput., vol. 60, no. 12, pp. 1744–1758, Dec.
of DWC. 2011.
The proposed approach can be applied at the reconfigur- [6] C. Carmichael, “Triple module redundancy design techniques for
able logic block level with a broadened range of design virtex FPGAs,” Xilinx, San Jose, CA, USA, Application Note
XAPP197(v1.0.1), Jul. 2001.
parameters to meet area and power constraints while main- [7] C. Carmichael and C. W. Tseng, “Correcting single-event upsets
taining both adequate fault isolation and recovery. The area in Virtex-4 FPGA configuration memory,” Xilinx, San Jose, CA,
overhead imposed by design disjunction is roughly limited USA, Application Note XAPP1088(v1.0), Oct. 2009.
[8] J. Heiner, B. Sellers, M. Wirthlin, and J. Kalb, “FPGA partial recon-
to T =R, where R includes the resources required to deploy figuration via configuration scrubbing,” in Proc. IEEE Int. Conf.
a CED scheme. Other components such as the embedded Field Programmable Logic Appl., Prague, Czech Republic, Aug./
processor and memory controller are often present in Sep. 2009, pp. 99–104.
embedded reconfigurable systems, and thus do not incur an [9] S. Mitra, W.-J. Huang, N. R. Saxena, S.-Y. Yu, and E. J. McCluskey,
“Reconfigurable architecture for autonomous self-repair,” IEEE
additional area cost. The reliability of these components Des. Test. Comput., vol. 21, no. 3, pp. 228–240, Jun. 2004.
falls within the scope of embedded system reliability and [10] Xilinx, “Vivado design suite,” White Paper WP416 (v1.1), Jun. 2012.
can be protected by appropriate techniques [43]. The [11] W. Zha, “Facilitating FPGA reconfiguration through low-level
reconfiguration structure is not limited to ICAP. For manipulation,” Ph.D. dissertation, Virginia Polytechnic Inst. State
Univ., Blacksburg, VA, USA, Feb.. 2014.
instance, Xilinx has recently introduced processor config- [12] C. Bolchini and A. Miele, “Design space exploration for the design
uration access port (PCAP) interface [44] for ARM-based of reliable SRAM-based FPGA systems,” in Proc. IEEE Int. Symp.
systems to write configuration bits. Design disjunction is Defect Fault Tolerance VLSI Syst., Boston, MA, USA, Oct. 2008,
realized without loss of generality by the regularity and pp. 332–340.
[13] S. Chakraverty, A. Agarwal, A. Agarwal, A. Kumar, and A. Sikri,
reconfigurability features of the FPGA device used. Since “Design space exploration for high availability drFPGA based
these features are ubiquitous in contemporary reconfigur- embedded systems,” in Proc. 1st Int. Conf. Adv. Mach. Learn. Tech-
able devices, the proposed approach can be highly com- nol. Appl., Cairo, Egypt, Dec. 2012, pp. 234–243.
patible with many FPGA families from different vendors [14] M. Abramovici, C. Strond, C. Hamilton, S. Wijesuriya, and
V. Verma, “Using roving STARs for on-line testing and diagnosis
and other classes of reconfigurable ICs, such as complex of FPGAs in fault-tolerant applications,” in Proc. IEEE Int. Test
programmable logic devices (CPLDs). Conf., Atlantic City, NJ, USA, Sep. 1999, pp. 973–982.
[15] X. Iturbe, K. Benkrid, C. Hong, A. Ebrahim, R. Torrego, I. Martinez,
T. Arslan, and J. Perez, “R3TOS: A novel reliable reconfigurable
8 CONCLUSION real-time operating system for highly adaptive, efficient, and
dependable computing on FPGAs,” IEEE Trans. Comput., vol. 62,
Design disjunction offers a mathematically-rooted, parame- no. 8, pp. 1542–1556, Aug. 2013.
terized, multi-fault isolation and recovery technique for [16] H. Zhang, L. Bauer, M. A. Kochte, E. Schneider, C. Braun, M. E.
reconfigurable hardware fabrics. Combinatorial construc- Imhof, H.-J. Wunderlich, and J. Henkel, “Module diversification:
tion methods for disjunction and failure ranking schemes Fault tolerance and aging mitigation for runtime reconfigurable
architectures,” in Proc. IEEE Int. Test Conf., Anaheim, CA, USA,
for fault diagnosis are developed using operational testing Sep. 2013, pp. 1–10.
techniques. Experimental results for a set of benchmarks on [17] V. Hahanov, S. Galagan, V. Olchovoy, and A. Priymak, “Algebra-
a Xilinx 7-series FPGA have demonstrated f-diagnosability logical repair method for FPGA logic blocks,” in Proc. IEEE
East-West Des. Test Symp., St. Petersburg, Russia, Sep. 2010,
at the individual slice level with a minimum average isola- pp. 482–487.
tion accuracy of 96:4 percent (94:4 percent) for f ¼ 1 (f ¼ 2). [18] C. A. Sharma, A. Sarvi, A. Alzahrani, and R. F. DeMara, “Self-
An algebraic-based extension was also developed to tolerate healing reconfigurable logic using autonomous group testing,”
inarticulate tests and increase isolation accuracy to any level Microprocess. Microsyst., vol. 37, no. 2, pp. 174–184, Mar. 2013.
[19] K. Zhang, R. F. DeMara, and C. A. Sharma, “Consensus-based
deemed adequate for successful recovery and repair. Based evaluation for fault isolation and on-line evolutionary regener-
on these favorable properties and low costs, design disjunc- ation,” in Proc. 6th Int. Conf. Evolvable Syst.: From Biol. Hardware,
tion is worthy of consideration for autonomous resiliency in 2005, pp. 12–24.
reconfigurable systems demanding high availability. [20] M. B. Tahoori, “High resolution application specific fault diagno-
sis of FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 19, no. 10, pp. 1775–1786, Oct. 2011.
ACKNOWLEDGMENTS [21] A. J. Van De Goor, “Using march tests to test SRAMs,” IEEE Des.
Test. Comput., vol. 10, no. 1, pp. 8–14, Mar. 1993.
This research was funded by the Ministry of Education of [22] L. Bauer, C. Braun, M. Imhof, M. Kochte, E. Schneider, H. Zhang,
Saudi Arabia under scholarship grant no. 64923. J. Henkel, and H.-J. Wunderlich, “Test strategies for reliable run-
time reconfigurable architectures,” IEEE Trans. Comput., vol. 62,
no. 8, pp. 1494–1507, Aug. 2013.
REFERENCES [23] M. Renovell, P. Faure, J. M. Portal, J. Figueras, and Y. Zorian, “IS-
[1] E. Marcus and H. Stern, Blueprints for High Availability. New York, FPGA: A new symmetric FPGA architecture with implicit scan,”
NY, USA: Wiley, 2003. in Proc. IEEE Int. Test Conf., Baltimore, MD, USA, Oct./Nov. 2001,
[2] J. Henkel, L. Bauer, J. Becker, O. Bringmann, U. Brinkschulte, S. pp. 924–931.
Chakraborty, M. Engel, R. Ernst, H. Hartig, L. Hedrich, et al., [24] S. Mitra and E. McCluskey, “Which concurrent error detection
“Design and architectures for dependable embedded systems,” in scheme to choose?” in Proc. IEEE Int. Test Conf., Atlantic City, NJ,
Proc. IEEE 9th Int. Conf. Hardware/Softw. Codes. Syst. Synthesis, Tai- USA, Oct. 2000, pp. 985–994.
pei, Taiwan, Oct. 2011, pp. 69–78. [25] C. Bolchini, A. Miele, and C. Sandionigi, “Autonomous fault-toler-
[3] P. S. Ostler, M. P. Caffrey, D. S. Gibelyou, P. S. Graham, K. S. ant systems onto SRAM-based FPGA platforms,” J. Electron. Test.,
Morgan, B. H. Pratt, H. M. Quinn, and M. J. Wirthlin, “SRAM vol. 29, no. 6, pp. 779–793, Nov. 2013.
FPGA reliability analysis for harsh radiation environments,” IEEE [26] A. Doumar and H. Ito, “Detecting, diagnosing, and tolerating
Trans. Nucl. Sci., vol. 56, no. 6, pp. 3519–3526, Dec. 2009. faults in SRAM-based field programmable gate arrays: A survey,”
[4] C. Constantinescu, “Trends and challenges in VLSI circuit IEEE Trans. Very Large Scale Integr. Syst., vol. 11, no. 3, pp. 386–
reliability,” IEEE Micro, vol. 23, no. 4, pp. 14–19, Jul./Aug. 2003. 405, Jun. 2003.
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3069

[27] D. Keymeulen, R. Zebulum, Y. Jin, and A. Stoica, “Fault-tolerant Ahmad Alzahrani received the BS degree in
evolvable hardware using field-programmable transistor arrays,” electrical engineering from the Umm Al-Qura Uni-
IEEE Trans. Rel., vol. 49, no. 3, pp. 305–316, Sep. 2000. versity, in 2002. He received the MS degree in
[28] E. A. Stott, N. P. Sedcole, and P. Y. K. Cheung, “Fault tolerance computer engineering from the University of
and reliability in field-programmable gate arrays,” IET Comput. Arkansas, Fayetteville, in 2009, and the PhD
Digital Techn., vol. 4, no. 3, pp. 196–210, May 2010. degree in computer engineering from the Univer-
[29] M. G. Parris, C. A. Sharma, and R. F. DeMara, “Progress in auton- sity of Central Florida, in 2015. His research
omous fault recovery of field-programmable gate arrays,” ACM interests include computer architecture, fault tol-
Comput. Surveys, vol. 43, no. 4, p. 31, Oct. 2011. erance, and adaptive reconfigurable computing.
[30] A. Seffrin and A. Biedermann, “Cellular-array implementations of He is a member of the IEEE.
bio-inspired self-healing systems: State of the art and future
perspectives,” in Design Methodologies for Secure Embedded Systems.
Berlin, Germany: Springer, 2011, vol. 78, pp. 151–170.
[31] R. Dorfman, “The detection of defective members of large pop- Ronald F. DeMara received the PhD degree in
ulations,” Ann. Math. Statist., vol. 14, no. 4, pp. 436–440, Dec. 1943. computer engineering from the University of
[32] A. B. Kahng and S. Reda, “New and improved BIST diagnosis Southern California, in 1992. Since 1993, he has
methods from combinatorial group testing theory,” IEEE Trans. been a full-time faculty member at the University
Comput.-Aided Des. Integr. Circuits Syst., vol. 25, no. 3, pp. 533–543, of Central Florida where he is a professor and
Computer Engineering program coordinator. His
Mar. 2006.
research interests are in computer architecture
[33] J. Ghosh-Dastidar, and N. A. Touba, “A rapid and scalable diag-
nosis scheme for BIST environments with a large number of scan with emphasis on Evolvable and Resilient Hard-
chains,” in Proc. IEEE 18th VLSI Test Symp., Montreal, PQ, Canada, ware, on which he has published approximately
Apr./May 2000, pp. 79–85. 175 articles. He has served on the Editorial
[34] M. Cheraghchi, “Coding-theoretic methods for sparse recovery,” Boards of IEEE Transactions on VLSI Systems,
ACM Transactions on Embedded Systems, Journal of Circuits, Systems,
in Proc. IEEE 49th Annu. Allerton Conf. Commun., Control Comput.,
and Computers, the journal Microprocessors and Microsystems, various
Monticello, IL, USA, Sep. 2011, pp. 909–916.
[35] A. J. Macula, “A simple construction of d-disjunct matrices with conference program committees, and is currently an associate editor
certain constant weights,” Discr. Math., vol. 162, nos. 1–3, pp. 311– of IEEE Transactions on Computers. He received the Joseph M.
312, Dec. 1996. Bidenbach Outstanding Engineering Educator Award from the IEEE, in
[36] C. L. Chan, S. Jaggi, V. Saligrama, and S. Agnihotri, “Non-adap- 2008. He is a senior member of the IEEE
tive group testing: Explicit bounds and novel algorithms,” in Proc.
IEEE Int. Symp. Inform. Theory, Jul. 2012, pp. 1837–1841.
" For more information on this or any other computing topic,
[37] M. Cheraghchi, A. Hormati, A. Karbasi, and M. Vetterli, “Group
testing with probabilistic tests: Theory, design and application,” please visit our Digital Library at www.computer.org/publications/dlib.
IEEE Trans. Inf. Theory, vol. 57, no. 10, pp. 7057–7067, Oct. 2011.
[38] E. Knill, W. J. Bruno, and D. C. Torney, “Non-adaptive group test-
ing in the presence of errors,” Discr. Appl. Math., vol. 88, no. 1,
pp. 261–290, Nov. 1998.
[39] T. Kumar and F. Lombardi, “A novel heuristic method for appli-
cation-dependent testing of a SRAM-based FPGA interconnect,”
IEEE Trans. Comput., vol. 62, no. 1, pp. 163–172, Jan. 2013.
[40] A. Alzahrani and R. F. DeMara, “Hypergraph-cover diversity for
maximally-resilient reconfigurable systems,” in Proc. IEEE 12th
Int. Conf. Embedded Softw. Syst., New York, NY, USA, Aug 2015,
pp. 1086–1092.
[41] M. Mozaffari-Kermani and A. Reyhani-Masoleh, “Concurrent
structure-independent fault detection schemes for the advanced
encryption standard,” IEEE Trans. Comput., vol. 59, no. 5, pp. 608–
622, May 2010.
[42] A. Alzahrani and R. F. DeMara, “Process variation immunity of
alternative 16nm HK/MG-based FPGA logic blocks,” in Proc.
IEEE 58th Int. Midwest Symp. Circuits Syst., Fort Collins, CO, USA,
Aug 2015, pp. 1–4.
[43] H. Kopetz, Real-Time Systems: Design Principles for Distributed
Embedded Applications, 2nd ed. Berlin, Germany: Springer, Apr.
2011.
[44] C. Kohn, “Partial reconfiguration of a hardware accelerator on
Zynq-7000 all programmable SoC devices,” Xilinx, San Jose, CA,
USA, Application Note XAPP1088(v1.0), Jan. 2013.

You might also like