Anany 2019
Anany 2019
Abstract—Design disjunction is developed to offer a broad coverage, high resolution, and low overhead approach to online diagnosis
and recovery of reconfigurable fabrics. Design disjunction leverages the condensed diagnosability of T logic resources to achieve self-
recovery using partial reconfiguration in O(log T ) steps. Reconfiguration is guided by the constructive property of f-disjunctness which
forms O(log T ) resource groups at design-time. Resolution of f simultaneous resource faults is shown to be guaranteed when the
resource groups are mutually f-disjunct. This extends run-time fault resilience to a large resource space with certainty for up to f faults
using a decision-free resolution process that also provides a high likelihood of identifying the fault’s location to a fine granularity.
Finally, design disjunction is parameterized to accommodate the low coverage issue of functional testing for which inarticulate tests can
otherwise impair fault isolation. Experimental results for MCNC and ISCAS benchmarks on a Xilinx 7-series field programmable gate
array (FPGA) demonstrate f-diagnosability at the individual slice level with a minimum average isolation accuracy of 96:4 percent
(94:4 percent) for f ¼ 1 (f ¼ 2). Results have also demonstrated millisecond order recovery with a minimum increase of 83:6 percent
in fault coverage compared to N-modular redundancy (NMR) schemes. Recovery is achieved while incurring an average critical path
delay impact of only 1:49 percent and energy cost roughly comparable to conventional two-MR approaches.
Index Terms—Reconfigurable logic devices, field programmable gate arrays, autonomous fault handling, fault-tolerant systems, run-time
fault diagnosis and recovery, online test, design space exploration
1 INTRODUCTION
TABLE 1
Comparison of Design Disjunction with Related Approaches
approaches for FPGA logic using dynamic reconfiguration carry out the test. The work in [20] also demonstrates an effec-
have relied on built-in self-test (BIST) [14], [22]. However, tive application-dependent diagnosis for FPGA intercon-
dedicated BIST structures including test pattern generators nects. Distinct test configurations are applied to modulate
(TPGs) and output response analyzers (ORAs) are typically application LUT functionalities and study output patterns to
not available for FPGA platforms [22]. Modern FPGA archi- discern which nets are faulty. These application-dependent
tectures are also not entirely scan-ready. Thus, scan chains, approaches assume the resources undergoing diagnosis pro-
TGPs, and ORAs are frequently implemented directly in the cedures are unavailable during diagnosis. Thus, methods
fabric using look-up tables (LUTs) and shift registers. As a which eliminate these limitations on availability are sought.
consequence, BIST-inspired methods can increase FPGA Alternative approaches that eliminate BIST area and
resource requirements by up to 50 percent [23]. power overheads, referred to as operational testing tech-
The BIST-based roving STARs test scheme in [14] parti- niques, conduct functional tests via input data that are
tions the reconfigurable fabric into tiles, and continuous simultaneously used for normal throughput [19]. These
online testing is carried out by roving a BISTer from one tile techniques attain availability by relying on run-time inputs,
to another while the resources not used by the BISTer struc- computational redundancy, and output comparison to
ture are dynamically reconfigured to maintain availability. assess the subset of resources currently used by an applica-
Although failures are resolved at a fine resolution, data tion. Permanent and temporary fault monitoring for opera-
throughput must be suspended to copy state values prior to tional testing can be realized using concurrent error
each tile movement. Resource recycling is also facilitated; detection (CED) techniques based on duplication with
however, fault isolation and recovery depend on the latency comparison (DWC) or parity-based methods [24]. DWC that
of BISTers to rove the device before encountering faulty ele- compares the Hamming distance between the outputs of
ments. Another recent BIST-based fault-tolerant FPGA two spatially redundant modules is compatible with recent
approach is illustrated by the reliable reconfigurable real- multi-objective DSE approaches [25] which utilize a cost
time operating system (R3TOS) [15] wherein a hardware function that considers area requirements and resource
microkernel (HWuK) provides a task scheduler, an alloca- utilization against overhead of reconfiguration time. In [18],
tor to manage FPGA resources for tile placement, and a another operational testing method based on adaptive
configuration manager which converts commands issued group testing (AGT) for diagnosis of reconfigurable fabrics
by the scheduler and allocator into FPGA reconfiguration is described under a single-fault assumption. However,
operations. To minimize single-point of failure exposures, since the creation of test designs are adaptive based on
HWuK components are realized by an 8-bit PicoBlaze outcomes of successive tests, the AGT method is unsuitable
processor occupying six block RAMs (BRAMs) and 500 for high availability applications. Similar to iterative logic
configurable logic blocks (CLBs) protected with selective array (ILA) and array-based testing methods [26], most
triple modular redundancy (TMR) and error-correcting functional testing techniques are mainly used for testing a
code (ECC) bits whose resources also undergo periodic group of resources and provide no fault localization at a
testing. The impact of BIST latency is masked by the use fine resolution. In this work, benefits of operational testing
of hardware replication and voting. are explored with design disjunction to locate faulty resour-
To reduce the high complexity and cost of BIST, applica- ces while avoiding BIST overheads.
tion-dependent BIST testing [20] focuses on the subset of Other previous design-time approaches for run-time
resources used to maintain design functionality. Thus, fault recovery have used genetic algorithms (GAs) [27] to
exhaustive test vectors generated by a TPG and response evolve a pool of best-fit designs that exhibit resilience to var-
analysis carried out by an ORA can be relaxed without con- ious failures. The evolved designs are used at run-time to
tinually engaging a dedicated reconfiguration controller to maintain system functionality. Although GAs can succeed
3058 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016
in finding resilient designs, the number of evolved designs specified. The groups are constructed iteratively based on
requiring functional evaluation is large, and also being a each successive test outcome during the testing procedure.
probabilistic process does not explicitly guarantee conver- As testing progresses, the iterative sampling of groups
gence. The work in [17] presents an algebraic method for narrows down the suspect set of faulty resources until
devising an optimal remapping strategy for logic blocks at defectives are identified. The binary search (BS) method
row and column levels to reduce recovery latency and mini- described in [33] presents one of the simplest AGT algo-
mize number of spare rows and columns required to tolerate rithms. At the initial stage of BS, the set of scan cells to be
a large combination of fault locations. Remapping by inter- tested, X, are considered suspect. The set X is partitioned
change of device columns and rows is still performed at run- into two groups, each of which is collectively tested using
time, which relies on an independent fault diagnosis process two scan chains. The BS technique is applied recursively to
to locate faulty cells before identifying which resources to any erroneous group until faulty cells are singled out. A
interchange. The consensus-based evaluation(CBE) method modified implementation of this algorithm was first pro-
described in [19] generates, at design-time, a diverse pool of posed for functional testing of FPGAs in [18] under a single-
FPGA designs with alternative device resources. These fault assumption. Each test group is a set of resources which
designs are evaluated against each other using a duplex implement a functionally equivalent design. Initially, all
arrangement. Statistical clustering is used to identify opera- resources in the reconfigurable container are deemed sus-
tionally correct designs without the assumption of a golden pect. The test starts by dividing suspect resources among
element. The module diversity approach described in [16] different functionally equivalent FPGA designs. The suspect
provides yet another method for generating diverse designs set is narrowed down to those that implement a fault-
at design-time for mitigating aging effects at run-time. The affected design. The modified suspect set is iteratively
diverse designs can be deployed according to a scheduling divided and utilized by a new generation of test designs.
policy that results in a steady stress distribution across The algorithm terminates when only a single cell remains in
resources to achieve an extended lifetime. The set of diverse the suspect set, thus identifying the defective resource. The
designs also guarantees fault recovery under a single-fault operational complexity of this algorithm depends on the
assumption for all possible single CLB faults. maximum number of test designs allowed in every test
Unfortunately, none of the existing approaches demon- generation. Thus, an overriding concern with AGT is the
strate provable coverage for multiple faults nor do they downtime needed to generate new test designs by repeat-
allow the use of diverse designs for diagnostic tests to locate edly invoking the design flow which is infeasible on
faulty resources. In this work, we describe an explicit deployed real-time embedded systems.
method for generating the optimal number of DCs that
guarantee recovery from multiple faults at fine granularity 3.2 Non-Adaptive Group Testing
while providing rapid fault isolation. Broader surveys of In the case of non-adaptive group testing, the sampling pro-
recent techniques for fault tolerance, autonomous recovery, cedure for all groups is known apriori to the execution of
and self-healing of FPGA-based systems are presented tests. An intuitive way to model and describe the problem
in [28], [29], and [30], respectively. of fault isolation in FPGAs using this class of group testing
techniques is through matrix algebra. The following nota-
3 GROUP TESTING FOR DIAGNOSIS OF tions are used throughout the paper:
RECONFIGURABLE ARCHITECTURES Design matrix D gT is a binary matrix indicating the
If a test is used to identify f defectives among T elements, subset of resources used by each of g DCs. Rows in
where f is unknown, then a straightforward, albeit subopti- this matrix correspond to DCs whereas columns cor-
mal, procedure is to evaluate each element individually. respond to resources. An entry ki;j of D matrix is one
Assuming all tests are reliable, then the testing time com- if resource j is utilized by DCi , and zero otherwise.
plexity becomes OðT Þ. This cost can be considerably Health vector hT 1 is a binary vector of length T rep-
reduced by dividing the T elements into g subsets, or resenting the health of the T resources, i.e., an entry
groups. The collective results after testing each group can hj is one if resource j is defective and zero if resource
be interpreted to identify the f defectives. The challenge is j is healthy.
to sample the minimum number of groups sufficient to find Outcome vector o g1 is a binary vector of length g
the defectives. This is the basic idea behind group testing containing the error detection outcomes of all g DCs,
first introduced by Dorfman [31] for screening a large num- i.e., an entry oi is one if an erroneous outcome is
ber of blood samples by pooling them together to reduce detected while DCi is deployed and zero if DCi sus-
testing cost. Group testing has been adapted to diverse tains correct operation.
applications such as testing for manufacturing defects, Set cðvvÞ is the subset of elements in binary vector v
DNA library screening, coding theory, software testing, and whose entries are one.
BIST-based diagnosis in digital systems [32]. Based upon vðvvÞ is the weight of binary vector v, i.e., number of
how test groups are sampled, most group testing techniques elements whose entries are one.
can be classified into adaptive or non-adaptive categories. Gnr is the set of all r-combinations of n elements.
The Outcome Vector, o g1 , can be given as follows:
3.1 Adaptive Group Testing
When using adaptive group testing, complete knowledge of o g1 ¼ D gT h T 1 : (1)
how groups are sampled before testing begins is not
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3059
h ¼ ð 0 0 0 1 0 0 0 0 1 0 ÞT : (4)
Fig. 3. Required number of DCs versus resource count for typical values
of f (d ¼ 1).
d
numerical simulations [37], [38]. In Section 3.2, a discussion
Y R
was provided for the classical requirement to obtain f-dis- pdc nf ðdÞ ¼ 1 ; d 5 1: (6)
junction which states that d must be greater than or equal 1. k¼1
T k1
As d increases beyond 1, the effect of inarticulate tests on
the decoding procedure can be masked. In the context of Thus, recovery coverage (RC), defined by the probability of
operational testing of reconfigurable hardware, increasing recovery for g DCs, can be computed for any accumulated
the disjunction factor d results in an increased number of fault count d as:
alternative DCs. Since resources are sensitized in a diverse g
way as the device is reconfigured to different DCs, diversity RCðdÞ ¼ 1 1 pdc nf ðdÞ ; d 5 1: (7)
among DCs enables a better collective diagnostic coverage
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3063
6 EVALUATION
6.1 Evaluation Setup
The proposed work is initially evaluated on a set of MCNC
and ISCAS benchmarks through hardware simulations to
Fig. 7. Recovery coverage of disjunct DCs (T ¼ 100, R ¼ 30, d ¼ 1).
show its applicability to a variety of applications. A modu-
larized AES128 encryption core is selected as a realistic tar-
In order to examine the recovery behavior of the pro- get application for the hardware prototype. The actual
posed method, three sets of f-disjunct designs for f ¼ 1; 2; hardware demonstration is performed on the commercial
and 3 were tested against all possible set of fault locations Xilinx KC705 FPGA evaluation board. The KC705 board fea-
G Td for varying accumulated fault count d. Fig. 7 compares tures: 28 nm-based Kintex-7 FPGA, 1 GB DDR3 memory,
simulation results against our model given by Eq. (7). 128MB linear flash memory, and a joint test action
Recovery coverage on the left vertical axis also indicates the group (JTAG) interface. For hardware simulation, a soft-
proportion of G Td combinations of defective(s) that were suc- ware-based CED scheme is utilized to detect failures during
cessfully evaded by at least one DC. All three disjunct sets simulation. Parity-based and DWC error detection methods
exhibit high fault resilience for fault count d larger than f. are adopted in the hardware prototype. For all case studies,
A target recovery rate can be met by choosing the appro- Xilinx 7-series FPGAs using Xilinx design toolsets are used
priate hardware utilization as indicated in Eq. (6). For practi- to generate disjunct DCs.
cal considerations, the optimal number of DCs for recovery The design flow for the evaluation framework is depicted
during the system lifetime can be generated at design-time in Fig. 8. The flow starts from a conventional design in a
and stored in an off-chip flash memory.The data in the exter- hardware description language using Xilinx’s ISE synthesis
nal flash memory can be protected using hardware redun- tool. The synthesized netlists for target application are
dancy or error correction schemes in addition to functional imported to Xilinx’s PlanAhead to generate the physical
verification by CED which is resident on the FPGA. implementation of all disjunct DCs. To enable partial recon-
figuration support in the PlanAhead tool, a reconfigurable
partition (RP) must be floorplanned such that it contains T
5.4 Incidental Disjunction for Interconnect resources necessary to realize the disjunct DCs. The RP is
Fault Tolerance interfaced with the static region (SR) outside the RP through
Contemporary reconfigurable devices utilize hundreds of proxy LUTs. All disjunct DCs must use the same proxy logic
thousands of routing points. For instance, Xilinx 7-series for the target application’s input and output ports which is
FPGAs fabricated in a 28 nm process allow over 3;500 pro- possible by locking all port sets with the LOC constraint.
grammable interconnect points (PIPs) to be defined in each Each DC is defined as a distinct reconfigurable module
TABLE 2
Isolation Accuracy Results (d ¼ 1)
f¼1 f¼2
Isolation Accuracy (%) Isolation Accuracy (%)
Benchmark R T g t mc m 95% CI T g t mc m 95% CI
Circuit (ms) lower upper (s) lower upper
alu4 73 144 15 41 96.86 96.07 97.65 198 41 12.74 95.78 93.89 97.67
c880 16 30 10 7 95.80 93.85 97.75 45 25 0.057 95.56 93.54 97.57
misex3 103 198 15 98 91.73 89.28 94.18 286 44 51.7 88.16 84.34 91.99
exp5 22 40 11 9 97.17 96.28 98.07 66 29 0.161 93.42 90.19 96.64
vda 43 84 14 13 98.32 97.15 99.50 119 35 1.97 97.13 95.12 99.15
c6288 139 256 15 211 99.14 98.53 99.75 390 48 174.7 97.01 94.69 99.33
seq 132 252 15 205 91.71 89.69 93.74 385 47 170.3 89.90 86.49 93.32
apex4 70 136 14 31 98.56 97.75 99.37 204 41 14.7 97.40 95.87 98.94
des 146 275 16 262 97.31 96.26 98.35 391 48 179.8 92.67 89.55 95.79
c3540 58 112 14 21 97.66 96.31 99.01 162 38 5.97 96.67 95.16 98.19
average – – – 89.8 96.43 95.11 97.74 – – 61.12 94.37 91.88 96.86
average isolation accuracy over all benchmarks for similar to those listed in Table 2. It is evident that design
f ¼ 1 (f ¼ 2) is 96:4 percent (94:4 percent). Although the disjunction allows the ratio of shared PIPs among DCs to be
obtained isolation accuracy results are still promising, it is much lower than that of logic resources. This is attributed to
evident that design disjunction for d > 1 is needed to over- the PAR mechanism in the FPGA tool and its reaction to the
come the impact of low test coverage. Test coverage also diverse logic realizations. Also, it translates into an increase
depends on the quality of input test patterns, a higher isola- in the likelihood of finding at least one DC that avoids all
tion accuracy can be achieved if specialized high-coverage faulty resources as confirmed here for logic slices and PIPs.
test patterns generated by conventional ATPG tools at To observe the impact of design disjunction on applica-
design-time are used at run-time. tion performance, the timing slacks along critical paths of
Design disjunction for d > 1 is also evaluated to demon- all DCs are compared to the total slack of baseline design
strate feasibility to reach optimal fault isolation under inar- for each benchmark. The baseline design is the conventional
ticulate testing. Table 3 shows how design disjunction for a physical implementation of an application inside its dedi-
moderate increase in disjunction factor d results in a greater cated RP without resource constraints. For typical imple-
than 99 percent isolation accuracy for all selected bench- mentation, PAR algorithms search for the best placement
marks. The three selected benchmarks include the misex3 and routing to meet timing constraints. Total slack s is given
benchmark which gives the worst combined isolation accu- by post PAR timing reports as follows:
racy for f ¼ 1 and f ¼ 2 using d ¼ 1. Nevertheless, isolation
accuracy exceeding > 99 percent given by the upper 95 per- s ¼ ttarget ttotal ¼ ttarget ½tcp tcps þ tcu ; (8)
cent CI is reached using d ¼ 5. A diminishing return in
improving isolation accuracy is also observed as d increases. where ttarget is target clock period, ttotal is total delay, tcp is
Thus, the range 14d411 can be chosen for an optimal critical path delay, tcps is clock path skew, and tcu is clock
tradeoff between isolation accuracy and g. A linear depen- uncertainty. ttarget is set such that the total slack of baseline
dency of g on d is also observed that is consistent with the design is 2 ns. Figure 11 shows s and tcp data for each bench-
analysis provided in Section 5. mark. The average increase in tcp compared to the baseline
Fig. 10 reports fault recovery results for the exhaustive design is 1:49 percent and the average decrease in the ratio
fault coverage evaluation on logic and PIPs for f ¼ 1 and of the total slack to the total delay is only 1:78 percent. It is
d ¼ 1. The design parameters for these benchmarks are also observed that the top-performing DC can be slightly
TABLE 3
Isolation Accuracy versus d for Selected Benchmarks (f ¼ 1)
TABLE 4
Design Parameters for AES Modules
7 COMPARISON OF DESIGN DISJUNCTION AND As depicted in Fig. 13a, due to the provision of fine-
MODULAR REDUNDANCY grained resource allocation and relocation by design dis-
junction, a higher FR compared to NMR schemes can be
Modular redundancy using an NMR method is the most com-
obtained for the same area overhead. For instance, with a
mon form of hardware redundancy to tolerate failures. NMR
similar area overhead to TMR, design disjunction achieves
methods can be realized using commercially-available and
83:6 percent (143:3 percent) increase in FR over TMR for
academic design tools such as Xilinx TMR (XTMR) and BYU-
d ¼ 1 (d ¼ 7). Similarly, design disjunction can provide a
LANL TMR (BL-TMR), respectively. NMR employs N repli-
comparable FR to that of TMR using a considerably lower
cas and majority voting which masks failed modules by
area overhead. Fig. 13b reflects the area efficiency of the pro-
selecting a majority output. The area and power overheads of
posed work compared to modular redundancy. Area effi-
this scheme are approximately ðN 1Þ-fold including over-
ciency is quantified by the ratio of FR to the total resource
heads incurred by voting logic. A single failure in a module
count T . Similar to modular redundancy methods, a dimin-
can render that module unusable which compromises failure
ishing return on FR occurs as more hardware resources are
recoverability besides pre-determining resource use. Failure
considered. The resultant area advantage from using design
recoverability, denoted by FR, is defined as the cumulative
disjunction is more prominent for larger area overhead. For
sum of recovery coverage for all possible combinations of
the lowest design setting, i.e., f ¼ 1 and d ¼ 1, design dis-
fault locations. This definition can be expressed for a given
junction still enables a higher FR per area than any NMR
fault count d as:
setup included in this analysis. It is also worth noting that
X
T
FR ¼ RCðdÞ: (9)
d¼1
the area advantage of design disjunction can be further [5] C. Bolchini, A. Miele, and C. Sandionigi, “A novel design method-
ology for implementing reliability-aware systems on SRAM-based
enhanced by using parity-based error detection instead FPGAs,” IEEE Trans. Comput., vol. 60, no. 12, pp. 1744–1758, Dec.
of DWC. 2011.
The proposed approach can be applied at the reconfigur- [6] C. Carmichael, “Triple module redundancy design techniques for
able logic block level with a broadened range of design virtex FPGAs,” Xilinx, San Jose, CA, USA, Application Note
XAPP197(v1.0.1), Jul. 2001.
parameters to meet area and power constraints while main- [7] C. Carmichael and C. W. Tseng, “Correcting single-event upsets
taining both adequate fault isolation and recovery. The area in Virtex-4 FPGA configuration memory,” Xilinx, San Jose, CA,
overhead imposed by design disjunction is roughly limited USA, Application Note XAPP1088(v1.0), Oct. 2009.
[8] J. Heiner, B. Sellers, M. Wirthlin, and J. Kalb, “FPGA partial recon-
to T =R, where R includes the resources required to deploy figuration via configuration scrubbing,” in Proc. IEEE Int. Conf.
a CED scheme. Other components such as the embedded Field Programmable Logic Appl., Prague, Czech Republic, Aug./
processor and memory controller are often present in Sep. 2009, pp. 99–104.
embedded reconfigurable systems, and thus do not incur an [9] S. Mitra, W.-J. Huang, N. R. Saxena, S.-Y. Yu, and E. J. McCluskey,
“Reconfigurable architecture for autonomous self-repair,” IEEE
additional area cost. The reliability of these components Des. Test. Comput., vol. 21, no. 3, pp. 228–240, Jun. 2004.
falls within the scope of embedded system reliability and [10] Xilinx, “Vivado design suite,” White Paper WP416 (v1.1), Jun. 2012.
can be protected by appropriate techniques [43]. The [11] W. Zha, “Facilitating FPGA reconfiguration through low-level
reconfiguration structure is not limited to ICAP. For manipulation,” Ph.D. dissertation, Virginia Polytechnic Inst. State
Univ., Blacksburg, VA, USA, Feb.. 2014.
instance, Xilinx has recently introduced processor config- [12] C. Bolchini and A. Miele, “Design space exploration for the design
uration access port (PCAP) interface [44] for ARM-based of reliable SRAM-based FPGA systems,” in Proc. IEEE Int. Symp.
systems to write configuration bits. Design disjunction is Defect Fault Tolerance VLSI Syst., Boston, MA, USA, Oct. 2008,
realized without loss of generality by the regularity and pp. 332–340.
[13] S. Chakraverty, A. Agarwal, A. Agarwal, A. Kumar, and A. Sikri,
reconfigurability features of the FPGA device used. Since “Design space exploration for high availability drFPGA based
these features are ubiquitous in contemporary reconfigur- embedded systems,” in Proc. 1st Int. Conf. Adv. Mach. Learn. Tech-
able devices, the proposed approach can be highly com- nol. Appl., Cairo, Egypt, Dec. 2012, pp. 234–243.
patible with many FPGA families from different vendors [14] M. Abramovici, C. Strond, C. Hamilton, S. Wijesuriya, and
V. Verma, “Using roving STARs for on-line testing and diagnosis
and other classes of reconfigurable ICs, such as complex of FPGAs in fault-tolerant applications,” in Proc. IEEE Int. Test
programmable logic devices (CPLDs). Conf., Atlantic City, NJ, USA, Sep. 1999, pp. 973–982.
[15] X. Iturbe, K. Benkrid, C. Hong, A. Ebrahim, R. Torrego, I. Martinez,
T. Arslan, and J. Perez, “R3TOS: A novel reliable reconfigurable
8 CONCLUSION real-time operating system for highly adaptive, efficient, and
dependable computing on FPGAs,” IEEE Trans. Comput., vol. 62,
Design disjunction offers a mathematically-rooted, parame- no. 8, pp. 1542–1556, Aug. 2013.
terized, multi-fault isolation and recovery technique for [16] H. Zhang, L. Bauer, M. A. Kochte, E. Schneider, C. Braun, M. E.
reconfigurable hardware fabrics. Combinatorial construc- Imhof, H.-J. Wunderlich, and J. Henkel, “Module diversification:
tion methods for disjunction and failure ranking schemes Fault tolerance and aging mitigation for runtime reconfigurable
architectures,” in Proc. IEEE Int. Test Conf., Anaheim, CA, USA,
for fault diagnosis are developed using operational testing Sep. 2013, pp. 1–10.
techniques. Experimental results for a set of benchmarks on [17] V. Hahanov, S. Galagan, V. Olchovoy, and A. Priymak, “Algebra-
a Xilinx 7-series FPGA have demonstrated f-diagnosability logical repair method for FPGA logic blocks,” in Proc. IEEE
East-West Des. Test Symp., St. Petersburg, Russia, Sep. 2010,
at the individual slice level with a minimum average isola- pp. 482–487.
tion accuracy of 96:4 percent (94:4 percent) for f ¼ 1 (f ¼ 2). [18] C. A. Sharma, A. Sarvi, A. Alzahrani, and R. F. DeMara, “Self-
An algebraic-based extension was also developed to tolerate healing reconfigurable logic using autonomous group testing,”
inarticulate tests and increase isolation accuracy to any level Microprocess. Microsyst., vol. 37, no. 2, pp. 174–184, Mar. 2013.
[19] K. Zhang, R. F. DeMara, and C. A. Sharma, “Consensus-based
deemed adequate for successful recovery and repair. Based evaluation for fault isolation and on-line evolutionary regener-
on these favorable properties and low costs, design disjunc- ation,” in Proc. 6th Int. Conf. Evolvable Syst.: From Biol. Hardware,
tion is worthy of consideration for autonomous resiliency in 2005, pp. 12–24.
reconfigurable systems demanding high availability. [20] M. B. Tahoori, “High resolution application specific fault diagno-
sis of FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 19, no. 10, pp. 1775–1786, Oct. 2011.
ACKNOWLEDGMENTS [21] A. J. Van De Goor, “Using march tests to test SRAMs,” IEEE Des.
Test. Comput., vol. 10, no. 1, pp. 8–14, Mar. 1993.
This research was funded by the Ministry of Education of [22] L. Bauer, C. Braun, M. Imhof, M. Kochte, E. Schneider, H. Zhang,
Saudi Arabia under scholarship grant no. 64923. J. Henkel, and H.-J. Wunderlich, “Test strategies for reliable run-
time reconfigurable architectures,” IEEE Trans. Comput., vol. 62,
no. 8, pp. 1494–1507, Aug. 2013.
REFERENCES [23] M. Renovell, P. Faure, J. M. Portal, J. Figueras, and Y. Zorian, “IS-
[1] E. Marcus and H. Stern, Blueprints for High Availability. New York, FPGA: A new symmetric FPGA architecture with implicit scan,”
NY, USA: Wiley, 2003. in Proc. IEEE Int. Test Conf., Baltimore, MD, USA, Oct./Nov. 2001,
[2] J. Henkel, L. Bauer, J. Becker, O. Bringmann, U. Brinkschulte, S. pp. 924–931.
Chakraborty, M. Engel, R. Ernst, H. Hartig, L. Hedrich, et al., [24] S. Mitra and E. McCluskey, “Which concurrent error detection
“Design and architectures for dependable embedded systems,” in scheme to choose?” in Proc. IEEE Int. Test Conf., Atlantic City, NJ,
Proc. IEEE 9th Int. Conf. Hardware/Softw. Codes. Syst. Synthesis, Tai- USA, Oct. 2000, pp. 985–994.
pei, Taiwan, Oct. 2011, pp. 69–78. [25] C. Bolchini, A. Miele, and C. Sandionigi, “Autonomous fault-toler-
[3] P. S. Ostler, M. P. Caffrey, D. S. Gibelyou, P. S. Graham, K. S. ant systems onto SRAM-based FPGA platforms,” J. Electron. Test.,
Morgan, B. H. Pratt, H. M. Quinn, and M. J. Wirthlin, “SRAM vol. 29, no. 6, pp. 779–793, Nov. 2013.
FPGA reliability analysis for harsh radiation environments,” IEEE [26] A. Doumar and H. Ito, “Detecting, diagnosing, and tolerating
Trans. Nucl. Sci., vol. 56, no. 6, pp. 3519–3526, Dec. 2009. faults in SRAM-based field programmable gate arrays: A survey,”
[4] C. Constantinescu, “Trends and challenges in VLSI circuit IEEE Trans. Very Large Scale Integr. Syst., vol. 11, no. 3, pp. 386–
reliability,” IEEE Micro, vol. 23, no. 4, pp. 14–19, Jul./Aug. 2003. 405, Jun. 2003.
ALZAHRANI AND DEMARA: FAST ONLINE DIAGNOSIS AND RECOVERY OF RECONFIGURABLE LOGIC FABRICS USING DESIGN DISJUNCTION 3069
[27] D. Keymeulen, R. Zebulum, Y. Jin, and A. Stoica, “Fault-tolerant Ahmad Alzahrani received the BS degree in
evolvable hardware using field-programmable transistor arrays,” electrical engineering from the Umm Al-Qura Uni-
IEEE Trans. Rel., vol. 49, no. 3, pp. 305–316, Sep. 2000. versity, in 2002. He received the MS degree in
[28] E. A. Stott, N. P. Sedcole, and P. Y. K. Cheung, “Fault tolerance computer engineering from the University of
and reliability in field-programmable gate arrays,” IET Comput. Arkansas, Fayetteville, in 2009, and the PhD
Digital Techn., vol. 4, no. 3, pp. 196–210, May 2010. degree in computer engineering from the Univer-
[29] M. G. Parris, C. A. Sharma, and R. F. DeMara, “Progress in auton- sity of Central Florida, in 2015. His research
omous fault recovery of field-programmable gate arrays,” ACM interests include computer architecture, fault tol-
Comput. Surveys, vol. 43, no. 4, p. 31, Oct. 2011. erance, and adaptive reconfigurable computing.
[30] A. Seffrin and A. Biedermann, “Cellular-array implementations of He is a member of the IEEE.
bio-inspired self-healing systems: State of the art and future
perspectives,” in Design Methodologies for Secure Embedded Systems.
Berlin, Germany: Springer, 2011, vol. 78, pp. 151–170.
[31] R. Dorfman, “The detection of defective members of large pop- Ronald F. DeMara received the PhD degree in
ulations,” Ann. Math. Statist., vol. 14, no. 4, pp. 436–440, Dec. 1943. computer engineering from the University of
[32] A. B. Kahng and S. Reda, “New and improved BIST diagnosis Southern California, in 1992. Since 1993, he has
methods from combinatorial group testing theory,” IEEE Trans. been a full-time faculty member at the University
Comput.-Aided Des. Integr. Circuits Syst., vol. 25, no. 3, pp. 533–543, of Central Florida where he is a professor and
Computer Engineering program coordinator. His
Mar. 2006.
research interests are in computer architecture
[33] J. Ghosh-Dastidar, and N. A. Touba, “A rapid and scalable diag-
nosis scheme for BIST environments with a large number of scan with emphasis on Evolvable and Resilient Hard-
chains,” in Proc. IEEE 18th VLSI Test Symp., Montreal, PQ, Canada, ware, on which he has published approximately
Apr./May 2000, pp. 79–85. 175 articles. He has served on the Editorial
[34] M. Cheraghchi, “Coding-theoretic methods for sparse recovery,” Boards of IEEE Transactions on VLSI Systems,
ACM Transactions on Embedded Systems, Journal of Circuits, Systems,
in Proc. IEEE 49th Annu. Allerton Conf. Commun., Control Comput.,
and Computers, the journal Microprocessors and Microsystems, various
Monticello, IL, USA, Sep. 2011, pp. 909–916.
[35] A. J. Macula, “A simple construction of d-disjunct matrices with conference program committees, and is currently an associate editor
certain constant weights,” Discr. Math., vol. 162, nos. 1–3, pp. 311– of IEEE Transactions on Computers. He received the Joseph M.
312, Dec. 1996. Bidenbach Outstanding Engineering Educator Award from the IEEE, in
[36] C. L. Chan, S. Jaggi, V. Saligrama, and S. Agnihotri, “Non-adap- 2008. He is a senior member of the IEEE
tive group testing: Explicit bounds and novel algorithms,” in Proc.
IEEE Int. Symp. Inform. Theory, Jul. 2012, pp. 1837–1841.
" For more information on this or any other computing topic,
[37] M. Cheraghchi, A. Hormati, A. Karbasi, and M. Vetterli, “Group
testing with probabilistic tests: Theory, design and application,” please visit our Digital Library at www.computer.org/publications/dlib.
IEEE Trans. Inf. Theory, vol. 57, no. 10, pp. 7057–7067, Oct. 2011.
[38] E. Knill, W. J. Bruno, and D. C. Torney, “Non-adaptive group test-
ing in the presence of errors,” Discr. Appl. Math., vol. 88, no. 1,
pp. 261–290, Nov. 1998.
[39] T. Kumar and F. Lombardi, “A novel heuristic method for appli-
cation-dependent testing of a SRAM-based FPGA interconnect,”
IEEE Trans. Comput., vol. 62, no. 1, pp. 163–172, Jan. 2013.
[40] A. Alzahrani and R. F. DeMara, “Hypergraph-cover diversity for
maximally-resilient reconfigurable systems,” in Proc. IEEE 12th
Int. Conf. Embedded Softw. Syst., New York, NY, USA, Aug 2015,
pp. 1086–1092.
[41] M. Mozaffari-Kermani and A. Reyhani-Masoleh, “Concurrent
structure-independent fault detection schemes for the advanced
encryption standard,” IEEE Trans. Comput., vol. 59, no. 5, pp. 608–
622, May 2010.
[42] A. Alzahrani and R. F. DeMara, “Process variation immunity of
alternative 16nm HK/MG-based FPGA logic blocks,” in Proc.
IEEE 58th Int. Midwest Symp. Circuits Syst., Fort Collins, CO, USA,
Aug 2015, pp. 1–4.
[43] H. Kopetz, Real-Time Systems: Design Principles for Distributed
Embedded Applications, 2nd ed. Berlin, Germany: Springer, Apr.
2011.
[44] C. Kohn, “Partial reconfiguration of a hardware accelerator on
Zynq-7000 all programmable SoC devices,” Xilinx, San Jose, CA,
USA, Application Note XAPP1088(v1.0), Jan. 2013.