06A Dual-Split 6T SRAM-Based Computing-in-Memory
06A Dual-Split 6T SRAM-Based Computing-in-Memory
A Dual-Split 6T SRAM-Based
Computing-in-Memory Unit-Macro With
Fully Parallel Product-Sum Operation for
Binarized DNN Edge Processors
Xin Si, Win-San Khwa , Jia-Jing Chen, Jia-Fang Li, Xiaoyu Sun , Student Member, IEEE,
Rui Liu, Student Member, IEEE, Shimeng Yu , Senior Member, IEEE, Hiroyuki Yamauchi ,
Qiang Li , Senior Member, IEEE, and Meng-Fan Chang , Fellow, IEEE
Abstract— Computing-in-memory (CIM) is a promising insensitive small offset voltage-mode sensing amplifier (CMI-
approach to reduce the latency and improve the energy effi- VSA). A fabricated 65-nm 4-Kb SRAM-CIM unit-macro achieved
ciency of deep neural network (DNN) artificial intelligence (AI) 2.4- and 2.3-ns product-sum access times for a FCNN layer using
edge processors. However, SRAM-based CIM (SRAM-CIM) XNORNN and MBNN, respectively. The measured maximum
faces practical challenges in terms of area overhead, perfor- energy efficiency reached 30.49 TOPS/W for XNORNN and
mance, energy efficiency, and yield against variations in data 55.8 TOPS/W for the MBNN modes.
patterns and transistor performance. This paper employed a
circuit-system co-design methodology to develop a SRAM-CIM Index Terms— Random access memory, computing-in-memory,
unit-macro for a binary-based fully connected neural net- binarized DNN edge processors, artificial intelligence.
work (FCNN) layer of the DNN AI edge processors. The I. I NTRODUCTION
proposed SRAM-CIM unit-macro supports two binarized neural
network models: an XNOR neural network (XNORNN) and a
modified binary neural network (MBNN). To achieve compact
area, fast access time, robust operations, and high energy-
D EEP neural networks (DNNs) are commonly used
for artificial intelligence (AI) processors to achieve
high-accuracy recognition and prediction functions for a vari-
efficiency, our proposed SRAM-CIM uses a split-wordline ety of applications [1]–[4]. As shown in Fig. 1(a), deep neural
compact-rule 6T SRAM and circuit techniques, including a
dynamic input-aware reference generation (DIARG) scheme, networks (DNNs) typically comprise a series of convolu-
an algorithm-dependent asymmetric control (ADAC) scheme, tion (CNN) and fully-connect (FC) layers, with a number of
a write disturb-free (WDF) scheme, and a common-mode- non-linear layers, such as a pooling layer and a rectified linear
unit activation layer (ReLU). In deep neural network (DNN)
Manuscript received November 18, 2018; revised March 14, 2019 and
May 19, 2019; accepted July 1, 2019. This work was supported in part by processors [5]–[10], product-sum (PS) operations dominate
the Taiwan Semiconductor Research Institute (TSRI), in part by the Taiwan the computational workload in both convolution (CNN) and
Semiconductor Manufacturing Company-Joint Development Program (TSMC- fully connected (FC) layers. These neural network layers
JDP), in part by the MediaTek-Joint Development Program (MTK-JDP), and
in part by the Ministry of Science and Technology (MOST) of Taiwan. This are computationally intensive and require the movement and
paper was recommended by Associate Editor Y. Ha. (Corresponding author: storage of enormous volumes of data. Thus, the applica-
Meng-Fan Chang.) tion of deep neural network (DNN) processors for AI edge
X. Si is with the Institute of Integrated Circuits and Systems,
University of Electronic Science and Technology of China (UESTC), devices usually require fast inference operations, ultra-low
Chengdu 610054, China, and also with the Department of Electrical Engi- energy consumption, low cost, and sufficient accuracy. The
neering, National Tsing Hua University (NTHU), Hsinchu 30013, Taiwan reduced bit precision and memory cost of binary deep neural
(e-mail: [email protected]).
W.-S. Khwa was with the Department of Electrical Engineering, National networks (DNNs) [11]–[16] make it possible to reduce the
Tsing Hua University (NTHU), Hsinchu 30013, Taiwan. He is now with computational and hardware costs of AI edge devices; how-
Taiwan Semiconductor Manufacturing Company (TSMC), Hsinchu 30078, ever, conventional all digital solutions have been unable to
Taiwan.
J.-J. Chen, J.-F. Li, and M.-F. Chang are with the Department of Electrical resolve the memory bottleneck. In conventional all digital solu-
Engineering, National Tsing Hua University (NTHU), Hsinchu 30013, Taiwan tions, process engine (PE) arrays typically exploit parallelized
(e-mail: [email protected]). computation; however, they suffer from inefficient single-row
X. Sun and S. Yu are with the Georgia Institute of Technology, Atlanta,
GA 30332 USA. SRAM access to weights, and larger SRAM arrays are required
R. Liu is with Synopsys, San Francisco, CA 94107 USA. to store a huge amounts of intermediate data, as shown in
H. Yamauchi is with the Fukuoka Institute of Technology, Fukuoka Fig. 1(b). Furthermore, the energy required to access data from
811-0295, Japan.
Q. Li is with the Institute of Integrated Circuits and Systems, University memory can far exceed the energy required for computing
of Electronic Science and Technology of China (UESTC), Chengdu 610054, operations using that data [17].
China. Computing-in-memory (CIM) or process-in-memory (PIM)
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. methods have been proposed to improve computational effi-
Digital Object Identifier 10.1109/TCSI.2019.2928043 ciency by enabling parallel computing within the memory
1549-8328 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. (a) Typical deep neural networks structure. (b) Conceptual of II. BACKGROUND : B INARIZED N EURAL N ETWORK
SRAM-CIM for AI Edge processors. AND D UAL -S PLIT 6T SRAM C ELL
SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 3
TABLE I
C OMPARISON TABLE OF VARIOUS B INARIZED N EURAL N ETWORK A LGORITHM
SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 5
TABLE III
L OOK -U P TABLE (LUT) FOR R EFERENCE T UNING OF FCN-L AST
Fig. 5. Target reference voltage for FCN-last layer. the corresponding F0RC in reference column RC2 is asserted
as high, whereupon 1 × IMC−D is generated on RBLR2. Thus,
PSR = (1 × IMC−D )/2 = [1 × (+1)]/2 = +1/2.
(PSR = 0 or NIWP=+1 = NIWP=−1 ). In the case where The LUT is listed in Table III. The set value of this
n inputs are activated for an XNORNN SRAM-CIM, n LUT is based on the product-sum-result (PSR) distribution
corresponding WL-combiners (WLCBs) are enabled. Then, of MNIST 10K images using the XNORNN model with
n pairs of WLL1s and WLR1s of the WL-combiners optimized sensing margin. With an increase in detected value
(WLCBs) activated for RC1 are asserted as high. As a (n), the PSR of the winner increases (based on analysis of the
result, n × IMC−C is generated on BLL and n × IMC−D MNIST dataset). With the assistance of this LUT, the inference
is generated on BLR. With the combination of RBLL accuracy for MNIST reached 96.1% in only one read-out
and RBLR (replica BL-selection switch, RBLSW = 1), iteration.
the reference current is set as n × IMC−C + n × IMC−D , For example, if the number of INPUTs is 10(n = 10),
resulting in PSRREF = n × (−1) + n × (+1) = 0. then the reference product-sum-result (PSR) value is set at
Accordingly, the reference voltage (VREF−XNOR−FC ) for PSR = ’0’ by first using reference column RC1. This is
product-sum result (PSR) = 0 based on the value of n is gen- the same method used for FCN-(L-1). Based on LUT (as
erated across various process, voltage, and temperature (PVT) shown in Table III), two ENRWLT are set to ‘0’ to fine-
conditions. tune the reference product-sum-result (PSR) value to PSR =
In the last fully connected layer (FCN-last), there is always 0 +2×(−1/2) = −1. Finally, the cell-tracking VREF = 0.53V
a winner among the neurons (outputs). Thus, the reference is generated under of the following conditions: 25◦C and TT
voltage (VREF−LAST ) used to read the product-sum result corner.
(PSR) of each column should be set between the voltages of
IV. P ROPOSED C OMPUTING - IN -M EMORY SRAM
the winner and second candidate, which is different from the
U NIT-M ACRO FOR MBNN
FCN–(L-1). Fig. 5 presents the target and dynamic input-aware
reference generation (DIARG)VREF for the MNIST dataset. A. Architecture and Basic Operation of MBNN SRAM-CIM
Generating VREF−XNOR−LAST requires circuit blocks addi- Fig. 6 illustrates the architecture of the proposed mod-
tional to reference column1 (RC1) and WL-combiner ified binary neural network (MBNN) SRAM-CIM macro
(WLCB), including an input detect logic (IDL), reference and its operational waveform, which is similar to that of
column RC2, a look-up table (LUT), and a reference WL-tuner the XNOR neural network (XNORNN) SRAM-CIM macro,
(RWLT). except for algorithm-dependent asymmetric control (ADAC)
Input detect logic (IDL) is used to detect the number of and dynamic input-aware reference generation for modified
INPUTs (n). Then, based on the value of n from input detect binary neural network (DIARG-MBNN). This modified binary
logic (IDL), the LUT is used to control the enable signals of neural network (MBNN) SRAM-CIM has two operation
the reference WL-tuners (RWLTs). Reference column1 (RC1) modes: SRAM and MBNN modes. SRAM mode, which is
is used to generate the input-aware common-mode VREF , the same as that in XNORNN SRAM-CIM, is activated to
which is set at PSR = 0 across various inputs (input-aware). store (write operation) the trained weights. MBNN mode is
Reference column2 (RC2) is used to fine-tune the reference for MBNN-based CIM operations. Unlike XNOR operations,
voltage (or reference PSR value). the input of MBNN is either “1” or “0”.
When the enable signal of an RWLT(ENRWLT ) is ‘0’, WLL Fig. 7(a) presents the structure and operational waveform of
of the corresponding fixed-zero replica memory cell (F0RC) MBNN SRAM-CIM using algorithm-dependent asymmetric
in reference column RC2 is asserted as high, whereupon control (ADAC) scheme in left sensing mode (asymmetric flag,
1 × IMC−C is generated on RBLL2. With the combination AF = 1). When an input (IN[i ]) = ‘ + 1’, its WLL(WLL[i ])
of RBLL1, RBLR1, RBLL2, and RBLLR2 (as shown in is asserted as ‘1’ and its WLR(WLR[i ]) is asserted as cell
RBLSW in Fig.5), the influence of the two reference columns stores the weight “+1” by setting Q = 1, while storing the
(RC1 and RC2, i.e., 2× header), the change in reference ‘0’. When IN[i ] = ‘0’, WLL[i ] = WLR[i ] = 0. The 6T
PSR value (PSR) is set as PSR = (1 × IMC−C )/2 = SRAM weight “−1” by setting Q to 0. When the input-
[1 × (−1)]/2 = −1/2. When ENRWLT = ‘1’, the WLR2 of weight-product result (IWP) of an MBNN operation between
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
applications and
datasets, due to the fact that product-sum-
result (PSR, IN × W) polarizes the last layer to present
only the most probable candidate. Furthermore, the asym-
metry seen in NIWP=+1 /NIWP=−1 is opposed in the two
layers. In accordance with these characteristics, the algorithm-
dependent asymmetric control (ADAC) scheme (combining
the split-WL feature of DSC6T cell) is enabled to reduce BL
current and power consumption by the macro. This makes
it possible to implement two WL/BL access modes (1. left
sensing, WLL-BLL and 2. right sensing, WLR-BLR) for the
two layers using the same SRAM-CIM unit-macro.
In the algorithm-dependent asymmetric control (ADAC)
scheme, we employed the following sub-blocks: 1. asym-
Fig. 6. (a) Macro structure and (b) waveform of MBNN SRAM-CIM. metric flag (AF); 2. bitline-selection switches (BLSW); 3.
WL-selection switches (WLSW); 4. dual-path output-drivers
TABLE IV (DPOD); and 5. dual-split-control DSC6T cell array.
T RUTH TABLE OF MBNN SRAM-CIM Asymmetric flag (AF) is pre-defined during training or con-
figure it using an application that specifies whether left-sensing
or right-sensing is to be used. We determine asymmetric flag
(AF) using NIWP=+1 and NIWP=−1 of all BLs in the macro.
In the case where (NIWP=+1 > NIWP=−1 ), asymmetric flag
(AF) is asserted as 1 for left-sensing. WLSW activates left-
sensing by asserting only the WLLs of the selected rows
(IN = 1), when all WLRs are grounded. Each of the BLLs
is connected to its corresponding VSA via BLSW, whereas
BLR = VDD is isolated from VSA. Then, VSA detects VBLL
and directs its output (SAOUT) via the non-inversion path of
dual-path output-drivers (DPOD) to DOUT.
Fig. 8(b) shows that when asymmetric flag (AF) = 0
(NIWP=+1 < NIWP=−1 ) and right-sensing is selected, the roles
of WLR-BLR and WLL-BLL are switched, and the final
SAOUT result of VSA is sent along the inversion path of
dual-path output-drivers (DPOD) to DOUT.
This means that ADAC can be combined with DSC6T to
reduce BL current and power consumption to below that of
IN[i ] and MC[i , j ] is ‘ + 1’, the DSC6T cell generates a a conventional 6T cell. This can be explained by a reduction
charge current (IMC−C ) on the BLL. When input-weight- in parasitic load on activated WLs (one transistor per cell),
product IWP = −1, the DSC6T cell generates a discharge a reduction in BL current on the selected BL, and a lock of
current (IMC−D ) on BLL. When IWP = 0, the DSC6T cell BL current from unselected BLs.
does not provide any cell current to BLL. At the same time,
BLR is disconnected from BLL to remain in a floating state C. Dynamic Input-Aware V R E F Generation Scheme for
without IMC−C or IMC−D . The total number of input-weight- MBNN (DIARG-MBNN)
product IWP results associated with each modified binary
neural network (MBNN) operation on activated DSC6T cells is The proposed DIARG scheme for MBNN (DIARG-MBNN)
presented on the BLL. As a result, the modified binary neural is similar to the DIARG-XNOR scheme in Section III; how-
network (MBNN) count could be digitized using the sensing ever, in this case, VREF is based on the number of Input = 1
of VBLL . A truth table of the modified binary neural network (NInput=1), rather than on the total number of all inputs (n) in
(MBNN) SRAM CIM is presented in Table IV. DIARG-XNOR. Furthermore, the design of IDL (input detect
logic) for MBNN differs from the design for XNORNN.
As described in the previous section, when FCN-(L-1), the
B. Algorithm-Dependent Asymmetric Control (ADAC) reference voltage for sensing should be set at PSR = 0 or
Scheme NIWP=+1 = NIWP=−1 . In cases where m WLs (NInput=1 = m)
Fig. 8 presents data pattern analysis of MNIST test images are activated for an MBNN SRAM-CIM, m corresponding WL
We can see intriguing asymmetry between the number of combiners (WLCBs) are enabled. Based on the truth table
“input-weight-product (IWP) = +1” (NIWP=+1 ) and “input- (Table IV), we can see that m pairs of WLL1s and WLRs
weight-product (IWP) = −1” (NIWP=−1 ) on a BL in the associated with WLCBs activated for RC1 are asserted as
last two FCNLs (i.e. FCN-(L-1) and FCN-Last) when applied high. As a result, m × IMC−C are generated on the BLL and
to MNIST. This is a generic characteristic found in many m ×IMC−D are generated on the BLR. Using a combination of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 7
Fig. 7. Structure and waveform of MBNN SRAM-CIM using ADAC scheme: (a) applying AF = 1 (nominal mode) for FCN-(L-1) layer; and (b) applying
AF = 0 (inversing mode) for FCN-last layer.
Fig. 8. Data pattern analysis of MNIST test images for the last two FCN layers.
RBLL and RBLR(RBLSW = 1), the reference current is set tackled the above-mentioned challenges by developing a write
as m × IMC−C + m × IMC−D , such that reference product-sum disturb free (WDF) scheme and a common-mode-insensitive
result (PSRREF ) = m×(−1)+m×(+1) = 0. This is the means small-offset voltage-mode sense amplifier (CMI-VSA).
by which the reference voltage (VREF−MBNN ) is generated for
product-sum result (PSR) = 0 (based on the NInput=1 values). A. Write Disturb Free (WDF) Scheme
In FCN-Last, the reference VREF−MBNN−LAST should be
A wide range of product-sum results (PSR) for each bitline
set between the winner and other candidates, which is differ-
imposes a wide range of bitline voltages (VBL ), as a function
ent from VREF−MBNN . Similar to XNORNN (as descripted
of NIWP=+1 and NIWP=−1 ., which respectively refer to the
in Section III), generating VREF−MBNN−LAST also requires
number of cells conducting charge (IMC−C ) and discharge
6 circuit blocks: RC1, RC2, WLCB, IDL, LUT and RWLT.
(IMC−D ) currents. IMC−D decreases VBL significantly, whereas
The difference is that the number of enabled RWLT signals
IMC−C increases VBL marginally. Thus, VBL is heavily depen-
is based on the number of Input = 1 (NInput=1); therefore,
dent on NIWP=−1 , but only slightly dependent on NIWP=+1 .
we use IDL here to detect NInput=1.
A small number of activated WLs (NWL ) leads to a higher
VBL , due to the fact that NIWP=−1 is small. A large NWL can
V. A DDITIONAL C IRCUIT T ECHNIQUES U SED lead to a very low VBL , when NIWP=−1 is large. This can cause
IN THE D ESIGN OF SRAM-CIM a 6T SRAM cell storing “1” (Q = 1) to suffer data flipping
XNORNN and MBNN SRAM-CIM unit-macro designs to “0” (Q = 0). This is a particular concern for cases with
impose the same challenges: (1) SRAM cells suffer from write process variation or global process corners that favor write
disturb issues when the BL voltage is too low; (2) The mini- operations, resulting in write disturbance.
mum sensing margin for reading PSR (BL voltage) is small; This work employed a BL-clamper (BLC) to avoid
(3) BL voltage covers a wide range and induces different input write disturbance for. This scheme was implemented using
offsets for typical voltage-mode sense amplifiers (VSAs). We diode-connect PMOS transistors. The BLC prevents VBL from
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
L OOK -U P TABLE (LUT) W ITH WDF S CHEMES
SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 9
Fig. 11. Simulated (a) input offset and (b) sensing speed of CMI-VSA and
conventional (CNV) VSA.
Fig. 12. Simulated (a) variations of bitline voltage and (b) inference accuracy
of the last 2 fully connected layers in SRAM-CIM macros.
Fig. 14. Structure of SRAM-CIM testchip.
As expected, the CMI-VSA provided uniform sensing delay consumption even further (by 61.4% compared to conventional
across a wide range of VCM Values. In conventional VSA, 6T SRAMs).
TSA tends to slow down when VCM drops. At VCM =
0.3V, the TSA of CIM-VSA was 10 + × faster than that of
conventional VSAs. B. Measurement Results
Fig. 12(a) presents the simulated BL voltage distributions of This work implemented a testchip that included four 4Kb
the MBNN SRAM-CIM macro for the winner (VWIN ) and the (64 × 64b) DSC6T SRAM-CIM unit-macros, fabricated using
candidates with the 2nd highest PSR(V2ND ) when applied to a 65nm CMOS logic process. The structure of the testchip
the MNIST dataset for the last FCNL. This analysis shows that is presented in Fig. 14. The four DSC6T SRAM-CIM unit-
when using a fixed VREF for BL sensing, erroneous detections macros were respectively assigned to modified binary LeNet-
can occur in many cases, due to overlap between VWIN and 5 neural network model (2 CNN layers with 5 × 5 kernels
V2ND (> 0.4V), as shown in Fig. 12(a). This means that even and 2 FC layers: one with 64 × 64 weights and the other with
with a perfect VSA, 5-6 sensing iterations are required to 64 × 10 weights) for XNORNN and MBNN macros. For the
approach the baseline accuracy of the model employed for fully-connected (FC) layers, we listed in the same column all
the MNIST dataset. As shown in Fig. 12(b), DIARG achieved of the weights corresponding to a given output in realizing the
winner-detection accuracy of 96.1% in the first iteration. product-sum (PS) operation, as shown in Fig. 15. The proposed
In contrast, conventional fixed-VREF requires at least four DSC6T SRAM-CIM unit-macros is scalable for a variety of
iterations to achieve this level of performance. The proposed neural networks. First, this scheme allows a larger number of
scheme was shown to reduce latency and energy overhead parallel activated rows (m). The maximum m value within
by 4 + ×. a macro represents a tradeoff among the input offset of the
Fig. 13 illustrates the current consumption of the FCN-last sense amplifier, accuracy, area efficiency, and neural network
when applied to the MNIST dataset. When using a conven- model that is employed. Generally, a smaller input offset can
tional 6T SRAM array, both of the pass-gates (PGL and PGR) increase the maximum value of m, albeit at the cost of larger
were simultaneously activated by the same wordline, such area. Second, additional support for a larger or deeper network
that BLL and BLR both consumed current for product-sum can be gained by employing an accelerator chip with multiple
operations when using the XNORNN or the MBNN models. unit-macros [33] and modifications to the read out circuits.
The use of DSC6T with only one pass-gate turned on reduced When implementing multiple mini-macros for larger networks,
the average current consumption of the DSC6T SRAM-CIM there is a tradeoff between the area and power overhead versus
by 46.5%, compared to a conventional 6T SRAM array. the precision required to generate partial sums and the size
Combining DSC6T with the ADAC scheme reduced current of the unit-macro. As shown in Fig. 16, when the precision
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 16. Simulated area cost and power consumption under different precision
of generated partial sums. When tested using 10k preprocessed MNIST images,
the inference accuracy of XNORNN SRAM-CIM in the last
two FCN layers was measured at 96.5%. The measured energy
per layer was 134.3pJ, which is equivalent to the energy
efficiency of 30.49TOPS/W per 4Kb macro. Note that in this
energy derivation, one operation is equal to one XNOR-count
operation. Counting each XNOR operation as two separate
operations (as in some reports [13]–[15]) would translate to
energy efficiency of 60.98 TOPS/W for our 4Kb XNORNN
SRAM-CIM unit-macro.
2) Measurement Results From MBNN SRAM-CIM:
Fig. 19(a) presents the captured waveform of the inference
Fig. 17. Die photo of SRAM-CIM testchip.
operation of the MBNN-based FCNN-(L-1) (MBNN1 macro).
The macro access time (TAC−M ) was 2.3ns for detection of
the MNIST winner at VDD = 1V.
required to generate partial sums was 3b, the area overhead The read access time (TAC−M−2layers ) of the integrated
was only 6.21% and the power overhead was less than 7.74%. last-two MBNN-based fully connected layers (MBNN1 and
Furthermore, each unit-macro could generate a partial sum, MBNN2 macros) was 4.8ns for MNIST image identification,
all of which could be added up to obtain the final sum or as shown in Fig. 19(b).
larger product sum for binary activation. We employed the Fig. 20 presents a measured shmoo plot of the MBNN-
DFF-based path-delay exclusion approach [34], [35] to enable mode SRAM-CIM unit-macro in terms of inference time vs.
the extraction of access times for each SRAM-CIM macro. various wordline voltages. The proposed schemes support a
A die photo is presented in Fig.17. XNORNN macros achieved wide range of WL voltages (VWL = 1.2V − 0.8V) without
around 1.5% higher classification accuracy, while lower energy speed degradation at VDD = 1V. When WL voltage was
efficiency than MBNN macros. at the lower bound (0.7V-0.8V), there was insufficient cell
1) Measurement Results From XNORNN SRAM-CIM: current to discharge the BL load quickly. This may account
Fig. 18 presents the captured waveform obtained during for the observed differences in access times when VWL =
the inference operation of the XNORNN-based FCNN-(L-1) 0.7 ∼ 0.8V. When WL voltage was at the higher bound
(XNOR1 macro). The macro access time (TAC−M ) was 2.4ns (1V-0.8V), there was sufficient cell current to discharge the BL
for the detection of the MNIST winner at VDD = 1V. load quickly. In this situation, BL voltage developing time was
The read access time (TAC−M−2layers ) of the integrated dominated by voltage divider behavior and remained nearly
last-two XNORNN fully connected layers (XNOR1 and constant. As a result, we did not observe any differences in
XNOR2 macros) was 5.0ns for MNIST image identification. access times when VWL = 1V ∼ 0.8V. This macro supports
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 11
TABLE VI
C OMPARISON TABLE OF SRAM-CIM C HIP
VII. C ONCLUSIONS [10] S. Park, S. Choi, J. Lee, M. Kim, J. Park, and H.-J. Yoo,
“A 126.1 mW real-time natural UI/UX processor with embedded deep-
This work proposes a 65nm 4Kb algorithm-dependent learning core for low-power smart glasses,” in IEEE Int. Solid-State
SRAM-CIM unit-macro, which supports both XNORNN and Circuits Conf. (ISSCC) Dig. Tech. Papers, Oct. 2016, pp. 254–255.
our proposed modified BNN (MBNN) nets. This work also [11] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Binarized neural networks: Training deep neural networks with weights
proposes dynamic input aware reference generation (DIAR and activations constrained to +1 or −1,” 2016, arXiv:1602.02830.
G), algorithm dependent asymmetric control (ADAC), write [Online]. Available: https://arxiv.org/abs/1602.02830
disturb free (WDF), and common-mode-insensitive voltage- [12] M. Kim and P. Smaragdis, “Bitwise neural networks,” in Proc. Int.
mode sensing-amplifier (CMI-VSA) schemes to reduce power Conf. Mach. Learn. Workshop Resource-Efficient Mach. Learn. (ICML),
Jul. 2015, pp. 6–11.
consumption, while achieving fast inference times, and robust [13] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
inference operations. A testchip with four 65nm 4Kb SRAM- ImageNet classification using binary convolutional neural networks,” in
CIMs was fabricated to confirm the proposed concepts. The Proc. Eur. Conf. Comput. Vis., 2016, pp. 525–542.
fabricated chip achieved access time of 2.3ns and energy [14] M. Courbariaux, Y. Bengio, and J. David, “BinaryConnect: Training
deep neural networks with binary weights during propagations,” in Proc.
efficiency of 55.8 TOPS/W per layer when applied to MNIST Adv. Neural Inf. Process. Syst. (NIPS), 2015, pp. 3105–3113.
image recognition. [15] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks
with few multiplications,” in Proc. Int. Conf. Learn. Represent. (ICLR),
2016, pp. 1–9.
R EFERENCES [16] R. Liu et al., “Parallelizing SRAM arrays with customized bit-cell
[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for binary neural networks,” in Proc. Design Automat. Conf. (DAC),
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. Jun. 2018, p. 21.
Available: https://arxiv.org/abs/1409.1556 [17] M. Horowitz, “Computing’s energy problem (and what we can do about
[2] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9. Feb. 2014, pp. 10–14.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [18] Q. Dong et al., “A 0.3 V VDDmin 4 + 2TSRAM for searching and
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- in-memory computing using 55 nm DDC technology,” in Proc. IEEE
nit. (CVPR), Jun. 2016, pp. 770–778. Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2017, pp. 160–161.
[4] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H.-J. Yoo, [19] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of
“A 0.62 mW ultra-low-power convolutional -neural—Network face- a machine-learning classifier in a standard 6T SRAM array,”
recognition processor and a CIS integrated with always-on Haar-like IEEE J. Solid-State Circuits, vol. 52, no. 4, pp. 915–924, Apr. 2017.
face detector,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. [20] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-SRAM: Enabling in-
Papers, Feb. 2017, pp. 344–346. memory Boolean computations in CMOS static random access mem-
[5] M. Price, J. Glass, and A. P. Chandrakasan, “A scalable speech recog- ories,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12,
nizer with deep-neural-network acoustic models and voice-activated pp. 4219–4232, Dec. 2018.
power gating,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers, Feb. 2017, pp. 244–245. [21] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz,
[6] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier “An energy-efficient VLSI architecture for pattern recognition via deep
implemented in a standard 6T SRAM array,” in Proc. IEEE Symp. VLSI embedding of computation in SRAM,” in Proc. IEEE Int. Conf. Acoust.,
Circuits, Jun. 2016, pp. 1–2. Speech Signal Process. (ICASSP), May 2014, pp. 8326–8330.
[7] F. Su et al., “A 462 GOPs/J RRAM-based nonvolatile intelligent [22] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 nm
processor for energy harvesting IoE system featuring nonvolatile logics configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit
and processing-in-memory,” in Symp. VLSI Technol. Dig. Tech. Papers, cell enabling logic-in-memory,” IEEE J. Solid-State Circuits, vol. 51,
Jun. 2017, pp. T260–T261. no. 4, pp. 1009–1021, Apr. 2016.
[8] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy- [23] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and
efficient reconfigurable accelerator for deep convolutional neural net- R. Das, “Compute caches,” in Proc. IEEE Int. Symp. High Perform.
works,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Comput. Archit. (HPCA), Feb. 2017, pp. 481–492.
Papers, Jan. 2016, pp. 262–263. [24] A. Biswas and A. P. Chandrakasan, “Conv-RAM: An energy-
[9] Y. Kim, D. Shin, J. Lee, Y. Lee, and H.-J. Yoo, “A 0.55 V efficient SRAM with embedded convolution computation for low-power
1.1 mW artificial-intelligence processor with PVT compensation for CNN-based machine learning applications,” in IEEE Int. Solid-State
micro robots,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA,
Papers, Sep. 2016, pp. 258–259. Feb. 2018, pp. 488–490.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 13
[25] A. Agrawal et al., “Xcel-RAM: Accelerating binary neural networks in Jia-Jing Chen received the B.S. degree in electrical
high-throughput SRAM compute arrays,” Jul. 2018, arXiv:1807.00343. engineering from Chang Gung University, Taoyuan,
[Online]. Available: https://arxiv.org/abs/1807.00343 Taiwan, in 2016. He is currently pursuing the M.S.
[26] Z. Jiang, S. Yin, M. Seok, and J.-S. Seo, “XNOR-SRAM: In-memory degree with the Institute of Electrical Engineering,
computing SRAM macro for binary/ternary deep neural networks,” National Tsing Hua University, Hsinchu, Taiwan.
in Proc. IEEE Symp. VLSI Technol., Honolulu, HI, USA, Jun. 2018, His current research interests include circuit design
pp. 173–174. of SRAM and computing in memory.
[27] S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42 pJ/decision
3.12 TOPS/W robust in-memory machine learning classifier with on-
chip training,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, San Francisco, CA, USA, Feb. 2018, pp. 490–491.
[28] W. S. Khwa et al., “A 65 nm 4 Kb algorithm-dependent computing-in-
memory SRAM unit-macro with 2.3 ns and 55.8 TOPS/W fully parallel
product-sum operation for binary DNN edge processors,” in IEEE ISSCC
Dig. Tech. Papers, Feb. 2018, pp. 496–498.
[29] M.-F. Chang, C.-F. Chen, T.-H. Chang, C.-C. Shuai, Y.-Y. Wang, and Jia-Fang Li received the B.S. degree from the
H. Yamauchi, “A 28 nm 256 kb 6T-SRAM with 280 mV improvement in Electronic Department, National United University,
VMINusing a dual-split-control assist scheme,” in IEEE Int. Solid-State Taiwan, in 2016. She is currently pursuing the M.S.
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb., 2015, pp. 1–3. degree with the Institute of Electronics Engineering,
[30] M.-F. Chang et al., “A compact-area low-VDDmin 6T SRAM with National Tsing Hua University, Hsinchu, Taiwan.
improvement in cell stability, read speed, and write margin using a dual- Her current research interests include circuit design
split-control-assist scheme,” IEEE J. Solid-State Circuits, vol. 52, no. 9, of SRAM and emerging nonvolatile memory.
pp. 2498–2514, Sep. 2017.
[31] K. Ando et al., “BRein memory: A single-chip binary/ternary reconfig-
urable in-memory deep neural network accelerator achieving 1.4 TOPS
at 0.6 W,” IEEE J. Solid-State Circuits, vol. 53, no. 4, pp. 983–994,
Apr. 2018.
[32] T.-H. Yang, K.-X. Li, Y.-N. Chiang, W.-Y. Lin, H.-T. Lin, and
M.-F. Chang, “A 28 nm 32 Kb embedded 2T2MTJ STT-MRAM macro
with 1.3 ns read-access time for fast and reliable read applications,” Xiaoyu Sun (S’17) received the B.S. degree in
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, electrical engineering from the Huazhong University
Feb. 2018, pp. 482–484. of Science and Technology, Wuhan, China, in 2014,
[33] R. Guo et al., “A 5.1 pJ/neuron 127.3μs/inference RNN-based speech and the M.S. degree in electrical engineering from
recognition processor using 16 computing-in-memory SRAM macros Arizona State University in 2016. He is currently
in 65 nm CMOS,” in IEEE Symp. VLSI Circuits Dig. Tech. Papers, pursuing the Ph.D. degree in electrical and com-
Jun. 2019, pp. 120–121. puter engineering with the Georgia Institute of Tech-
[34] M.-F. Chang et al., “A 3T1R nonvolatile TCAM using MLC ReRAM nology, Atlanta, GA, USA. His research interests
for frequent-off instant-on filters in IoT and big-data processing,” IEEE include SRAM- and NVM-based hardware imple-
J. Solid-State Circuits, vol. 52, no. 6, pp. 1664–1679, Jun. 2017. mentations of neural networks.
[35] M. F. Chang et al., “A ReRAM-based 4T2R nonvolatile TCAM using
RC-filtered stress-decoupled scheme for frequent-OFF instant-ON search
engines used in IoT and big-data processing,” IEEE J. Solid-State
Circuits, vol. 51, no. 11, pp. 2786–2798, Nov. 2016.
Hiroyuki Yamauchi received the Ph.D. degree Meng-Fan Chang (M’05–SM’14–F’19) received
in engineering from Kyushu University, Fukuoka, the M.S. degree from Pennsylvania State University,
Japan, in 1997. In 1985, he joined the Semicon- USA, and the Ph.D. degree from National Chiao
ductor Research Center, Panasonic, Japan. From Tung University, Hisnchu, Taiwan.
1985 to 1987, he worked on the scaled sense Before 2006, he has worked in industry for over
amplifier for ultrahigh-density DRAM’s. From ten years. From 1996 to 1997, he designed mem-
1988 to 1994, he was engaged in the research ory compilers in Mentor Graphics, NJ, USA. From
and development of 16-Mb CMOS DRAMs, 1997 to 2001, he designed embedded SRAMs and
including the battery-operated high-speed 16-Mb Flash in Design Service Division (DSD) at TSMC,
CMOS DRAM and the ultralow-power, three times Hsinchu, Taiwan. From 2001 to 2006, he was a
longer, self-refresh DRAM. He also invented the Co-Founder and the Director of IPLib Company,
charge-recycling bus architecture and low-voltage operated high-speed Taiwan, where he developed embedded SRAM and ROM compilers, flash
VLSI’s, including 0.5-V/100-MHz-operated SRAM and Gate-Over-Driving macros, and flat-cell ROM products. He is currently a Full Professor with
CMOS architecture. After the development of various embedded memories, National Tsing Hua University (NTHU), Taiwan. His research interests include
eSRAM, eDRAM, eFlash, eFeRAM, and eReRAM for system LSI in Pana- circuit designs for volatile and nonvolatile memory, ultra-low-voltage sys-
sonic as a General Manager, he moved to the Fukuoka Institute of Technology, tems, 3D-memory, circuit-device interactions, spintronics circuits, memristor
where he has been a Professor since 2005. His current interests are focused on logics for neuromorphic computing, and computing-in-memory for artificial
convolution/deconvolution algorithms for machine-learning-based low-power intelligence.
signal and image processing for the IoT sensor applications. Dr. Chang was a recipient of several prestigious national-level awards
in Taiwan, including the Outstanding Research Award of MOST-Taiwan,
the Outstanding Electrical Engineering Professor Award, the Academia Sinica
Junior Research Investigators Award, and the Ta-You Wu Memorial Award.
He has been serving as an Associate Editor for the IEEE TVLSI and
Qiang Li (S’04–M’07–SM’13) received the B.Eng. IEEE TCAD. He has been serving as a Guest Editor for the IEEE JSSC,
degree in electrical engineering from the Huazhong IEEE TCAS-II, and IEEE JETCAS. He has been serving on the technical
University of Science and Technology (HUST), program committees for ISSCC, IEDM (Ex-Com and MT Chair), DAC (Sub-
Wuhan, China, in 2001, and the Ph.D. degree from Com Chair), ISCAS (Track Co-Chair), A-SSCC, and numerous international
Nanyang Technological University (NTU), Singa- conferences. He has been a Distinguished Lecture (DL) Speaker for the
pore, in 2007. IEEE Solid-State Circuits Society (SSCS) and Circuits and Systems Society
Since 2001, he has been working on analog/RF (CASS), the Technical Committee Member of CASS, and the Administrative
circuits in both academia and industry, holding posi- Committee (AdCom) Member of the IEEE Nanotechnology Council. He has
tions of Engineer, Project Leader, and Technical also been serving as the Program Director of the Micro-Electronics Program of
Consultant in Singapore and Associate Professor Ministry of Since and Technology (MOST) in Taiwan (2018–2020) and an
in Denmark. He is currently a Full Professor with Associate Executive Director for the Taiwan’s National Program of Intelligent
the University of Electronic Science and Technology of China (UESTC), Electronics (NPIE) and NPIE Bridge Program (2011–2018).
heading the Institute of Integrated Circuits and Systems. His research interests
include low-voltage and low-power analog/RF circuits, data converters, and
mixed-mode circuits for biomedical and sensor interfaces.
Dr. Li was a recipient of the Young Changjiang Scholar award in 2015,
the National Top-Notch Young Professionals award in 2013, and the UESTC
Teaching Excellence Award in 2011. He was the TPC Chair of the IEEE
2018 APCCAS. He serves on the Student Research Preview (SRP) Committee
of ISSCC and the TPC of ESSCIRC and ASSCC (both in DC subcommittee).
He serves as a Guest Editor of the IEEE T RANSACTIONS ON C IRCUITS
AND S YSTEMS I (TCAS-I). He is the Founding Chair of the IEEE Chengdu
CASS/SSCS Joint Chapter.