0% found this document useful (0 votes)
27 views

06A Dual-Split 6T SRAM-Based Computing-in-Memory

Uploaded by

huangzhen11245
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

06A Dual-Split 6T SRAM-Based Computing-in-Memory

Uploaded by

huangzhen11245
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS 1

A Dual-Split 6T SRAM-Based
Computing-in-Memory Unit-Macro With
Fully Parallel Product-Sum Operation for
Binarized DNN Edge Processors
Xin Si, Win-San Khwa , Jia-Jing Chen, Jia-Fang Li, Xiaoyu Sun , Student Member, IEEE,
Rui Liu, Student Member, IEEE, Shimeng Yu , Senior Member, IEEE, Hiroyuki Yamauchi ,
Qiang Li , Senior Member, IEEE, and Meng-Fan Chang , Fellow, IEEE

Abstract— Computing-in-memory (CIM) is a promising insensitive small offset voltage-mode sensing amplifier (CMI-
approach to reduce the latency and improve the energy effi- VSA). A fabricated 65-nm 4-Kb SRAM-CIM unit-macro achieved
ciency of deep neural network (DNN) artificial intelligence (AI) 2.4- and 2.3-ns product-sum access times for a FCNN layer using
edge processors. However, SRAM-based CIM (SRAM-CIM) XNORNN and MBNN, respectively. The measured maximum
faces practical challenges in terms of area overhead, perfor- energy efficiency reached 30.49 TOPS/W for XNORNN and
mance, energy efficiency, and yield against variations in data 55.8 TOPS/W for the MBNN modes.
patterns and transistor performance. This paper employed a
circuit-system co-design methodology to develop a SRAM-CIM Index Terms— Random access memory, computing-in-memory,
unit-macro for a binary-based fully connected neural net- binarized DNN edge processors, artificial intelligence.
work (FCNN) layer of the DNN AI edge processors. The I. I NTRODUCTION
proposed SRAM-CIM unit-macro supports two binarized neural
network models: an XNOR neural network (XNORNN) and a
modified binary neural network (MBNN). To achieve compact
area, fast access time, robust operations, and high energy-
D EEP neural networks (DNNs) are commonly used
for artificial intelligence (AI) processors to achieve
high-accuracy recognition and prediction functions for a vari-
efficiency, our proposed SRAM-CIM uses a split-wordline ety of applications [1]–[4]. As shown in Fig. 1(a), deep neural
compact-rule 6T SRAM and circuit techniques, including a
dynamic input-aware reference generation (DIARG) scheme, networks (DNNs) typically comprise a series of convolu-
an algorithm-dependent asymmetric control (ADAC) scheme, tion (CNN) and fully-connect (FC) layers, with a number of
a write disturb-free (WDF) scheme, and a common-mode- non-linear layers, such as a pooling layer and a rectified linear
unit activation layer (ReLU). In deep neural network (DNN)
Manuscript received November 18, 2018; revised March 14, 2019 and
May 19, 2019; accepted July 1, 2019. This work was supported in part by processors [5]–[10], product-sum (PS) operations dominate
the Taiwan Semiconductor Research Institute (TSRI), in part by the Taiwan the computational workload in both convolution (CNN) and
Semiconductor Manufacturing Company-Joint Development Program (TSMC- fully connected (FC) layers. These neural network layers
JDP), in part by the MediaTek-Joint Development Program (MTK-JDP), and
in part by the Ministry of Science and Technology (MOST) of Taiwan. This are computationally intensive and require the movement and
paper was recommended by Associate Editor Y. Ha. (Corresponding author: storage of enormous volumes of data. Thus, the applica-
Meng-Fan Chang.) tion of deep neural network (DNN) processors for AI edge
X. Si is with the Institute of Integrated Circuits and Systems,
University of Electronic Science and Technology of China (UESTC), devices usually require fast inference operations, ultra-low
Chengdu 610054, China, and also with the Department of Electrical Engi- energy consumption, low cost, and sufficient accuracy. The
neering, National Tsing Hua University (NTHU), Hsinchu 30013, Taiwan reduced bit precision and memory cost of binary deep neural
(e-mail: [email protected]).
W.-S. Khwa was with the Department of Electrical Engineering, National networks (DNNs) [11]–[16] make it possible to reduce the
Tsing Hua University (NTHU), Hsinchu 30013, Taiwan. He is now with computational and hardware costs of AI edge devices; how-
Taiwan Semiconductor Manufacturing Company (TSMC), Hsinchu 30078, ever, conventional all digital solutions have been unable to
Taiwan.
J.-J. Chen, J.-F. Li, and M.-F. Chang are with the Department of Electrical resolve the memory bottleneck. In conventional all digital solu-
Engineering, National Tsing Hua University (NTHU), Hsinchu 30013, Taiwan tions, process engine (PE) arrays typically exploit parallelized
(e-mail: [email protected]). computation; however, they suffer from inefficient single-row
X. Sun and S. Yu are with the Georgia Institute of Technology, Atlanta,
GA 30332 USA. SRAM access to weights, and larger SRAM arrays are required
R. Liu is with Synopsys, San Francisco, CA 94107 USA. to store a huge amounts of intermediate data, as shown in
H. Yamauchi is with the Fukuoka Institute of Technology, Fukuoka Fig. 1(b). Furthermore, the energy required to access data from
811-0295, Japan.
Q. Li is with the Institute of Integrated Circuits and Systems, University memory can far exceed the energy required for computing
of Electronic Science and Technology of China (UESTC), Chengdu 610054, operations using that data [17].
China. Computing-in-memory (CIM) or process-in-memory (PIM)
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. methods have been proposed to improve computational effi-
Digital Object Identifier 10.1109/TCSI.2019.2928043 ciency by enabling parallel computing within the memory
1549-8328 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

memory and bit-wise logic operations [22], a compute cache


for data-centric applications [23], a Conv-RAM for binary
weight neural networks [24], a Xcel-RAM for binary neural
networks [25], a XNOR-SRAM for binary/ternary deep neural
networks [26], and an in-memory machine learning classifier
with on-chip training [27]. These SRAM-CIM works have
demonstrated the potential of the CIM structure in achiev-
ing faster inference speeds and improving energy efficiency.
However, these works face a number of issues: 1) large cell
area overhead due to the use of a larger number of transistors
(10T or 12T) and the use of logic-rules for cell layout rather
than the foundry’s compact-rule 6T SRAM cell; 2) high
power consumption on BL and BLB with multiple activated
WLs; and 3) insufficient signal margin against input-offset of
the sense amplifier for robust read operations. To overcome
these issues, we employed compact-rule 6T SRAM bit-cell to
reduce the area overhead, an algorithm-dependent asymmetric
control (ADAC) scheme to reduce power consumption, and a
common-mode-insensitive small offset voltage-mode sensing
amplifier (CMI-VSA).
In this work, we implemented fully parallel product-sum
operations within an SRAM cell array to improve perfor-
mance in terms of area, energy efficiency, and yield against
variations in data-pattern and transistor performance [28].
A 65nm 4Kb algorithm-dependent SRAM-CIM unit-macro
for XNOR neural network (XNORNN) and modified binary
neural network (MBNN) was implemented.
The remainder of the paper is organized as follows:
Section II briefly describes the binary-based neural network
algorithm. Section III describes the proposed XNOR neural
network (XNORNN) SRAM-CIM. Section III describes the
proposed modified binary neural network (MBNN) SRAM-
CIM. Section IV describes additional circuitry techniques to
overcome the difficulties in the application of SRAM-CIM
to binary neural networks. Section VI outlines performance
results, and Section VII presents a summary of this work.

Fig. 1. (a) Typical deep neural networks structure. (b) Conceptual of II. BACKGROUND : B INARIZED N EURAL N ETWORK
SRAM-CIM for AI Edge processors. AND D UAL -S PLIT 6T SRAM C ELL

A. Background of Binary Neural Networks


macro or array. The fact that the CIM structure allows for Binary neural networks (BNN) have demonstrated potential
the processing of data within memory provides two main in reducing the complexity of neural networks to reduce power
benefits: (1) a reduction in the amount of intermediate data, consumption and hardware costs for AI edge devices, such
and (2) improvements in the efficiency of parallel computing as the toys with AI function, which require only moderate
in terms of energy consumption and area overhead. The accuracy. Early neural networks that use binary weights are
implementation of CIM faces challenges in terms of area referred to as binary connect (BC) networks [14]. The weights
overhead, computational performance, energy efficiency, and in binary connect (BC) networks are restricted to ±1, whereas
yield under various process, voltage, and temperature (PVT) the activations are conducted at full resolution. XNOR neural
conditions. Up to this point, some silicon-verified CIM macros network (XNORNN) [13] was shown to dramatically reduce
have been reported [18]–[27]. hardware requirements by constraining weights and activations
The pioneering works in SRAM-CIM include a 4T + 2T to either “+1” or “−1”. The inputs and weights in Bitwise
for content-addressable-memory and two-input logic oper- Neural Networks [12] can be restricted to either ±1 or 0/1.
ations [18], a 10-way error-adaptive classifier for MNIST Restricting the inputs to 0/1 results in a 0.03% loss of accuracy,
datasets [19], an X-SRAM for in-memory vector Boolean compared with ±1 input [12].
computation [20], compute memory for pattern recognition To reduce hardware cost and increase energy efficiency,
based on multi-row read access and analog signal process- a binary deep neural network (DNN) with 0/1 neurons and
ing [21], a push-rule 6T SRAM for content addressable ±1 weight [16] can be simplified as a modified binary neural
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 3

network (MBNN). The binarized function for binary activation


in MBNN is as follows:

1 i f x ≥ 0,
x = Sign(x) =
b
(1)
0 other wi se
The binarized function for binary weights in HBNN is

+1 i f w ≥ 0,
w = Sign(w) =
b
(2)
−1 other wi se

where x b is the binarized activation, x is the real-valued


neuron, wb is the binarized weight, and w is the real-valued
weight.
Using the MNIST dataset, we trained and binarized two
neural network (NN) (referred to as inspired LeNet-5) with
two convolution layers and two fully-connected layers: XNOR
neural network (XNORNN) and modified binary neural net-
work (MBNN). Table I presents a comparison of various
binary neural network algorithms. Clearly, XNORNN achieves
the highest classification accuracy. And MBNNs could achieve
classification accuracy similar to that of previous BNN struc-
tures when applied to MNIST classification.

B. Dual-Split-Control 6T Cell (DSC6T)


Fig. 2(a) presents a cell schematic of our previously pro-
posed dual-split-control (DSC) 6T SRAM cell [29]. The
footprint of this DSC6T cell is basically the same as that of
the foundry’s compact 6T SRAM cell but with split wordlines
(SWL: WLL and WLR) and split cell-VSS (CVSS) lines
(SCVSS: CVSS1 and CVSS2). This DSC6T cell was meant to Fig. 2. (a) Schematic illustration, (b) layout of employed DSC 6T cell and
achieve compact cell area and low VDDmin through the use of (c) waveforms of read/write operations.
split wordlines (SWL)/split cell-VSS(CVSS)-based read/write-
assist schemes [29]. Fig. 2(b) presents the simplified wave-
form of the DSC6T cell during read/write operations. For
normal write and read operation, WLL and WLR are shorted
simultaneously, and the read/write operations of the DSC6T
cell are the same as those of conventional 6T SRAM cells.
In this work, we employed the split wordline structure for
CIM operations; however, it was not necessary to adopt the
split CVSS or low-voltage assist schemes as in [29].
High energy efficiency and compact area were achieved by
using the 6T SRAM in conjunction with the split-wordline
feature of DSCS6T cells to develop a SRAM-CIM unit-
macro that supports both XNOR neural network (XNORNN)
and modified binary neural network (MBNN). This is further
detailed in Sections III and IV. Fig. 3. (a) Macro structure and (b) Waveform of XNORNN SRAM-CIM
mode (DIARG-XNOR).
III. P ROPOSED SRAM C OMPUTING - IN -M EMORY (CIM)
U NIT-M ACRO FOR XNORNN O PERATIONS whereas XNORNN mode is activated for XNOR-based CIM
operations.
A. Architecture of XNORNN SRAM-CIM In SRAM mode, only one row is activated for read and write
Fig. 3(a) illustrates the architecture of the proposed XNOR operations, as reported in [30] and Section II.B. The SRAM
neural network (XNORNN) SRAM-CIM macro. This device cell array can be accessed with both WLL and WLR on, which
consists of a dual-split-control 6T (DSC6T) bit-cell array is like the read and write operations of a conventional SRAM.
and peripheral circuits for operation in two modes: SRAM In such situations, the trained weights are stored in the dual-
and XNOR neural network (XNORNN) modes. SRAM mode split-control 6T (DSC6T) cell array via a write operation in
is activated to store the trained weight (write operation), SRAM mode under nominal-VDD.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

TABLE I
C OMPARISON TABLE OF VARIOUS B INARIZED N EURAL N ETWORK A LGORITHM

In XNORNN mode, multiple rows are activated at the same TABLE II


time, and each input data (IN) is pre-encoded to two wordlines T RUTH TABLE OF P ROPOSED XNORNN SRAM-CIM
(WLL and WLR). The weights (W) of m-weight FCNL are
stored in consecutive mDSC6T memory cells (MC) in the
same column. When WLL or WLR is activated, the read
current (IMC ) of each activated memory-cell represents its
input-weight-product (IWP = IN × W). All IMC values of the
activated MCs are then summed at the bitlines (BLL and BLR
are shorted) of the same column. When using a voltage-divider
type sensing scheme through a PMOS header, BL voltage
(VBL−XNOR ) is the summation of all input-weight-products
(IWP[m − 1 : 0]). To achieve fully parallel computing and
high throughput, each BL is sensed in parallel. The analog-to-
digital read-outs of each BL are generated simultaneously by
comparing VBL−XNOR and an appropriate reference voltage
(VREF ) generated using the proposed dynamic input-aware
reference generation (DIARG) scheme through a common-
mode-insensitive small-offset voltage-mode sense amplifier
(CMI-VSA).

B. Circuits and Operation of XNORNN SRAM-CIM Macro


1) XNOR Computing in a DSC6T SRAM Cell Array:
Fig. 3(b) presents the detailed waveform of SRAM-CIM in
XNOR operation. When an input (IN[i ]) = ‘ + 1’, the corre-
sponding WLL(WLL[i ]) is asserted as ‘1’ and the correspond-
ing WLR(WLR[i ]) is asserted as ‘0’. When IN[i ] = ‘ − 1’, Fig. 4. Dynamic input-aware reference generation scheme for XNORNN.
WLL[i ] = 0 and WLR[i ] = 1. When a SRAM cell stores
the weight “+1”, its storage node (Q) stores logic “1”. When dynamic input-aware reference generation scheme for XNOR
a dual split control (DSC6T) cell stores the weight “−1”, mode operation (DIARG-XNOR), as shown in Fig. 4. The
the corresponding Q = 0 and QB = 1. Table II presents a truth DIARG-XNOR comprises two reference columns (RC1 and
table of the XNOR operation. When the input-weight-product RC2) of fixed-zero (Q = 0) replica memory-cells (F0RC),
result (IWP) of the XNOR operation between IN[i ] and MC[i, a reference BL-header (R-BLH), a WL-combiner (WLCB),
j] is ‘+1’, the DSC6T cell generates a charge current (IMC−C ) a reference-WL-tuner (RWLT), and a replica BL-selection
on BLL or BLR, in accordance with the value (weight) it switch (RBLSW). Each RC1/RC2 has its own reference
stored. When input-weight-product IWP = −1, the DSC6T BL pair (RBLL1/RBLR1 and RBLR2/RBLR2) and WL pair
cell generates a discharge current (IMC−D ) on either BLL or (WLL1/WLR1 and WLL2/WLR2).
BLR. As a result, the XNOR result is presented on BLL and In the final second fully connected layer (FCN), referred to
BLR. By combining BLL and BLR into the dataline (DL) by as FCN-(L-1), the output of each IO is derived by determin-
turning on the BL-selection switch (BLSW = 1), the total ing whether the number (NIWP=+1 ) of input-weight-product
count of XNOR results is presented on the DL, which results (IWP) = ‘ + 1’ from the cell-level XNOR (product) results
in VBL−XNOR . Thus, the XNOR-count results can be digitized in the same column are larger than the number (NIWP=−1 )
by the sensing of VBL−XNOR . of input-weight-product (IWP) = ‘ − 1’ from the cell-level
2) Dynamic Input-Aware Reference Generation (DIARG) XNOR operation. When NIWP=+1 > NIWP=−1 , the product-
Scheme for XNORNN (DIARG-XNOR): To generate an appro- sum result (PSR) of a column is ‘ + 1’ (PSR = ‘ + 1’). Oth-
priate reference voltage (VREF−XNORNN ) for the XNORNN erwise, the product-sum result (PSR) = ‘ − 1’. Therefore, the
SRAM-CIM macro of different FCN layers, we developed a target reference product-sum result (PSRREF ) should be zero
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 5

TABLE III
L OOK -U P TABLE (LUT) FOR R EFERENCE T UNING OF FCN-L AST

Fig. 5. Target reference voltage for FCN-last layer. the corresponding F0RC in reference column RC2 is asserted
as high, whereupon 1 × IMC−D is generated on RBLR2. Thus,
PSR = (1 × IMC−D )/2 = [1 × (+1)]/2 = +1/2.
(PSR = 0 or NIWP=+1 = NIWP=−1 ). In the case where The LUT is listed in Table III. The set value of this
n inputs are activated for an XNORNN SRAM-CIM, n LUT is based on the product-sum-result (PSR) distribution
corresponding WL-combiners (WLCBs) are enabled. Then, of MNIST 10K images using the XNORNN model with
n pairs of WLL1s and WLR1s of the WL-combiners optimized sensing margin. With an increase in detected value
(WLCBs) activated for RC1 are asserted as high. As a (n), the PSR of the winner increases (based on analysis of the
result, n × IMC−C is generated on BLL and n × IMC−D MNIST dataset). With the assistance of this LUT, the inference
is generated on BLR. With the combination of RBLL accuracy for MNIST reached 96.1% in only one read-out
and RBLR (replica BL-selection switch, RBLSW = 1), iteration.
the reference current is set as n × IMC−C + n × IMC−D , For example, if the number of INPUTs is 10(n = 10),
resulting in PSRREF = n × (−1) + n × (+1) = 0. then the reference product-sum-result (PSR) value is set at
Accordingly, the reference voltage (VREF−XNOR−FC ) for PSR = ’0’ by first using reference column RC1. This is
product-sum result (PSR) = 0 based on the value of n is gen- the same method used for FCN-(L-1). Based on LUT (as
erated across various process, voltage, and temperature (PVT) shown in Table III), two ENRWLT are set to ‘0’ to fine-
conditions. tune the reference product-sum-result (PSR) value to PSR =
In the last fully connected layer (FCN-last), there is always 0 +2×(−1/2) = −1. Finally, the cell-tracking VREF = 0.53V
a winner among the neurons (outputs). Thus, the reference is generated under of the following conditions: 25◦C and TT
voltage (VREF−LAST ) used to read the product-sum result corner.
(PSR) of each column should be set between the voltages of
IV. P ROPOSED C OMPUTING - IN -M EMORY SRAM
the winner and second candidate, which is different from the
U NIT-M ACRO FOR MBNN
FCN–(L-1). Fig. 5 presents the target and dynamic input-aware
reference generation (DIARG)VREF for the MNIST dataset. A. Architecture and Basic Operation of MBNN SRAM-CIM
Generating VREF−XNOR−LAST requires circuit blocks addi- Fig. 6 illustrates the architecture of the proposed mod-
tional to reference column1 (RC1) and WL-combiner ified binary neural network (MBNN) SRAM-CIM macro
(WLCB), including an input detect logic (IDL), reference and its operational waveform, which is similar to that of
column RC2, a look-up table (LUT), and a reference WL-tuner the XNOR neural network (XNORNN) SRAM-CIM macro,
(RWLT). except for algorithm-dependent asymmetric control (ADAC)
Input detect logic (IDL) is used to detect the number of and dynamic input-aware reference generation for modified
INPUTs (n). Then, based on the value of n from input detect binary neural network (DIARG-MBNN). This modified binary
logic (IDL), the LUT is used to control the enable signals of neural network (MBNN) SRAM-CIM has two operation
the reference WL-tuners (RWLTs). Reference column1 (RC1) modes: SRAM and MBNN modes. SRAM mode, which is
is used to generate the input-aware common-mode VREF , the same as that in XNORNN SRAM-CIM, is activated to
which is set at PSR = 0 across various inputs (input-aware). store (write operation) the trained weights. MBNN mode is
Reference column2 (RC2) is used to fine-tune the reference for MBNN-based CIM operations. Unlike XNOR operations,
voltage (or reference PSR value). the input of MBNN is either “1” or “0”.
When the enable signal of an RWLT(ENRWLT ) is ‘0’, WLL Fig. 7(a) presents the structure and operational waveform of
of the corresponding fixed-zero replica memory cell (F0RC) MBNN SRAM-CIM using algorithm-dependent asymmetric
in reference column RC2 is asserted as high, whereupon control (ADAC) scheme in left sensing mode (asymmetric flag,
1 × IMC−C is generated on RBLL2. With the combination AF = 1). When an input (IN[i ]) = ‘ + 1’, its WLL(WLL[i ])
of RBLL1, RBLR1, RBLL2, and RBLLR2 (as shown in is asserted as ‘1’ and its WLR(WLR[i ]) is asserted as cell
RBLSW in Fig.5), the influence of the two reference columns stores the weight “+1” by setting Q = 1, while storing the
(RC1 and RC2, i.e., 2× header), the change in reference ‘0’. When IN[i ] = ‘0’, WLL[i ] = WLR[i ] = 0. The 6T
PSR value (PSR) is set as PSR = (1 × IMC−C )/2 = SRAM weight “−1” by setting Q to 0. When the input-
[1 × (−1)]/2 = −1/2. When ENRWLT = ‘1’, the WLR2 of weight-product result (IWP) of an MBNN operation between
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

applications and
 datasets, due to the fact that product-sum-
result (PSR, IN × W) polarizes the last layer to present
only the most probable candidate. Furthermore, the asym-
metry seen in NIWP=+1 /NIWP=−1 is opposed in the two
layers. In accordance with these characteristics, the algorithm-
dependent asymmetric control (ADAC) scheme (combining
the split-WL feature of DSC6T cell) is enabled to reduce BL
current and power consumption by the macro. This makes
it possible to implement two WL/BL access modes (1. left
sensing, WLL-BLL and 2. right sensing, WLR-BLR) for the
two layers using the same SRAM-CIM unit-macro.
In the algorithm-dependent asymmetric control (ADAC)
scheme, we employed the following sub-blocks: 1. asym-
Fig. 6. (a) Macro structure and (b) waveform of MBNN SRAM-CIM. metric flag (AF); 2. bitline-selection switches (BLSW); 3.
WL-selection switches (WLSW); 4. dual-path output-drivers
TABLE IV (DPOD); and 5. dual-split-control DSC6T cell array.
T RUTH TABLE OF MBNN SRAM-CIM Asymmetric flag (AF) is pre-defined during training or con-
figure it using an application that specifies whether left-sensing
or right-sensing is to be used. We determine asymmetric flag
(AF) using NIWP=+1 and NIWP=−1 of all BLs in the macro.
In the case where (NIWP=+1 > NIWP=−1 ), asymmetric flag
(AF) is asserted as 1 for left-sensing. WLSW activates left-
sensing by asserting only the WLLs of the selected rows
(IN = 1), when all WLRs are grounded. Each of the BLLs
is connected to its corresponding VSA via BLSW, whereas
BLR = VDD is isolated from VSA. Then, VSA detects VBLL
and directs its output (SAOUT) via the non-inversion path of
dual-path output-drivers (DPOD) to DOUT.
Fig. 8(b) shows that when asymmetric flag (AF) = 0
(NIWP=+1 < NIWP=−1 ) and right-sensing is selected, the roles
of WLR-BLR and WLL-BLL are switched, and the final
SAOUT result of VSA is sent along the inversion path of
dual-path output-drivers (DPOD) to DOUT.
This means that ADAC can be combined with DSC6T to
reduce BL current and power consumption to below that of
IN[i ] and MC[i , j ] is ‘ + 1’, the DSC6T cell generates a a conventional 6T cell. This can be explained by a reduction
charge current (IMC−C ) on the BLL. When input-weight- in parasitic load on activated WLs (one transistor per cell),
product IWP = −1, the DSC6T cell generates a discharge a reduction in BL current on the selected BL, and a lock of
current (IMC−D ) on BLL. When IWP = 0, the DSC6T cell BL current from unselected BLs.
does not provide any cell current to BLL. At the same time,
BLR is disconnected from BLL to remain in a floating state C. Dynamic Input-Aware V R E F Generation Scheme for
without IMC−C or IMC−D . The total number of input-weight- MBNN (DIARG-MBNN)
product IWP results associated with each modified binary
neural network (MBNN) operation on activated DSC6T cells is The proposed DIARG scheme for MBNN (DIARG-MBNN)
presented on the BLL. As a result, the modified binary neural is similar to the DIARG-XNOR scheme in Section III; how-
network (MBNN) count could be digitized using the sensing ever, in this case, VREF is based on the number of Input = 1
of VBLL . A truth table of the modified binary neural network (NInput=1), rather than on the total number of all inputs (n) in
(MBNN) SRAM CIM is presented in Table IV. DIARG-XNOR. Furthermore, the design of IDL (input detect
logic) for MBNN differs from the design for XNORNN.
As described in the previous section, when FCN-(L-1), the
B. Algorithm-Dependent Asymmetric Control (ADAC) reference voltage for sensing should be set at PSR = 0 or
Scheme NIWP=+1 = NIWP=−1 . In cases where m WLs (NInput=1 = m)
Fig. 8 presents data pattern analysis of MNIST test images are activated for an MBNN SRAM-CIM, m corresponding WL
We can see intriguing asymmetry between the number of combiners (WLCBs) are enabled. Based on the truth table
“input-weight-product (IWP) = +1” (NIWP=+1 ) and “input- (Table IV), we can see that m pairs of WLL1s and WLRs
weight-product (IWP) = −1” (NIWP=−1 ) on a BL in the associated with WLCBs activated for RC1 are asserted as
last two FCNLs (i.e. FCN-(L-1) and FCN-Last) when applied high. As a result, m × IMC−C are generated on the BLL and
to MNIST. This is a generic characteristic found in many m ×IMC−D are generated on the BLR. Using a combination of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 7

Fig. 7. Structure and waveform of MBNN SRAM-CIM using ADAC scheme: (a) applying AF = 1 (nominal mode) for FCN-(L-1) layer; and (b) applying
AF = 0 (inversing mode) for FCN-last layer.

Fig. 8. Data pattern analysis of MNIST test images for the last two FCN layers.

RBLL and RBLR(RBLSW = 1), the reference current is set tackled the above-mentioned challenges by developing a write
as m × IMC−C + m × IMC−D , such that reference product-sum disturb free (WDF) scheme and a common-mode-insensitive
result (PSRREF ) = m×(−1)+m×(+1) = 0. This is the means small-offset voltage-mode sense amplifier (CMI-VSA).
by which the reference voltage (VREF−MBNN ) is generated for
product-sum result (PSR) = 0 (based on the NInput=1 values). A. Write Disturb Free (WDF) Scheme
In FCN-Last, the reference VREF−MBNN−LAST should be
A wide range of product-sum results (PSR) for each bitline
set between the winner and other candidates, which is differ-
imposes a wide range of bitline voltages (VBL ), as a function
ent from VREF−MBNN . Similar to XNORNN (as descripted
of NIWP=+1 and NIWP=−1 ., which respectively refer to the
in Section III), generating VREF−MBNN−LAST also requires
number of cells conducting charge (IMC−C ) and discharge
6 circuit blocks: RC1, RC2, WLCB, IDL, LUT and RWLT.
(IMC−D ) currents. IMC−D decreases VBL significantly, whereas
The difference is that the number of enabled RWLT signals
IMC−C increases VBL marginally. Thus, VBL is heavily depen-
is based on the number of Input = 1 (NInput=1); therefore,
dent on NIWP=−1 , but only slightly dependent on NIWP=+1 .
we use IDL here to detect NInput=1.
A small number of activated WLs (NWL ) leads to a higher
VBL , due to the fact that NIWP=−1 is small. A large NWL can
V. A DDITIONAL C IRCUIT T ECHNIQUES U SED lead to a very low VBL , when NIWP=−1 is large. This can cause
IN THE D ESIGN OF SRAM-CIM a 6T SRAM cell storing “1” (Q = 1) to suffer data flipping
XNORNN and MBNN SRAM-CIM unit-macro designs to “0” (Q = 0). This is a particular concern for cases with
impose the same challenges: (1) SRAM cells suffer from write process variation or global process corners that favor write
disturb issues when the BL voltage is too low; (2) The mini- operations, resulting in write disturbance.
mum sensing margin for reading PSR (BL voltage) is small; This work employed a BL-clamper (BLC) to avoid
(3) BL voltage covers a wide range and induces different input write disturbance for. This scheme was implemented using
offsets for typical voltage-mode sense amplifiers (VSAs). We diode-connect PMOS transistors. The BLC prevents VBL from
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Fig. 9. Write disturb Free (WDF) scheme for SRAM-CIM macros.

TABLE V
L OOK -U P TABLE (LUT) W ITH WDF S CHEMES

dropping below the threshold voltage, which is usually higher


than the write-threshold voltage (VWTH ) of 6T SRAM cells.
We also employed an input-aware BL header (BLH) to
shift VBL according to NWL to attain a higher sensing margin
(VSM ). When NWL is small, there is no risk of write distur- Fig. 10. Schematic illustration and waveform of common-mode-insensitive
bance because VBL is usually higher, and a weaker BLH is small offset voltage mode sense amplifier (CMI-VSA).
used to ensure a sufficiently large voltage difference between
product-sum result (PSR) values. A larger NWL increases the
risk of write disturbance because VBL is usually lower, and with the result that VCL = VAL = VTRP−L and VCR = VAR =
a stronger BLH is used to raise VBL−CM for the assignment VTRP−R . In the meantime, SW5 = SW6 = on and SW7 =
of various product-sum result (PSR) values ranging between SW8 = off ensure that the voltages at node INL(VINL ) and
VDD and VWTH . Table V lists the LUT used in conjunction INR(VINR ) are equal to VBL and VREF .
with the write disturb free (WDF) scheme. Fig. 9 presents the In PH2 (pre-amplification), setting SW1 ∼ SW4 = off puts
WDF based on this LUT. VCL /VCR /VAL /VAR in a floating state. Setting SW5 = off
and SW8 = on switches VINL from VBL to VREF and then
B. Common-Mode-Insensitive Small-Offset Voltage-Mode couples (VBL − VREF ) to VCL through C1, such that VCL =
Sense Amplifier (CMI-VSA) VTRP−L − (VBL − VREF ). Setting SW6 = off and SW7 = on
switches VINR from VREF to VBL and then couples (VREF −
As shown in Fig. 10, we developed a common-mode- VBL ) to VCR through C2, such that VCR = VTRP−R + (VBL −
insensitive small-offset voltage-mode sense amplifier (CMI- VREF ). Ideally, this increases the voltage difference (VINV )
VSA) to provide tolerance for a small BL signal margin (VSM ) between VCL and VCR to 2 × (VBL − VREF ).
against wide VBL common-mode (VBL−CM ) range across In PH3 (amplification), setting W1 = SW2 = on enables
various PSRs. The CMI-VSA comprises two cross-coupled INV1 and INV2 to amplify VINV in order to generate a full
inverters (INV-L, INV-R), two capacitors (C1, C2), and eight swing for VCL and VCR . The operational waveform of CIM-
switches (SW1-SW8) for auto-zeroing with margin enhance- VSA is shown in Fig. 10. Note that PH1 is hidden in the BL
ment. This VSA also uses capacitors to enable DC isolation developing time (TBL ), while PH3 is the same as a regular
to accommodate a wide input VBL−CM range. The CMI-CSA latch-type SA. Only PH2 consumes any additional delay.
employs three phases (PH1 ∼ PH3) for sensing operations.
In standby mode, SW1 = SW2 = on and SW3 = SW4 = off,
VI. P ERFORMANCE AND M EASUREMENT R ESULTS
while the CMI-VSA latches the previous result at its internal
node VAR and VAL . During sensing operations, WL signals A. Performance of Proposed Schemes
are triggered to develop the VBL and VREF . For a given BL Fig. 11(a) illustrates the input offset voltage of CMI-VSA
developing time (TBL ), CMI-VSA is enabled to implement the versus various input common-mode voltages (VCM ), as sim-
three phases. ulated using 32K Monte Carlo simulations where VDD =
In PH1 (voltage development), SW3 = SW4 = on and 1V at room temperature. The input offset of CMI-VSA was
SW1 = SW2 = off to force the two inverters (INV-L and 2.5−3.6× smaller than that of a conventional latch-type VSA.
INV-R) into an auto-zero state. This biases the CL and CR Fig. 11(b) compares CMI-VSA and conventional VSAs in
nodes at their respective trigger points (VTRP−L and VTRP−R ), terms of read sensing speed (TSA ) with various VCM values.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 9

Fig. 11. Simulated (a) input offset and (b) sensing speed of CMI-VSA and
conventional (CNV) VSA.

Fig. 13. Simulated current consumption of SRAM-CIM: DSC6T with ADAC


scheme vs. conventional 6T SRAM cell.

Fig. 12. Simulated (a) variations of bitline voltage and (b) inference accuracy
of the last 2 fully connected layers in SRAM-CIM macros.
Fig. 14. Structure of SRAM-CIM testchip.

As expected, the CMI-VSA provided uniform sensing delay consumption even further (by 61.4% compared to conventional
across a wide range of VCM Values. In conventional VSA, 6T SRAMs).
TSA tends to slow down when VCM drops. At VCM =
0.3V, the TSA of CIM-VSA was 10 + × faster than that of
conventional VSAs. B. Measurement Results
Fig. 12(a) presents the simulated BL voltage distributions of This work implemented a testchip that included four 4Kb
the MBNN SRAM-CIM macro for the winner (VWIN ) and the (64 × 64b) DSC6T SRAM-CIM unit-macros, fabricated using
candidates with the 2nd highest PSR(V2ND ) when applied to a 65nm CMOS logic process. The structure of the testchip
the MNIST dataset for the last FCNL. This analysis shows that is presented in Fig. 14. The four DSC6T SRAM-CIM unit-
when using a fixed VREF for BL sensing, erroneous detections macros were respectively assigned to modified binary LeNet-
can occur in many cases, due to overlap between VWIN and 5 neural network model (2 CNN layers with 5 × 5 kernels
V2ND (> 0.4V), as shown in Fig. 12(a). This means that even and 2 FC layers: one with 64 × 64 weights and the other with
with a perfect VSA, 5-6 sensing iterations are required to 64 × 10 weights) for XNORNN and MBNN macros. For the
approach the baseline accuracy of the model employed for fully-connected (FC) layers, we listed in the same column all
the MNIST dataset. As shown in Fig. 12(b), DIARG achieved of the weights corresponding to a given output in realizing the
winner-detection accuracy of 96.1% in the first iteration. product-sum (PS) operation, as shown in Fig. 15. The proposed
In contrast, conventional fixed-VREF requires at least four DSC6T SRAM-CIM unit-macros is scalable for a variety of
iterations to achieve this level of performance. The proposed neural networks. First, this scheme allows a larger number of
scheme was shown to reduce latency and energy overhead parallel activated rows (m). The maximum m value within
by 4 + ×. a macro represents a tradeoff among the input offset of the
Fig. 13 illustrates the current consumption of the FCN-last sense amplifier, accuracy, area efficiency, and neural network
when applied to the MNIST dataset. When using a conven- model that is employed. Generally, a smaller input offset can
tional 6T SRAM array, both of the pass-gates (PGL and PGR) increase the maximum value of m, albeit at the cost of larger
were simultaneously activated by the same wordline, such area. Second, additional support for a larger or deeper network
that BLL and BLR both consumed current for product-sum can be gained by employing an accelerator chip with multiple
operations when using the XNORNN or the MBNN models. unit-macros [33] and modifications to the read out circuits.
The use of DSC6T with only one pass-gate turned on reduced When implementing multiple mini-macros for larger networks,
the average current consumption of the DSC6T SRAM-CIM there is a tradeoff between the area and power overhead versus
by 46.5%, compared to a conventional 6T SRAM array. the precision required to generate partial sums and the size
Combining DSC6T with the ADAC scheme reduced current of the unit-macro. As shown in Fig. 16, when the precision
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Fig. 15. Mapping method for fully-connected (FC) layer.

Fig. 18. Captured waveform: (a) XNOR1 macro for FCNN-(L-1),


(b) XNOR1 + XNOR2 macros for the last-two FCNNs (FCNN-(L-1) +
FCNN-Last).

Fig. 16. Simulated area cost and power consumption under different precision
of generated partial sums. When tested using 10k preprocessed MNIST images,
the inference accuracy of XNORNN SRAM-CIM in the last
two FCN layers was measured at 96.5%. The measured energy
per layer was 134.3pJ, which is equivalent to the energy
efficiency of 30.49TOPS/W per 4Kb macro. Note that in this
energy derivation, one operation is equal to one XNOR-count
operation. Counting each XNOR operation as two separate
operations (as in some reports [13]–[15]) would translate to
energy efficiency of 60.98 TOPS/W for our 4Kb XNORNN
SRAM-CIM unit-macro.
2) Measurement Results From MBNN SRAM-CIM:
Fig. 19(a) presents the captured waveform of the inference
Fig. 17. Die photo of SRAM-CIM testchip.
operation of the MBNN-based FCNN-(L-1) (MBNN1 macro).
The macro access time (TAC−M ) was 2.3ns for detection of
the MNIST winner at VDD = 1V.
required to generate partial sums was 3b, the area overhead The read access time (TAC−M−2layers ) of the integrated
was only 6.21% and the power overhead was less than 7.74%. last-two MBNN-based fully connected layers (MBNN1 and
Furthermore, each unit-macro could generate a partial sum, MBNN2 macros) was 4.8ns for MNIST image identification,
all of which could be added up to obtain the final sum or as shown in Fig. 19(b).
larger product sum for binary activation. We employed the Fig. 20 presents a measured shmoo plot of the MBNN-
DFF-based path-delay exclusion approach [34], [35] to enable mode SRAM-CIM unit-macro in terms of inference time vs.
the extraction of access times for each SRAM-CIM macro. various wordline voltages. The proposed schemes support a
A die photo is presented in Fig.17. XNORNN macros achieved wide range of WL voltages (VWL = 1.2V − 0.8V) without
around 1.5% higher classification accuracy, while lower energy speed degradation at VDD = 1V. When WL voltage was
efficiency than MBNN macros. at the lower bound (0.7V-0.8V), there was insufficient cell
1) Measurement Results From XNORNN SRAM-CIM: current to discharge the BL load quickly. This may account
Fig. 18 presents the captured waveform obtained during for the observed differences in access times when VWL =
the inference operation of the XNORNN-based FCNN-(L-1) 0.7 ∼ 0.8V. When WL voltage was at the higher bound
(XNOR1 macro). The macro access time (TAC−M ) was 2.4ns (1V-0.8V), there was sufficient cell current to discharge the BL
for the detection of the MNIST winner at VDD = 1V. load quickly. In this situation, BL voltage developing time was
The read access time (TAC−M−2layers ) of the integrated dominated by voltage divider behavior and remained nearly
last-two XNORNN fully connected layers (XNOR1 and constant. As a result, we did not observe any differences in
XNOR2 macros) was 5.0ns for MNIST image identification. access times when VWL = 1V ∼ 0.8V. This macro supports
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 11

Fig. 21. Demonstration of SRAM-CIM testchip.

Fig. 19. Captured waveform: (a) MBNN1 macro for FCNN-(L-1),


(b) MBNN1 + MBNN2 macros for the last-two FCNNs (FCNN − (L − 1) +
FCNN − Last).

Fig. 22. Measured accuracy based on MNIST and CIFAR10 datasets.

testchip. Table VI presents comparison with prior works,


both near memory digital [31] and computing-in-memory
(CIM) approaches [19], [24], [27]. The proposed DSC6T
SRAM-CIM unit-macro can use various LUT settings for
dynamic input aware reference generation (DIARG), thereby
providing support for a variety of applications and network
architectures. Fig. 22 illustrates the measured accuracy of the
Fig. 20. Measured shmoo plot of proposed MBNN SRAM-CIM.
fabricated testchip when applied to the MNIST and CIFAR-
10 datasets. In the CIFAR10 experiment, the neural network
we employed was a modified binarized VGG-16 with six
the use of VWL = 0.8V with a 2.3ns penalty in inference CNN layers and three FC layers (FCNN1, FCNN2, and
times. FCNN3). The kernel size of the six CNN layers was 3 × 3,
The measured energy of a 4Kb MBNN SRAM-CIM whereas the sizes of FCNN1, FCNN2, and FCNN3 were
macro was 73.4pJ, which translates into energy efficiency 8192 × 1024, 1024 × 1024, and 1024 × 10, respectively.
of 55.8 TOPS/W (1 operation is equal to 1 multiply-count Note that network used for CIFAR-10 exceeded the capability
operation). If each product-sum operation were counted as of our fabricated SRAM-CIM testchip, which was designed
two operations, the energy efficiency would be 111.6(55.8 × for MNIST applications. We therefore applied only SRAM-
2)TOPS/W). In the standard 10k MNIST images test, CIM to the FCNN3 with multiple re-loading of weights in
the inference accuracy was measured at 95.1% when order to emulate the scenario of using multiple unit-macros
using two MBNN SRAM-CIM unit macros as the last-two for networks larger than MNIST. Limiting the output of
FCN layers. the fabricated SRAM-CIM unit-macro to single-bit precision
Fig. 21 illustrates a demonstration system using the four led to a significant loss of accuracy loss, compared with
SRAM-CIM macros with XNORNN and MBNN for MNIST baseline values. As mentioned before, using a unit-macro with
testing. Note that this system-level demonstration was con- multibit read out capability can improve inference accuracy
ducted using a camera to detect the input image. The captured for the CIFAR-10 dataset. This experiment confirmed that the
images were then reduced in size using the same computer proposed DSC6T SRAM-CIM unit macro structure supports a
used for activation computation in order to generate the variety of applications and networks that use a variety of LUT
inputs required for fabrication of the SRAM-CIM macro settings.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

TABLE VI
C OMPARISON TABLE OF SRAM-CIM C HIP

VII. C ONCLUSIONS [10] S. Park, S. Choi, J. Lee, M. Kim, J. Park, and H.-J. Yoo,
“A 126.1 mW real-time natural UI/UX processor with embedded deep-
This work proposes a 65nm 4Kb algorithm-dependent learning core for low-power smart glasses,” in IEEE Int. Solid-State
SRAM-CIM unit-macro, which supports both XNORNN and Circuits Conf. (ISSCC) Dig. Tech. Papers, Oct. 2016, pp. 254–255.
our proposed modified BNN (MBNN) nets. This work also [11] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Binarized neural networks: Training deep neural networks with weights
proposes dynamic input aware reference generation (DIAR and activations constrained to +1 or −1,” 2016, arXiv:1602.02830.
G), algorithm dependent asymmetric control (ADAC), write [Online]. Available: https://arxiv.org/abs/1602.02830
disturb free (WDF), and common-mode-insensitive voltage- [12] M. Kim and P. Smaragdis, “Bitwise neural networks,” in Proc. Int.
mode sensing-amplifier (CMI-VSA) schemes to reduce power Conf. Mach. Learn. Workshop Resource-Efficient Mach. Learn. (ICML),
Jul. 2015, pp. 6–11.
consumption, while achieving fast inference times, and robust [13] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
inference operations. A testchip with four 65nm 4Kb SRAM- ImageNet classification using binary convolutional neural networks,” in
CIMs was fabricated to confirm the proposed concepts. The Proc. Eur. Conf. Comput. Vis., 2016, pp. 525–542.
fabricated chip achieved access time of 2.3ns and energy [14] M. Courbariaux, Y. Bengio, and J. David, “BinaryConnect: Training
deep neural networks with binary weights during propagations,” in Proc.
efficiency of 55.8 TOPS/W per layer when applied to MNIST Adv. Neural Inf. Process. Syst. (NIPS), 2015, pp. 3105–3113.
image recognition. [15] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks
with few multiplications,” in Proc. Int. Conf. Learn. Represent. (ICLR),
2016, pp. 1–9.
R EFERENCES [16] R. Liu et al., “Parallelizing SRAM arrays with customized bit-cell
[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for binary neural networks,” in Proc. Design Automat. Conf. (DAC),
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. Jun. 2018, p. 21.
Available: https://arxiv.org/abs/1409.1556 [17] M. Horowitz, “Computing’s energy problem (and what we can do about
[2] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9. Feb. 2014, pp. 10–14.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [18] Q. Dong et al., “A 0.3 V VDDmin 4 + 2TSRAM for searching and
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- in-memory computing using 55 nm DDC technology,” in Proc. IEEE
nit. (CVPR), Jun. 2016, pp. 770–778. Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2017, pp. 160–161.
[4] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H.-J. Yoo, [19] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of
“A 0.62 mW ultra-low-power convolutional -neural—Network face- a machine-learning classifier in a standard 6T SRAM array,”
recognition processor and a CIS integrated with always-on Haar-like IEEE J. Solid-State Circuits, vol. 52, no. 4, pp. 915–924, Apr. 2017.
face detector,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. [20] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-SRAM: Enabling in-
Papers, Feb. 2017, pp. 344–346. memory Boolean computations in CMOS static random access mem-
[5] M. Price, J. Glass, and A. P. Chandrakasan, “A scalable speech recog- ories,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12,
nizer with deep-neural-network acoustic models and voice-activated pp. 4219–4232, Dec. 2018.
power gating,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers, Feb. 2017, pp. 244–245. [21] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz,
[6] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier “An energy-efficient VLSI architecture for pattern recognition via deep
implemented in a standard 6T SRAM array,” in Proc. IEEE Symp. VLSI embedding of computation in SRAM,” in Proc. IEEE Int. Conf. Acoust.,
Circuits, Jun. 2016, pp. 1–2. Speech Signal Process. (ICASSP), May 2014, pp. 8326–8330.
[7] F. Su et al., “A 462 GOPs/J RRAM-based nonvolatile intelligent [22] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 nm
processor for energy harvesting IoE system featuring nonvolatile logics configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit
and processing-in-memory,” in Symp. VLSI Technol. Dig. Tech. Papers, cell enabling logic-in-memory,” IEEE J. Solid-State Circuits, vol. 51,
Jun. 2017, pp. T260–T261. no. 4, pp. 1009–1021, Apr. 2016.
[8] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy- [23] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and
efficient reconfigurable accelerator for deep convolutional neural net- R. Das, “Compute caches,” in Proc. IEEE Int. Symp. High Perform.
works,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Comput. Archit. (HPCA), Feb. 2017, pp. 481–492.
Papers, Jan. 2016, pp. 262–263. [24] A. Biswas and A. P. Chandrakasan, “Conv-RAM: An energy-
[9] Y. Kim, D. Shin, J. Lee, Y. Lee, and H.-J. Yoo, “A 0.55 V efficient SRAM with embedded convolution computation for low-power
1.1 mW artificial-intelligence processor with PVT compensation for CNN-based machine learning applications,” in IEEE Int. Solid-State
micro robots,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA,
Papers, Sep. 2016, pp. 258–259. Feb. 2018, pp. 488–490.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SI et al.: DUAL-SPLIT 6T SRAM-BASED CIM UNIT-MACRO WITH FULLY PARALLEL PRODUCT-SUM OPERATION 13

[25] A. Agrawal et al., “Xcel-RAM: Accelerating binary neural networks in Jia-Jing Chen received the B.S. degree in electrical
high-throughput SRAM compute arrays,” Jul. 2018, arXiv:1807.00343. engineering from Chang Gung University, Taoyuan,
[Online]. Available: https://arxiv.org/abs/1807.00343 Taiwan, in 2016. He is currently pursuing the M.S.
[26] Z. Jiang, S. Yin, M. Seok, and J.-S. Seo, “XNOR-SRAM: In-memory degree with the Institute of Electrical Engineering,
computing SRAM macro for binary/ternary deep neural networks,” National Tsing Hua University, Hsinchu, Taiwan.
in Proc. IEEE Symp. VLSI Technol., Honolulu, HI, USA, Jun. 2018, His current research interests include circuit design
pp. 173–174. of SRAM and computing in memory.
[27] S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42 pJ/decision
3.12 TOPS/W robust in-memory machine learning classifier with on-
chip training,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, San Francisco, CA, USA, Feb. 2018, pp. 490–491.
[28] W. S. Khwa et al., “A 65 nm 4 Kb algorithm-dependent computing-in-
memory SRAM unit-macro with 2.3 ns and 55.8 TOPS/W fully parallel
product-sum operation for binary DNN edge processors,” in IEEE ISSCC
Dig. Tech. Papers, Feb. 2018, pp. 496–498.
[29] M.-F. Chang, C.-F. Chen, T.-H. Chang, C.-C. Shuai, Y.-Y. Wang, and Jia-Fang Li received the B.S. degree from the
H. Yamauchi, “A 28 nm 256 kb 6T-SRAM with 280 mV improvement in Electronic Department, National United University,
VMINusing a dual-split-control assist scheme,” in IEEE Int. Solid-State Taiwan, in 2016. She is currently pursuing the M.S.
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb., 2015, pp. 1–3. degree with the Institute of Electronics Engineering,
[30] M.-F. Chang et al., “A compact-area low-VDDmin 6T SRAM with National Tsing Hua University, Hsinchu, Taiwan.
improvement in cell stability, read speed, and write margin using a dual- Her current research interests include circuit design
split-control-assist scheme,” IEEE J. Solid-State Circuits, vol. 52, no. 9, of SRAM and emerging nonvolatile memory.
pp. 2498–2514, Sep. 2017.
[31] K. Ando et al., “BRein memory: A single-chip binary/ternary reconfig-
urable in-memory deep neural network accelerator achieving 1.4 TOPS
at 0.6 W,” IEEE J. Solid-State Circuits, vol. 53, no. 4, pp. 983–994,
Apr. 2018.
[32] T.-H. Yang, K.-X. Li, Y.-N. Chiang, W.-Y. Lin, H.-T. Lin, and
M.-F. Chang, “A 28 nm 32 Kb embedded 2T2MTJ STT-MRAM macro
with 1.3 ns read-access time for fast and reliable read applications,” Xiaoyu Sun (S’17) received the B.S. degree in
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, electrical engineering from the Huazhong University
Feb. 2018, pp. 482–484. of Science and Technology, Wuhan, China, in 2014,
[33] R. Guo et al., “A 5.1 pJ/neuron 127.3μs/inference RNN-based speech and the M.S. degree in electrical engineering from
recognition processor using 16 computing-in-memory SRAM macros Arizona State University in 2016. He is currently
in 65 nm CMOS,” in IEEE Symp. VLSI Circuits Dig. Tech. Papers, pursuing the Ph.D. degree in electrical and com-
Jun. 2019, pp. 120–121. puter engineering with the Georgia Institute of Tech-
[34] M.-F. Chang et al., “A 3T1R nonvolatile TCAM using MLC ReRAM nology, Atlanta, GA, USA. His research interests
for frequent-off instant-on filters in IoT and big-data processing,” IEEE include SRAM- and NVM-based hardware imple-
J. Solid-State Circuits, vol. 52, no. 6, pp. 1664–1679, Jun. 2017. mentations of neural networks.
[35] M. F. Chang et al., “A ReRAM-based 4T2R nonvolatile TCAM using
RC-filtered stress-decoupled scheme for frequent-OFF instant-ON search
engines used in IoT and big-data processing,” IEEE J. Solid-State
Circuits, vol. 51, no. 11, pp. 2786–2798, Nov. 2016.

Rui Liu (S’16) received the B.S. degree from


Xidian University, Xi’an, China, in 2011, the M.S.
degree from Peking University, Beijing, China,
in 2014, and the Ph.D. degree in electrical
engineering from Arizona State University in 2018.
Xin Si received the B.S. degree in integrated circuit Her research interests include emerging
design and integration system from the University non-volatile memory device/architecture design,
of Electronic Science and Technology of China radiation effects in RRAM devices and array
(UESTC), Chengdu, China, in 2016, where he is architectures, hardware design for security systems,
currently pursuing the Ph.D. degree. and new computing paradigm exploration.
He is also with National Tsing Hua University,
Taiwan. His research interests include analog circuit
design, memory, and computing in memory circuit
designs.

Shimeng Yu (M’14–SM’19) received the B.S.


degree in microelectronics from Peking University,
Beijing, China, in 2009, and the M.S. and Ph.D.
degrees in electrical engineering from Stanford Uni-
versity, Stanford, CA, USA, in 2011 and 2013,
respectively. He is currently an Associate Professor
Win-San Khwa received the B.S. degree from of electrical and computer engineering with the
the University of California at Los Angeles, Los Georgia Institute of Technology, Atlanta, GA, USA.
Angeles, CA, USA, in 2007, the M.S. degree He has published over 80 journal papers and over
from the University of Michigan, Ann Arbor, MI, 130 conference papers with over 8000 citations
USA, in 2010, and the Ph.D. degree in electri- and an H-index on 43. His research interests are
cal engineering from National TsingHua University, emerging nano-devices and circuits with a focus on the resistive memories
Hsinchu, Taiwan, in 2017. He joined Macronix Inter- for different applications, including machine/deep learning, neuromorphic
national (MXIC) in 2012. He is currently working computing, hardware security, and so on. He was a recipient of the NSF
at TSMC on emerging memory path finding and IP Faculty Early CAREER Award in 2016, the IEEE Electron Devices Society
development. Early Career Award in 2017, and the ACM Special Interests Group on Design
Automation (SIGDA) Outstanding New Faculty Award in 2018.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Hiroyuki Yamauchi received the Ph.D. degree Meng-Fan Chang (M’05–SM’14–F’19) received
in engineering from Kyushu University, Fukuoka, the M.S. degree from Pennsylvania State University,
Japan, in 1997. In 1985, he joined the Semicon- USA, and the Ph.D. degree from National Chiao
ductor Research Center, Panasonic, Japan. From Tung University, Hisnchu, Taiwan.
1985 to 1987, he worked on the scaled sense Before 2006, he has worked in industry for over
amplifier for ultrahigh-density DRAM’s. From ten years. From 1996 to 1997, he designed mem-
1988 to 1994, he was engaged in the research ory compilers in Mentor Graphics, NJ, USA. From
and development of 16-Mb CMOS DRAMs, 1997 to 2001, he designed embedded SRAMs and
including the battery-operated high-speed 16-Mb Flash in Design Service Division (DSD) at TSMC,
CMOS DRAM and the ultralow-power, three times Hsinchu, Taiwan. From 2001 to 2006, he was a
longer, self-refresh DRAM. He also invented the Co-Founder and the Director of IPLib Company,
charge-recycling bus architecture and low-voltage operated high-speed Taiwan, where he developed embedded SRAM and ROM compilers, flash
VLSI’s, including 0.5-V/100-MHz-operated SRAM and Gate-Over-Driving macros, and flat-cell ROM products. He is currently a Full Professor with
CMOS architecture. After the development of various embedded memories, National Tsing Hua University (NTHU), Taiwan. His research interests include
eSRAM, eDRAM, eFlash, eFeRAM, and eReRAM for system LSI in Pana- circuit designs for volatile and nonvolatile memory, ultra-low-voltage sys-
sonic as a General Manager, he moved to the Fukuoka Institute of Technology, tems, 3D-memory, circuit-device interactions, spintronics circuits, memristor
where he has been a Professor since 2005. His current interests are focused on logics for neuromorphic computing, and computing-in-memory for artificial
convolution/deconvolution algorithms for machine-learning-based low-power intelligence.
signal and image processing for the IoT sensor applications. Dr. Chang was a recipient of several prestigious national-level awards
in Taiwan, including the Outstanding Research Award of MOST-Taiwan,
the Outstanding Electrical Engineering Professor Award, the Academia Sinica
Junior Research Investigators Award, and the Ta-You Wu Memorial Award.
He has been serving as an Associate Editor for the IEEE TVLSI and
Qiang Li (S’04–M’07–SM’13) received the B.Eng. IEEE TCAD. He has been serving as a Guest Editor for the IEEE JSSC,
degree in electrical engineering from the Huazhong IEEE TCAS-II, and IEEE JETCAS. He has been serving on the technical
University of Science and Technology (HUST), program committees for ISSCC, IEDM (Ex-Com and MT Chair), DAC (Sub-
Wuhan, China, in 2001, and the Ph.D. degree from Com Chair), ISCAS (Track Co-Chair), A-SSCC, and numerous international
Nanyang Technological University (NTU), Singa- conferences. He has been a Distinguished Lecture (DL) Speaker for the
pore, in 2007. IEEE Solid-State Circuits Society (SSCS) and Circuits and Systems Society
Since 2001, he has been working on analog/RF (CASS), the Technical Committee Member of CASS, and the Administrative
circuits in both academia and industry, holding posi- Committee (AdCom) Member of the IEEE Nanotechnology Council. He has
tions of Engineer, Project Leader, and Technical also been serving as the Program Director of the Micro-Electronics Program of
Consultant in Singapore and Associate Professor Ministry of Since and Technology (MOST) in Taiwan (2018–2020) and an
in Denmark. He is currently a Full Professor with Associate Executive Director for the Taiwan’s National Program of Intelligent
the University of Electronic Science and Technology of China (UESTC), Electronics (NPIE) and NPIE Bridge Program (2011–2018).
heading the Institute of Integrated Circuits and Systems. His research interests
include low-voltage and low-power analog/RF circuits, data converters, and
mixed-mode circuits for biomedical and sensor interfaces.
Dr. Li was a recipient of the Young Changjiang Scholar award in 2015,
the National Top-Notch Young Professionals award in 2013, and the UESTC
Teaching Excellence Award in 2011. He was the TPC Chair of the IEEE
2018 APCCAS. He serves on the Student Research Preview (SRP) Committee
of ISSCC and the TPC of ESSCIRC and ASSCC (both in DC subcommittee).
He serves as a Guest Editor of the IEEE T RANSACTIONS ON C IRCUITS
AND S YSTEMS I (TCAS-I). He is the Founding Chair of the IEEE Chengdu
CASS/SSCS Joint Chapter.

You might also like