0% found this document useful (0 votes)
81 views

CIMAT A Compute-In-Memory Architecture For On-Chip Training Based On Transpose SRAM Arrays

This document proposes CIMAT, a compute-in-memory architecture for on-chip training of deep neural networks using transpose SRAM arrays. It introduces two designs of 7T and 8T transpose SRAM bitcells that enable bidirectional read access needed for feedforward and backpropagation. The architecture supports on-chip training by designing peripheral circuits, mapping strategies, and data flows for the backpropagation process and weight updates. Evaluation shows the 7T design achieves 3.38x higher energy efficiency and 4.34x faster frame rate than a baseline with conventional 6T SRAM, using only 50% chip area. The 8T design provides even better performance with 10.79 TOPS/W and 48,335 fps using

Uploaded by

Abhay Shriram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

CIMAT A Compute-In-Memory Architecture For On-Chip Training Based On Transpose SRAM Arrays

This document proposes CIMAT, a compute-in-memory architecture for on-chip training of deep neural networks using transpose SRAM arrays. It introduces two designs of 7T and 8T transpose SRAM bitcells that enable bidirectional read access needed for feedforward and backpropagation. The architecture supports on-chip training by designing peripheral circuits, mapping strategies, and data flows for the backpropagation process and weight updates. Evaluation shows the 7T design achieves 3.38x higher energy efficiency and 4.34x faster frame rate than a baseline with conventional 6T SRAM, using only 50% chip area. The 8T design provides even better performance with 10.79 TOPS/W and 48,335 fps using

Uploaded by

Abhay Shriram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

944 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO.

7, JULY 2020

CIMAT: A Compute-In-Memory Architecture


for On-chip Training Based on
Transpose SRAM Arrays
Hongwu Jiang , Xiaochen Peng , Shanshi Huang , and Shimeng Yu , Senior Member, IEEE

Abstract—Rapid development in deep neural networks (DNNs) is enabling many intelligent applications. However, on-chip training of
DNNs is challenging due to the extensive computation and memory bandwidth requirements. To solve the bottleneck of the memory
wall problem, compute-in-memory (CIM) approach exploits the analog computation along the bit line of the memory array thus
significantly speeds up the vector-matrix multiplications. So far, most of the CIM-based architectures target at implementing inference
engine for offline training only. In this article, we propose CIMAT, a CIM Architecture for Training. At the bitcell level, we design two
versions of 7T and 8T transpose SRAM to implement bi-directional vector-to-matrix multiplication that is needed for feedforward (FF)
and backprogpagation (BP). Moreover, we design the periphery circuitry, mapping strategy and the data flow for the BP process and
weight update to support the on-chip training based on CIM. To further improve training performance, we explore the pipeline
optimization of proposed architecture. We utilize the mature and advanced CMOS technology at 7 nm to design the CIMAT architecture
with 7T/8T transpose SRAM array that supports bi-directional parallel read. We explore the 8-bit training performance of ImageNet on
ResNet-18, showing that 7T-based design can achieve 3.38 higher energy efficiency (6.02 TOPS/W), 4.34 frame rate
(4,020 fps) and only 50 percent chip size compared to the baseline architecture with conventional 6T SRAM array that supports
row-by-row read only. The even better performance is obtained with 8T-based architecture, which can reach 10.79 TOPS/W and
48,335 fps with 74-percent chip area compared to the baseline.

Index Terms—SRAM, deep neural network, compute-in-memory, on-chip training

1 INTRODUCTION
ECENTLY, DNNs have achieved remarkable improvement computation with optimized data is realized across multiple
R for a wide range of intelligence applications, from image
classification to speech recognition an autonomous vehicle.
processing elements (PE), the weights and intermediate data
still require inefficient on-chip or off-chip memory access.
The main elements of DNNs are convolutional layers and This drawback is exacerbated for DNN training due to fre-
fully connected layers. To achieve incremental accuracy quent back and forth data movement. To alleviate the mem-
improvement, state-of-the-art DNNs tend to increase the ory access bottleneck, compute-in-memory (CIM) is a
depth and size of the neural network aggressively, which promising solution for DNN hardware acceleration. Weight
requires large amount of computational resources and mem- movements can be eliminated by in-memory computing.
ory storage for high-precision multiply-and-accumulate CIM could also improve the parallelism within the memory
(MAC) operation. For example, Resnet-50 [1] can achieve array by activating multiple rows and use the analog readout
70 percent accuracy with 25.6M parameters. Although to conduct multiplication and current summation. However,
graphic processing units (GPUs) are the most popular hard- most of the proposed CIM architectures so far [3], [4], [5]
ware for DNN training at the cloud, there have been many could support the DNN inference only. Some efforts are
efforts from academia and industry on the design of applica- made to accelerate training in CIM [6], [7], [8] but based on
tion specific integrated circuit (ASIC) accelerators for infer- resistive random access memory (RRAM), however, the rela-
ence or even training on-chip. However, the memory wall tively large write latency/energy, and asymmetry and non-
problem remains in the conventional CMOS ASIC accelera- linearity in the conductance tuning of RRAM prevent it from
tors such as TPU [2] where the parameters (i.e., weights and ideal candidates for extensive weight updates [9]. In addi-
activations) are stored in global buffer and the computation tion, for backpropagation (BP) process in the DNN training,
is still performed at the digital MAC arrays. Despite parallel CIM array needs to perform convolution computation with
transposed weight matrices. If the regular memory array
 The authors are with the School of Electrical and Computer Engineering, with row-wise input and column-wise output is used, the
Georgia Institute of Technology, Atlanta, GA 30332. E-mail: {hjiang318, parallel analog read-out could not be achieved when proc-
xpeng76, shuang406}@gatech.edu, [email protected]. essing the transposed weight matrices. Instead, the sequen-
Manuscript received 14 Oct. 2019; revised 25 Jan. 2020; accepted 1 Mar. 2020. tial column by column read-out is required during BP,
Date of publication 13 Mar. 2020; date of current version 9 June 2020. which significantly decreases the throughput and energy-
(Corresponding author: Shimeng Yu.)
Recommended for acceptance by Xuehai Qian and Yanzhi Wang. efficiency. On the other hand, network compression is also a
Digital Object Identifier no. 10.1109/TC.2020.2980533 promising approach to reduce energy and area cost of the
0018-9340 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 945

storage. There have been many efforts in reducing the preci-


sion of parameters even to 1-bit during inference (e.g., BNN
[10] and XNOR-net [11]). Due to the incremental accumula-
tion in stochastic gradient descent (SGD) optimization, the
precision demand and computational complexity for train-
ing is much higher than inference. Recently, discrete training
techniques (e.g., DoReFa-Net [12] and WAGE [13]) are pro-
posed to process both training and inference with low-bit-
width parameters, which could be in favor of on-chip
training compared to full precision in floating range.
In this paper, we propose a transpose SRAM-based CIM
Architecture for multi-bit precision DNN Training, namely
CIMAT, with two different bit-cell design, and explore the
corresponding weight mapping strategies, data flow and
pipeline design. SRAM is a mature CMOS technology and Fig. 1. Basic diagram of DNN training process.
recent silicon prototype chips have demonstrated the effi-
cacy of SRAM based CIM for inference only [14], [15], [16]. baseline design with conventional 6T SRAM array. Section 5
Therefore, the next step is to explore the architectural design concludes the paper.
for SRAM based CIM for training. Compared to other CIM This work is an extension of our prior conference paper
architectures, we make the following key contributions: [18]. The new materials added include 1) a more advanced
8T transpose design to perform forward and backward
1. Novel hardware to support DNN training. To support training pass simultaneously; 2) the pipeline optimization
on-chip training, we present 7T and 8T transpose of CIMAT is also explored deeply to improve training
SRAM bit-cell design, which could perform bi-direc- performance.
tion read access. In addition, we propose a near-
memory hardware solution to perform weight 2 BACKGROUND
update process.
2.1 DNN Training Basics
2. Transpose crossbar structure which can implement both
feed-forward and backpropagation computation. With the Convolutional neural network (CNN) is one of the most
novel bit-cell design, we explored the mapping strat- popular DNN models. The training process of CNN could
egy and data flow of transpose crossbar structure to be divided into four steps, as shown in Fig. 1, namely, 1)
feed-forward (FF), 2) BP for error calculation, 3) weight gra-
perform feed-forward and backpropagation on the
dient (DWn ) calculation and 4) weight update. These four
same memory array.
steps run in a loop to obtain a well-trained model through
3. CIM solution for weight gradient calculation. A CIM
iterations.
approach for matrix-to-matrix gradient calculation is
In the FF process, it takes input data and calculates the
proposed to improve the performance of CIM on-
chip training. error between the predicted output and the label (ground
4. Pipeline optimization for CIM training with 7T and 8T truth). The intermediate activations in each layer need to be
SRAM design. We propose a pipeline design to speed stored into buffer to for later usage in step 3). This is differ-
up forward and error calculation process with 7T ent from the inference engine design where the intermediate
SRAM design, respectively. To further improve activations could be discarded. For a given layer n, the FF
energy efficiency and processing speed, we optimize operation is shown in Eq. (1),
the hardware with 8T SRAM design to pipeline the Yn ¼ f ðWn Yn1 þ bn Þ; (1)
entire training process.
The rest of paper is organized as follows: Section 2 introdu- where f is the neuron activation function such as ReLU, Wn
ces the basics of the DNN training and CIM approach. Section 3 is the weight of the current layer, Yn is the output of the cur-
shows two novel transpose SRAM cell design and peripheral rent layer, Yn1 denotes the output from the previous layer,
circuity to support the entire DNN training process. Section 4 which acts as the activation to the current layer and bn rep-
presents the proposed CIM architecture, mapping strategies resents the bias.
and data flow for all four steps of DNN training, namely feed- During the BP process, the main goal is to calculate the
forward/inference, error calculation, weight gradient calcula- gradient on the weights of each layer. A method based on
tion and weight update. Section 5 shows pipeline design for stochastic gradient descent (SGD) is used to calculate gradi-
7T and 8T memory cell design, respectively. Especially, we ent layer by layer from the back to the front. For a given
optimize architecture for 8T design to accelerate pipeline, layer n, with the chain rule, the error @L=@Yn is calculated
which obtains significant improvement on energy efficiency by the convolution between the error and the “transpose”
T
and throughput. Section 6 presents specifications of our pro- weight matrix Wnþ1 from the next layer, as shown in Eq. (2).
posed architecture to implement the ResNet-18 training. The Buffer is also needed for storing the error output in BP.
chip-level evaluation is performed in the NeuroSim simulator Then the weight gradient DWn of the current layer (layer
[17] on 7 nm technology node. We also compare the perfor- n) is obtained by another convolution between the error
mance of energy efficiency, frame rate and chip area between @L=@Yn and the activation Yn1 that is obtained in the FF
the proposed 7T/8T transpose SRAM based CIM array and a process, as shown in Eq. (3). Finally, the weights of the

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
946 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020

current layer are updated by DWn modulated by the learn-


ing rate (LR), as shown in Eq. (4).

@L @L @Ynþ1 @L T
¼  ¼  Wnþ1 (2)
@Yn @Ynþ1 @Yn @Ynþ1

@L @L @Yn @L
DWn ¼ ¼  ¼  Yn1 (3)
@Wn @Yn @Wn @Yn

@L
Wnt ¼ Wnt1  LR  : (4)
@Wn
As Eq. (2) suggested, the critical challenge to implement
on-chip training is that the training hardware needs to sup-
port the read-out of the transposed weight matrix. Another
issue is how to deal with massive intermediate data, which
are generated during the entire training process.

2.2 Compute-in-Memory Basics


CIM, or in-memory computing, is an attractive solution for Fig. 2. (a) 7T transpose SRAM bit cell; (b) Operation modes of 7T SRAM
the extensive MAC operations in both DNN inference and bit cell.
training as it combines memory access and computation. In
general, CIM architecture performs mixed-signal computa- respectively. In forward mode, C_RWL is enabled as neuron
tion, i.e., analog current summation along the column/or the input and C_RBL is used as bit line for partial sum read-out.
row, then the analog to digital conversion (ADC) at the edge For backward mode, these two lines exchange their roles:
of the array. Due to the increased parallelism and reduced C_RWL acts as R_RBL, and C_RBL acts as R_RWL. R_RWL
data movement, CIM is expected to significantly improve is enabled as neuron input and R_RBL is used as bit line for
the throughput and energy efficiency. As a trade-off, the lim- partial sum read-out. Both column and row paths have sep-
ited precision of ADCs and their variations lead to approxi- arate sets of WL writers and ADCs. The analog value of cur-
mate computation results in CIM, which generally result in a rent along C_RBL/R_RBL represents the MAC results and
slight degradation of the inference accuracy [19]. this partial sum is digitized and quantized by the ADCs.
SRAM has been considered as a mature candidate for CIM.
The general approach is to modify the SRAM bit-cell and
periphery to enable the parallel access. For example, the 3.1.2 8T Transpose SRAM Design
design in [3] expanded 6T cells into 8T to support bitwise Despite 7T transpose SRAM design could perform bi-direc-
XNOR. The VMM is done in a parallel fashion where the input tion read access to support both FF and BP calculation, the
vectors activate multiple rows and the dot-product is obtained weight can be only stored and readout through Q point,
as column voltage or current. Sense amplifier (SA) is also which means FF and BP cannot perform simultaneously in
replaced by ADC to produce quantized output. Multi-bit same cell. In batch mode, 7T SRAM-based CIM could only
inference [20] is possible with SRAM based CIM architectures. realize pipiline in FF and BP seperately. To further improve
If using the FF only CIM design, when it processes the processing speed and energy efficiency, we propose 8T T-
transpose weight matrix in BP, the input vector is applied to SRAM bit-cell structure as shown in Fig. 3a. There is an
the column, and it is only possible to read-out the weights additional PMOS transitor compared with 7T design. The
in a particular row and perform the summation in digital gate of PMOS transistor is connected to QB to support read
adders along that row. CIM could not be realized directly. access from two sides of bit-cell. As shown in Fig. 3b, this
Therefore, it is imperative to design a new CIM architecture 8T SRAM design also have two modes to support forward
to support both FF and BP calculation. and backward process. In forward process, additional
NMOS transistor in blue color is activated. C_RWL acts as
3 HARDWARE SUPPORT activation input and C_RBL is used as bitline to collect ana-
log current as partial sum readout. In backward mode, addi-
3.1 Transpose Bit-Cell Design tional PMOS transistor in red color is activated. For BP
3.1.1 7T Transpose SRAM Design calculation, error is fed into cell through R_RWL as input
7T transpose SRAM is used as the bit-cell as shown in and R_RBL is used as bitline for partial sum readout. Since
Fig. 2a. The feasibility of this 7T transpose SRAM has vali- forward and backward mode have seperate wordline, bit-
dated in silicon chips as in [21]. Such 7T design only has a line and storing point for read access, bi-direction read can
very small cell area overhead, while providing bi-direc- be performed simultaneously. Thus, with 8T design, pipe-
tional read access and read-disturb-free access. The regular line can be implemented between FF and error calculation
6T cell is used for the data storage and row-by-row write to speedup training process further.
(controlled by WWL), while the innovation is on the addi- The 6-transistor part of 7T/8T design has the same struc-
tional transistor (in blue color) for bi-directional read access. ture as conventional 6T SRAM design which can be used for
As shown in Fig. 2b, this 7T transpose SRAM design has data storage and normal read/write is controlled by WWL
two read modes to support forward and backward process, as shown in Figs. 2 and 3. The compact foundry design rule

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 947

Fig. 5. Data flow of FF process showing one PE that implements a 0


element in the 3  3 filter only. (To implement all the elements in the
filter, 9 PEs are needed.)
Fig. 3. (a) 8T transpose SRAM bit cell; (b) Operation mode of 8T SRAM
bit cell.
flash-ADCs (i.e., multilevel sense amplifiers with different
could be applied there without any modification. The addi- references) are employed to quantize the partial sum. Shift
tional transistors implement bi-directional read access for in- and adder accumulates digitalized partial sum from least
memory computation. At this stage, these additional transis- significant bit (LSB) input cycle to most significant bit (MSB)
tors could be added using logic design rule if foundry has input cycle to support multi-bit input activation. In order to
not provided the optimized design rule. support bi-directional access, two groups of periphery cir-
cuits are needed. Besides the transpose SRAM bit-cell, there
3.2 Periphery Circuit Design for Training is also a row of 6T SRAM connected to the same BL/BLB,
As shown in Fig. 4, the transpose 7T/8T SRAM array has the which is used for weight update inspired by [21].
typical periphery circuits for memory array, including word
line (WL) writers for both column and row access, WL 4 PROPOSED ARCHITECTURE
decoders for weight write and pre-charge circuit. In addition, CIMAT, proposed for Compute-In-Memory Architecture for
Training, not only could perform the DNN inference but
also could implement the BP calculation and weight update.
Section 4.1 shows the mapping strategy of FF process.
Section 4.2 presents the dataflow and approach to implement
error calculation with the same memory array of FF process.
The CIM solution of weight gradient calculation are pro-
posed in Section 4.3. Finally, Section 4.4 introduces how to
perform weight update with designed periphery circuit.

4.1 Feed-Forward Process


The FF process of the CIM array is shown in Fig. 5. Here M is
the number of filters/output feature map (OFM) channels, C
is number of input feature map (IFM)/filter channels, H/W
is the IFM plane height/width, and E/F is the OFM plane
height/width. The filters are unrolled to a weight matrix that
is stored in the memory array. The IFM vector is applied to
the row in parallel, and the OFM vector is obtained from the
column in parallel. In general, there are two types of weight
mapping schemes for CIM convolution. One scheme is to
unroll all the elements in a filter (e.g., a 3  3 filter with 128
channels) to one long vector and put them into the same col-
umn (e.g., with a length of 1,152) in one PE [23]. The other
scheme is to put each element in a filter to the same column
Fig. 4. Block diagram of transpose SRAM sub-array and periphery in a subarray and use different PEs for different locations of
circuity. the elements (e.g., 9 subarrays for 3  3 filter and each PE has

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
948 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020

Fig. 7. Weight mapping of CIM approach for weight gradient calculation


(Eq. (3)), error is stored in the 6T CIM SRAM array, while the activation
Yn1 is loaded as input from cycle to cycle. k ¼ 3 for 3  3 filter.

Fig. 6. Weight mapping for BP process (Eq. (2)) in the PE that implements and the output vector (i.e., the error in the current layer
a0 element in the 3  3 filter.
@L=@Yn ) is obtained from the row in parallel. With such
transpose architecture, FF and error calculation can be per-
a column length of 128), then sum up the partial sums from formed within the same PE. No additional memory access
different PEs through an adder tree [24]. Although both is needed in error calculation, which means an improve-
schemes can be used for the transpose CIM architecture, the ment in throughput and energy efficiency.
first one will make the forward and backward operation
asymmetric because the column output in FF is directly the
entire partial sum, while the row output in BP is just part of 4.3 Weight Gradient Calculation
the partial sum. To keep the balance between FF and BP, we For the calculation of weight gradient matrix DWn in Eq. (3),
choose to use the second mapping scheme (as shown in we propose using CIM approach with additional 6T non-
Fig. 5) in our transpose CIM architecture. First, weights are transpose CIM SRAM arrays to perform the outer dot prod-
pre-written into 7T/8T SRAM cells by write-wordlines uct multiplication between error matrix @L=@Yn and related
(WWLs) in regular read mode. Then, for in-memory MAC input activation matrix Yn1 . The mapping method for DWn
operation, activations are first fed into SRAM cells through calculation is shown in Fig. 7. @L=@Yn that is calculated in
read-wordlines (RWLs). One-bit multiplication can be imple- the previous step by Eq. (2) is written in 6T CIM SRAM
mented by NAND gate. Thus, in computing mode, the read- array as weight first and then Yn1 is loaded (from off-chip
out current from read-bitline (RBL) represents product of DRAM to on-chip buffer) as activation. Each plain of
multiplication. The summed current along one column/row, @L=@Yn is stretched into one long column which length
namely partial sum, represents final value of MAC operation. equals EF, the number of @L=@Yn channels is M, which
It is called ‘partial sum’ since this value is only the summed means there are M columns in total. Thus, @L=@Yn can be
dot-product from one kernel of the filter. To get the final out- mapped to a large weight matrix, whose height and width
put feature map, partial sums from different kernels need to equal to EF and M. Sliding windows on each plain of the
be summed again. For the fully connected layers, they are input, Yn1 , are also unrolled to a group of columns. The
treated as a special case of convolution layers with 1  1 fil- activation columns, whose length also equal to the height of
ters, thus the same mapping scheme could be applied. weight matrix, are fed into the array cycle by cycle, thereby
performing bitwise multiplication and accumulation. Partial
4.2 Backpropagation Process sums from same column with multiple cycles generate all
the weight gradients of one filter while partial sums from
Fig. 6 shows the details of error calculation in Eq. (2) for the
different columns form the entire gradient matrix DWn . In
first channel group (a0 element) of the filters. The backward
the normal batch training mode, DWn of each image is sent
pass for a convolution operation is also a convolution (but
to off-chip DRAM for storing, and at the end of each batch,
with spatially-flipped filters). For the FF, the product of
weight gradients are loaded back and accumulated on-chip.
input sliding window with the same filter across all the
The averaged DWn is used as input of weight update that is
channels is summed up to generate one output, which
performed in periphery circuit of the array, as describe in
means all dot products in same column of the PE is summed
the next sub-section.
up. However, for the BP, the product of input sliding win-
dow and the same channel across different filters is
summed up, which means all dot-products in same row 4.4 Weight Update
need to be summed up. Essentially, we will process the Fig. 8 shows the structure of weigh update module. The 6T
transposed version of weight matrix in the BP. As shown in SRAM row is disabled during the FF and BP. When the
T
Fig. 6, Wnþ1 is the weights in FF while Wnþ1 is the trans- accumulated weight gradients are ready after one batch,
posed weights for BP, and they are mapped to the same they are fed into shift register to realize multiplication with
memory array. In BP, the input vector (i.e., the error in the learning rate by shift (Eq. (4)) in a read-modify-write
next layer @L=@Ynþ1 ) is applied to the column in parallel, scheme that is done row-by-row. First, one row of the DWn

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 949

Fig. 8. The structure of weight update modules performing read-modify-


write.

matrix is written into the 6T row of the sub-array. Then,


WLs of 6T row and the to-be-updated weight row are acti-
vated simultaneously to send stored information to the
weight update module. The weight update module has an
adder to calculate the sum of DWn and Wn using in-memory
computing technique. This is to obtain the new Wn, which is
then written back to the weight row and Cout of the adder is
forwarded to the higher significant bit, which provides Cin
for higher significant bit update. Since this process is
repeated row by row in one PE, we could treat the update
of different significant bits in a pipeline to speed up. The Fig. 10. (a) Pipeline structure of 8T SRAM-based CIM training. (b) The
entire process of weight update is completed when all the state of each stage as a function of time. 15 super clock cycles are
illustrated.
multi-bit weights, from LSB to MSB, are updated.

latency of each stage, we group 6th to 17th layers as one


5 PIPELINE DESIGN
stage (stage 6). Fully-connected layer and other activation
5.1 7T SRAM Bit-Cell-Based Pipeline Design function circuit form stage 7.
The pipeline dataflow for FF and BP with 7T SRAM bit-cell Fig. 9b shows the entire training process of 7T design in
design is shown in Fig. 9a. As an example of implementing timeline. First, for FF process, one batch of images is fed into
ResNet-18, the forward pass and BP pass of error calcula- training system stage by stage in forward direction. After fin-
tion are realized using 7-stage pipeline design. The latency ishing FF process of one batch (T1), the error caulcation starts
of the 1st to 5th layers to process an entire image is almost to operate stage by stage in backward direction (T2). The
the same while the total latency of 6th to 17th convolution generated intermediate data in FF and BP process needs to
layers is only half of the previous layers. Therefore, the 1st be saved off-chip for gradient calculation since we will
to 5th layers are treated as pipeline stage 1 to stage 5 respec- obtain hundred groups of activations and errors in the batch
tively. To implement pipeline for all layers and match the mode, which are too large to be saved on-chip. Typically, 128
images form one batch. For gradient calculation process,
after the errors are obtained for one batch, they are used
togehther with the activations to calculate the gradients (to
be applied to the weights) image by image (T3), which means
we will get 128 groups of weight gradients after 128 runs.
The gradient calculation is performed after the batch FF and
BP process. Finally, the 128 groups of weight gradients are
averaged across the batch and the weights are updated in
one step using this averaged weight gradients (T4).

5.2 8T SRAM Bit-Cell-Based Pipeline Design


As described in Section 3.1.2, 8T SRAM bit-cell can perform
bi-directional read simultaneously, which means 8T based
subarray is able to support bi-directional vector-to-matrix
MAC calculation synchronously. Thus, we further optimize
the pipeline design of 8T SRAM as shown in Fig. 10a. The
Fig. 9. (a) Intra-pipeline design inside FF and BP. (b) Training process in stage configuration of FF and BP process is the same as 7T
timeline of 7T-based architecture. pipeline design. However, instead of waiting for the complete

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
950 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020

of FF process, error calculation process of 8T design could


form pipelines together with FF process to increase through-
put significantly. In addition, as long as activation and error
of the 1st image are ready, the weight gradient calculation
(WGC) could start to work. The process of gradient calcula-
tion with CIM approach in Section 4.3 can be accelerated by
duplicating CIM arrays. If the latency of WGC stage is
approximately equals to the latency of FF/BP stage, gradient
calculation could also work in a pipeline fashion together
with FF and BP process avoiding off-chip DWn movement.
The state of each stage as a function of time is shown in
Fig. 10b. For example, at the 14th cycle, FF/BP stage 7 is per-
forming FF calculation of image 8 and error calculation of
image 7 simutaneously. Meanwhile, WGC stage 1 is able to Fig. 11. Top-level CIMAT architecture for one convolution layer.
calculate weight gradient for image 6 because necessary acti-
vations and error of image 6 have been obtained since the 12th To simplify periphery circuit of transposable subarray,
cycle and the 13th cycle, respectively. weight bits with different significance are stored on different
Compared to intra-pipeline of 7T based architecutre, 8T tiles (i.e., 8 tiles) with shift-add to combine them together
design could also implement inter-pipeline between differ- after obtaining their partial sums. Fig. 11 shows the top-level
ent training processes, which aggressively improves the CIMAT architecture for one convolution layer, which con-
throughput of training. Besides, the optimized pipeline is tains 8 tiles, shift-add units, activation function units, L2
also beneficial to energy saving due to the reduced off-chip buffer and the global buffer. A tile contains multiple PEs,
memory access and standby leakage of SRAM. The over- adder trees and L1 buffer. In both FF and BP, the multi-bit
head of 8T based training architecture is on-chip buffer has inputs are sent to the weight arrays from LSB to MSB using
to be large enough to run the pipeline. eight cycles and the output are accumulated with shift-add
calculation. Then the adder trees accumulate results for all
6 EVALUATION RESULTS the subarrays in one PE for one element of the filter. Accumu-
lated results from PEs in the same tile accumulate again
We evaluate our CIMAT architecture design by implement- through adder trees outside PE to finish bitwise MAC for the
ing the ResNet-18 model for on-chip training on ImageNet. entire filter, but only for single significance of the weight.
We first describe the experiments setup, and then present Outside tiles, shift-add will be performed again for eight tiles
the simulation results of chip-level performance using modi- to obtain the eventual 8-bit output of OFM.
fied NeuroSim [25]. NeuroSim is a circuit macro model that For the baseline, we use the regular 6T SRAM arrays to
supports flexible CIM array design options with different store weights and do near-memory computation (i.e., row-
device technologies (from SRAM to emerging nonvolatile by-row read-out with digital adders at the edge of the array
memories) with various peripheral circuitry modules [15], to accumulate the partial sum). For the BP, since non-trans-
which has been validated with SPICE simulations and actual pose SRAM is used, each time when we read out a row, we
device data. obtain all the elements for one transpose filter. Then a group
of 8-bit multipliers is used to get the products of the filter ele-
6.1 Experiments Setup ments and the inputs. These products are accumulated by
Recent progresses in algorithms have proved that DNN adders to get the final sum of error in Eq. (2). For weight gra-
training with 8 bits is sufficient to maintain the accuracy of dient calculation in Eq. (3), the FF activations of each layer
for large-scale data set [11], [26], thus we use 8-bit weight are first fetched from off-chip DRAM and then multipliers
and 8-bit activations in both CIMAT and baseline settings. and adder trees are used to realize bitwise matrix-to-matrix
To enable the bi-directional access, weight bits with differ- multiplication and accumulation of the activation and error.
ent significance are stored on different tiles (i.e, 8 tiles) with All the weight gradients will be stored off-chip for weight
shift-add to combine them together after obtaining their update. Finally, DWn for batch inputs are accumulated by
partial sums. The multi-bit activations are sent to the weight adders and new weights are calculated by digital circuits
arrays from LSB to MSB using eight cycles and the output and then written back to 6T SRAM arrays. We propose this
are accumulated with shift-add as well. near-memory computation solution as baseline to verify the
The batch size for training is 128, which means 128 advantages of CIM approach.
images go through the FF/BP operations and the averaged
weight gradients after 128 runs are used to update after the 6.2 Circuit-Level Parameter
entire batch. The typical weight matrix size of one PE in We model the CIMAT design according to state-of-the-art 7
ResNet-18 could be several hundreds up to thousands, but nm high performance (HP) CMOS library and build the
we use matrix partition technique and limit the subarray CIMAT architecture with a modified NeuroSim framework
size to 128  128 considering the practical maximum SRAM by adding the considerations of the on-chip buffer and off-
array size when accessed in parallel [3]. Hence, 16 subarrays chip DRAM access. Table 1 shows circuit-level parameters
are formed into one PE and 9 PEs (corresponding to 3  3 based on 7T SRAM design, including the hardware configu-
filters) are formed into one tile, which means any layer in ration, precision, area and energy for key circuit modules.
ResNet-18 can fit into 8 tiles (for 8-bit weight precision). The energy is given in energy per operation (or bit), for

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 951

TABLE 1 TABLE 2
7T CIMAT Parameters 8T CIMAT Parameters

example, sub-array total energy is 25.75 pJ/op, which means SRAM cell based CIMAT design to implement inter-/intra-
the total energy for one single sub-array to do one vector- pipeline. The circuit-level paramters of 8T transpose SRAM
matrix multiplication, and L1 buffer energy is 0.006pJ/bit, based architecture are shown in Table 2. The 8T memory
which means total averaged energy to write one-bit data. array is a little larger than 7T due to area overhead of addi-
The estimated energy cost of off-chip DRAM access is 4.2pJ/ tional transistors. For 8T based architecture, since FF calcula-
bit from prior work [27] assuming 3D high-bandwidth mem- tion and error calculation can perform simultaneously, the
ory is used. As described in section 6.1, the multi-bit activa- L1, L2 and output buffer in tile is increased to support cocur-
tions are sent to the weight arrays from LSB to MSB using rent bi-directional computation. Moreover, global buffer size
eight cycles and the outputs are accumulated with shift-add. also needs to be enlarged to support inter-pipeline among
Calculated MAC value of LSB after ADC is 4-bit and first FF, error calculation and gradient calculation. As shown in
stored in register. Then, MAC value of MSB will be shifted table 2, global buffer of 8T design is 20 MB compared to 8 MB
and added to stored value in the register. The adder is of 7T design. Such large buffer size is feasible as TPU uses
designed to be 11-bit, which is enough to keep carry-bit infor- 24 MB on-chip buffer [2]. Other hardware of 8T based chip
mation for each shift-add operation of 8 bit. remains similar to 7T based chip. Totally, we need 357 tiles to
Intermediate data during training process is massive and store all 8-bit weights from ResNet-18 networks. Despite 8T-
hard to store completely in on-chip SRAM buffer. For 7T based pipeline could reduce DRAM access during feedfor-
SRAM cell based CIMAT design, the global buffer size is ward and error calculation, DRAM access during weight gra-
8 MB. To store massive generated intermediate data, in FF, dient calculation still cannot be eliminated. This observation
activation outputs of each layer Yn will be sent to off-chip can be proved by our evaluation as shown in Table 3. Com-
DRAM for reuse in weight gradient calculation. The calcu- pared to 6 TOPS/W of 7T design, 8T could achieve 10
lated errors of each layer for batch input also need to be TOPS/W. However, total energy efficiency is still limited by
stored off-chip for gradient calculation. After the weight gra- massive off-chip memory access while 55 TOPS/W can be
dient calculation for each image, DWn will be stored off-chip reached without considering DRAM access.
DRAM for weight update. Without including energy cost of
off-chip data transfer, energy efficiency could reach 20 6.3 Benchmark Evaluation
TOPS/W. Considering DRAM access, energy efficiency We compare the proposed CIMAT architectures with a
decreases to 6 TOPS/W as shown in Table 3. To further baseline design with regular 6T SRAM array with row-by-
improve energy efficiency and throughput, we proposed 8T row read. As described in 6.1, the baseline architecture is a

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
952 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020

TABLE 3
Benchmark Results

Fig. 12. Performance benchmark: (a) Throughput comparison. (b) Energy


efficiency comparison.

at 22nm technology as reported in the reference paper [28].


Our evaluation shows that CIMAT 7T design can achieve
on average 7.4 speedup and 100 energy efficiency in
training as compared to GPU-based approach. Compared to
Neural Cache, CIMAT provides 78.7 speedup and 53x
TOPS/W means total operation number per second per watt. FPS is the frame
higher energy efficiency in feed-forward and could support
rate per second. training instead of inference only. Figure 12 also compares
CIMAT training performance over inference with 7T and 8T
near-memory computation solution that places digital com- structure. Our evaluation shows that training could only
putation units at the edge of the memory while SRAM array achieve 0.25 energy efficiency over inference due to mas-
only serves as weight storage unit. Table 3 shows the com- sive off-chip intermediate data access during training. For
parison of performance between 7T/8T SRAM based archi- throughput, 8T design could provides almost same speed
tecture and the baseline. The batch size for training is fixed for both inference and training while the training speed of
to 128 ImageNet images per batch. The energy efficiency in 7T design is limited to less than 1/10 of inference. The
terms of TOPS/W, the frame rate in terms of FPS and the higher throughput of 8T design comes from the inter-pipe-
chip area are evaluated for FF (Eq. (1)), BP (Eq. (2), (3)) and line between training processes as shown in Section 5.2.
weight update (Eq. (4)), respectively. Here, BP performance The TPU v3 (for training) energy efficiency is estimated
includes both error calculation and gradient calculation. to be approximately 0.45TOPS/W [29]. Compared to other
According to Table 3, energy efficiency of CIMAT for custom designed architecture for training, for example, the
both FF and BP process is much improved over that of base- TIME [7] obtains 5.72 TOPS/W by reducing tunning cost
line, which verifies our hypothesis that CIM architecture of RRAM with look up table policy. The energy efficiency of
can minimize memory access for convolution computation. Pipelayer [8] is only 0.14 TOPS/W since all intermediate
Besides, with the transpose SRAM design, area for BP pro- data is written to RRAM arrays. In this work, we employed
cess is much smaller than baseline due to shared CIM arrays the fast-write SRAM technology, 7T design achieves 6.02
and elimination of additional digital circuit for error calcu- TOPS/W and 8T design achieves 10.79 TOPS/W, showing
lation. For BP process, hardware of gradient calculation of the benefits of our proposed CIMAT architecture.
8T CIMAT has 8x area size of 7T design, which accelerates
gradient calculation by duplicating CIM arrays to imple-
ment inter-pipeline in training. As shown in Table 3, com- 7 RELATED WORK
pared to near-memory computation baseline, 7T SRAM Since training of DNNs that involves BP and weight update
based CIMAT training architecture can achieve overall is more complicated, most prior designs only support infer-
4.34 speed up and 3.38 improvement in energy effi- ence. A limited number of works present techniques to
ciency with 0.5 chip area while 8T based CIMAT can fur- implement DNN training. We now discuss these works.
ther increase energy efficiency and throughput by 6.06 Song et al. [8] presented a RRAM-based design named Pipe-
and 52.14, respectively with 0.74 chip area. Layer. They separate the RRAM-based memory into two
Figure 12 compares the performance and energy effi- types: morphable subarrays and memory array. The results
ciency of CIMAT with the GPU-based implementation, pro- of FF process are stored in memory array for future back-
posed near-memory computation baseline and Neural ward computation and morphable array is used as both com-
Cache [28] which is a state-of-the-art hardware accelerating pute unit and storage. Their design exploits both intra- and
DNN inference using SRAM-based in-memory computing inter-layer parallelism to implement pipelined training.
architecture. For GPU platform, the experiments are per- Compared to GPU, their design could achieve improvement
formed using Pytorch running on NVidia Titan RTX. We in speed and energy efficiency. Imani et al. [30] proposed
use NVidia-SMI tool for GPU power measurement. For FloatPIM, a CIM-based DNN training architecture that
Neural Cache, we used 28 TOPS and 52.92W average power exploits analog properties of the memory without converting

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 953

data into the analog domain. FloatPIM could support both [3] L. Rui et al., “Parallelizing SRAM arrays with customized bit-cell
for binary neural networks,” in Proc. IEEE/ACM Design Autom.
floating-point and fixed-point precision. To overcome the Conf., 2018, Art. no. 21.
internal data movement issue, the author design a switch to [4] P. Chi et al., ”PRIME: A novel processing-in-memory architecture
enable in-parallel data transfer between the neighboring for neural network computation in ReRAM-based main memory,”
blocks. The evaluation result shows that FloatPIM can in Proc. ACM/IEEE Int. Symp. Comput. Archit., 2016, pp. 27–39.
[5] A. Shafiee et al., “ISAAC: A convolutional neural network acceler-
achieve on average 6.3 and 21.6 higher speedup and ator with in-situ analog arithmetic in crossbars,” in Proc. ACM/
energy efficiency as compared to PipeLayer. Fujiki et al. [31] IEEE Int. Symp. Comput. Archit., 2016, pp. 14–26.
proposed Duality Cache, developing a single instruction [6] B. Li, L. Song, F. Chen, X. Qian, Y. Chen, and H. H. Li, “ReRAM-
based accelerator for deep learning,” in Proc. ACM/IEEE Des.
multiple thread (SIMT) architecture by enabling in-situ float- Autom. Test Europe Conf., 2018, pp. 815–820.
ing point and transcendental functions to improve data-par- [7] M. Chen et al., “Time: A training-in-memory architecture for
allel acceleration. Their design could improve performance memristor-based deep neural networks,” in Proc. ACM/IEEE Des.
by 3.6x and 4.0x for GPU benchmarks and OpenACC bench- Autom. Conf., 2017, Art. no. 26.
[8] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined
marks respectively. ReRAM-based accelerator for deep learning,” in Proc. IEEE Int.
Symp. High Perform. Comput. Archit., 2017, pp. 541–552.
[9] X. Sun and S. Yu, “Impact of non-ideal characteristics of resistive
8 CONCLUSION synaptic devices on implementing convolutional neural networks,”
We propose a SRAM-based Compute-In-Memory Training IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 3, pp. 570–579,
Sep. 2019.
Architecture, namely CIMAT, which can maximize hardware [10] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
reuse with the transpose array based on novel 7T/8T bit-cell “Binarized neural networks,” in Proc. Int. Conf. Neural Inf. Process.
design. A new CIM solution for error calculation is proposed Syst., 2016, pp. 4107–4115.
with low hardware overhead. We focus on ResNet-18 impele- [11] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-net:
Imagenet classification using binary convolutional neural
mentation in this work, but our proposed methodologies networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 525–542.
could be applied to implement other DNN models. [12] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, “Dorefa-net:
The experiment results show that, 7T SRAM based CIMAT Training low bitwidth convolutional neural networks with low
bitwidth gradients,” 2016. arXiv:1606.06160.
can achieve 3.38 higher energy efficiency (6.02 TOPS/W), [13] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with
4.34 frame rate (4,020 fps) and only 50 percent chip size integers in deep neural networks,” in Proc. Int. Conf. Learn. Repre-
(81.80 mm2 ) compared to the baseline architecture with con- sentations, 2018.
ventional 6T SRAM array that supports row-by-row read [14] Z. Jiang, S. Yin, M. Seok, and J.S. Seo “XNOR-SRAM: In-memory
computing SRAM macro for binary/ternary deep neural
only. With more advanced 8T bit cell and optimized pipeline networks,” in Proc. IEEE Symp. VLSI Circuits, 2018, pp. 173–174.
design, 8T SRAM based CIMAT can further achieves more [15] W. S. Khwa et al., “A 65nm 4Kb algorithm-dependent computing-
energy saving (10.79 TOPS/W) and aggressively more than in-memory SRAM unit-macro with 2.3ns and 55.8 TOPS/W fully
parallel product-sum operation for binary DNN edge processors,”
10 throughput (48,335 fps) with tolerable area overhead in Proc. IEEE Int. Solid-State Circuits Conf., 2018, pp. 496–498.
(121.51 mm2 ) compared to 7T CIMAT. Our results reveal that [16] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier
CIM is a promising solution to implement on-chip DNN train- implemented in a standard 6T SRAM array,” in Proc. IEEE Symp.
ing, which can reduce off-chip talk significantly. Limited on- VLSI Circuits, 2016, pp. 1–2.
[17] P. Y. Chen, X. Peng, and S. Yu, “NeuroSim: A circuit-level macro
chip buffer will be a constraint for pipeline implementation of model for benchmarking neuro-inspired architectures in online
deeper neural networks. Possible replacement of SRAM learning,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,
buffer with larger but slower RRAM buffer is worthy of future vol. 37, no. 12, pp. 3067–3080, Dec. 2018.
exploration. Another limitation for CIM approach is that [18] H. Jiang, X. Peng, S. Huang and S. Yu, “CIMAT: A transpose
SRAM-based compute-in-memory architecture for deep neural
extreme large DNN model is hard to be fully stored on-chip network on-chip training,” in Proc. ACM Int. Symp. Memory Syst.,
with today’s silicon technology. On one side, CIM approach is 2019, pp. 490–496.
more attractive to edge device due to much improved energy [19] X. Sun, S. Yin, X. Peng, R. Liu, J. S. Seo, and S. Yu, “XNOR-RRAM:
A scalable and parallel resistive synaptic architecture for binary
efficiency. From the algorithm’s perspective, there are many neural networks,” in Proc. ACM/IEEE Des. Autom. Test Europe
efforts on developing small networks for edge AI application. Conf., 2018, pp. 1423–1428.
Besides, advanced algorithmic methods like transfer learning [20] X. Si et al., “A twin-8T SRAM computation-in-memory macro for
also help reduce the load for training. From the hardware per- multiple-bit CNN-based machine learning,” in Proc. IEEE Int.
Solid-State Circuits Conf., 2019, pp. 396–398.
spective, 5nm process [32] (512 Mb ¼ 64 MB) SRAM cache is [21] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H. J. Yoo, “A 0.62 mW
available from the foundry recently, which could increase ultra-low-power convolutional-neural-network face-recognition pro-
capacity of CIM solution for larger networks in the near future cessor and a CIS integrated with always-on haar-like face detector,”
in Proc. IEEE Int. Solid-State Circuits Conf., 2017, pp. 248–249.
considering the possible scaling to 3nm node. [22] J. Wang et al., “A compute SRAM with bit-serial integer/floating-
point operations for programmable in-memory vector acceler-
ation,” in Proc. IEEE Int. Solid-State Circuits Conf., 2019, pp. 224–226.
ACKNOWLEDGMENTS [23] T. Gokmen and Y. Vlasov, “Training deep convolutional neural net-
This work was supported in part by Samsung GRO program. works with resistive cross-point devices,” Front. Neurosci., vol. 11,
2017, Art. no. 538.
[24] X. Peng, R. Liu, and S. Yu, “Optimizing weight mapping and data
flow for convolutional neural networks on RRAM based process-
REFERENCES ing-in-memory architecture,” in Proc. IEEE Int. Symp. Circuits
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Syst., 2019, pp. 1–5.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- [25] [Online]. Available: https://github.com/neurosim/DNN_
nit., 2016, pp. 770–778. NeuroSim_V1.0
[2] N.P. Jouppi et al., “In-datacenter performance analysis of a tensor [26] R. Banner et al., “Scalable methods for 8-bit training of neural
processing unit,” in Proc. ACM/IEEE Int. Symp. Comput. Archit., networks,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2018,
2017, pp. 1–12. pp. 5151–5159.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
954 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020

[27] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, ”TETRIS: Shanshi Huang received the BS degree in com-
Scalable and efficient neural network acceleration with 3D memo- munication engineering from the Beijing Institute
ry,” in Proc. ACM Int. Conf. Archit. Support Program. Lang. Operat- of Technology, in 2012, and the MS degree in
ing Syst., 2017, pp. 751–764. electrical engineering from Arizona State Univer-
[28] C. Eckert et al., “Neural cache: Bit-serial in-cache acceleration of sity, in 2014, she is currently working toward the
deep neural networks,” in Proc. ACM/IEEE Int. Symp. Comput. PhD degree in electrical and computer engineer-
Archit., 2018, pp. 383–396. ing at the Georgia Institute of Technology in
[29] P. Teich. [Online]. Available: https://www.nextplatform.com/ Atlanta, Georgia. Her current research interests
2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/ include deep learning algorithm & hardware co-
[30] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “FloatPIM:: In-memory design and deep learning security.
acceleration of deep neural network training with high precision,”
in Proc. ACM/IEEE Int. Symp. Comput. Archit., 2019, pp. 802–815.
[31] D. Fujiki, S. Mahlke, and R. Das, “Duality cache for data parallel Shimeng Yu (Senior Member, IEEE) received
acceleration,” in Proc. ACM/IEEE Int. Symp. Comput. Archit., 2019, the BS degree in microelectronics from Peking
pp. 397–410. University, Beijing, China, in 2009, and the MS
[32] G. Yeap et al., “5nm CMOS production technology platform fea- and PhD degrees in electrical engineering from
turing full-fledged EUV, and high mobility channel FinFETs with Stanford University, Stanford, California, in 2011
densest 0.021um2 SRAM cells for mobile SoC and high perfor- and in 2013, respectively. He is currently an
mance computing applications,” IEEE Int. Electron Devices Meet- associate professor of electrical and computer
ing, 2019, pp. 36.7.1–36.7.4. engineering at the Georgia Institute of Technol-
ogy in Atlanta, Georgia. From 2013 to 2018, he
Hongwu Jiang received the BS degree from the was an assistant professor of electrical and com-
Dalian University of technology, in 2012, and the puter engineering at Arizona State University,
MS degree in electrical engineering from Arizona Tempe, Arizona. His research interests are nanoelectronic devices and
State University, in 2014, he is currently working circuits for energy-efficient computing systems. His expertise is on the
towards the PhD degree in electrical and com- emerging non-volatile memories (e.g., RRAM, ferroelectrics) for different
puter engineering at the Georgia Institute of Tech- applications, such as machine/deep learning accelerator, neuromorphic
nology in Atlanta, Georgia. His research interests computing, monolithic 3D integration, and hardware security, etc. He
include SRAM-/eNVM- based hardware architec- was a recipient of the NSF Faculty Early CAREER Award, in 2016, the
ture and accelerator design of deep learning. IEEE Electron Devices Society (EDS) Early Career Award, in 2017, the
ACM Special Interests Group on Design Automation (SIGDA) Outstand-
ing New Faculty Award, in 2018, the Semiconductor Research Corpora-
tion (SRC) Young Faculty Award, etc.

Xiaochen Peng received the BS degree in auto- " For more information on this or any other computing topic,
matic system from the Hefei University of Technol- please visit our Digital Library at www.computer.org/csdl.
ogy, in 2014, and the MS degree in electrical
engineering from Arizona State University, in 2016,
she is currently working toward the PhD degree in
electrical and computer engineering at the Georgia
Institute of Technology in Atlanta, Georgia. Her
research interests include development of device-
to-system benchmarking framework for machine
learning accelerators, and design of emerging-
device-based hardware implementation for neural
networks.

Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.

You might also like