CIMAT A Compute-In-Memory Architecture For On-Chip Training Based On Transpose SRAM Arrays
CIMAT A Compute-In-Memory Architecture For On-Chip Training Based On Transpose SRAM Arrays
7, JULY 2020
Abstract—Rapid development in deep neural networks (DNNs) is enabling many intelligent applications. However, on-chip training of
DNNs is challenging due to the extensive computation and memory bandwidth requirements. To solve the bottleneck of the memory
wall problem, compute-in-memory (CIM) approach exploits the analog computation along the bit line of the memory array thus
significantly speeds up the vector-matrix multiplications. So far, most of the CIM-based architectures target at implementing inference
engine for offline training only. In this article, we propose CIMAT, a CIM Architecture for Training. At the bitcell level, we design two
versions of 7T and 8T transpose SRAM to implement bi-directional vector-to-matrix multiplication that is needed for feedforward (FF)
and backprogpagation (BP). Moreover, we design the periphery circuitry, mapping strategy and the data flow for the BP process and
weight update to support the on-chip training based on CIM. To further improve training performance, we explore the pipeline
optimization of proposed architecture. We utilize the mature and advanced CMOS technology at 7 nm to design the CIMAT architecture
with 7T/8T transpose SRAM array that supports bi-directional parallel read. We explore the 8-bit training performance of ImageNet on
ResNet-18, showing that 7T-based design can achieve 3.38 higher energy efficiency (6.02 TOPS/W), 4.34 frame rate
(4,020 fps) and only 50 percent chip size compared to the baseline architecture with conventional 6T SRAM array that supports
row-by-row read only. The even better performance is obtained with 8T-based architecture, which can reach 10.79 TOPS/W and
48,335 fps with 74-percent chip area compared to the baseline.
1 INTRODUCTION
ECENTLY, DNNs have achieved remarkable improvement computation with optimized data is realized across multiple
R for a wide range of intelligence applications, from image
classification to speech recognition an autonomous vehicle.
processing elements (PE), the weights and intermediate data
still require inefficient on-chip or off-chip memory access.
The main elements of DNNs are convolutional layers and This drawback is exacerbated for DNN training due to fre-
fully connected layers. To achieve incremental accuracy quent back and forth data movement. To alleviate the mem-
improvement, state-of-the-art DNNs tend to increase the ory access bottleneck, compute-in-memory (CIM) is a
depth and size of the neural network aggressively, which promising solution for DNN hardware acceleration. Weight
requires large amount of computational resources and mem- movements can be eliminated by in-memory computing.
ory storage for high-precision multiply-and-accumulate CIM could also improve the parallelism within the memory
(MAC) operation. For example, Resnet-50 [1] can achieve array by activating multiple rows and use the analog readout
70 percent accuracy with 25.6M parameters. Although to conduct multiplication and current summation. However,
graphic processing units (GPUs) are the most popular hard- most of the proposed CIM architectures so far [3], [4], [5]
ware for DNN training at the cloud, there have been many could support the DNN inference only. Some efforts are
efforts from academia and industry on the design of applica- made to accelerate training in CIM [6], [7], [8] but based on
tion specific integrated circuit (ASIC) accelerators for infer- resistive random access memory (RRAM), however, the rela-
ence or even training on-chip. However, the memory wall tively large write latency/energy, and asymmetry and non-
problem remains in the conventional CMOS ASIC accelera- linearity in the conductance tuning of RRAM prevent it from
tors such as TPU [2] where the parameters (i.e., weights and ideal candidates for extensive weight updates [9]. In addi-
activations) are stored in global buffer and the computation tion, for backpropagation (BP) process in the DNN training,
is still performed at the digital MAC arrays. Despite parallel CIM array needs to perform convolution computation with
transposed weight matrices. If the regular memory array
The authors are with the School of Electrical and Computer Engineering, with row-wise input and column-wise output is used, the
Georgia Institute of Technology, Atlanta, GA 30332. E-mail: {hjiang318, parallel analog read-out could not be achieved when proc-
xpeng76, shuang406}@gatech.edu, [email protected]. essing the transposed weight matrices. Instead, the sequen-
Manuscript received 14 Oct. 2019; revised 25 Jan. 2020; accepted 1 Mar. 2020. tial column by column read-out is required during BP,
Date of publication 13 Mar. 2020; date of current version 9 June 2020. which significantly decreases the throughput and energy-
(Corresponding author: Shimeng Yu.)
Recommended for acceptance by Xuehai Qian and Yanzhi Wang. efficiency. On the other hand, network compression is also a
Digital Object Identifier no. 10.1109/TC.2020.2980533 promising approach to reduce energy and area cost of the
0018-9340 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 945
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
946 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020
@L @L @Ynþ1 @L T
¼ ¼ Wnþ1 (2)
@Yn @Ynþ1 @Yn @Ynþ1
@L @L @Yn @L
DWn ¼ ¼ ¼ Yn1 (3)
@Wn @Yn @Wn @Yn
@L
Wnt ¼ Wnt1 LR : (4)
@Wn
As Eq. (2) suggested, the critical challenge to implement
on-chip training is that the training hardware needs to sup-
port the read-out of the transposed weight matrix. Another
issue is how to deal with massive intermediate data, which
are generated during the entire training process.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 947
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
948 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020
Fig. 6. Weight mapping for BP process (Eq. (2)) in the PE that implements and the output vector (i.e., the error in the current layer
a0 element in the 3 3 filter.
@L=@Yn ) is obtained from the row in parallel. With such
transpose architecture, FF and error calculation can be per-
a column length of 128), then sum up the partial sums from formed within the same PE. No additional memory access
different PEs through an adder tree [24]. Although both is needed in error calculation, which means an improve-
schemes can be used for the transpose CIM architecture, the ment in throughput and energy efficiency.
first one will make the forward and backward operation
asymmetric because the column output in FF is directly the
entire partial sum, while the row output in BP is just part of 4.3 Weight Gradient Calculation
the partial sum. To keep the balance between FF and BP, we For the calculation of weight gradient matrix DWn in Eq. (3),
choose to use the second mapping scheme (as shown in we propose using CIM approach with additional 6T non-
Fig. 5) in our transpose CIM architecture. First, weights are transpose CIM SRAM arrays to perform the outer dot prod-
pre-written into 7T/8T SRAM cells by write-wordlines uct multiplication between error matrix @L=@Yn and related
(WWLs) in regular read mode. Then, for in-memory MAC input activation matrix Yn1 . The mapping method for DWn
operation, activations are first fed into SRAM cells through calculation is shown in Fig. 7. @L=@Yn that is calculated in
read-wordlines (RWLs). One-bit multiplication can be imple- the previous step by Eq. (2) is written in 6T CIM SRAM
mented by NAND gate. Thus, in computing mode, the read- array as weight first and then Yn1 is loaded (from off-chip
out current from read-bitline (RBL) represents product of DRAM to on-chip buffer) as activation. Each plain of
multiplication. The summed current along one column/row, @L=@Yn is stretched into one long column which length
namely partial sum, represents final value of MAC operation. equals EF, the number of @L=@Yn channels is M, which
It is called ‘partial sum’ since this value is only the summed means there are M columns in total. Thus, @L=@Yn can be
dot-product from one kernel of the filter. To get the final out- mapped to a large weight matrix, whose height and width
put feature map, partial sums from different kernels need to equal to EF and M. Sliding windows on each plain of the
be summed again. For the fully connected layers, they are input, Yn1 , are also unrolled to a group of columns. The
treated as a special case of convolution layers with 1 1 fil- activation columns, whose length also equal to the height of
ters, thus the same mapping scheme could be applied. weight matrix, are fed into the array cycle by cycle, thereby
performing bitwise multiplication and accumulation. Partial
4.2 Backpropagation Process sums from same column with multiple cycles generate all
the weight gradients of one filter while partial sums from
Fig. 6 shows the details of error calculation in Eq. (2) for the
different columns form the entire gradient matrix DWn . In
first channel group (a0 element) of the filters. The backward
the normal batch training mode, DWn of each image is sent
pass for a convolution operation is also a convolution (but
to off-chip DRAM for storing, and at the end of each batch,
with spatially-flipped filters). For the FF, the product of
weight gradients are loaded back and accumulated on-chip.
input sliding window with the same filter across all the
The averaged DWn is used as input of weight update that is
channels is summed up to generate one output, which
performed in periphery circuit of the array, as describe in
means all dot products in same column of the PE is summed
the next sub-section.
up. However, for the BP, the product of input sliding win-
dow and the same channel across different filters is
summed up, which means all dot-products in same row 4.4 Weight Update
need to be summed up. Essentially, we will process the Fig. 8 shows the structure of weigh update module. The 6T
transposed version of weight matrix in the BP. As shown in SRAM row is disabled during the FF and BP. When the
T
Fig. 6, Wnþ1 is the weights in FF while Wnþ1 is the trans- accumulated weight gradients are ready after one batch,
posed weights for BP, and they are mapped to the same they are fed into shift register to realize multiplication with
memory array. In BP, the input vector (i.e., the error in the learning rate by shift (Eq. (4)) in a read-modify-write
next layer @L=@Ynþ1 ) is applied to the column in parallel, scheme that is done row-by-row. First, one row of the DWn
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 949
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
950 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 951
TABLE 1 TABLE 2
7T CIMAT Parameters 8T CIMAT Parameters
example, sub-array total energy is 25.75 pJ/op, which means SRAM cell based CIMAT design to implement inter-/intra-
the total energy for one single sub-array to do one vector- pipeline. The circuit-level paramters of 8T transpose SRAM
matrix multiplication, and L1 buffer energy is 0.006pJ/bit, based architecture are shown in Table 2. The 8T memory
which means total averaged energy to write one-bit data. array is a little larger than 7T due to area overhead of addi-
The estimated energy cost of off-chip DRAM access is 4.2pJ/ tional transistors. For 8T based architecture, since FF calcula-
bit from prior work [27] assuming 3D high-bandwidth mem- tion and error calculation can perform simultaneously, the
ory is used. As described in section 6.1, the multi-bit activa- L1, L2 and output buffer in tile is increased to support cocur-
tions are sent to the weight arrays from LSB to MSB using rent bi-directional computation. Moreover, global buffer size
eight cycles and the outputs are accumulated with shift-add. also needs to be enlarged to support inter-pipeline among
Calculated MAC value of LSB after ADC is 4-bit and first FF, error calculation and gradient calculation. As shown in
stored in register. Then, MAC value of MSB will be shifted table 2, global buffer of 8T design is 20 MB compared to 8 MB
and added to stored value in the register. The adder is of 7T design. Such large buffer size is feasible as TPU uses
designed to be 11-bit, which is enough to keep carry-bit infor- 24 MB on-chip buffer [2]. Other hardware of 8T based chip
mation for each shift-add operation of 8 bit. remains similar to 7T based chip. Totally, we need 357 tiles to
Intermediate data during training process is massive and store all 8-bit weights from ResNet-18 networks. Despite 8T-
hard to store completely in on-chip SRAM buffer. For 7T based pipeline could reduce DRAM access during feedfor-
SRAM cell based CIMAT design, the global buffer size is ward and error calculation, DRAM access during weight gra-
8 MB. To store massive generated intermediate data, in FF, dient calculation still cannot be eliminated. This observation
activation outputs of each layer Yn will be sent to off-chip can be proved by our evaluation as shown in Table 3. Com-
DRAM for reuse in weight gradient calculation. The calcu- pared to 6 TOPS/W of 7T design, 8T could achieve 10
lated errors of each layer for batch input also need to be TOPS/W. However, total energy efficiency is still limited by
stored off-chip for gradient calculation. After the weight gra- massive off-chip memory access while 55 TOPS/W can be
dient calculation for each image, DWn will be stored off-chip reached without considering DRAM access.
DRAM for weight update. Without including energy cost of
off-chip data transfer, energy efficiency could reach 20 6.3 Benchmark Evaluation
TOPS/W. Considering DRAM access, energy efficiency We compare the proposed CIMAT architectures with a
decreases to 6 TOPS/W as shown in Table 3. To further baseline design with regular 6T SRAM array with row-by-
improve energy efficiency and throughput, we proposed 8T row read. As described in 6.1, the baseline architecture is a
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
952 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020
TABLE 3
Benchmark Results
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
JIANG ET AL.: CIMAT: A COMPUTE-IN-MEMORY ARCHITECTURE FOR ON-CHIP TRAINING BASED ON TRANSPOSE SRAM ARRAYS 953
data into the analog domain. FloatPIM could support both [3] L. Rui et al., “Parallelizing SRAM arrays with customized bit-cell
for binary neural networks,” in Proc. IEEE/ACM Design Autom.
floating-point and fixed-point precision. To overcome the Conf., 2018, Art. no. 21.
internal data movement issue, the author design a switch to [4] P. Chi et al., ”PRIME: A novel processing-in-memory architecture
enable in-parallel data transfer between the neighboring for neural network computation in ReRAM-based main memory,”
blocks. The evaluation result shows that FloatPIM can in Proc. ACM/IEEE Int. Symp. Comput. Archit., 2016, pp. 27–39.
[5] A. Shafiee et al., “ISAAC: A convolutional neural network acceler-
achieve on average 6.3 and 21.6 higher speedup and ator with in-situ analog arithmetic in crossbars,” in Proc. ACM/
energy efficiency as compared to PipeLayer. Fujiki et al. [31] IEEE Int. Symp. Comput. Archit., 2016, pp. 14–26.
proposed Duality Cache, developing a single instruction [6] B. Li, L. Song, F. Chen, X. Qian, Y. Chen, and H. H. Li, “ReRAM-
based accelerator for deep learning,” in Proc. ACM/IEEE Des.
multiple thread (SIMT) architecture by enabling in-situ float- Autom. Test Europe Conf., 2018, pp. 815–820.
ing point and transcendental functions to improve data-par- [7] M. Chen et al., “Time: A training-in-memory architecture for
allel acceleration. Their design could improve performance memristor-based deep neural networks,” in Proc. ACM/IEEE Des.
by 3.6x and 4.0x for GPU benchmarks and OpenACC bench- Autom. Conf., 2017, Art. no. 26.
[8] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined
marks respectively. ReRAM-based accelerator for deep learning,” in Proc. IEEE Int.
Symp. High Perform. Comput. Archit., 2017, pp. 541–552.
[9] X. Sun and S. Yu, “Impact of non-ideal characteristics of resistive
8 CONCLUSION synaptic devices on implementing convolutional neural networks,”
We propose a SRAM-based Compute-In-Memory Training IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 3, pp. 570–579,
Sep. 2019.
Architecture, namely CIMAT, which can maximize hardware [10] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
reuse with the transpose array based on novel 7T/8T bit-cell “Binarized neural networks,” in Proc. Int. Conf. Neural Inf. Process.
design. A new CIM solution for error calculation is proposed Syst., 2016, pp. 4107–4115.
with low hardware overhead. We focus on ResNet-18 impele- [11] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-net:
Imagenet classification using binary convolutional neural
mentation in this work, but our proposed methodologies networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 525–542.
could be applied to implement other DNN models. [12] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, “Dorefa-net:
The experiment results show that, 7T SRAM based CIMAT Training low bitwidth convolutional neural networks with low
bitwidth gradients,” 2016. arXiv:1606.06160.
can achieve 3.38 higher energy efficiency (6.02 TOPS/W), [13] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with
4.34 frame rate (4,020 fps) and only 50 percent chip size integers in deep neural networks,” in Proc. Int. Conf. Learn. Repre-
(81.80 mm2 ) compared to the baseline architecture with con- sentations, 2018.
ventional 6T SRAM array that supports row-by-row read [14] Z. Jiang, S. Yin, M. Seok, and J.S. Seo “XNOR-SRAM: In-memory
computing SRAM macro for binary/ternary deep neural
only. With more advanced 8T bit cell and optimized pipeline networks,” in Proc. IEEE Symp. VLSI Circuits, 2018, pp. 173–174.
design, 8T SRAM based CIMAT can further achieves more [15] W. S. Khwa et al., “A 65nm 4Kb algorithm-dependent computing-
energy saving (10.79 TOPS/W) and aggressively more than in-memory SRAM unit-macro with 2.3ns and 55.8 TOPS/W fully
parallel product-sum operation for binary DNN edge processors,”
10 throughput (48,335 fps) with tolerable area overhead in Proc. IEEE Int. Solid-State Circuits Conf., 2018, pp. 496–498.
(121.51 mm2 ) compared to 7T CIMAT. Our results reveal that [16] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier
CIM is a promising solution to implement on-chip DNN train- implemented in a standard 6T SRAM array,” in Proc. IEEE Symp.
ing, which can reduce off-chip talk significantly. Limited on- VLSI Circuits, 2016, pp. 1–2.
[17] P. Y. Chen, X. Peng, and S. Yu, “NeuroSim: A circuit-level macro
chip buffer will be a constraint for pipeline implementation of model for benchmarking neuro-inspired architectures in online
deeper neural networks. Possible replacement of SRAM learning,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,
buffer with larger but slower RRAM buffer is worthy of future vol. 37, no. 12, pp. 3067–3080, Dec. 2018.
exploration. Another limitation for CIM approach is that [18] H. Jiang, X. Peng, S. Huang and S. Yu, “CIMAT: A transpose
SRAM-based compute-in-memory architecture for deep neural
extreme large DNN model is hard to be fully stored on-chip network on-chip training,” in Proc. ACM Int. Symp. Memory Syst.,
with today’s silicon technology. On one side, CIM approach is 2019, pp. 490–496.
more attractive to edge device due to much improved energy [19] X. Sun, S. Yin, X. Peng, R. Liu, J. S. Seo, and S. Yu, “XNOR-RRAM:
A scalable and parallel resistive synaptic architecture for binary
efficiency. From the algorithm’s perspective, there are many neural networks,” in Proc. ACM/IEEE Des. Autom. Test Europe
efforts on developing small networks for edge AI application. Conf., 2018, pp. 1423–1428.
Besides, advanced algorithmic methods like transfer learning [20] X. Si et al., “A twin-8T SRAM computation-in-memory macro for
also help reduce the load for training. From the hardware per- multiple-bit CNN-based machine learning,” in Proc. IEEE Int.
Solid-State Circuits Conf., 2019, pp. 396–398.
spective, 5nm process [32] (512 Mb ¼ 64 MB) SRAM cache is [21] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H. J. Yoo, “A 0.62 mW
available from the foundry recently, which could increase ultra-low-power convolutional-neural-network face-recognition pro-
capacity of CIM solution for larger networks in the near future cessor and a CIS integrated with always-on haar-like face detector,”
in Proc. IEEE Int. Solid-State Circuits Conf., 2017, pp. 248–249.
considering the possible scaling to 3nm node. [22] J. Wang et al., “A compute SRAM with bit-serial integer/floating-
point operations for programmable in-memory vector acceler-
ation,” in Proc. IEEE Int. Solid-State Circuits Conf., 2019, pp. 224–226.
ACKNOWLEDGMENTS [23] T. Gokmen and Y. Vlasov, “Training deep convolutional neural net-
This work was supported in part by Samsung GRO program. works with resistive cross-point devices,” Front. Neurosci., vol. 11,
2017, Art. no. 538.
[24] X. Peng, R. Liu, and S. Yu, “Optimizing weight mapping and data
flow for convolutional neural networks on RRAM based process-
REFERENCES ing-in-memory architecture,” in Proc. IEEE Int. Symp. Circuits
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Syst., 2019, pp. 1–5.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- [25] [Online]. Available: https://github.com/neurosim/DNN_
nit., 2016, pp. 770–778. NeuroSim_V1.0
[2] N.P. Jouppi et al., “In-datacenter performance analysis of a tensor [26] R. Banner et al., “Scalable methods for 8-bit training of neural
processing unit,” in Proc. ACM/IEEE Int. Symp. Comput. Archit., networks,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2018,
2017, pp. 1–12. pp. 5151–5159.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.
954 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 7, JULY 2020
[27] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, ”TETRIS: Shanshi Huang received the BS degree in com-
Scalable and efficient neural network acceleration with 3D memo- munication engineering from the Beijing Institute
ry,” in Proc. ACM Int. Conf. Archit. Support Program. Lang. Operat- of Technology, in 2012, and the MS degree in
ing Syst., 2017, pp. 751–764. electrical engineering from Arizona State Univer-
[28] C. Eckert et al., “Neural cache: Bit-serial in-cache acceleration of sity, in 2014, she is currently working toward the
deep neural networks,” in Proc. ACM/IEEE Int. Symp. Comput. PhD degree in electrical and computer engineer-
Archit., 2018, pp. 383–396. ing at the Georgia Institute of Technology in
[29] P. Teich. [Online]. Available: https://www.nextplatform.com/ Atlanta, Georgia. Her current research interests
2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/ include deep learning algorithm & hardware co-
[30] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “FloatPIM:: In-memory design and deep learning security.
acceleration of deep neural network training with high precision,”
in Proc. ACM/IEEE Int. Symp. Comput. Archit., 2019, pp. 802–815.
[31] D. Fujiki, S. Mahlke, and R. Das, “Duality cache for data parallel Shimeng Yu (Senior Member, IEEE) received
acceleration,” in Proc. ACM/IEEE Int. Symp. Comput. Archit., 2019, the BS degree in microelectronics from Peking
pp. 397–410. University, Beijing, China, in 2009, and the MS
[32] G. Yeap et al., “5nm CMOS production technology platform fea- and PhD degrees in electrical engineering from
turing full-fledged EUV, and high mobility channel FinFETs with Stanford University, Stanford, California, in 2011
densest 0.021um2 SRAM cells for mobile SoC and high perfor- and in 2013, respectively. He is currently an
mance computing applications,” IEEE Int. Electron Devices Meet- associate professor of electrical and computer
ing, 2019, pp. 36.7.1–36.7.4. engineering at the Georgia Institute of Technol-
ogy in Atlanta, Georgia. From 2013 to 2018, he
Hongwu Jiang received the BS degree from the was an assistant professor of electrical and com-
Dalian University of technology, in 2012, and the puter engineering at Arizona State University,
MS degree in electrical engineering from Arizona Tempe, Arizona. His research interests are nanoelectronic devices and
State University, in 2014, he is currently working circuits for energy-efficient computing systems. His expertise is on the
towards the PhD degree in electrical and com- emerging non-volatile memories (e.g., RRAM, ferroelectrics) for different
puter engineering at the Georgia Institute of Tech- applications, such as machine/deep learning accelerator, neuromorphic
nology in Atlanta, Georgia. His research interests computing, monolithic 3D integration, and hardware security, etc. He
include SRAM-/eNVM- based hardware architec- was a recipient of the NSF Faculty Early CAREER Award, in 2016, the
ture and accelerator design of deep learning. IEEE Electron Devices Society (EDS) Early Career Award, in 2017, the
ACM Special Interests Group on Design Automation (SIGDA) Outstand-
ing New Faculty Award, in 2018, the Semiconductor Research Corpora-
tion (SRC) Young Faculty Award, etc.
Xiaochen Peng received the BS degree in auto- " For more information on this or any other computing topic,
matic system from the Hefei University of Technol- please visit our Digital Library at www.computer.org/csdl.
ogy, in 2014, and the MS degree in electrical
engineering from Arizona State University, in 2016,
she is currently working toward the PhD degree in
electrical and computer engineering at the Georgia
Institute of Technology in Atlanta, Georgia. Her
research interests include development of device-
to-system benchmarking framework for machine
learning accelerators, and design of emerging-
device-based hardware implementation for neural
networks.
Authorized licensed use limited to: Birla Institute of Technology & Science. Downloaded on September 14,2022 at 16:38:53 UTC from IEEE Xplore. Restrictions apply.