Data and Hardware Efficient Design For Convolutional Neural Network!
Data and Hardware Efficient Design For Convolutional Neural Network!
5, MAY 2018
I. I NTRODUCTION
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1643
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
1644 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1645
TABLE II
D ATA R EUSE P OLICY
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
1646 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018
TABLE III
D ATA A MOUNT PER BATCH F OR T HREE D IFFERENT
D ATA R EUSE S TRATEGIES
Fig. 10. Average data access of the convolutional layer for different data
reuse strategies.
Fig. 12. Data flow of the CNN accelerator.
TABLE IV
E FFICIENCY R ATIO OF D ATA R EUSE F OR S INGLE BATCH C ONVOLUTIONAL Fig. 12 shows the data flow of the CNN accelerator has two
L AYERS OF THE P OPULAR N ETWORKS operating modes: the convolutional mode (convolutional layer,
pooling layer, and ReLU layer) and fully connected mode
(fully connected layer, ReLU layer, and 1x1 convolutional
layer). In which, the 1x1 convolutional layer is regarded as
the fully connected layer with many batches. For inference
of a CNN, this accelerator is operated layer by layer. The
accelerator will be run-time reconfigured according to its layer
structure through configuration data. The output of intermedi-
partial sum) to save extra write out and read in accesses. Thus, ate layers will be stored to external memory and read back as
we choose this strategy in this paper. Though this strategy next layer input.
has been used in others works, e.g. [33], [34], our analytical For the proposed accelerator, we adopt the output first
equation helps find the best design point under the resource strategy to design the data flow of the convolutional layer.
constraints through program search. In this design, we set Ti = 1 to minimize input buffer size
Table IV shows the data reuse efficiency compared to the because the overall data amount per batch is independent
non-reuse case (three (input + weight + partial sum) data per of Ti 3 4 in the output first strategy. That is, we just compute
MAC and data amount of output first reuse), which is around one input map in a certain computational time.
300× to 600× depending on the network structures. Fig. 13 shows the data flow of the convolutional layer, which
includes the initial stage for the first input map convolution,
the final stage for the last one, and the intermediate stage
V. S YSTEM A RCHITECTURE for the others. In the initial stage, we set the initial partial
Fig. 11 shows the whole architecture of our CNN acceler- sums to be the bias terms. Then in the intermediate stage,
ator, which includes three data buffers and address generators the partial sums will be stored in the partial result buffer except
for input, weight and output, respectively, a configurable the case of no pooling layer. In above exceptional case, we will
processing engine and partial result buffer. This CNN acceler- save the final convolutional outputs into the output data buffer,
ator can realize an end-to-end implementation which includes which will be stored to the external memory afterwards. Above
the convolutional layer, pooling layer, fully connected layer, exceptional case also applies to the final stage. In addition,
and ReLU layer. Moreover, the convolutional layer, pooling the final data path depends on if the ReLU layer exists.
layer, and fully connected layer can share the same processing The hardware of the convolutional layer can execute the
engine to exploit hardware efficiency. pooling layer as well if the operation is average or maximum
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1647
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
1648 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018
Fig. 19. Output data buffer effect on (a) hardware utilization, and (b) average
data access.
Fig. 20. Weight data buffer effect on (a) hardware utilization, and (b) average
data access.
Fig. 18. Input data buffer effect on (a) hardware utilization, and (b) average
data access.
Fig. 22. Weight bandwidth effect on the fully connected layer.
The optimal input and output buffer size is determined by
the convolutional layers because the data per tile is always per cycle for the analysis in Fig. 21. Fig. 21 shows that we
one for the fully connected layer and the fully connected layer need higher weight bandwidth or/and more batches for larger
needs only smaller size of input and output data buffer. scale convolutional layer hardware. On the other hand, for the
The weight buffer size of the convolutional layer is just fully connected layer, this problem is getting worse, which
the number of hardware MACs (denoted as M∗ ). For fully also exists even in the smaller scale hardware. In the analysis
connected layer, the weight buffer size, Ti M∗ , cannot improve of Fig. 22, we assume 108 MACs, the high bandwidth case
the hardware utilization anymore after the upper bound due with 27 weight loading per cycle, and the low bandwidth case
to the limited weight bandwidth. In short, this buffer size with 1 weight loading per cycle. Fig. 22 shows that the fully
is relative insensitive once exceeding a certain level. Despite connected layers need much higher weight bandwidth or/and
this, weight buffer size matters for the fully connected layers larger batches.
more than the convolutional layers. Thus, we will use the fully
connected layers to decide the optimized weight buffer size. C. Generator
For weight bandwidth problem, we assume a large scale Based on above basic architecture and analysis, we use a
hardware with 432 MACs, small scale hardware with generator to accept input configurations and design constraints
54 MACs, the high bandwidth case with 9 weight loading and generate the optimized design and its parameters. These
per cycle, and the low bandwidth case with 1 weight loading parameters include the static Verilog parameters and dynamic
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1649
TABLE VI
T HE I MPLEMENTATION R ESULTS OF THE BATCH -4 A LEX N ET C ASE
TABLE VII
C OMPARISONS ON THE C ONVOLUTIONAL L AYERS
OF THE B ATCH -4 A LEX N ET C ASE
Fig. 23. The flow of the generator.
Fig. 24. An example for the design space exploration of AlexNet C1.
TABLE V
T HE C ONFIGURATIONS OF THE E XAMPLE I MPLEMENTATION
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
1650 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018
TABLE VIII
C OMPARISONS B ETWEEN CNN A CCELERATORS
Table VII shows comparisons on the convolutional layers these works. The reason for this high efficiency is the proposed
of the batch-4 AlexNet case. Our design has much higher architecture design with high hardware utilization and optimal
throughput and lower area cost than that of Eyeriss [33]. design generator under design constraints.
Thus the area efficiency of our design is higher even after
technology scaling. This is because our connection is more VII. C ONCLUSION
regular that makes lower hardware cost. For the bandwidth In this paper, we propose a run-time reconfigurable
comparison, we use uncompressed bandwidth in [33] since CNN architecture that can handle non-unit stride kernels and
their compression method can be adopted here as well. Under process different kernel types efficiently to overcome the large
this condition, our design needs smaller internal buffer size computational complexity and data amount of the CNN algo-
to obtain almost the same data access per frame. In addition, rithm. The overall design exploits layer specific characteristics
our design has also higher area and bandwidth efficiency than for the buffer design and data bandwidth. The hardware design
those in [33]. Thus our data reuse strategy indeed relieves the adopts a tile based model and the output first strategy to
data bandwidth. reuse data of the convolutional layers with 300× ∼ 600×
Table VIII shows the comparisons between different CNN better efficiency than the non-reused case. Designs for different
accelerators. In this table, we only list the designs implemented networks can be generated by a CNN generator for a target
with ASIC processes instead of FPGA chips due to space network that is optimal under the given hardware resources
limitation. For fair comparison, we have scaled all throughput and data bandwidth. An example design for AlexNet with
to the same process node. In which, area cost in [25] is TSMC 40nm process consumes 1.783M gate count for
the smallest one because it had fixed connections and only 216 MACs and 142.64 KB internal buffer size, and achieves
implemented convolutional layers. Thus, it has the highest 61.6 fps under the 454 MHz clock frequency, which shows
area efficiency but with the limited application. Except this, better area and bandwidth efficiency compared to the other
the area efficiency of our design is the highest one among state-of-the-art works.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1651
R EFERENCES [26] S. Han et al., “EIE: Efficient inference engine on compressed deep neural
network,” in Proc. ISCA, 2016, pp. 243–254.
[1] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and [27] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello,
M. A. Horowitz, “Convolution engine: Balancing efficiency & flexibility “A 240 G-ops/s mobile coprocessor for deep neural networks,” in Proc.
in specialized computing,” in Proc. ACM Int. Symp. Comput. Archit., IEEE CVPRW, Jun. 2014, pp. 682–687.
2013, pp. 24–35. [28] M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal, “Memory-
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica- centric accelerator design for convolutional neural networks,” in Proc.
tion with deep convolutional neural networks,” in Proc. NIPS, 2012, ICCD, 2013, pp. 13–19.
pp. 1097–1105. [29] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
[3] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE FPGA-based accelerator design for deep convolutional neural networks,”
CVPR, Jun. 2015, pp. 1–9. in Proc. ISFPGA, 2015, pp. 161–170.
[4] K. Simonyan and A. Zisserman. (2014). “Very deep convolutional [30] J. Qiu et al., “Going deeper with embedded FPGA platform for
networks for large-scale image recognition.” [Online]. Available: convolutional neural network,” in Proc. ISFPGA, 2016, pp. 26–35.
https://arxiv.org/abs/1409.1556 [31] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image L. Benini, “Origami: A convolutional network accelerator,” in Proc.
recognition,” in Proc. IEEE CVPR, Jun. 2016, pp. 770–778. GLSVLSI, 2015, pp. 199–204.
[6] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks [32] J. Sim, J.-S. Park, M. Kim, D. Bae, Y. Choi, and L.-S. Kim,
for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., “A 1.42TOPS/W deep convolutional neural network recognition proces-
vol. 35, no. 1, pp. 221–231, Jan. 2013. sor for intelligent IoE systems,” in IEEE Int. Solid-State Circuits Conf.
[7] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and (ISSCC) Dig. Tech. Papers, Jan./Feb. 2016, pp. 264–265.
Y. LeCun. (2014). “OverFeat: Integrated recognition, localization [33] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
and detection using convolutional networks.” [Online]. Available: energy-efficient reconfigurable accelerator for deep convolutional neural
https://arxiv.org/abs/1312.6229 networks,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature Tech. Papers, Jan./Feb. 2016, pp. 262–263.
hierarchies for accurate object detection and semantic segmentation,” [34] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envi-
in Proc. IEEE CVPR, Jun. 2014, pp. 580–587. sion: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-
[9] T. He, W. Huang, Y. Qiao, and J. Yao, “Text-attentional convolutional frequency-scalable Convolutional Neural Network processor in 28nm
neural network for scene text detection,” IEEE Trans. Image Process., FDSOI,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
vol. 25, no. 6, pp. 2529–2541, Jun. 2016. Tech. Papers, Feb. 2017, pp. 246–247.
[10] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural [35] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
network cascade for face detection,” in Proc. IEEE CVPR, Jun. 2015, tional networks,” in Proc. ECCV, 2014, pp. 818–833.
pp. 5325–5334. [36] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
[11] D. Tomè, F. Monti, L. Baroffio, L. Bondi, M. Tagliasacchi, and networks,” in Proc. AISTATS, 2011, p. 275.
S. Tubaro, “Deep convolutional neural networks for pedestrian detec- [37] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
tion,” Signal Process., Image Commun., vol. 47, pp. 482–489, Sep. 2016. boltzmann machines,” in Proc. ICML, 2010, pp. 807–814.
[12] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the [38] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “CNP: An FPGA-based
sensor,” in Proc. ISCA, Jun. 2015, pp. 92–104. processor for convolutional networks,” in Proc. FPL, 2009, pp. 32–37.
[13] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in [39] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator
Proc. MICRO, 2014, pp. 609–622. for ubiquitous machine-learning,” in Proc. ASPLOS, 2014, pp. 269–284.
[14] Y. Chen, X. Yang, B. Zhong, S. Pan, D. Chen, and H. Zhang, [40] D. Liu et al., “PuDianNao: A polyvalent machine learning accelerator,”
“CNNTracker: Online discriminative object tracking via deep convo- in Proc. ASPLOS, 2015, pp. 369–381.
lutional neural network,” Appl. Soft Comput., vol. 38, pp. 1088–1098, [41] T. Chen et al., “A high-throughput neural network accelerator,” IEEE
Jan. 2016. Micro, vol. 35, no. 3, pp. 24–32, May 2015.
[15] H. Li, Y. Li, and F. Porikli, “Robust online visual tracking with a single [42] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
convolutional neural network,” in Proc. ACCV, 2014, pp. 194–209. energy-efficient dataflow for convolutional neural networks,” in Proc.
[16] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and ISCA, 2016, pp. 367–379.
Y. LeCun, “NeuFlow: A runtime reconfigurable dataflow processor for
vision,” in Proc. IEEE CVPRW, Jun. 2011, pp. 109–116.
[17] L. Cavigelli, M. Magno, and L. Benini, “Accelerating real-time embed- Yue-Jin Lin received the M.S. degrees in electronics
ded scene labeling with convolutional networks,” in Proc. DAC, 2015, engineering from National Chiao Tung University,
p. 108. Hsinchu, Taiwan, in 2016. He is currently with
[18] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning Novatek, Hsinchu. His current research interests are
deep features for scene recognition using places database,” in Proc. image processing, deep learning, and digital inte-
NIPS, 2014, pp. 487–495. grated circuits.
[19] P.-H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and
E. Culurciello, “NeuFlow: Dataflow vision processing system-on-a-
chip,” in Proc. MWSCAS, 2012, pp. 1044–1047.
[20] F. Conti and L. Benini, “A ultra-low-energy convolution engine for fast
brain-inspired vision in multicore clusters,” in Proc. IEEE Des. Autom.
Test Eur. Conf., Mar. 2015, pp. 683–688.
[21] W. Yang, W. Ouyang, H. Li, and X. Wang, “End-to-end learning Tian Sheuan Chang (S’93–M’06–SM’07) received
of deformable mixture of parts and deep convolutional neural net- the B.S., M.S., and Ph.D. degrees in electronic
works for human pose estimation,” in Proc. IEEE CVPR, Jun. 2016, engineering from National Chiao Tung Univer-
pp. 3073–3082. sity (NCTU), Hsinchu, Taiwan, in 1993, 1995, and
[22] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for 1999, respectively.
human pose estimation in videos,” in Proc. IEEE ICCV, Dec. 2015, From 2000 to 2004, he was a Deputy Manager
pp. 1913–1921. with Global Unichip Corporation, Hsinchu. In 2004,
[23] D. Weimer, B. Scholz-Reiter, and M. Shpitalni, “Design of deep con- he joined the Department of Electronics Engineer-
volutional neural network architectures for automated feature extraction ing, NCTU, where he is currently a Professor.
in industrial inspection,” CIRP Ann.-Manuf. Technol., vol. 65, no. 1, In 2009, he was a Visiting Scholar with Imec,
pp. 417–420, 2016. Belgium. His current research interests include
[24] X. Bian, S. N. Lim, and N. Zhou, “Multiscale fully convolutional system-on-a-chip design, VLSI signal processing, and computer architecture.
network with application to industrial inspection,” in Proc. WACV, 2016, Dr. Chang has received the Excellent Young Electrical Engineer from the
pp. 1–8. Chinese Institute of Electrical Engineering in 2007, and the Outstanding
[25] L. Cavigelli and L. Benini. (Dec. 2015). “Origami: A 803 Young Scholar from the Taiwan IC Design Society in 2010. He has been
GOp/s/W convolutional network accelerator.” [Online]. Available: actively involved in many international conferences as an organizing commit-
https://arxiv.org/abs/1512.04295 tee or technical program committee member.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.