2020_01_15_vivienne_sze_efficient_computing
2020_01_15_vivienne_sze_efficient_computing
In collaboration with Luca Carlone, Yu-Hsin Chen, Joel Emer, Sertac Karaman, Tushar Krishna,
Thomas Heldt, Trevor Henderson, Hsin-Yu Lai, Peter Li, Fangchang Ma, James Noraky, Gladynel
Saavedra Peña, Charlie Sodini, Amr Suleiman, Nellie Wu, Diana Wofk, Tien-Ju Yang, Zhengdong Zhang
Slides available at
https://tinyurl.com/SzeMITDL2020
Year
Source: Open AI (https://openai.com/blog/ai-and-compute/)
(Feb 2018)
Slowdown
• Geometric Understanding
Semantic Segmentation
State-of-the-art approaches
use Deep Neural Networks,
which require up to several
hundred millions of
operations and weights to
compute!
>100x more complex than
video compression
Input: Output:
Image “Volvo XC90”
Y3 -1
-1 0 1
-1
-1 0 1
• Convolutional Layer
– Feed forward, sparsely-connected w/ weight sharing
– Convolutional Neural Network (CNN) Output Layer
Input Layer
– Typically used for images Hidden Layer
• Attention Layer/Mechanism
– Attention (matrix multiply) + feed forward, fully connected Output Layer
Input Layer
– Transformer [Vaswani, NeurIPS 2017] Hidden Layer
S W
S W F
Element-wise Partial Sum (psum)
Multiplication Accumulation
S W F
Sliding Window Processing
…
output fmap
C
…
…
H
R E
…
…
S W F
Many Input Channels (C)
…
C
…
…
H
R E
1
…
…
S W F
…
Many
Output Channels (M)
C
…
R
M
…
…
M
…
C
…
H
R E
1 1
…
…
S W F
…
…
C
…
…
C
…
…
R E
…
H
…
…
S
N
…
F
W
Vivienne Sze ( @eems_mit) Image batch size: 1 – 256 (N)
22 Define Shape for Each Layer
Input fmaps
Filters C
Output fmaps H – Height of input fmap (activations)
… W – Width of input fmap (activations)
M
…
C
…
H
C – Number of 2-D input fmaps /filters
R
1
E (channels)
1 1
…
…
S W F
R – Height of 2-D filter (weights)
S – Width of 2-D filter (weights)
…
…
M
M – Number of 2-D output fmaps (channels)
C C
E – Height of output fmap (activations)
…
…
…
… …
…
H
N N – Number of input fmaps/output fmaps
…
S
……
N F (batch size)
W
…
3 1x1 64 16
3 3x3 64 1
3 1x1 24 64
…
6 1x1 120 40
6 5x5 120 1
6 1x1 40 120
…
Filters Filter
Input Fmap Input Fmap
Filter 1
1
2 2
W0 W1 W2 W3 W4 W5 W6 W7 PE
Weight
P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum
I0 I1 I2 I3 I4 I5 I6 I7 PE
Act
PE 1
Row 1 * Row 1 • Maximize row
convolutional reuse in RF
− Keep a filter row and fmap
sliding window in RF
* =
Vivienne Sze ( @eems_mit) [Chen, ISCA 2016] Select for Micro Top Picks
33 Row Stationary Dataflow
Row 1 Row 2 Row 3
PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3
PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
* =
* =
*
Optimize for overall energy efficiency instead
=
for only a certain data type
Vivienne Sze ( @eems_mit) [Chen, ISCA 2016] Select for Micro Top Picks
34 Dataflow Comparison: CONV Layers
2
1.5
psums
Normalized
1 weights
Energy/MAC
pixels
0.5
0
WS OSA OSB OSC NLR RS
CNN Dataflows
Zero
== 0 Enable 45% power reduction
Buff
5 1.7×
DRAM 44 1.8× Uncompressed
Access 3 1.9× Fmaps + Weights
(MB) 22
1
RLE Compressed
00 Fmaps + Weights
11 22 33 44 55
AlexNet Conv Layer
AlexNet Conv Layer
Vivienne Sze ( @eems_mit) [Chen, ISSCC 2016]
36 Eyeriss: Deep Neural Network Accelerator
4mm
Spatial
PE Array
On-chip Buffer
4mm
[Chen, ISSCC 2016]
Exploits data reuse for 100x reduction in memory accesses from global
buffer and 1400x reduction in memory accesses from off-chip DRAM
Overall >10x energy reduction compared to a mobile GPU (Nvidia TK1)
Results for AlexNet
Eyeriss Project Website: http://eyeriss.mit.edu
Vivienne Sze ( @eems_mit) [Joint work with Joel Emer]
37 Features: Energy vs. Accuracy
Exponential
10000
VGG162
1000
Energy/ AlexNet2
100
Pixel (nJ)
10
Measured in 65nm* Video
4mm 4mm Compression
1
Spatial
HOG1
On-chip Buffer
4mm
PE Array
Linear
4mm
0.1
0 20 40 60 80
1 [Suleiman, VLSI 2016] 2 [Chen, ISSCC 2016]
Accuracy (Average Precision)
* Only feature extraction. Does Measured in on VOC 2007 Dataset
not include data, classification
energy, augmentation and
1. DPM v5 [Girshick, 2012]
ensemble, etc. 2. Fast R-CNN [Girshick, CVPR 2015]
pruning
synapses
R R
pruning
neurons
C 1 C
S 1 1
S
Examples: SqueezeNet, MobileNet
PE PE
Global
DRAM
Buffer fetch data to run
PE ALU
a MAC here
Energy@eems_mit)
of weight depends on memory hierarchy and dataflow
Vivienne Sze (
41 Energy-Evaluation Methodology
…
Optimization # acc. at mem. level n Edata
Weights
Energy Consumption 22%
of GoogLeNet
…
Energy 10.5
…
…
Energy 41 … 46
NetAdapt Measure
Network Proposals
A B C D Z
Adapted
Network
Code available at http://netadapt.mit.edu [Yang, ECCV 2018]
Vivienne Sze ( @eems_mit) In collaboration with Google’s Mobile Vision Team
46 Simplified Example of One Iteration
3. Maximize
1. Input 2. Meet Budget 4. Output
Accuracy
Layer 1
Selected Selected
…
…
Layer 4
Selected
+0.3% accuracy
1.7x faster
+0.3% accuracy
1.6x faster
Reduction
Expansion
(similar to classification)
> 10x
~40fps on
an iPhone
R R
pruning
neurons
C 1 C
S 1 1
S
Convolutional Depth-Wise Point-Wise
Layer Layer Layer
Reduce Precision
32-bit float 10100101000000000101000000000100
Activation
Memory
Weight
reuse
Weight
MAC array
Memory
Activation
reuse
S 1
Number of feature map
input channels or batch size
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
Global Buffer
Global Buffer
Global Buffer
Global Buffer
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
Router … …
Mesh
…
Cluster All-to-All
PE All-to-all
Cluster … Network …
(a) (b)
• Different Layers
Speed up over Eyeriss v1 scales with number of PEs
– CONV, FC, depth wise, etc.
# of PEs 256 1024 16384
• Wide range of sparsity
AlexNet 17.9x 71.5x 1086.7x
– Dense and Sparse
GoogLeNet 10.4x 37.8x 448.8x
• Scalable architecture MobileNet 15.7x 57.9x 873.0x
Transmit low resolution for lower bandwidth Screens are getting larger
SR algorithm SR
FAST 15x faster
Decode
Navion Project Website: http://navion.mit.edu [Zhang et al., RSS 2017], [Suleiman et al., VLSI 2018]
Vivienne Sze ( @eems_mit) [Suleiman, VLSI-C 2018] Best Student Paper Award
70 Where to Go Next: Planning and Mapping
Robot Exploration: Decide where to go by computing Shannon Mutual Information
Move to Update
Select candidate Compute Shannon MI and
location Occupancy
scan locations choose best location
and scan Map
Core 1
Core 3 Core 2
Core 2
Core N
Core 1
8 8 8 Bank 0
7 7 7 Bank 1
Y 6 6 6 Bank 2
X
5 5 5 Bank 3
4 4 4 8 Bank 4
3 3 3 5 6 7 Bank 5
2 2 3 4 Bank 6
1 2 3 4 5 6 7 8 Bank 7
Y Y
Phantom v25-11 SR EYELINK 1000 PLUS Reulen et al., Med. & Biol. Eng. &
Comp, 1988.
Count
Eye movement
feature
Overview Paper
V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,
“Efficient Processing of Deep Neural
Networks: A Tutorial and Survey,”
Proceedings of the IEEE, Dec. 2017
For updates
More info about
EEMS Mailing List
Tutorial on DNN Architectures
http://eyeriss.mit.edu/tutorial.html
Vivienne Sze ( @eems_mit)
81 Additional Resources
YouTube Channel
EEMS Group – PI: Vivienne Sze
Joel Emer
Sertac Karaman
Thomas Heldt
Research conducted in the MIT Energy-Efficient Multimedia Systems Group would not
be possible without the support of the following organizations: