0% found this document useful (0 votes)
0 views

2020_01_15_vivienne_sze_efficient_computing

The document discusses the challenges and advancements in efficient computing for deep learning, AI, and robotics, highlighting the exponential increase in compute demands from models like AlexNet to AlphaGo Zero. It emphasizes the need for specialized hardware due to the limitations of existing processors and the inefficiencies in data movement and memory access. The presentation also covers the architecture of deep neural networks, their operational complexities, and the importance of optimizing hardware to improve performance and energy efficiency.

Uploaded by

ambeydeveloper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

2020_01_15_vivienne_sze_efficient_computing

The document discusses the challenges and advancements in efficient computing for deep learning, AI, and robotics, highlighting the exponential increase in compute demands from models like AlexNet to AlphaGo Zero. It emphasizes the need for specialized hardware due to the limitations of existing processors and the inefficiencies in data movement and memory access. The presentation also covers the architecture of deep neural networks, their operational complexities, and the importance of optimizing hardware to improve performance and energy efficiency.

Uploaded by

ambeydeveloper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

1

Efficient Computing for


Deep Learning, AI and Robotics
Vivienne Sze ( @eems_mit)
Massachusetts Institute of Technology

In collaboration with Luca Carlone, Yu-Hsin Chen, Joel Emer, Sertac Karaman, Tushar Krishna,
Thomas Heldt, Trevor Henderson, Hsin-Yu Lai, Peter Li, Fangchang Ma, James Noraky, Gladynel
Saavedra Peña, Charlie Sodini, Amr Suleiman, Nellie Wu, Diana Wofk, Tien-Ju Yang, Zhengdong Zhang

Slides available at
https://tinyurl.com/SzeMITDL2020

Vivienne Sze ( @eems_mit)


2 Compute Demands for Deep Neural Networks
AlexNet to AlphaGo Zero: A 300,000x Increase in Compute
Petaflop/s-days
(exponential)

Year
Source: Open AI (https://openai.com/blog/ai-and-compute/)

Vivienne Sze ( @eems_mit)


3 Compute Demands for Deep Neural Networks

[Strubell, ACL 2019]

[Strubell, ACL 2019]

Vivienne Sze ( @eems_mit)


4 Processing at “Edge” instead of the “Cloud”

Communication Privacy Latency

Vivienne Sze ( @eems_mit)


5 Computing Challenge for Self-Driving Cars

(Feb 2018)

Cameras and radar generate


~6 gigabytes of data every 30 seconds.

Self-driving car prototypes use


approximately 2,500 Watts of
computing power.

Generates wasted heat and some


prototypes need water-cooling!

Vivienne Sze ( @eems_mit)


6 Existing Processors Consume Too Much Power

< 1 Watt > 10 Watts

Vivienne Sze ( @eems_mit)


7 Transistors are NOT Getting More Efficient
Slow down of Moore’s Law and Dennard Scaling
General purpose microprocessors not getting faster or more efficient

• Need specialized hardware for significant improvement in


speed and energy efficiency
• Redesign computing hardware from the ground up!

Slowdown

Vivienne Sze ( @eems_mit)


8 Popularity of Specialized Hardware for DNNs

Big Bets On A.I. Open a New Frontier for


Chips Start-Ups, Too. (January 14, 2018)

“Today, at least 45 start-ups are working


on chips that can power tasks like speech
and self-driving cars, and at least five of
them have raised more than $100 million
from investors. Venture capitalists
invested more than $1.5 billion in chip
start-ups last year, nearly doubling the
investments made two years ago, according
to the research firm CB Insights.”

Vivienne Sze ( @eems_mit)


9 Power Dominated by Data Movement
Operation: Energy Relative Energy Cost Area Relative Area Cost
(pJ) (µm2)
8b Add 0.03 36
16b Add 0.05 67
32b Add 0.1 137
16b FP Add 0.4 1360
32b FP Add 0.9 4184
8b Mult 0.2 282
32b Mult 3.1 3495
16b FP Mult 1.1 1640
32b FP Mult 3.7 7700
32b SRAM Read (8KB) 5 N/A
32b DRAM Read 640 N/A
1 10 102 103 104 1 10 102 103
Memory access is orders of magnitude higher energy than compute
Vivienne Sze ( @eems_mit) [Horowitz, ISSCC 2014]
10 Autonomous Navigation Uses a Lot of Data
• Semantic Understanding

- High frame rate


- Large resolutions
- Data expansion
2 million pixels 10x-100x more pixels

• Geometric Understanding

- Growing map size

Vivienne Sze ( @eems_mit) [Pire, RAS 2017]


11 Understanding the Environment
Depth Estimation

Semantic Segmentation
State-of-the-art approaches
use Deep Neural Networks,
which require up to several
hundred millions of
operations and weights to
compute!
>100x more complex than
video compression

Vivienne Sze ( @eems_mit)


12 Deep Neural Networks
Deep Neural Networks (DNNs) have become a cornerstone of AI
Computer Vision Speech Recognition

Game Play Medical

Vivienne Sze ( @eems_mit)


13 What Are Deep Neural Networks?

Low Level Features High Level Features

Input: Output:
Image “Volvo XC90”

Modified Image Source: [Lee, CACM 2011]

Vivienne Sze ( @eems_mit)


14 Weighted Sum
Nonlinear ⎛ 3 ⎞
Yj ∑
activation ⎜ Wij
= Activation × Xi ⎟
Function ⎝ i=1 ⎠
W11 Y1
X1 Sigmoid Rectified Linear Unit (ReLU)
1 1
Y2
0
X2 0
y=1/(1+e-x) y=max(0,x)

Y3 -1
-1 0 1
-1
-1 0 1

X3 Image source: Caffe tutorial

W34 Y4 Output Layer


Input Layer
Hidden Layer

Key operation is multiply and accumulate (MAC)


Accounts for > 90% of computation

Vivienne Sze ( @eems_mit)


15 Popular Types of Layers in DNNs
• Fully Connected Layer Feed Feedback
Forward
– Feed forward, fully connected
– Multilayer Perceptron (MLP)

• Convolutional Layer
– Feed forward, sparsely-connected w/ weight sharing
– Convolutional Neural Network (CNN) Output Layer
Input Layer
– Typically used for images Hidden Layer

• Recurrent Layer Fully


Connected Sparsely
– Feedback Connected
– Recurrent Neural Network (RNN)
– Typically used for sequential data (e.g., speech, language)

• Attention Layer/Mechanism
– Attention (matrix multiply) + feed forward, fully connected Output Layer
Input Layer
– Transformer [Vaswani, NeurIPS 2017] Hidden Layer

Vivienne Sze ( @eems_mit)


16 High-Dimensional Convolution in CNN
a plane of input activations
a.k.a. input feature map (fmap)
filter (weights)
H
R

S W

Vivienne Sze ( @eems_mit)


17 High-Dimensional Convolution in CNN

input fmap output fmap


filter (weights) an output
activation
H
R E

S W F
Element-wise Partial Sum (psum)
Multiplication Accumulation

Vivienne Sze ( @eems_mit)


18 High-Dimensional Convolution in CNN

input fmap output fmap


filter (weights) an output
activation
H
R E

S W F
Sliding Window Processing

Vivienne Sze ( @eems_mit)


19 High-Dimensional Convolution in CNN
input fmap
filter C


output fmap
C


H
R E


S W F
Many Input Channels (C)

AlexNet: 3 – 192 Channels (C)


Vivienne Sze ( @eems_mit)
20 High-Dimensional Convolution in CNN

many input fmap


output fmap
filters (M) C


C


H
R E
1


S W F

Many
Output Channels (M)
C

R
M

S AlexNet: 96 – 384 Filters (M)


Vivienne Sze ( @eems_mit)
21 High-Dimensional Convolution in CNN
Many
Input fmaps (N) Many
C
Output fmaps (N)
filters


M


C

H
R E
1 1


S W F


C


C


R E

H


S
N

F
W
Vivienne Sze ( @eems_mit) Image batch size: 1 – 256 (N)
22 Define Shape for Each Layer

Input fmaps
Filters C
Output fmaps H – Height of input fmap (activations)
… W – Width of input fmap (activations)
M


C

H
C – Number of 2-D input fmaps /filters
R
1
E (channels)
1 1


S W F
R – Height of 2-D filter (weights)
S – Width of 2-D filter (weights)


M
M – Number of 2-D output fmaps (channels)
C C
E – Height of output fmap (activations)


… …

R F – Width of output fmap (activations)


M E

H
N N – Number of input fmaps/output fmaps

S
……

N F (batch size)
W

Shape varies across layers

Vivienne Sze ( @eems_mit)


23 Layers with Varying Shapes
MobileNetV3-Large Convolutional Layer Configurations

Block Filter Size (RxS) # Filters (M) # Channels (C)


1 3x3 16 3


3 1x1 64 16
3 3x3 64 1
3 1x1 24 64


6 1x1 120 40
6 5x5 120 1
6 1x1 40 120

[Howard, ICCV 2019]


Vivienne Sze ( @eems_mit)
24 Popular DNN Models
Metrics LeNet-5 AlexNet VGG-16 GoogLeNet ResNet-50 EfficientNet-B4
(v1)
Top-5 error n/a 16.4 7.4 6.7 5.3 3.7*
(ImageNet)
Input Size 28x28 227x227 224x224 224x224 224x224 380x380
# of CONV Layers 2 5 16 21 (depth) 49 96
# of Weights 2.6k 2.3M 14.7M 6.0M 23.5M 14M
# of MACs 283k 666M 15.3G 1.43G 3.86G 4.4G
# of FC layers 2 3 3 1 1 65**
# of Weights 58k 58.6M 124M 1M 2M 4.9M
# of MACs 58k 58.6M 124M 1M 2M 4.9M
Total Weights 60k 61M 138M 7M 25.5M 19M
Total MACs 341k 724M 15.5G 1.43G 3.9G 4.4G
Reference Lecun, Krizhevsky, Simonyan, Szegedy, He, Tan,
PIEEE 1998 NeurIPS 2012 ICLR 2015 CVPR 2015 CVPR 2016 ICML 2019

DNN models getting larger and deeper


* Does not include multi-crop and ensemble
** Increase in FC layers due to squeeze-and-excitation layers (much smaller than FC layers for classification)

Vivienne Sze ( @eems_mit)


25

Efficient Hardware Acceleration


for Deep Neural Networks

Vivienne Sze ( @eems_mit)


26 Properties We Can Leverage
• Operations exhibit high parallelism
à high throughput possible
• Memory Access is the Bottleneck
Memory Read MAC* Memory Write
filter weight ALU
image pixel
DRAM partial sum
updated DRAM
partial sum
200x 1x * multiply-and-accumulate

Worst Case: all memory R/W are DRAM accesses


• Example: AlexNet has 724M MACs
à 2896M DRAM accesses required
Vivienne Sze ( @eems_mit)
27 Properties We Can Leverage
• Operations exhibit high parallelism
à high throughput possible
• Input data reuse opportunities (up to 500x)
Input Fmaps

Filters Filter
Input Fmap Input Fmap
Filter 1
1

2 2

Convolutional Reuse Fmap Reuse Filter Reuse


(Activations, Weights) (Activations) (Weights)
CONV layers only CONV and FC layers CONV and FC layers
(sliding window) (batch size > 1)

Vivienne Sze ( @eems_mit)


28 Exploit Data Reuse at Low-Cost Memories
Specialized
PE PE File– 1.0hardware
Reg 0.5 kB with
Global small (< 1kB)
DRAM
Buffer fetch data to low
run cost memory
PE ALU
here near compute
Control
a MAC

Normalized Energy Cost*


ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process

Vivienne Sze ( Farther and larger memories consume more power


@eems_mit)
29 Weight Stationary (WS)
Global Buffer
Psum Activation

W0 W1 W2 W3 W4 W5 W6 W7 PE
Weight

• Minimize weight read energy consumption


− maximize convolutional and filter reuse of weights

• Broadcast activations and accumulate partial sums


spatially across the PE array

• Examples: TPU [Jouppi, ISCA 2017], NVDLA


Vivienne Sze ( @eems_mit) [Chen, ISCA 2016]
30 Output Stationary (OS)
Global Buffer
Activation Weight

P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum

• Minimize partial sum R/W energy consumption


− maximize local accumulation

• Broadcast/Multicast filter weights and reuse activations


spatially across the PE array

• Examples: [Moons, VLSI 2016], [Thinker, VLSI 2017]


Vivienne Sze ( @eems_mit) [Chen, ISCA 2016]
31 Input Stationary (IS)
Global Buffer
Weight Psum

I0 I1 I2 I3 I4 I5 I6 I7 PE
Act

• Minimize activation read energy consumption


− maximize convolutional and fmap reuse of activations

• Unicast weights and accumulate partial sums spatially


across the PE array

• Example: [SCNN, ISCA 2017]

Vivienne Sze ( @eems_mit) [Chen, ISCA 2016]


32 Row Stationary Dataflow
Row 1

PE 1
Row 1 * Row 1 • Maximize row
convolutional reuse in RF
− Keep a filter row and fmap
sliding window in RF

• Maximize row psum


accumulation in RF

* =

Vivienne Sze ( @eems_mit) [Chen, ISCA 2016] Select for Micro Top Picks
33 Row Stationary Dataflow
Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

* =
* =
*
Optimize for overall energy efficiency instead
=
for only a certain data type
Vivienne Sze ( @eems_mit) [Chen, ISCA 2016] Select for Micro Top Picks
34 Dataflow Comparison: CONV Layers
2

1.5
psums
Normalized
1 weights
Energy/MAC
pixels
0.5

0
WS OSA OSB OSC NLR RS
CNN Dataflows

RS optimizes for the best overall energy efficiency


Vivienne Sze ( @eems_mit) [Chen, ISCA 2016]
35 Exploit Sparsity
Method 1. Skip memory access and computation
Zero Data Skipping
No R/W No Switching
RegisterPad
Scratch File

Zero
== 0 Enable 45% power reduction
Buff

Method 2. Compress data to reduce storage and data movement


Uncompressed Compressed
1.2×
66 1.4×
DRAM Access (MB)

5 1.7×
DRAM 44 1.8× Uncompressed
Access 3 1.9× Fmaps + Weights
(MB) 22
1
RLE Compressed
00 Fmaps + Weights
11 22 33 44 55
AlexNet Conv Layer
AlexNet Conv Layer
Vivienne Sze ( @eems_mit) [Chen, ISSCC 2016]
36 Eyeriss: Deep Neural Network Accelerator
4mm

Spatial
PE Array

On-chip Buffer
4mm
[Chen, ISSCC 2016]
Exploits data reuse for 100x reduction in memory accesses from global
buffer and 1400x reduction in memory accesses from off-chip DRAM
Overall >10x energy reduction compared to a mobile GPU (Nvidia TK1)
Results for AlexNet
Eyeriss Project Website: http://eyeriss.mit.edu
Vivienne Sze ( @eems_mit) [Joint work with Joel Emer]
37 Features: Energy vs. Accuracy
Exponential
10000
VGG162
1000
Energy/ AlexNet2
100
Pixel (nJ)
10
Measured in 65nm* Video
4mm 4mm Compression
1
Spatial
HOG1
On-chip Buffer
4mm

PE Array
Linear
4mm

0.1
0 20 40 60 80
1 [Suleiman, VLSI 2016] 2 [Chen, ISSCC 2016]
Accuracy (Average Precision)
* Only feature extraction. Does Measured in on VOC 2007 Dataset
not include data, classification
energy, augmentation and
1. DPM v5 [Girshick, 2012]
ensemble, etc. 2. Fast R-CNN [Girshick, CVPR 2015]

Vivienne Sze ( @eems_mit) [Suleiman*, Chen*, ISCAS 2017]


38 Energy-Efficient Processing of DNNs
A significant amount of algorithm and hardware research
on energy-efficient processing of DNNs

V. Sze, Y.-H. Chen,


T-J. Yang, J. Emer,
“Efficient Processing of
Deep Neural Networks:
A Tutorial and Survey,”
Proceedings of the IEEE,
Dec. 2017
Book Coming
Spring 2020!
http://eyeriss.mit.edu/tutorial.html

We identified various limitations to existing approaches

Vivienne Sze ( @eems_mit)


39 Design of Efficient DNN Algorithms
• Popular efficient DNN algorithm approaches
Network Pruning Efficient Network Architectures
before pruning after pruning

pruning
synapses

R R
pruning
neurons
C 1 C
S 1 1
S
Examples: SqueezeNet, MobileNet

... also reduced precision

• Focus on reducing number of MACs and weights


• Does it translate to energy savings and reduced latency?

Vivienne Sze ( @eems_mit) [Chen*, Yang*, SysML 2018]


40 Data Movement is Expensive

PE PE
Global
DRAM
Buffer fetch data to run
PE ALU
a MAC here

Normalized Energy Cost*


ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process

Energy@eems_mit)
of weight depends on memory hierarchy and dataflow
Vivienne Sze (
41 Energy-Evaluation Methodology

DNN Shape Configuration Hardware Energy Costs of each


(# of channels, # of filters, etc.) MAC and Memory Access

# acc. at mem. level 1


Memory # acc. at mem. level 2
Accesses


Optimization # acc. at mem. level n Edata

# of MACs # of MACs Ecomp


Calculation

DNN Weights and Input Data Energy


[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
L1 L2 L3 …
Tool available at: https://energyestimation.mit.edu/ DNN Energy Consumption

Vivienne Sze ( @eems_mit) [Yang, CVPR 2017]


42 Key Observations

• Number of weights alone is not a good metric for energy


• All data types should be considered
Computation
10% Input Feature Map
25%

Weights
Energy Consumption 22%
of GoogLeNet

Output Feature Map


43%

Vivienne Sze ( @eems_mit) [Yang, CVPR 2017]


43 Energy-Aware Pruning
Directly target energy and Normalized Energy (AlexNet)
incorporate it into the x109
4.5
optimization of DNNs to 4
provide greater energy savings 3.5
3
• Sort layers based on energy and 2.5 2.1x 3.7x
prune layers that consume most 2
energy first 1.5
1
• EAP reduces AlexNet energy by
0.5
3.7x and outperforms the
0
previous work that uses Ori. Magnitude
DC Energy
EAP Aware
Based Pruning Pruning
magnitude-based pruning by 1.7x
Pruned models available at
http://eyeriss.mit.edu/energy.html

Vivienne Sze ( @eems_mit) [Yang, CVPR 2017]


44 # of Operations vs. Latency
• # of operations (MACs) does not approximate latency well

Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)

Vivienne Sze ( @eems_mit)


45 NetAdapt: Platform-Aware DNN Adaptation
• Automatically adapt DNN to a mobile platform to reach a
target latency or energy budget
• Use empirical measurements to guide optimization (avoid
modeling of tool chain or platform architecture)
Pretrained Budget Platform
Network Empirical Measurements
Metric Budget
Metric Proposal A … Proposal Z
Latency 3.8
Latency 15.6 … 14.3


Energy 10.5



Energy 41 … 46

NetAdapt Measure
Network Proposals
A B C D Z

Adapted
Network
Code available at http://netadapt.mit.edu [Yang, ECCV 2018]
Vivienne Sze ( @eems_mit) In collaboration with Google’s Mobile Vision Team
46 Simplified Example of One Iteration
3. Maximize
1. Input 2. Meet Budget 4. Output
Accuracy
Layer 1

100ms 90ms 80ms Acc: 60%

Network from Network for


Previous Iteration Next Iteration

Selected Selected


Layer 4

Latency: 100ms 100ms 80ms Acc: 40% Latency: 80ms


Budget: 80ms Budget: 60ms

Selected

Vivienne Sze ( @eems_mit) [Yang, ECCV 2018]


47 Improved Latency vs. Accuracy Tradeoff
• NetAdapt boosts the real inference speed of MobileNet
by up to 1.7x with higher accuracy

+0.3% accuracy
1.7x faster

+0.3% accuracy
1.6x faster

*Tested on the ImageNet dataset and a Google Pixel 1 CPU


Reference:
MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017
MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018

Vivienne Sze ( @eems_mit) [Yang, ECCV 2018]


48 FastDepth: Fast Monocular Depth Estimation
Depth estimation from a single RGB image desirable, due to
the relatively low cost and size of monocular cameras.
RGB Prediction

Auto Encoder DNN Architecture (Dense Output)

Reduction
Expansion
(similar to classification)

Vivienne Sze ( @eems_mit) [Joint work with Sertac Karaman]


49 FastDepth: Fast Monocular Depth Estimation
Apply NetAdapt, compact network design, and depth wise decomposition
to decoder layer to enable depth estimation at high frame rates on an
embedded platform while still maintaining accuracy

> 10x

~40fps on
an iPhone

Configuration: Batch size of one (32-bit float)

Models available at http://fastdepth.mit.edu


Vivienne Sze ( @eems_mit) [Wofk*, Ma*, ICRA 2019]
50 Many Efficient DNN Design Approaches
Network Pruning Efficient Network Architectures
before pruning after pruning
Channel
Groups G …
pruning
synapses

R R
pruning
neurons
C 1 C
S 1 1
S
Convolutional Depth-Wise Point-Wise
Layer Layer Layer
Reduce Precision
32-bit float 10100101000000000101000000000100

8-bit fixed 0 1 1 0 0 1 1 0 No guarantee that DNN algorithm


designer will use a given approach.
Binary 0
Need flexible hardware!

Vivienne Sze ( @eems_mit) [Chen*, Yang*, SysML 2018]


51 Existing DNN Architectures
• Specialized DNN hardware often rely on certain properties of
DNN in order to achieve high energy-efficiency
• Example: Reduce memory access by amortizing across MAC array

Activation
Memory

Weight
reuse
Weight
MAC array
Memory

Activation
reuse

Vivienne Sze ( @eems_mit)


52 Limitation of Existing DNN Architectures
• Example: Reuse and array utilization depends on # of channels,
feature map/batch size
– Not efficient across all network architectures (e.g., compact DNNs)

Number of feature map


input channels or batch size

Number of filters MAC array Number of filters MAC array


(output channels) (spatial (output channels) (temporal
accumulation) accumulation)

Vivienne Sze ( @eems_mit)


53 Limitation of Existing DNN Architectures
• Example: Reuse and array utilization depends on # of channels,
feature map/batch size
– Not efficient across all network architectures (e.g., compact DNNs)

Example mapping for


depth wise layer R 1 C
1

S 1
Number of feature map
input channels or batch size

Number of filters MAC array Number of filters MAC array


(output channels) (spatial (output channels) (temporal
accumulation) accumulation)

Vivienne Sze ( @eems_mit)


54 Limitation of Existing DNN Architectures
• Example: Reuse and array utilization depends on # of channels,
feature map/batch size
– Not efficient across all network architectures (e.g., compact DNNs)
– Less efficient as array scales up in size
– Can be challenging to exploit sparsity

Number of feature map


input channels or batch size

Number of filters MAC array Number of filters MAC array


(output channels) (spatial (output channels) (temporal
accumulation) accumulation)

Vivienne Sze ( @eems_mit)


55 Need Flexible Dataflow
• Use flexible dataflow (Row Stationary) to exploit reuse in any
dimension of DNN to increase energy efficiency and array
utilization

Example: Depth-wise layer


Vivienne Sze ( @eems_mit)
56 Need Flexible NoC for Varying Reuse
• When reuse available, need multicast to exploit spatial data
reuse for energy efficiency and high array utilization
• When reuse not available, need unicast for high BW for weights
for FC and weights & activations for high PE utilization
• An all-to-all satisfies above but too expensive and not scalable
High Bandwidth, Low Spatial Reuse Low Bandwidth, High Spatial Reuse

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
Global Buffer

Global Buffer

Global Buffer

Global Buffer
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Unicast Networks 1D Systolic Networks 1D Multicast Networks Broadcast Network

Vivienne Sze ( @eems_mit) [Chen, JETCAS 2019]


57 Hierarchical Mesh
GLB … Mesh …
Cluster Network

Router … …
Mesh


Cluster All-to-All

PE All-to-all
Cluster … Network …

(a) (b)

High Bandwidth High Reuse Grouped Multicast Interleaved Multicast

(b) (c) (d) (e)


Vivienne Sze ( @eems_mit) [Chen, JETCAS 2019]
58 Eyeriss v2: Balancing Flexibility and Efficiency

Efficiently supports 12.6


10.9
• Wide range of filter shapes 5.6

– Large and Compact

• Different Layers
Speed up over Eyeriss v1 scales with number of PEs
– CONV, FC, depth wise, etc.
# of PEs 256 1024 16384
• Wide range of sparsity
AlexNet 17.9x 71.5x 1086.7x
– Dense and Sparse
GoogLeNet 10.4x 37.8x 448.8x
• Scalable architecture MobileNet 15.7x 57.9x 873.0x

Over an order of magnitude faster and


[Chen, JETCAS 2019]
more energy efficient than Eyeriss v1

Vivienne Sze ( @eems_mit) [Joint work with Joel Emer]


59

Looking Beyond the DNN


Accelerator for Acceleration

Vivienne Sze ( @eems_mit)


60 Super-Resolution on Mobile Devices
Low High
Resolution Resolution
Streaming Playback

Transmit low resolution for lower bandwidth Screens are getting larger

Use super-resolution to improve the viewing experience of


lower-resolution content (reduce communication bandwidth)

Vivienne Sze ( @eems_mit)


61 FAST: A Framework to Accelerate SuperRes

SR algorithm SR
FAST 15x faster

Compressed video Real-time

A framework that accelerates any SR algorithm by up to


15x when running on compressed videos

Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]


62
Free Information in Compressed Videos

Decode

Pixels Block-structure Motion-compensation


Compressed video

Video as a stack of pixels Representation in compressed video

This representation can help accelerate super-resolution

Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]


63 Transfer is Lightweight
SR Transfer
SR
SR
SR
SRSR

Low-res video Low-res video


High-res video High-res video
Transfer allows SR to run on only a subset of frames

Fractional Bicubic Skip Flag


Interpolation Interpolation

The complexity of the transfer is comparable to bicubic interpolation.


Transfer N frames, accelerate by N
Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]
64 Evaluation: Accelerating SRCNN

Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]


65 Visual Evaluation

SRCNN FAST + Bicubic


SRCNN

Look beyond the DNN accelerator for opportunities to accelerate


DNN processing (e.g., structure of data and temporal correlation)
Code released at www.rle.mit.edu/eems/fast

Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]


66

Beyond Deep Neural


Networks

Vivienne Sze ( @eems_mit)


67 Visual-Inertial Localization
Determines location/orientation of robot from images and IMU
Localization

Image sequence Visual-Inertial


Odometry
IMU (VIO) *
Inertial Measurement Unit

*Subset of SLAM algorithm


(Simultaneous Localization And Mapping) Mapping Slide 28

Vivienne Sze ( @eems_mit)


68 Localization at Under 25 mW
First chip that performs
complete Visual-Inertial Odometry
Front-End for camera
(Feature detection, tracking, and
outlier elimination)

Front-End for IMU


(pre-integration of accelerometer
and gyroscope data)

Back-End Optimization of Pose


Graph Navion
Consumes 684× and 1582×
less energy than
mobile and desktop CPUs,
respectively

Navion Project Website: http://navion.mit.edu [Zhang et al., RSS 2017], [Suleiman et al., VLSI 2018]

Vivienne Sze ( @eems_mit) [Joint work with Sertac Karaman]


69 Key Methods to Reduce Data Size
Navion: Fully integrated system – no off-chip processing or storage
Vision Frontend (VFE) Backend (BE)
Previous Current Backend Control
Line Buffers Frame Frame
Data & Control Bus
Feature Undistort Undistort Feature Floating Build Graph Exploit
Detection & Rectify & Rectify Tracking Point Graph
Apply Low (FD) (UR) (UR) (FT) Arithmetic Linear
Solver
Sparsity in
Matrix Linearize
Cost Left Right Operations Horizon Graph and
Frame Frame Linear States
Frame
Cholesky
Back
Solver
Shared
Linear Solver
Compression Sparse Stereo (SS) Substitute Marginal Memory

Vision Frontend Control Rodrigues Register


Operations Retract File
Data & Control Bus
IMU Frontend (IFE) IMU
Fixed Point Floating Point
RANSAC Arithmetic
Point Cloud Pre-Integration memory
Arithmetic

Use compression and exploit sparsity to


reduce memory down to 854kB

Vivienne Sze ( @eems_mit) [Suleiman, VLSI-C 2018] Best Student Paper Award
70 Where to Go Next: Planning and Mapping
Robot Exploration: Decide where to go by computing Shannon Mutual Information

Move to Update
Select candidate Compute Shannon MI and
location Occupancy
scan locations choose best location
and scan Map

Where to scan? Mutual Information Updated Map

Occupancy map Mutual information map

Occupancy map with


planned path
Exploration with a mini
race car using motion
capture for localization
MI surface

Vivienne Sze ( @eems_mit) [Zhang, ICRA 2019]


71 Challenge is Data Delivery to All Cores
Process multiple beams in parallel
Core N

Core 1
Core 3 Core 2
Core 2
Core N
Core 1

Data delivery from memory is limited


Read Port 1 Core 1
Core 2
Read Port 2
Core N
Vivienne Sze ( @eems_mit)
72 Specialized Memory Architecture
Break up map into separate memory banks and novel storage pattern to
minimize read conflicts when processing different beams in parallel.
Memory Access Pattern Diagonal Banking Pattern

8 8 8 Bank 0
7 7 7 Bank 1
Y 6 6 6 Bank 2
X
5 5 5 Bank 3

4 4 4 8 Bank 4

3 3 3 5 6 7 Bank 5

2 2 3 4 Bank 6

1 2 3 4 5 6 7 8 Bank 7
Y Y

X X [Li, RSS 2019]


Compute the mutual information for an entire map of 20m x 20m at 0.1m resolution
in under a second à a 100x speed up versus CPU for 1/10th of the power.

Vivienne Sze ( @eems_mit) [Joint work with Sertac Karaman]


73 Monitoring Neurodegenerative Disorders
Dementia affects 50 million people worldwide today
(75 million in 10 years) [World Alzheimer’s Report]

Mini-Mental Clock-drawing test


State Examination (MMSE)
Q1. What is the year? Season? Date?
Q2. Where are you now? State? Floor?
Q3. Could you count backward from
100 by sevens? (93, 86, …) Agrell et al.
Age and Ageing, 1998.

• Neuropsychological assessments are time consuming and require a


trained specialist
• Repeat medical assessments are sparse, mostly qualitative, and
suffer from high retest variability
Vivienne Sze ( @eems_mit) [Joint work with Thomas Heldt and Charlie Sodini]
74 Use Eye Movements for Quantitative Evaluation
Eye movements can be used to quantitatively evaluate severity,
progression or regression of neurodegenerative diseases
High-speed camera Substantial head support IR illumination

Phantom v25-11 SR EYELINK 1000 PLUS Reulen et al., Med. & Biol. Eng. &
Comp, 1988.

Clinical measurements of saccade latency are done in constrained


environments that rely on specialized, costly equipment.
Vivienne Sze ( @eems_mit)
75 Measure Eye Movements Using Phone
Smartphone
Eye movements

Count
Eye movement
feature

Develop algorithm to measure eye iPhone 6 Phantom


(< $1k) ($100k)
movement using a consumer-grade
camera rather than high-cost
research-grade camera.

Enable low-cost in-home


longitudinal measurements.
Reaction Time (milliseconds)
Vivienne Sze ( @eems_mit) [Saavedra Peña, EMBC 2018] [Lai, ICIP 2018]
76 Looking For Volunteers for Eye Reaction Time

If you are near or on


MIT Campus and interested
in volunteering your eye
movements for this study,
please contact us at
[email protected]

Vivienne Sze ( @eems_mit)


77 Low Power 3D Time of Flight Imaging
• Pulsed Time of Flight: Measure distance using round trip time
of laser light for each image pixel
– Illumination + Imager Power: 2.5 – 20 W for range from 1 - 8 m

• Use computer vision techniques and passive images to


estimate changes in depth without turning on laser
– CMOS Imaging Sensor Power: < 350 mW

Estimated Depth Maps


Real-time Performance on Embedded Processor
VGA @ 30 fps on Cortex-A7 (< 0.5W active power)
Vivienne Sze ( @eems_mit) [Noraky, ICIP 2017]
78 Results of Low Power Depth ToF Imaging

RGB Image Depth Map Depth Map


Ground Truth Estimated

Mean Relative Error: 0.7%


Duty Cycle (on-time of laser): 11%

Vivienne Sze ( @eems_mit) [Noraky, ICIP 2017]


79 Summary

• Efficient computing extends the reach of AI beyond


the cloud by reducing communication requirements,
enabling privacy, and providing low latency so that
AI can be used in wide range of applications ranging
from robotics to health care.
• Cross-layer design with specialized hardware
enables energy-efficient AI, and will be critical to the
progress of AI over the next decade.

Today’s slides available at


https://tinyurl.com/SzeMITDL2020
Vivienne Sze ( @eems_mit)
80 Additional Resources

Overview Paper
V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,
“Efficient Processing of Deep Neural
Networks: A Tutorial and Survey,”
Proceedings of the IEEE, Dec. 2017

Book Coming Spring 2020!

For updates
More info about
EEMS Mailing List
Tutorial on DNN Architectures
http://eyeriss.mit.edu/tutorial.html
Vivienne Sze ( @eems_mit)
81 Additional Resources

MIT Professional Education Course on


“Designing Efficient Deep Learning Systems”
http://shortprograms.mit.edu/dls
Next Offering: July 20-21, 2020 on MIT Campus
Vivienne Sze ( @eems_mit)
82 Additional Resources
Talks and Tutorial Available Online
https://www.rle.mit.edu/eems/publications/tutorials/

YouTube Channel
EEMS Group – PI: Vivienne Sze

Vivienne Sze ( @eems_mit)


83 Acknowledgements

Joel Emer

Sertac Karaman

Thomas Heldt
Research conducted in the MIT Energy-Efficient Multimedia Systems Group would not
be possible without the support of the following organizations:

Vivienne Sze ( @eems_mit) Mailing List: http://mailman.mit.edu/mailman/listinfo/eems-news


84 References

• Energy-Efficient Hardware for Deep Neural Networks


– Project website: http://eyeriss.mit.edu
– Y.-H. Chen, T. Krishna, J. Emer, V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks,” IEEE Journal of Solid State Circuits (JSSC), ISSCC Special Issue, Vol. 52,
No. 1, pp. 127-138, January 2017.
– Y.-H. Chen, J. Emer, V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for
Convolutional Neural Networks,” International Symposium on Computer Architecture (ISCA), pp. 367-
379, June 2016.
– Y.-H. Chen, T.-J. Yang, J. Emer, V. Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural
Networks on Mobile Devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems
(JETCAS), June 2019.
– Eyexam: https://arxiv.org/abs/1807.07928

• Limitations of Existing Efficient DNN Approaches


– Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, “Understanding the Limitations of Existing Energy-Efficient
Design Approaches for Deep Neural Networks,” SysML Conference, February 2018.
– V. Sze, Y.-H. Chen, T.-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and
Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
– Hardware Architecture for Deep Neural Networks: http://eyeriss.mit.edu/tutorial.html

Vivienne Sze ( @eems_mit)


85 References

• Co-Design of Algorithms and Hardware for Deep Neural Networks


– T.-J. Yang, Y.-H. Chen, V. Sze, “Designing Energy-Efficient Convolutional Neural Networks using Energy-
Aware Pruning,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
– Energy estimation tool: http://eyeriss.mit.edu/energy.html
– T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, “NetAdapt: Platform-Aware Neural
Network Adaptation for Mobile Applications,” European Conference on Computer Vision (ECCV), 2018.
– D. Wofk*, F. Ma*, T.-J. Yang, S. Karaman, V. Sze, “FastDepth: Fast Monocular Depth Estimation on
Embedded Systems,” IEEE International Conference on Robotics and Automation (ICRA), May 2019.
http://fastdepth.mit.edu/

• Energy-Efficient Visual Inertial Localization


– Project website: http://navion.mit.edu
– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A Fully Integrated Energy-Efficient
Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Symposium on
VLSI Circuits (VLSI-Circuits), June 2018.
– Z. Zhang*, A. Suleiman*, L. Carlone, V. Sze, S. Karaman, “Visual-Inertial Odometry on Chip: An
Algorithm-and-Hardware Co-design Approach,” Robotics: Science and Systems (RSS), July 2017.
– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A 2mW Fully Integrated Real-Time
Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Journal of Solid
State Circuits (JSSC), VLSI Symposia Special Issue, Vol. 54, No. 4, pp. 1106-1119, April 2019.

Vivienne Sze ( @eems_mit)


86 References
• Fast Shannon Mutual Information for Robot Exploration
– Z. Zhang, T. Henderson, V. Sze, S. Karaman, “FSMI: Fast computation of Shannon Mutual Information for
information-theoretic mapping,” IEEE International Conference on Robotics and Automation (ICRA), May 2019.
– P. Li*, Z. Zhang*, S. Karaman, V. Sze, “High-throughput Computation of Shannon Mutual Information on Chip,”
Robotics: Science and Systems (RSS), June 2019.

• Low Power Time of Flight Imaging


– J. Noraky, V. Sze, “Low Power Depth Estimation of Rigid Objects for Time-of-Flight Imaging,” IEEE Transactions
on Circuits and Systems for Video Technology (TCSVT), 2019.
– J. Noraky, V. Sze, “Depth Estimation of Non-Rigid Objects For Time-Of-Flight Imaging,” IEEE International
Conference on Image Processing (ICIP), October 2018.
– J. Noraky, V. Sze, “Low Power Depth Estimation for Time-of-Flight Imaging,” IEEE International Conference on
Image Processing (ICIP), September 2017.

• Monitoring Neurodegenerative Disorders Using a Phone


– H.-Y. Lai, G. Saavedra Peña, C. Sodini, T. Heldt, V. Sze, “Enabling Saccade Latency Measurements with
Consumer-Grade Cameras,” IEEE International Conference on Image Processing (ICIP), October 2018.
– G. Saavedra Peña, H.-Y. Lai, V. Sze, T. Heldt, “Determination of saccade latency distributions using video
recordings from consumer-grade devices,” IEEE International Engineering in Medicine and Biology Conference
(EMBC), 2018.
– H.-Y. Lai, G. Saavedra Peña, C. Sodini, V. Sze, T. Heldt, “Measuring Saccade Latency Using Smartphone
Cameras,” IEEE Journal of Biomedical and Health Informatics (JBHI), March 2020.

Vivienne Sze ( @eems_mit)

You might also like