0% found this document useful (0 votes)

0 views

2020_01_15_vivienne_sze_efficient_computing

The document discusses the challenges and advancements in efficient computing for deep learning, AI, and robotics, highlighting the exponential increase in compute demands from models like AlexNet to AlphaGo Zero. It emphasizes the need for specialized hardware due to the limitations of existing processors and the inefficiencies in data movement and memory access. The presentation also covers the architecture of deep neural networks, their operational complexities, and the importance of optimizing hardware to improve performance and energy efficiency.

Uploaded by

ambeydeveloper

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

2020_01_15_vivienne_sze_efficient_computing

Uploaded by

ambeydeveloper

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

1

Efficient Computing for

Deep Learning, AI and Robotics
Vivienne Sze ( @eems_mit)
Massachusetts Institute of Technology

In collaboration with Luca Carlone, Yu-Hsin Chen, Joel Emer, Sertac Karaman, Tushar Krishna,
Thomas Heldt, Trevor Henderson, Hsin-Yu Lai, Peter Li, Fangchang Ma, James Noraky, Gladynel
Saavedra Peña, Charlie Sodini, Amr Suleiman, Nellie Wu, Diana Wofk, Tien-Ju Yang, Zhengdong Zhang

Slides available at
https://tinyurl.com/SzeMITDL2020

Vivienne Sze ( @eems_mit)

2 Compute Demands for Deep Neural Networks
AlexNet to AlphaGo Zero: A 300,000x Increase in Compute
Petaflop/s-days
(exponential)

Year
Source: Open AI (https://openai.com/blog/ai-and-compute/)

Vivienne Sze ( @eems_mit)

3 Compute Demands for Deep Neural Networks

[Strubell, ACL 2019]

Vivienne Sze ( @eems_mit)

4 Processing at “Edge” instead of the “Cloud”

Communication Privacy Latency

Vivienne Sze ( @eems_mit)

5 Computing Challenge for Self-Driving Cars

(Feb 2018)

Cameras and radar generate

~6 gigabytes of data every 30 seconds.

Self-driving car prototypes use

approximately 2,500 Watts of
computing power.

Generates wasted heat and some

prototypes need water-cooling!

Vivienne Sze ( @eems_mit)

6 Existing Processors Consume Too Much Power

< 1 Watt > 10 Watts

Vivienne Sze ( @eems_mit)

7 Transistors are NOT Getting More Efficient
Slow down of Moore’s Law and Dennard Scaling
General purpose microprocessors not getting faster or more efficient

• Need specialized hardware for significant improvement in

speed and energy efficiency
• Redesign computing hardware from the ground up!

Slowdown

Vivienne Sze ( @eems_mit)

8 Popularity of Specialized Hardware for DNNs

Big Bets On A.I. Open a New Frontier for

Chips Start-Ups, Too. (January 14, 2018)

“Today, at least 45 start-ups are working

on chips that can power tasks like speech
and self-driving cars, and at least five of
them have raised more than $100 million
from investors. Venture capitalists
invested more than $1.5 billion in chip
start-ups last year, nearly doubling the
investments made two years ago, according
to the research firm CB Insights.”

Vivienne Sze ( @eems_mit)

9 Power Dominated by Data Movement
Operation: Energy Relative Energy Cost Area Relative Area Cost
(pJ) (µm2)
8b Add 0.03 36
16b Add 0.05 67
32b Add 0.1 137
16b FP Add 0.4 1360
32b FP Add 0.9 4184
8b Mult 0.2 282
32b Mult 3.1 3495
16b FP Mult 1.1 1640
32b FP Mult 3.7 7700
32b SRAM Read (8KB) 5 N/A
32b DRAM Read 640 N/A
1 10 102 103 104 1 10 102 103
Memory access is orders of magnitude higher energy than compute
Vivienne Sze ( @eems_mit) [Horowitz, ISSCC 2014]
10 Autonomous Navigation Uses a Lot of Data
• Semantic Understanding

- High frame rate

- Large resolutions
- Data expansion
2 million pixels 10x-100x more pixels

• Geometric Understanding

- Growing map size

Vivienne Sze ( @eems_mit) [Pire, RAS 2017]

11 Understanding the Environment
Depth Estimation

Semantic Segmentation
State-of-the-art approaches
use Deep Neural Networks,
which require up to several
hundred millions of
operations and weights to
compute!
>100x more complex than
video compression

Vivienne Sze ( @eems_mit)

12 Deep Neural Networks
Deep Neural Networks (DNNs) have become a cornerstone of AI
Computer Vision Speech Recognition

Game Play Medical

Vivienne Sze ( @eems_mit)

13 What Are Deep Neural Networks?

Low Level Features High Level Features

Input: Output:
Image “Volvo XC90”

Modified Image Source: [Lee, CACM 2011]

Vivienne Sze ( @eems_mit)

14 Weighted Sum
Nonlinear ⎛ 3 ⎞
Yj ∑
activation ⎜ Wij
= Activation × Xi ⎟
Function ⎝ i=1 ⎠
W11 Y1
X1 Sigmoid Rectified Linear Unit (ReLU)
1 1
Y2
0
X2 0
y=1/(1+e-x) y=max(0,x)

Y3 -1
-1 0 1
-1
-1 0 1

X3 Image source: Caffe tutorial

W34 Y4 Output Layer

Input Layer
Hidden Layer

Key operation is multiply and accumulate (MAC)

Accounts for > 90% of computation

Vivienne Sze ( @eems_mit)

15 Popular Types of Layers in DNNs
• Fully Connected Layer Feed Feedback
Forward
– Feed forward, fully connected
– Multilayer Perceptron (MLP)

• Convolutional Layer
– Feed forward, sparsely-connected w/ weight sharing
– Convolutional Neural Network (CNN) Output Layer
Input Layer
– Typically used for images Hidden Layer

• Recurrent Layer Fully

Connected Sparsely
– Feedback Connected
– Recurrent Neural Network (RNN)
– Typically used for sequential data (e.g., speech, language)

• Attention Layer/Mechanism
– Attention (matrix multiply) + feed forward, fully connected Output Layer
Input Layer
– Transformer [Vaswani, NeurIPS 2017] Hidden Layer

Vivienne Sze ( @eems_mit)

16 High-Dimensional Convolution in CNN
a plane of input activations
a.k.a. input feature map (fmap)
filter (weights)
H
R

S W

Vivienne Sze ( @eems_mit)

17 High-Dimensional Convolution in CNN

input fmap output fmap

filter (weights) an output
activation
H
R E

S W F
Element-wise Partial Sum (psum)
Multiplication Accumulation

Vivienne Sze ( @eems_mit)

18 High-Dimensional Convolution in CNN

input fmap output fmap

filter (weights) an output
activation
H
R E

S W F
Sliding Window Processing

Vivienne Sze ( @eems_mit)

19 High-Dimensional Convolution in CNN
input fmap
filter C

…
output fmap
C
…

…
H
R E
…

…
S W F
Many Input Channels (C)

AlexNet: 3 – 192 Channels (C)

Vivienne Sze ( @eems_mit)
20 High-Dimensional Convolution in CNN

many input fmap

output fmap
filters (M) C

…
C
…

…
H
R E
1
…

…
S W F
…

Many
Output Channels (M)
C
…

R
M
…

S AlexNet: 96 – 384 Filters (M)

Vivienne Sze ( @eems_mit)
21 High-Dimensional Convolution in CNN
Many
Input fmaps (N) Many
C
Output fmaps (N)
filters

…
M

…
C
…

H
R E
1 1
…

…
S W F
…

…
C
…

…
C

…
…

R E
…

H
…

…
S
N
…

F
W
Vivienne Sze ( @eems_mit) Image batch size: 1 – 256 (N)
22 Define Shape for Each Layer

Input fmaps
Filters C
Output fmaps H – Height of input fmap (activations)
… W – Width of input fmap (activations)
M

…
C
…

H
C – Number of 2-D input fmaps /filters
R
1
E (channels)
1 1
…

…
S W F
R – Height of 2-D filter (weights)
S – Width of 2-D filter (weights)
…

…
M
M – Number of 2-D output fmaps (channels)
C C
E – Height of output fmap (activations)
…

…
…

… …
…

R F – Width of output fmap (activations)

M E
…

H
N N – Number of input fmaps/output fmaps
…

S
……

N F (batch size)
W

Shape varies across layers

Vivienne Sze ( @eems_mit)

23 Layers with Varying Shapes
MobileNetV3-Large Convolutional Layer Configurations

Block Filter Size (RxS) # Filters (M) # Channels (C)

1 3x3 16 3

…
3 1x1 64 16
3 3x3 64 1
3 1x1 24 64

…
6 1x1 120 40
6 5x5 120 1
6 1x1 40 120
…

[Howard, ICCV 2019]

Vivienne Sze ( @eems_mit)
24 Popular DNN Models
Metrics LeNet-5 AlexNet VGG-16 GoogLeNet ResNet-50 EfficientNet-B4
(v1)
Top-5 error n/a 16.4 7.4 6.7 5.3 3.7*
(ImageNet)
Input Size 28x28 227x227 224x224 224x224 224x224 380x380
# of CONV Layers 2 5 16 21 (depth) 49 96
# of Weights 2.6k 2.3M 14.7M 6.0M 23.5M 14M
# of MACs 283k 666M 15.3G 1.43G 3.86G 4.4G
# of FC layers 2 3 3 1 1 65**
# of Weights 58k 58.6M 124M 1M 2M 4.9M
# of MACs 58k 58.6M 124M 1M 2M 4.9M
Total Weights 60k 61M 138M 7M 25.5M 19M
Total MACs 341k 724M 15.5G 1.43G 3.9G 4.4G
Reference Lecun, Krizhevsky, Simonyan, Szegedy, He, Tan,
PIEEE 1998 NeurIPS 2012 ICLR 2015 CVPR 2015 CVPR 2016 ICML 2019

DNN models getting larger and deeper

* Does not include multi-crop and ensemble
** Increase in FC layers due to squeeze-and-excitation layers (much smaller than FC layers for classification)

Vivienne Sze ( @eems_mit)

Efficient Hardware Acceleration

for Deep Neural Networks

Vivienne Sze ( @eems_mit)

26 Properties We Can Leverage
• Operations exhibit high parallelism
à high throughput possible
• Memory Access is the Bottleneck
Memory Read MAC* Memory Write
filter weight ALU
image pixel
DRAM partial sum
updated DRAM
partial sum
200x 1x * multiply-and-accumulate

Worst Case: all memory R/W are DRAM accesses

• Example: AlexNet has 724M MACs
à 2896M DRAM accesses required
Vivienne Sze ( @eems_mit)
27 Properties We Can Leverage
• Operations exhibit high parallelism
à high throughput possible
• Input data reuse opportunities (up to 500x)
Input Fmaps

Filters Filter
Input Fmap Input Fmap
Filter 1
1

2 2

Convolutional Reuse Fmap Reuse Filter Reuse

(Activations, Weights) (Activations) (Weights)
CONV layers only CONV and FC layers CONV and FC layers
(sliding window) (batch size > 1)

Vivienne Sze ( @eems_mit)

28 Exploit Data Reuse at Low-Cost Memories
Specialized
PE PE File– 1.0hardware
Reg 0.5 kB with
Global small (< 1kB)
DRAM
Buffer fetch data to low
run cost memory
PE ALU
here near compute
Control
a MAC

Normalized Energy Cost*

ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process

Vivienne Sze ( Farther and larger memories consume more power

@eems_mit)
29 Weight Stationary (WS)
Global Buffer
Psum Activation

W0 W1 W2 W3 W4 W5 W6 W7 PE
Weight

• Minimize weight read energy consumption

− maximize convolutional and filter reuse of weights

• Broadcast activations and accumulate partial sums

spatially across the PE array

• Examples: TPU [Jouppi, ISCA 2017], NVDLA

Vivienne Sze ( @eems_mit) [Chen, ISCA 2016]
30 Output Stationary (OS)
Global Buffer
Activation Weight

P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum

• Minimize partial sum R/W energy consumption

− maximize local accumulation

• Broadcast/Multicast filter weights and reuse activations

spatially across the PE array

• Examples: [Moons, VLSI 2016], [Thinker, VLSI 2017]

Vivienne Sze ( @eems_mit) [Chen, ISCA 2016]
31 Input Stationary (IS)
Global Buffer
Weight Psum

I0 I1 I2 I3 I4 I5 I6 I7 PE
Act

• Minimize activation read energy consumption

− maximize convolutional and fmap reuse of activations

• Unicast weights and accumulate partial sums spatially

across the PE array

• Example: [SCNN, ISCA 2017]

Vivienne Sze ( @eems_mit) [Chen, ISCA 2016]

32 Row Stationary Dataflow
Row 1

PE 1
Row 1 * Row 1 • Maximize row
convolutional reuse in RF
− Keep a filter row and fmap
sliding window in RF

• Maximize row psum

accumulation in RF

* =

Vivienne Sze ( @eems_mit) [Chen, ISCA 2016] Select for Micro Top Picks
33 Row Stationary Dataflow
Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

* =
* =
*
Optimize for overall energy efficiency instead
=
for only a certain data type
Vivienne Sze ( @eems_mit) [Chen, ISCA 2016] Select for Micro Top Picks
34 Dataflow Comparison: CONV Layers
2

1.5
psums
Normalized
1 weights
Energy/MAC
pixels
0.5

0
WS OSA OSB OSC NLR RS
CNN Dataflows

RS optimizes for the best overall energy efficiency

Vivienne Sze ( @eems_mit) [Chen, ISCA 2016]
35 Exploit Sparsity
Method 1. Skip memory access and computation
Zero Data Skipping
No R/W No Switching
RegisterPad
Scratch File

Zero
== 0 Enable 45% power reduction
Buff

Method 2. Compress data to reduce storage and data movement

Uncompressed Compressed
1.2×
66 1.4×
DRAM Access (MB)

5 1.7×
DRAM 44 1.8× Uncompressed
Access 3 1.9× Fmaps + Weights
(MB) 22
1
RLE Compressed
00 Fmaps + Weights
11 22 33 44 55
AlexNet Conv Layer
AlexNet Conv Layer
Vivienne Sze ( @eems_mit) [Chen, ISSCC 2016]
36 Eyeriss: Deep Neural Network Accelerator
4mm

Spatial
PE Array

On-chip Buffer
4mm
[Chen, ISSCC 2016]
Exploits data reuse for 100x reduction in memory accesses from global
buffer and 1400x reduction in memory accesses from off-chip DRAM
Overall >10x energy reduction compared to a mobile GPU (Nvidia TK1)
Results for AlexNet
Eyeriss Project Website: http://eyeriss.mit.edu
Vivienne Sze ( @eems_mit) [Joint work with Joel Emer]
37 Features: Energy vs. Accuracy
Exponential
10000
VGG162
1000
Energy/ AlexNet2
100
Pixel (nJ)
10
Measured in 65nm* Video
4mm 4mm Compression
1
Spatial
HOG1
On-chip Buffer
4mm

PE Array
Linear
4mm

0.1
0 20 40 60 80
1 [Suleiman, VLSI 2016] 2 [Chen, ISSCC 2016]
Accuracy (Average Precision)
* Only feature extraction. Does Measured in on VOC 2007 Dataset
not include data, classification
energy, augmentation and
1. DPM v5 [Girshick, 2012]
ensemble, etc. 2. Fast R-CNN [Girshick, CVPR 2015]

Vivienne Sze ( @eems_mit) [Suleiman, Chen, ISCAS 2017]

38 Energy-Efficient Processing of DNNs
A significant amount of algorithm and hardware research
on energy-efficient processing of DNNs

V. Sze, Y.-H. Chen,

T-J. Yang, J. Emer,
“Efficient Processing of
Deep Neural Networks:
A Tutorial and Survey,”
Proceedings of the IEEE,
Dec. 2017
Book Coming
Spring 2020!
http://eyeriss.mit.edu/tutorial.html

We identified various limitations to existing approaches

Vivienne Sze ( @eems_mit)

39 Design of Efficient DNN Algorithms
• Popular efficient DNN algorithm approaches
Network Pruning Efficient Network Architectures
before pruning after pruning

pruning
synapses

R R
pruning
neurons
C 1 C
S 1 1
S
Examples: SqueezeNet, MobileNet

... also reduced precision

• Focus on reducing number of MACs and weights

• Does it translate to energy savings and reduced latency?

Vivienne Sze ( @eems_mit) [Chen, Yang, SysML 2018]

40 Data Movement is Expensive

PE PE
Global
DRAM
Buffer fetch data to run
PE ALU
a MAC here

Normalized Energy Cost*

ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process

Energy@eems_mit)
of weight depends on memory hierarchy and dataflow
Vivienne Sze (
41 Energy-Evaluation Methodology

DNN Shape Configuration Hardware Energy Costs of each

(# of channels, # of filters, etc.) MAC and Memory Access

# acc. at mem. level 1

Memory # acc. at mem. level 2
Accesses

…
Optimization # acc. at mem. level n Edata

# of MACs # of MACs Ecomp

Calculation

DNN Weights and Input Data Energy

[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
L1 L2 L3 …
Tool available at: https://energyestimation.mit.edu/ DNN Energy Consumption

Vivienne Sze ( @eems_mit) [Yang, CVPR 2017]

42 Key Observations

• Number of weights alone is not a good metric for energy

• All data types should be considered
Computation
10% Input Feature Map
25%

Weights
Energy Consumption 22%
of GoogLeNet

Output Feature Map

43%

Vivienne Sze ( @eems_mit) [Yang, CVPR 2017]

43 Energy-Aware Pruning
Directly target energy and Normalized Energy (AlexNet)
incorporate it into the x109
4.5
optimization of DNNs to 4
provide greater energy savings 3.5
3
• Sort layers based on energy and 2.5 2.1x 3.7x
prune layers that consume most 2
energy first 1.5
1
• EAP reduces AlexNet energy by
0.5
3.7x and outperforms the
0
previous work that uses Ori. Magnitude
DC Energy
EAP Aware
Based Pruning Pruning
magnitude-based pruning by 1.7x
Pruned models available at
http://eyeriss.mit.edu/energy.html

Vivienne Sze ( @eems_mit) [Yang, CVPR 2017]

44 # of Operations vs. Latency
• # of operations (MACs) does not approximate latency well

Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)

Vivienne Sze ( @eems_mit)

45 NetAdapt: Platform-Aware DNN Adaptation
• Automatically adapt DNN to a mobile platform to reach a
target latency or energy budget
• Use empirical measurements to guide optimization (avoid
modeling of tool chain or platform architecture)
Pretrained Budget Platform
Network Empirical Measurements
Metric Budget
Metric Proposal A … Proposal Z
Latency 3.8
Latency 15.6 … 14.3
…

…
Energy 10.5

…
…
Energy 41 … 46

NetAdapt Measure
Network Proposals
A B C D Z

Adapted
Network
Code available at http://netadapt.mit.edu [Yang, ECCV 2018]
Vivienne Sze ( @eems_mit) In collaboration with Google’s Mobile Vision Team
46 Simplified Example of One Iteration
3. Maximize
1. Input 2. Meet Budget 4. Output
Accuracy
Layer 1

100ms 90ms 80ms Acc: 60%

Network from Network for

Previous Iteration Next Iteration

Selected Selected
…

…
Layer 4

Latency: 100ms 100ms 80ms Acc: 40% Latency: 80ms

Budget: 80ms Budget: 60ms

Selected

Vivienne Sze ( @eems_mit) [Yang, ECCV 2018]

47 Improved Latency vs. Accuracy Tradeoff
• NetAdapt boosts the real inference speed of MobileNet
by up to 1.7x with higher accuracy

+0.3% accuracy
1.7x faster

+0.3% accuracy
1.6x faster

*Tested on the ImageNet dataset and a Google Pixel 1 CPU

Reference:
MobileNet: Howard et al, “Mobilenets: Efficient convolutional neural networks for mobile vision applications”, arXiv 2017
MorphNet: Gordon et al., “Morphnet: Fast & simple resource-constrained structure learning of deep networks”, CVPR 2018

Vivienne Sze ( @eems_mit) [Yang, ECCV 2018]

48 FastDepth: Fast Monocular Depth Estimation
Depth estimation from a single RGB image desirable, due to
the relatively low cost and size of monocular cameras.
RGB Prediction

Auto Encoder DNN Architecture (Dense Output)

Reduction
Expansion
(similar to classification)

Vivienne Sze ( @eems_mit) [Joint work with Sertac Karaman]

49 FastDepth: Fast Monocular Depth Estimation
Apply NetAdapt, compact network design, and depth wise decomposition
to decoder layer to enable depth estimation at high frame rates on an
embedded platform while still maintaining accuracy

> 10x

~40fps on
an iPhone

Configuration: Batch size of one (32-bit float)

Models available at http://fastdepth.mit.edu

Vivienne Sze ( @eems_mit) [Wofk*, Ma*, ICRA 2019]
50 Many Efficient DNN Design Approaches
Network Pruning Efficient Network Architectures
before pruning after pruning
Channel
Groups G …
pruning
synapses

R R
pruning
neurons
C 1 C
S 1 1
S
Convolutional Depth-Wise Point-Wise
Layer Layer Layer
Reduce Precision
32-bit float 10100101000000000101000000000100

8-bit fixed 0 1 1 0 0 1 1 0 No guarantee that DNN algorithm

designer will use a given approach.
Binary 0
Need flexible hardware!

Vivienne Sze ( @eems_mit) [Chen, Yang, SysML 2018]

51 Existing DNN Architectures
• Specialized DNN hardware often rely on certain properties of
DNN in order to achieve high energy-efficiency
• Example: Reduce memory access by amortizing across MAC array

Activation
Memory

Weight
reuse
Weight
MAC array
Memory

Activation
reuse

Vivienne Sze ( @eems_mit)

52 Limitation of Existing DNN Architectures
• Example: Reuse and array utilization depends on # of channels,
feature map/batch size
– Not efficient across all network architectures (e.g., compact DNNs)

Number of feature map

input channels or batch size

Number of filters MAC array Number of filters MAC array

(output channels) (spatial (output channels) (temporal
accumulation) accumulation)

Vivienne Sze ( @eems_mit)

53 Limitation of Existing DNN Architectures
• Example: Reuse and array utilization depends on # of channels,
feature map/batch size
– Not efficient across all network architectures (e.g., compact DNNs)

Example mapping for

depth wise layer R 1 C
1

S 1
Number of feature map
input channels or batch size

Number of filters MAC array Number of filters MAC array

(output channels) (spatial (output channels) (temporal
accumulation) accumulation)

Vivienne Sze ( @eems_mit)

54 Limitation of Existing DNN Architectures
• Example: Reuse and array utilization depends on # of channels,
feature map/batch size
– Not efficient across all network architectures (e.g., compact DNNs)
– Less efficient as array scales up in size
– Can be challenging to exploit sparsity

Number of feature map

input channels or batch size

Number of filters MAC array Number of filters MAC array

(output channels) (spatial (output channels) (temporal
accumulation) accumulation)

Vivienne Sze ( @eems_mit)

55 Need Flexible Dataflow
• Use flexible dataflow (Row Stationary) to exploit reuse in any
dimension of DNN to increase energy efficiency and array
utilization

Example: Depth-wise layer

Vivienne Sze ( @eems_mit)
56 Need Flexible NoC for Varying Reuse
• When reuse available, need multicast to exploit spatial data
reuse for energy efficiency and high array utilization
• When reuse not available, need unicast for high BW for weights
for FC and weights & activations for high PE utilization
• An all-to-all satisfies above but too expensive and not scalable
High Bandwidth, Low Spatial Reuse Low Bandwidth, High Spatial Reuse

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
Global Buffer

Global Buffer

Global Buffer
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Unicast Networks 1D Systolic Networks 1D Multicast Networks Broadcast Network

Vivienne Sze ( @eems_mit) [Chen, JETCAS 2019]

57 Hierarchical Mesh
GLB … Mesh …
Cluster Network

Router … …
Mesh

…
Cluster All-to-All

PE All-to-all
Cluster … Network …

(a) (b)

High Bandwidth High Reuse Grouped Multicast Interleaved Multicast

(b) (c) (d) (e)

Vivienne Sze ( @eems_mit) [Chen, JETCAS 2019]
58 Eyeriss v2: Balancing Flexibility and Efficiency

Efficiently supports 12.6

10.9
• Wide range of filter shapes 5.6

– Large and Compact

• Different Layers
Speed up over Eyeriss v1 scales with number of PEs
– CONV, FC, depth wise, etc.
# of PEs 256 1024 16384
• Wide range of sparsity
AlexNet 17.9x 71.5x 1086.7x
– Dense and Sparse
GoogLeNet 10.4x 37.8x 448.8x
• Scalable architecture MobileNet 15.7x 57.9x 873.0x

Over an order of magnitude faster and

[Chen, JETCAS 2019]
more energy efficient than Eyeriss v1

Vivienne Sze ( @eems_mit) [Joint work with Joel Emer]

Looking Beyond the DNN

Accelerator for Acceleration

Vivienne Sze ( @eems_mit)

60 Super-Resolution on Mobile Devices
Low High
Resolution Resolution
Streaming Playback

Transmit low resolution for lower bandwidth Screens are getting larger

Use super-resolution to improve the viewing experience of

lower-resolution content (reduce communication bandwidth)

Vivienne Sze ( @eems_mit)

61 FAST: A Framework to Accelerate SuperRes

SR algorithm SR
FAST 15x faster

Compressed video Real-time

A framework that accelerates any SR algorithm by up to

15x when running on compressed videos

Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]

62
Free Information in Compressed Videos

Decode

Pixels Block-structure Motion-compensation

Compressed video

Video as a stack of pixels Representation in compressed video

This representation can help accelerate super-resolution

Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]

63 Transfer is Lightweight
SR Transfer
SR
SR
SR
SRSR

Low-res video Low-res video

High-res video High-res video
Transfer allows SR to run on only a subset of frames

Fractional Bicubic Skip Flag

Interpolation Interpolation

The complexity of the transfer is comparable to bicubic interpolation.

Transfer N frames, accelerate by N
Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]
64 Evaluation: Accelerating SRCNN

Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]

65 Visual Evaluation

SRCNN FAST + Bicubic

SRCNN

Look beyond the DNN accelerator for opportunities to accelerate

DNN processing (e.g., structure of data and temporal correlation)
Code released at www.rle.mit.edu/eems/fast

Vivienne Sze ( @eems_mit) [Zhang, CVPRW 2017]

Beyond Deep Neural

Networks

Vivienne Sze ( @eems_mit)

67 Visual-Inertial Localization
Determines location/orientation of robot from images and IMU
Localization
…

Image sequence Visual-Inertial

Odometry
IMU (VIO) *
Inertial Measurement Unit

*Subset of SLAM algorithm

(Simultaneous Localization And Mapping) Mapping Slide 28

Vivienne Sze ( @eems_mit)

68 Localization at Under 25 mW
First chip that performs
complete Visual-Inertial Odometry
Front-End for camera
(Feature detection, tracking, and
outlier elimination)

Front-End for IMU

(pre-integration of accelerometer
and gyroscope data)

Back-End Optimization of Pose

Graph Navion
Consumes 684× and 1582×
less energy than
mobile and desktop CPUs,
respectively

Navion Project Website: http://navion.mit.edu [Zhang et al., RSS 2017], [Suleiman et al., VLSI 2018]

Vivienne Sze ( @eems_mit) [Joint work with Sertac Karaman]

69 Key Methods to Reduce Data Size
Navion: Fully integrated system – no off-chip processing or storage
Vision Frontend (VFE) Backend (BE)
Previous Current Backend Control
Line Buffers Frame Frame
Data & Control Bus
Feature Undistort Undistort Feature Floating Build Graph Exploit
Detection & Rectify & Rectify Tracking Point Graph
Apply Low (FD) (UR) (UR) (FT) Arithmetic Linear
Solver
Sparsity in
Matrix Linearize
Cost Left Right Operations Horizon Graph and
Frame Frame Linear States
Frame
Cholesky
Back
Solver
Shared
Linear Solver
Compression Sparse Stereo (SS) Substitute Marginal Memory

Vision Frontend Control Rodrigues Register

Operations Retract File
Data & Control Bus
IMU Frontend (IFE) IMU
Fixed Point Floating Point
RANSAC Arithmetic
Point Cloud Pre-Integration memory
Arithmetic

Use compression and exploit sparsity to

reduce memory down to 854kB

Vivienne Sze ( @eems_mit) [Suleiman, VLSI-C 2018] Best Student Paper Award
70 Where to Go Next: Planning and Mapping
Robot Exploration: Decide where to go by computing Shannon Mutual Information

Move to Update
Select candidate Compute Shannon MI and
location Occupancy
scan locations choose best location
and scan Map

Where to scan? Mutual Information Updated Map

Occupancy map Mutual information map

Occupancy map with

planned path
Exploration with a mini
race car using motion
capture for localization
MI surface

Vivienne Sze ( @eems_mit) [Zhang, ICRA 2019]

71 Challenge is Data Delivery to All Cores
Process multiple beams in parallel
Core N

Core 1
Core 3 Core 2
Core 2
Core N
Core 1

Data delivery from memory is limited

Read Port 1 Core 1
Core 2
Read Port 2
Core N
Vivienne Sze ( @eems_mit)
72 Specialized Memory Architecture
Break up map into separate memory banks and novel storage pattern to
minimize read conflicts when processing different beams in parallel.
Memory Access Pattern Diagonal Banking Pattern

8 8 8 Bank 0
7 7 7 Bank 1
Y 6 6 6 Bank 2
X
5 5 5 Bank 3

4 4 4 8 Bank 4

3 3 3 5 6 7 Bank 5

2 2 3 4 Bank 6

1 2 3 4 5 6 7 8 Bank 7
Y Y

X X [Li, RSS 2019]

Compute the mutual information for an entire map of 20m x 20m at 0.1m resolution
in under a second à a 100x speed up versus CPU for 1/10th of the power.

Vivienne Sze ( @eems_mit) [Joint work with Sertac Karaman]

73 Monitoring Neurodegenerative Disorders
Dementia affects 50 million people worldwide today
(75 million in 10 years) [World Alzheimer’s Report]

Mini-Mental Clock-drawing test

State Examination (MMSE)
Q1. What is the year? Season? Date?
Q2. Where are you now? State? Floor?
Q3. Could you count backward from
100 by sevens? (93, 86, …) Agrell et al.
Age and Ageing, 1998.

• Neuropsychological assessments are time consuming and require a

trained specialist
• Repeat medical assessments are sparse, mostly qualitative, and
suffer from high retest variability
Vivienne Sze ( @eems_mit) [Joint work with Thomas Heldt and Charlie Sodini]
74 Use Eye Movements for Quantitative Evaluation
Eye movements can be used to quantitatively evaluate severity,
progression or regression of neurodegenerative diseases
High-speed camera Substantial head support IR illumination

Phantom v25-11 SR EYELINK 1000 PLUS Reulen et al., Med. & Biol. Eng. &
Comp, 1988.

Clinical measurements of saccade latency are done in constrained

environments that rely on specialized, costly equipment.
Vivienne Sze ( @eems_mit)
75 Measure Eye Movements Using Phone
Smartphone
Eye movements

Count
Eye movement
feature

Develop algorithm to measure eye iPhone 6 Phantom

(< $1k) ($100k)
movement using a consumer-grade
camera rather than high-cost
research-grade camera.

Enable low-cost in-home

longitudinal measurements.
Reaction Time (milliseconds)
Vivienne Sze ( @eems_mit) [Saavedra Peña, EMBC 2018] [Lai, ICIP 2018]
76 Looking For Volunteers for Eye Reaction Time

If you are near or on

MIT Campus and interested
in volunteering your eye
movements for this study,
please contact us at
[email protected]

Vivienne Sze ( @eems_mit)

77 Low Power 3D Time of Flight Imaging
• Pulsed Time of Flight: Measure distance using round trip time
of laser light for each image pixel
– Illumination + Imager Power: 2.5 – 20 W for range from 1 - 8 m

• Use computer vision techniques and passive images to

estimate changes in depth without turning on laser
– CMOS Imaging Sensor Power: < 350 mW

Estimated Depth Maps

Real-time Performance on Embedded Processor
VGA @ 30 fps on Cortex-A7 (< 0.5W active power)
Vivienne Sze ( @eems_mit) [Noraky, ICIP 2017]
78 Results of Low Power Depth ToF Imaging

RGB Image Depth Map Depth Map

Ground Truth Estimated

Mean Relative Error: 0.7%

Duty Cycle (on-time of laser): 11%

Vivienne Sze ( @eems_mit) [Noraky, ICIP 2017]

79 Summary

• Efficient computing extends the reach of AI beyond

the cloud by reducing communication requirements,
enabling privacy, and providing low latency so that
AI can be used in wide range of applications ranging
from robotics to health care.
• Cross-layer design with specialized hardware
enables energy-efficient AI, and will be critical to the
progress of AI over the next decade.

Today’s slides available at

https://tinyurl.com/SzeMITDL2020
Vivienne Sze ( @eems_mit)
80 Additional Resources

Overview Paper
V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,
“Efficient Processing of Deep Neural
Networks: A Tutorial and Survey,”
Proceedings of the IEEE, Dec. 2017

Book Coming Spring 2020!

For updates
More info about
EEMS Mailing List
Tutorial on DNN Architectures
http://eyeriss.mit.edu/tutorial.html
Vivienne Sze ( @eems_mit)
81 Additional Resources

MIT Professional Education Course on

“Designing Efficient Deep Learning Systems”
http://shortprograms.mit.edu/dls
Next Offering: July 20-21, 2020 on MIT Campus
Vivienne Sze ( @eems_mit)
82 Additional Resources
Talks and Tutorial Available Online
https://www.rle.mit.edu/eems/publications/tutorials/

YouTube Channel
EEMS Group – PI: Vivienne Sze

Vivienne Sze ( @eems_mit)

83 Acknowledgements

Joel Emer

Sertac Karaman

Thomas Heldt
Research conducted in the MIT Energy-Efficient Multimedia Systems Group would not
be possible without the support of the following organizations:

Vivienne Sze ( @eems_mit) Mailing List: http://mailman.mit.edu/mailman/listinfo/eems-news

84 References

• Energy-Efficient Hardware for Deep Neural Networks

– Project website: http://eyeriss.mit.edu
– Y.-H. Chen, T. Krishna, J. Emer, V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks,” IEEE Journal of Solid State Circuits (JSSC), ISSCC Special Issue, Vol. 52,
No. 1, pp. 127-138, January 2017.
– Y.-H. Chen, J. Emer, V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for
Convolutional Neural Networks,” International Symposium on Computer Architecture (ISCA), pp. 367-
379, June 2016.
– Y.-H. Chen, T.-J. Yang, J. Emer, V. Sze, “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural
Networks on Mobile Devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems
(JETCAS), June 2019.
– Eyexam: https://arxiv.org/abs/1807.07928

• Limitations of Existing Efficient DNN Approaches

– Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, “Understanding the Limitations of Existing Energy-Efficient
Design Approaches for Deep Neural Networks,” SysML Conference, February 2018.
– V. Sze, Y.-H. Chen, T.-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and
Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
– Hardware Architecture for Deep Neural Networks: http://eyeriss.mit.edu/tutorial.html

Vivienne Sze ( @eems_mit)

85 References

• Co-Design of Algorithms and Hardware for Deep Neural Networks

– T.-J. Yang, Y.-H. Chen, V. Sze, “Designing Energy-Efficient Convolutional Neural Networks using Energy-
Aware Pruning,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
– Energy estimation tool: http://eyeriss.mit.edu/energy.html
– T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, “NetAdapt: Platform-Aware Neural
Network Adaptation for Mobile Applications,” European Conference on Computer Vision (ECCV), 2018.
– D. Wofk*, F. Ma*, T.-J. Yang, S. Karaman, V. Sze, “FastDepth: Fast Monocular Depth Estimation on
Embedded Systems,” IEEE International Conference on Robotics and Automation (ICRA), May 2019.
http://fastdepth.mit.edu/

• Energy-Efficient Visual Inertial Localization

– Project website: http://navion.mit.edu
– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A Fully Integrated Energy-Efficient
Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Symposium on
VLSI Circuits (VLSI-Circuits), June 2018.
– Z. Zhang*, A. Suleiman*, L. Carlone, V. Sze, S. Karaman, “Visual-Inertial Odometry on Chip: An
Algorithm-and-Hardware Co-design Approach,” Robotics: Science and Systems (RSS), July 2017.
– A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, V. Sze, “Navion: A 2mW Fully Integrated Real-Time
Visual-Inertial Odometry Accelerator for Autonomous Navigation of Nano Drones,” IEEE Journal of Solid
State Circuits (JSSC), VLSI Symposia Special Issue, Vol. 54, No. 4, pp. 1106-1119, April 2019.

Vivienne Sze ( @eems_mit)

86 References
• Fast Shannon Mutual Information for Robot Exploration
– Z. Zhang, T. Henderson, V. Sze, S. Karaman, “FSMI: Fast computation of Shannon Mutual Information for
information-theoretic mapping,” IEEE International Conference on Robotics and Automation (ICRA), May 2019.
– P. Li*, Z. Zhang*, S. Karaman, V. Sze, “High-throughput Computation of Shannon Mutual Information on Chip,”
Robotics: Science and Systems (RSS), June 2019.

• Low Power Time of Flight Imaging

– J. Noraky, V. Sze, “Low Power Depth Estimation of Rigid Objects for Time-of-Flight Imaging,” IEEE Transactions
on Circuits and Systems for Video Technology (TCSVT), 2019.
– J. Noraky, V. Sze, “Depth Estimation of Non-Rigid Objects For Time-Of-Flight Imaging,” IEEE International
Conference on Image Processing (ICIP), October 2018.
– J. Noraky, V. Sze, “Low Power Depth Estimation for Time-of-Flight Imaging,” IEEE International Conference on
Image Processing (ICIP), September 2017.

• Monitoring Neurodegenerative Disorders Using a Phone

– H.-Y. Lai, G. Saavedra Peña, C. Sodini, T. Heldt, V. Sze, “Enabling Saccade Latency Measurements with
Consumer-Grade Cameras,” IEEE International Conference on Image Processing (ICIP), October 2018.
– G. Saavedra Peña, H.-Y. Lai, V. Sze, T. Heldt, “Determination of saccade latency distributions using video
recordings from consumer-grade devices,” IEEE International Engineering in Medicine and Biology Conference
(EMBC), 2018.
– H.-Y. Lai, G. Saavedra Peña, C. Sodini, V. Sze, T. Heldt, “Measuring Saccade Latency Using Smartphone
Cameras,” IEEE Journal of Biomedical and Health Informatics (JBHI), March 2020.

Vivienne Sze ( @eems_mit)

Artificial Intelligence Hardware Design - Challenges and Solutions
100% (2)
Artificial Intelligence Hardware Design - Challenges and Solutions
233 pages
@ Car Evaluation
No ratings yet
@ Car Evaluation
10 pages
2019_neurips_tutorial
No ratings yet
2019_neurips_tutorial
138 pages
Hardware Architectures For Deep Neural Networks-MIT'16
No ratings yet
Hardware Architectures For Deep Neural Networks-MIT'16
300 pages
Tutorial On DNN 1 of 9 Background of DNNs
No ratings yet
Tutorial On DNN 1 of 9 Background of DNNs
65 pages
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
No ratings yet
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
290 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
DL Inference FPGA Class1
No ratings yet
DL Inference FPGA Class1
56 pages
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
No ratings yet
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
19 pages
Basic Design Approaches To Accelerating Deep Neural Networks
No ratings yet
Basic Design Approaches To Accelerating Deep Neural Networks
93 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
Deep Learning Cookbook
No ratings yet
Deep Learning Cookbook
24 pages
esm2024-mizrahi-slides (2)
No ratings yet
esm2024-mizrahi-slides (2)
77 pages
Deep Learning Notes For Easy Access
No ratings yet
Deep Learning Notes For Easy Access
14 pages
Deep NN - Theory, Tutorial and Survey
No ratings yet
Deep NN - Theory, Tutorial and Survey
32 pages
Chips For Artificial Intelligence
No ratings yet
Chips For Artificial Intelligence
3 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
5_lecture_28_01_25
No ratings yet
5_lecture_28_01_25
47 pages
Thit
No ratings yet
Thit
37 pages
14280
No ratings yet
14280
47 pages
Super VIP Cheatsheet - Deep Learning
No ratings yet
Super VIP Cheatsheet - Deep Learning
47 pages
L-0017398760-pdf
No ratings yet
L-0017398760-pdf
24 pages
7 CNN
No ratings yet
7 CNN
66 pages
Hardware Implementation of Neural Networks
No ratings yet
Hardware Implementation of Neural Networks
5 pages
CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor For Forward and Backward Propagation of Convolutional Neural Networks
No ratings yet
CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor For Forward and Backward Propagation of Convolutional Neural Networks
8 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
Intro To Deep Learning
100% (1)
Intro To Deep Learning
35 pages
Neuromorphic Architectures Lec 4-16-1731320691
No ratings yet
Neuromorphic Architectures Lec 4-16-1731320691
276 pages
5-Convolutional Neural Network
No ratings yet
5-Convolutional Neural Network
43 pages
Capra 2020
No ratings yet
Capra 2020
48 pages
2017 MSSC Verhelst eDNNP-1
No ratings yet
2017 MSSC Verhelst eDNNP-1
11 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
PE Implementation paper
No ratings yet
PE Implementation paper
2 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
Lec 9_AIHC_S2022_V2
No ratings yet
Lec 9_AIHC_S2022_V2
124 pages
An Overview of Convolutional Neural Network Architectures For Deep Learning
No ratings yet
An Overview of Convolutional Neural Network Architectures For Deep Learning
22 pages
Advanced Intelligent Systems - 2024 - Song - Hardware for Deep Learning Acceleration
No ratings yet
Advanced Intelligent Systems - 2024 - Song - Hardware for Deep Learning Acceleration
20 pages
Hardware For Deep Learning Acceleration
No ratings yet
Hardware For Deep Learning Acceleration
20 pages
Lect 2 Common Architectural Principles of Deep Networks (3)
No ratings yet
Lect 2 Common Architectural Principles of Deep Networks (3)
20 pages
Lecture2 Advanced CNN
No ratings yet
Lecture2 Advanced CNN
55 pages
Introduction To Deep Neural Networks - DataCamp
No ratings yet
Introduction To Deep Neural Networks - DataCamp
10 pages
Hot Chips Overview
No ratings yet
Hot Chips Overview
47 pages
20231130_IntroductionToAISystems
No ratings yet
20231130_IntroductionToAISystems
29 pages
Ch-3 Convolutional Neural Networks (CNNs)
No ratings yet
Ch-3 Convolutional Neural Networks (CNNs)
11 pages
A Developer's Guide To Artificial Intelligence (AI) : Definitions, Insights & Tools For Getting Started in AI
No ratings yet
A Developer's Guide To Artificial Intelligence (AI) : Definitions, Insights & Tools For Getting Started in AI
9 pages
Training Deep Neural Networks With Keras
No ratings yet
Training Deep Neural Networks With Keras
37 pages
thesis-2
No ratings yet
thesis-2
144 pages
1119810450-3
No ratings yet
1119810450-3
6 pages
Deep Neural Network
No ratings yet
Deep Neural Network
12 pages
Chapter 6 - Neural Networks (part 1)
No ratings yet
Chapter 6 - Neural Networks (part 1)
29 pages
Classify Webcam Images Using Deep Learning
No ratings yet
Classify Webcam Images Using Deep Learning
17 pages
Dataflow For DNN Accelerator Architectures (Part 2)
No ratings yet
Dataflow For DNN Accelerator Architectures (Part 2)
78 pages
Cnn
No ratings yet
Cnn
56 pages
Introduction To Neural Networks: by Suneel
No ratings yet
Introduction To Neural Networks: by Suneel
51 pages
NoC Based DNN Accelerators
No ratings yet
NoC Based DNN Accelerators
8 pages
Irmak2021energy_efficient
No ratings yet
Irmak2021energy_efficient
4 pages
CNN and Autoencoder
No ratings yet
CNN and Autoencoder
56 pages
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
No ratings yet
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
33 pages
Compiler Frontiers Unveiled
From Everand
Compiler Frontiers Unveiled
Azhar ul Haque Sario
No ratings yet
Handwritten Digit Recognition of MNIST Dataset Using Deep Learning State-Of-The-Art Artificial Neural Network ANN and Convolutional Neural Network CNN
No ratings yet
Handwritten Digit Recognition of MNIST Dataset Using Deep Learning State-Of-The-Art Artificial Neural Network ANN and Convolutional Neural Network CNN
7 pages
Terminology - What Is The Difference Between MLP and RBF - Cross Validated
No ratings yet
Terminology - What Is The Difference Between MLP and RBF - Cross Validated
2 pages
Mini Project 1
No ratings yet
Mini Project 1
16 pages
1491-Article Text-3334-1-10-20201216
No ratings yet
1491-Article Text-3334-1-10-20201216
6 pages
Research Paper Deep Learning Update
No ratings yet
Research Paper Deep Learning Update
12 pages
Quiz - Review On Machine Learning
No ratings yet
Quiz - Review On Machine Learning
6 pages
Soft Computing Vs Hard Computing
No ratings yet
Soft Computing Vs Hard Computing
23 pages
Backbones-Review: Feature Extraction Networks For Deep Learning and Deep Reinforcement Learning Approaches
No ratings yet
Backbones-Review: Feature Extraction Networks For Deep Learning and Deep Reinforcement Learning Approaches
23 pages
2072 4119 1 SM
No ratings yet
2072 4119 1 SM
5 pages
A Residual Network and Bi-directional LSTM based Hybrid Approach to Remote Sensing Image Captioning
No ratings yet
A Residual Network and Bi-directional LSTM based Hybrid Approach to Remote Sensing Image Captioning
10 pages
Intelligent Automation of Raud Detection and Investigationa Bibliometric Analysis Approach
No ratings yet
Intelligent Automation of Raud Detection and Investigationa Bibliometric Analysis Approach
26 pages
m4 Savana Naurizka 30321360untitled1.ipynb Colab
No ratings yet
m4 Savana Naurizka 30321360untitled1.ipynb Colab
2 pages
Deep Learning Practical File
No ratings yet
Deep Learning Practical File
36 pages
Breaking Cryptographic Implementations Using Deep Learning Techniques
No ratings yet
Breaking Cryptographic Implementations Using Deep Learning Techniques
25 pages
Full Hands On Machine Learning With Scikit Learn and TensorFlow Aurélien Géron Ebook All Chapters
100% (2)
Full Hands On Machine Learning With Scikit Learn and TensorFlow Aurélien Géron Ebook All Chapters
62 pages
Solution To Credit Assignment Problem in MLP. Rumelhart, Hinton and Relating To Economics)
No ratings yet
Solution To Credit Assignment Problem in MLP. Rumelhart, Hinton and Relating To Economics)
14 pages
AI Crash Course for Beginners
No ratings yet
AI Crash Course for Beginners
60 pages
Module 2_Deep_Learning_Fundamentals
No ratings yet
Module 2_Deep_Learning_Fundamentals
98 pages
Research Proposal Presentation
No ratings yet
Research Proposal Presentation
20 pages
BackPropagation for Exam Problem -2
No ratings yet
BackPropagation for Exam Problem -2
3 pages
Slides Used in The Sessions:Ramesh Ramani: Session 1:introduction
No ratings yet
Slides Used in The Sessions:Ramesh Ramani: Session 1:introduction
4 pages
10.2478 - Jaiscr 2019 0006
No ratings yet
10.2478 - Jaiscr 2019 0006
11 pages
Deep Learning NLP and Computer Vision
No ratings yet
Deep Learning NLP and Computer Vision
9 pages
Zhang Et Al. (2018)
No ratings yet
Zhang Et Al. (2018)
10 pages
Ist 407 Presentation
No ratings yet
Ist 407 Presentation
12 pages
Performance - Evaluation - of - Recurrent - Neural - Networks-LSTM - and - GRU - For ASR - IC2E3
No ratings yet
Performance - Evaluation - of - Recurrent - Neural - Networks-LSTM - and - GRU - For ASR - IC2E3
6 pages
Convolutional Neural Network - Towards Data Science PDF
No ratings yet
Convolutional Neural Network - Towards Data Science PDF
10 pages
191AIC701T - Deep Learning-U1,U2
No ratings yet
191AIC701T - Deep Learning-U1,U2
4 pages
Be Computer-Engineering Semester-8 2023 November Deep-Learning-2019-Pattern
No ratings yet
Be Computer-Engineering Semester-8 2023 November Deep-Learning-2019-Pattern
2 pages

2020_01_15_vivienne_sze_efficient_computing

Uploaded by

2020_01_15_vivienne_sze_efficient_computing

Uploaded by

1

Efficient Computing for

Vivienne Sze ( @eems_mit)

Vivienne Sze ( @eems_mit)

[Strubell, ACL 2019]

[Strubell, ACL 2019]

Vivienne Sze ( @eems_mit)

Communication Privacy Latency

Vivienne Sze ( @eems_mit)

Cameras and radar generate

Self-driving car prototypes use

Generates wasted heat and some

Vivienne Sze ( @eems_mit)

< 1 Watt > 10 Watts

Vivienne Sze ( @eems_mit)

• Need specialized hardware for significant improvement in

Vivienne Sze ( @eems_mit)

Big Bets On A.I. Open a New Frontier for

“Today, at least 45 start-ups are working

Vivienne Sze ( @eems_mit)

- High frame rate

- Growing map size

Vivienne Sze ( @eems_mit) [Pire, RAS 2017]

Vivienne Sze ( @eems_mit)

Game Play Medical

Vivienne Sze ( @eems_mit)

Low Level Features High Level Features

Modified Image Source: [Lee, CACM 2011]

Vivienne Sze ( @eems_mit)

X3 Image source: Caffe tutorial

W34 Y4 Output Layer

Key operation is multiply and accumulate (MAC)

Vivienne Sze ( @eems_mit)

• Recurrent Layer Fully

Vivienne Sze ( @eems_mit)

Vivienne Sze ( @eems_mit)

input fmap output fmap

Vivienne Sze ( @eems_mit)

input fmap output fmap

Vivienne Sze ( @eems_mit)

AlexNet: 3 – 192 Channels (C)

many input fmap

S AlexNet: 96 – 384 Filters (M)

R F – Width of output fmap (activations)

Shape varies across layers

Vivienne Sze ( @eems_mit)

Block Filter Size (RxS) # Filters (M) # Channels (C)

[Howard, ICCV 2019]

DNN models getting larger and deeper

Vivienne Sze ( @eems_mit)

Efficient Hardware Acceleration

Vivienne Sze ( @eems_mit)

Worst Case: all memory R/W are DRAM accesses

Convolutional Reuse Fmap Reuse Filter Reuse

Vivienne Sze ( @eems_mit)

Normalized Energy Cost*

Vivienne Sze ( Farther and larger memories consume more power

• Minimize weight read energy consumption

• Broadcast activations and accumulate partial sums

• Examples: TPU [Jouppi, ISCA 2017], NVDLA

• Minimize partial sum R/W energy consumption

• Broadcast/Multicast filter weights and reuse activations

• Examples: [Moons, VLSI 2016], [Thinker, VLSI 2017]

• Minimize activation read energy consumption

• Unicast weights and accumulate partial sums spatially

• Example: [SCNN, ISCA 2017]

Vivienne Sze ( @eems_mit) [Chen, ISCA 2016]

• Maximize row psum

RS optimizes for the best overall energy efficiency

Method 2. Compress data to reduce storage and data movement

Vivienne Sze ( @eems_mit) [Suleiman*, Chen*, ISCAS 2017]

V. Sze, Y.-H. Chen,

We identified various limitations to existing approaches

Vivienne Sze ( @eems_mit)

... also reduced precision

• Focus on reducing number of MACs and weights

Vivienne Sze ( @eems_mit) [Suleiman, Chen, ISCAS 2017]

Vivienne Sze ( @eems_mit) [Chen, Yang, SysML 2018]

Vivienne Sze ( @eems_mit) [Chen, Yang, SysML 2018]