0% found this document useful (0 votes)
40 views119 pages

Lec04 Pruning II

Uploaded by

peter.yeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views119 pages

Lec04 Pruning II

Uploaded by

peter.yeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

EfficientML.

ai Lecture 04:
Pruning and Sparsity
Part II

Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning? before pruning after pruning

• Determine the Pruning Granularity


pruning
• In what pattern should we prune the neural synapses
network?
• Determine the Pruning Criterion
pruning
• What synapses/neurons should we prune? neurons

• Determine the Pruning Ratio


• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 2
ffi
ffi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons
#publications on pruning and sparse neural networks
3200

2400
# Publications

1600
EIE

800 Optimal
Brain Damage
Deep
Compression
0
1989 1995 2001 2007 2013 2019 2:4 sparsity in A100 GPU
2X peak performance, 1.5X measured BERT speedup

Souce: https://github.com/mit-han-lab/pruning-sparsity-publications
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 3
ffi
ffi
Neural Network Pruning
• In general, we could formulate the pruning as
follows:
x x
arg min L(x; WP)
WP
subject to
∥Wp∥0 < N

• L represents the objective function for neural


network training;
• x is input, W is original weights, WP is pruned
weights; arg min L(x; W) arg min L(x; WP)
W WP
• ∥Wp∥0 calculates the #nonzeros in WP, and N is s . t .∥WP∥0 ≤ N
the target #nonzeros.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 4
ffi
ffi
Pruning at Different Granularities kw = 3
The case of convolutional layers

kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2

Irregular Regular

Fine-grained Pattern-based Vector-level Kernel-level Channel-level


Pruning Pruning Pruning Pruning Pruning
like Tetris :)

Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 5
ffi
ffi
Selection of Synapses to Prune
• When removing parameters from a neural network model,
• the less important the parameters being removed are,
• the better the performance of pruned neural network is.

• Magnitude-based pruning considers weights with larger absolute values are more important
than other weights.
• For element-wise pruning,
Importance = | W |
• Example

3 -2 L1-norm |3| |-2| 3 2 3 0

1 -5 Element-wise |1| |-5| 1 5 0 -5

Weight Importance Pruned Weight

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 6
ffi
ffi
Selection of Neurons to Prune
• When removing neurons from a neural network model,
• the less useful the neurons being removed are,
• the better the performance of pruned neural network is.
Recall: Neuron pruning is coarse-grained pruning indeed.
Weight Matrix

Neuron Pruning
in Linear Layer

Channel Pruning
in Convolution Layer

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 7
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
prune 30%?
• Determine the Pruning Ratio prune 50%?
• What should target sparsity be for each layer? prune 70%?

• Fine-tune/Train Pruned Neural Network


• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 8
ffi
ffi
ffi
Section 1: Pruning Ratio
How should we nd per-layer pruning ratios?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 9
ffi
ffi
fi
Recap
Non-uniform pruning is better than uniform shrinking

ImageNet Accuracy (%)


Pruning (AMC)

<
Uniform Scaling

… …
Uniform Shrink Channel Prune

Latency (ms)

AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 10
ffi
ffi
Recap
Non-uniform pruning is better than uniform shrinking

ImageNet Accuracy (%)


Pruning (AMC)

<
Uniform Scaling

… …
Uniform Shrink Channel Prune

Latency (ms)

Question: how should we nd ratios for each layer?

AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 11
ffi
ffi
fi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• We need di erent pruning ratios for each layer since di erent layers have di erent sensitivity
• Some layers are more sensitive (e.g., rst layer)
• Some layers are more redundant

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 12
ffi
ff
ffi
fi
ff
ff
Finding Pruning Ratios
Analyze the sensitivity of each layer
• We need di erent pruning ratios for each layer since di erent layers have di erent sensitivity
• Some layers are more sensitive (e.g., rst layer)
• Some layers are more redundant
• We can perform sensitivity analysis to determine the per-layer pruning ratio

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 13
ffi
ff
ffi
fi
ff
ff
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model

100

86
Accuracy (%)

72

58 L0

44

30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 14
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio

100

86
Accuracy (%)

ΔAcc
72 The higher pruning rate
58 L0 The more accuracy loss
44

30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 15
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 16
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 17
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 18
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 19
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 20
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
Some layers are less sensitive to pruning
100

86
Accuracy (%)

72

58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 21
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers

100

86
Accuracy (%)

72 Some layers are more sensitive to pruning


58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 22
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
• Pick a degradation threshold T such that the overall pruning rate is desired
100

86
Accuracy (%)

72 threshold T
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 23
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
• Pick a degradation threshold T such that the overall pruning rate is desired
100

86
Accuracy (%)

72 threshold T
58 L0 L1
L2 L3 Pruning rates:
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 24
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• Is this optimal?
• Maybe not. We do not consider the interaction between layers.
• Can we go beyond the heuristics?
• Yes!

100

86
Accuracy (%)

72 threshold T
58 L0 L1
L2 L3 Pruning rates:
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 25
ffi
ffi
Automatic Pruning
• Given an overall compression ratio, how do we choose per-layer pruning ratios?
• Sensitivity analysis ignores the interaction between layers -> sub-optimal

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 26
ffi
ffi
Automatic Pruning
• Given an overall compression ratio, how do we choose per-layer pruning ratios?
• Sensitivity analysis ignores the interaction between layers -> sub-optimal
• Conventionally, such process relies on human expertise and trials and errors

Customers

Engineers

AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 27
ffi
ffi
Automatic Pruning
• Can we develop a push-the-button solution?

Bridge the gap


+ A ut oM L

Machine learning expert


Hardware expert Non-expert Hardware-Centric
AutoML

Efficient Neural Net


AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 28
ffi
ffi
AMC: AutoML for Model Compression
Pruning
2
2 as
Ji a reinforcement
Lin, learning
Yihui He, Zhijian problem
Liu, Hanrui Wang, Li-Jia Li, Song Han
Ji Lin, Yihui He, Zhijian Liu, Hanrui Wang, Li-Jia Li, Song Han
Model Compression by Human:
Model Consuming,
Labor CompressionSub-optimal
by Human:
Labor Consuming, Sub-optimal

Reward
Reward= = -Error
-Error*log(FLOP)
Critic Reward= -Error*log(FLOP) Layer t+1
Critic Layer t+1
?%
Original NN Compressed NN ?%
Original NN Compressed NN
Action: Compress with
Actor Action: Compress
Sparsity with
ratio at (e.g. 50%) Layer t
AMC Engine Actor
AMC Engine Sparsity ratio at (e.g. 50%) Layer
50%t
50%

Embedding Layer t-1


Embedding Embedding st=[N,C,H,W,i…] Layer t-1
30%
Embedding st=[N,C,H,W,i…] 30%
Original NN Compressed NN
Original NN Compressed NN
Agent: DDPG
Model Compression by AI:
Agent: DDPG
Environment: Channel Pruning
Model Compression
Automated, by AI:
Higher Compression Rate, Faster Environment: Channel Pruning
Automated, Higher Compression Rate, Faster

Fig. 1. Overview AMC: of AutoML for


AutoML for Model ModelandCompression
Compression Acceleration on Mobile(AMC)
Devices [He etengine. Left: AMC replaces
al., ECCV 2018]
Fig.
MIT 1.TinyML
6.5940: Overview ofDeep
and E cient AutoML for Model Compression (AMC) engine. Left: https://e
Learning Computing AMC replaces
cientml.ai 29
ffi
ffi
AMC: AutoML for Model Compression
Pruning
2 asLin,
Ji a reinforcement learning
Yihui He, Zhijian problem
Liu, Hanrui Wang, Li-Jia Li, Song Han

Model Compression by Human:


Labor Consuming, Sub-optimal

Reward= -Error*log(FLOP)
Critic Layer t+1
?%
Original NN Compressed NN

Action: Compress with


Actor Sparsity ratio at (e.g. 50%) Layer t
AMC Engine
50%

Embedding Layer t-1


Embedding st=[N,C,H,W,i…] 30%

Original NN Compressed NN
Agent: DDPG
Model Compression by AI: Environment: Channel Pruning
Automated, Higher Compression Rate, Faster

Fig. 1. Overview ofAnAutoML


Analysis of Deepfor Model
Neural Compression
Network Models (AMC)
for Practical Applications engine.
[Canziani et al., 2016] Left: AMC replaces
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 30
ffi
ffi
AMC: AutoML for Model Compression
• AMC uses the following setups for the reinforcement learning problem
• State: features including layer indices, channel numbers, kernel sizes, FLOPs, …
• Action: A continuous number (pruning ratio) a ∈ [0,1)
• Agent: DDPG agent, since it supports continuous action output

{−∞,
− Error, if satis es constrains
Reward: R =
• if not
• We can also optimize latency constraints with a pre-built lookup table (LUT)

AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
fi
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 31
ffi
ffi
AMC: AutoML for Model Compression

Human*

AutoML

(smaller the better)

* E cient Methods and Hardware for Deep Learning [Han, thesis]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 32
ffi
ffi
ffi
AMC: AutoML for Model Compression
une 24–29, 2018, San Francisco, CA, USA Song Han and Willi
Peaks: our RL agent automatically learns 1x1 convolutions
have less redundancy and can be pruned less.

Crests: our RL agent automatically learns 3x3 convolutions have


more redundancy and can be pruned more.
Residual Block 1 Residual Block 2 Residual Block 3 Residual Block 4

Figure 14: The pruning policy (sparsity ratio) given by our reinforcement learning agent for ResNet-50.
ResNet50 Density Pruned by Human Expert neural network. In Proceedings of the 43rd International Symposium
ResNet50 Density Pruned by AMC (the lower the better) Architecture. IEEE Press, 243–254.
MIT 6.5940: TinyML and E cient Deep Learning Computing [9] Song Han, Huizi Mao, and William J Dally. 2015. cientml.ai
https://e Deep compression
33
ffi
ffi
AMC: AutoML for Model Compression

Model MAC Top-1 Latency* Speedup Memory

1.0 MobileNet 569M 70.6% 119.0ms 1x 20.1MB

AMC (50% FLOPs) 285M 70.5% 64.4ms 1.8x 14.3MB

AMC (50% Time) 272M 70.2% 59.7ms 2.0x 13.2MB

0.75 MobileNet 325M 68.4% 69.5ms 1.7x 14.8MB


* Measured with TF-Lite on Samsung Galaxy S7 Edge, which has Qualcomm Snapdragon SoC
Single core, Batch size = 1(mobile, latency oriented)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 34
ffi
ffi
NetAdapt
A rule-based iterative/progressive method

• The goal of NetAdapt is to nd a per-layer pruning ratio to meet a global resource constraint (e.g.,
latency, energy, …)
• The process is done iteratively
• We take latency constraint as an example

NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 35
ffi
ffi
fi
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount ΔR (manually de ned)

original model
NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 36
ffi
ffi
fi
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount ΔR (manually de ned)
• For each layer Lk (k in A-Z in the gure)
• Prune the layer s.t. the latency reduction meets ΔR (based on a pre-built lookup table)

original model prune each layer to reduce ΔR


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 37
ffi
ffi
fi
fi
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount ΔR (manually de ned)
• For each layer Lk (k in A-Z in the gure)
• Prune the layer s.t. the latency reduction meets ΔR (based on a pre-built lookup table)
• Short-term ne-tune model (10k iterations); measure accuracy after ne-tuning

AccA AccB AccC AccD … AccZ

Short-term ne-tune

original model prune each layer to reduce ΔR


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 38
ffi
fi
fi
ffi
fi
fi
fi
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount ΔR (manually de ned)
• For each layer Lk (k in A-Z in the gure)
• Prune the layer s.t. the latency reduction meets ΔR (based on a pre-built lookup table)
• Short-term ne-tune model (10k iterations); measure accuracy after ne-tuning
• Choose and prune the layer with the highest accuracy

AccA AccB AccC AccD … AccZ

Short-term ne-tune

original model prune each layer to reduce ΔR


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 39
ffi
fi
fi
ffi
fi
fi
fi
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount ΔR (manually de ned)
• For each layer Lk (k in A-Z in the gure)
• Prune the layer s.t. the latency reduction meets ΔR (based on a pre-built lookup table)
• Short-term ne-tune model (10k iterations); measure accuracy after ne-tuning
• Choose and prune the layer with the highest accuracy
• Repeat until the total latency reduction satis es the constraint

AccA AccB AccC AccD … AccZ

Short-term ne-tune

original model prune each layer to reduce ΔR


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 40
ffi
fi
fi
ffi
fi
fi
fi
fi
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount ΔR (manually de ned)
• For each layer Lk (k in A-Z in the gure)
• Prune the layer s.t. the latency reduction meets ΔR (based on a pre-built lookup table)
• Short-term ne-tune model (10k iterations); measure accuracy after ne-tuning
• Choose and prune the layer with the highest accuracy
• Repeat until the total latency reduction satis es the constraint
• Long-term ne-tune to recover accuracy
AccA AccB AccC AccD … AccZ

Short-term ne-tune

Long-term
ne-tune

original model prune each layer to reduce ΔR Final model


NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 41
fi
ffi
fi
fi
fi
ffi
fi
fi
fi
fi
NetAdapt

• The iterative nature allows us to obtain a serial of models with di erent costs
• #models = #iterations

model series

NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 42
ffi
ffi
ff
Neural Network Pruning
• Introduction to Pruning
• What is pruning? x
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
arg min L(x; WP)
• What should target sparsity be for each layer? WP

• Fine-tune/Train Pruned Neural Network s . t .∥WP∥0 ≤ N


• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 43
ffi
ffi
ffi
Section 2: Fine-tuning / Training
How should we improve performance of sparse models?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 44
ffi
ffi
Finetuning Pruned Neural Networks
• After pruning, the model may decrease, especially for larger pruning ratio.
• Fine-tuning the pruned neural networks will help recover the accuracy and push the pruning ratio
higher.
• Learning rate for ne-tuning is usually 1/100 or 1/10 of the original learning rate.

Pruning Pruning+Finetuing
Train Connectivity 0.5%

Accuracy Loss
-0.5%

-1.5%
Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 45
ffi
fi
ffi
ffi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing
Train Connectivity 0.5%

Accuracy Loss
-0.5%

-1.5%
Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 46
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing
Train Connectivity 0.5%

Accuracy Loss
-0.5%

-1.5%
30% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 47
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing
Train Connectivity 0.5%

Accuracy Loss
-0.5%

-1.5%
30% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 48
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing
Train Connectivity 0.5%

Accuracy Loss
-0.5%

-1.5%
50% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 49
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing
Train Connectivity 0.5%

Accuracy Loss
-0.5%

-1.5%
50% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 50
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing
Train Connectivity 0.5%

Accuracy Loss
-0.5%

-1.5%
70% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 51
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.

Pruning Pruning+Finetuing
Train Connectivity 0.5%

Accuracy Loss
-0.5%

-1.5%
70% pruned Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 52
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
• boost pruning ratio from 5✕ to 9✕ on AlexNet compared to single-step aggressive pruning.

Pruning Pruning+Finetuing Iterative Pruning and Finetuing


Train Connectivity 0.5%

Accuracy Loss
-0.5%

-1.5%
Prune Connections
-2.5%

-3.5%

Train Weights -4.5%


40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 53
ffi
ffi
ffi
fi
Regularization
• When training neural networks or ne-tuning quantized neural networks, regularization is added
to the loss term to
• penalize non-zero parameters
• encourage smaller parameters
• The most common regularization for improving performance of pruning is L1/L2 regularization.
• L1-Regularization
L′ = L(x; W) + λ | W |
• L2-Regularization
2
L′ = L(x; W) + λ∥W∥
• Examples:
• Magnitude-based Fine-grained Pruning applies L2 regularization on weights
• Network Slimming applies smooth-L1 regularization on channel scaling factors.

Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 54
ffi
ffi
ffi
ffi
fi
Neural Network Pruning
• Introduction to Pruning
• What is pruning? x
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
arg min L(x; WP)
• What should target sparsity be for each layer? WP

• Fine-tune/Train Pruned Neural Network s . t .∥WP∥0 ≤ N


• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 55
ffi
ffi
ffi
System Support for Sparsity

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 56
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 57
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 58
ffi
ffi
Proposed Paradigm

Conventional
Training Inference
Slow Power
Hungry

Proposed Accelerated
Model Inference
Training
Compression
Fast Power
Han et al ISCA’16 Efficient
Han et al ICLR’17
Han et al FPGA’17
Han et al NeurIPS’15
(best paper award)
Han et al ICLR’16
(best paper award)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 59
ffi
ffi
EIE: Efficient Inference Engine
The First DNN Accelerator for Sparse, Compressed Model

0*A=0 W*0=0 2.09, 1.92=> 2

Sparse Weight Sparse Activation Weight Sharing


90% static sparsity 70% dynamic sparsity 4-bit weights

10x less computation 3x less computation

5x less memory footprint 8x less memory footprint

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 60
ffi
ffi
ffi
EIE: Parallelization on Sparsity
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
PE PE PE PE P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
PE PE PE PE P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
Central Control
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
PE PE PE PE B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
PE PE PE PE
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 61
ffi
ffi
ffi
EIE: Parallelization on Sparsity
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
logically B
P E3 B 0 0 0 0 C B
B
C B
C=B
C
b3 C ReLU
C )
B
B
B
b3 C
C
C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3

physically Relative Index 0 1 2 0 0

Column Pointer 0 1 2 3 5

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 62
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 63
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 64
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 65
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 66
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 67
ffi
ffi
ffi
Micro Architecture for each PE

Act Value Act Value


Act Queue Act Leading
Act Index SRAM NZero
Encoded
Weight
Detect
Act Index

Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass Dest Src
End
Matrix Regs
Act Act
Addr SRAM Regs Regs
Address ReLU
Odd Ptr SRAM Bank Absolute Address
Relative Index
Accum
Pointer Read Sparse Matrix Access Arithmetic Unit Act R/W

SRAM Regs Comb

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 68
ffi
ffi
ffi
Load Balance

Act Value
PE PE PE PE Act Queue
Act Index

PE PE PE PE
Central Control
PE PE PE PE

PE PE PE PE

SRAM Regs Comb

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 69
ffi
ffi
ffi
Activation Sparsity

Act Value Act Value


Act Queue
Leading
PE PE PE PE
Act Index NZero
PE PE PE PE Detect
Act Index
Central Control
PE PE PE PE

Even Ptr SRAM Bank


PE PE PE PE

Odd Ptr SRAM Bank

Pointer Read

SRAM Regs Comb

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 70
ffi
ffi
ffi
Weight Sparsity

Act Value Act Value


Act Queue
Leading
PE PE PE PE
Act Index NZero
PE PE PE PE Detect
Act Index
Central Control
PE PE PE PE
Col Sparse
Even Ptr SRAM Bank Start/
PE PE PE PE
End
Matrix Regs
Addr SRAM
Odd Ptr SRAM Bank

Pointer Read Sparse Matrix Access

SRAM Regs Comb

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 71
ffi
ffi
ffi
Weight Sharing

Act Value Act Value


Act Queue
Leading
PE PE PE PE
Act Index NZero
Encoded Decoded
PE PE PE PE Weight Weight
Detect
Act Index
Central Control
PE PE PE PE
Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder
PE PE PE PE
End
Matrix Regs
Addr SRAM
Odd Ptr SRAM Bank Address
Relative
Accum Absolute
Pointer Read Sparse Matrix Access Index Index

SRAM Regs Comb

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 72
ffi
ffi
ffi
Arithmetic & Write Back

Act Value Act Value


Act Queue Act Leading
PE PE PE PE
Act Index SRAM NZero
Encoded
PE PE PE PE Weight Detect
Act Index
Central Control
PE PE PE PE
Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass Dest Src
PE PE PE PE
End
Matrix Regs
Act Act
Addr SRAM Regs Regs
Odd Ptr SRAM Bank Address
Absolute Address
Relative
Accum
Pointer Read Sparse Matrix Access Index Arithmetic Unit Act R/W

SRAM Regs Comb

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 73
ffi
ffi
ffi
Relu, Non-zero Detection

Act Value Act Value


Act Queue Act Leading
PE PE PE PE
Act Index SRAM NZero
Encoded
PE PE PE PE Weight Detect
Act Index
Central Control
PE PE PE PE
Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass Dest Src
PE PE PE PE
End
Matrix Regs
Act Act
Addr SRAM Regs Regs
Address ReLU
Odd Ptr SRAM Bank Absolute Address
Relative
Accum
Pointer Read Sparse Matrix Access Index Arithmetic Unit Act R/W

SRAM Regs Comb

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 74
ffi
ffi
ffi
What’s Special

Act Value Act Value


Act Queue Act Leading
PE PE PE PE
Act Index SRAM NZero
Encoded
PE PE PE PE Weight Detect
Act Index
Central Control
PE PE PE PE
Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass Dest Src
PE PE PE PE
End
Matrix Regs
Act Act
Addr SRAM Regs Regs
Address ReLU
Odd Ptr SRAM Bank Absolute Address
Relative
Accum
Pointer Read Sparse Matrix Access Index Arithmetic Unit Act R/W

SRAM Regs Comb

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 75
ffi
ffi
ffi
Benchmark
• CPU: Intel Core-i7 5930k
• GPU: NVIDIA TitanX
• Mobile GPU: NVIDIA Jetson TK1
Weight Activation FLOP
Layer Size Description
Density Density Reduction
AlexNet-6 4096 × 9216 9% 35% 33x
AlexNet for
AlexNet-7 4096 × 4096 9% 35% 33x image
classi cation
AlexNet-8 1000 × 4096 25% 38% 10x
VGG-6 4096 × 25088 4% 18% 100x
VGG-16 for
VGG-7 4096 × 4096 4% 37% 50x image
classi cation
VGG-8 1000 × 4096 23% 41% 10x
NeuralTalk-We 600 × 4096 10% 100% 10x
RNN and LSTM
NeuralTalk-Wd 8791 × 600 11% 100% 10x for image
caption
NeuralTalk-LSTM 2400 × 1201 10% 100% 10x
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 76
ffi
fi
fi
ffi
ffi
Comparison: Throughput
EIE

Throughput (Layers/s in log scale)


1E+06

ASIC
1E+05
ASIC
ASIC
1E+04

GPU
1E+03
ASIC
1E+02
CPU mGPU
1E+01 FPGA

1E+00
Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 77
ffi
ffi
ffi
Comparison: Energy Efficiency
EIE
Energy Ef ciency (Layers/J in log scale)
1E+06

1E+05
ASIC ASIC
1E+04
ASIC ASIC
1E+03

1E+02

1E+01 GPU mGPU

1E+00 CPU FPGA


Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs

EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 78
ffi
ffi
fi
ffi
Top-5 most cited papers in 50 years of ISCA

Rank Citations Year Title (★ means it won the ISCA Influential Paper Award) First Author + HOF Authors Type Topic
The SPLASH-2 programs: Characterization and
1 5351 1995 Stephen Woo, Anoop Gupta Tool Benchmark
methodological considerations
In-datacenter performance analysis of a Tensor Processing Norm Jouppi, David Machine
2 4214 2017 Arch
Unit Patterson Learning
★ Wattch: A framework for architectural-level power David Brooks, Margaret
3 3834 2000 Tool Power
analysis and optimizations Martonosi
★ Transactional memory: Architectural support for
4 3386 1993 Maurice Herlihy Micro Parallelism
lock-free data structures
EIE: Efficient inference engine on compressed deep neural Song Han, Bill Dally, Mark Machine
5 2690 2016 Arch
network Horowitz Learning
Xiaobo Fan, Luiz Barroso
6 2620 2007 ★ Power provisioning for a warehouse-sized computer Micro Power

Active messages: a mechanism for integrated


7 2507 1992 Thorsten von Eiken Micro Parallelism
communication and computation
Hadi Esmaeilzadeh, Doug
8 2391 2011 Dark silicon and the end of multicore
MIT 6.5940: TinyML and E cient Deep Learning Computing scaling Burger, Karthikeyan Micro Parallelism
https://e cientml.ai 79
ffi
ffi
Pro:
• EIE demonstrated that special-purpose hardware can make it cost-e ective to do sparse operations
with matrices that are up to 50% dense
• EIE exploits both weight sparsity and activation sparsity, not only saves energy by skipping zero
weights, but also saves the cycle by not computing it.
• EIE supports ne-grained sparsity, and allows pruning to achieve a higher pruning ratio.
• Aggressive weight quantization (4bit) to save memory footprint. To maintain accuracy, EIE decodes
the weight to 16bit and uses 16bit arithmetic. W4A16 approach is reborn in LLM: GPTQ, AWQ, llama.cpp,
MLC LLM
Con:
• EIE isn’t as easily applied to arrays of vector processors — improve: structured sparsity (N:M sparsity)
• EIE’s Control ow overhead, storage overhead — improve: coarse grain sparsity
• EIE only support FC layers - actually reborn in LLM
• EIE ts everything on SRAM - practical for TinyML, not LLM

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 80
fi
ffi
fi
fl
ffi
ff
The rst principle of e cient AI computing is to be lazy: avoid redundant computation,
quickly reject the work, or delay the work.

• Generative AI: spatial sparsity [SIGE, NeurIPS’22]


• Transformer: token sparsity, progressive quantization [SpAtten, HPCA’21]
• Video: temporal sparsity [TSM, ICCV’19]
• Point cloud: spatial sparsity [TorchSparse, MLSys’22 & PointAcc, Micro’22]

We envision future AI models will be sparse at various granularity and structures. Co-
designed with specialized accelerators, sparse models will become more e cient and
accessible.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 81
fi
ffi
ffi
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity for GEMM
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 82
ffi
ffi
M:N Sparsity
4 · Accelerating Sparse Deep Neural Networks ·

Structured-sparse Structured-sparse and


matrix W compressed matrix W

Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data

C C/2 C/2
= zero entry Non-zero data 2-bits
values indices
Fig. 1. Structured-sparse matrix (W) storage format. The uncompressed matrix is of
C
dimension R Two
⇥ Cweights arecompressed
and the nonzero out of four consecutive
matrix weightsR(2:4
is of dimension ⇥ sparsity).
2
.

Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]


3.1 2:4 Sparsity and Its Benefits
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 83
ffi
ffi
M:N Sparsity
4 · Accelerating Sparse Deep Neural Networks ·

Structured-sparse Structured-sparse and


matrix W compressed matrix W

Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data

C C/2 C/2
= zero entry Non-zero data 2-bits
values indices
Fig. 1. Structured-sparse matrix (W) storage format. The uncompressed matrix is of
C
dimension R Two
⇥ Cweights arecompressed
and the nonzero out of four consecutive
matrix weightsR(2:4
is of dimension ⇥ sparsity).
2
.

Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]


3.1 2:4 Sparsity and Its Benefits
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 84
ffi
ffi
M:N Sparsity
4 · Accelerating Sparse Deep Neural Networks ·

Structured-sparse Structured-sparse and


matrix W compressed matrix W

Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data

C C/2 C/2
= zero entry Non-zero data 2-bits
values indices
Fig. 1. Structured-sparse matrix (W) storage format. The uncompressed matrix is of
C
Push R
dimension all⇥
theCnonzero
and theelements to the
compressed left in memory:
matrix save storage
is of dimension R ⇥and
2
computation.
.

Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]


3.1 2:4 Sparsity and Its Benefits
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 85
ffi
ffi
System Support for M:N Sparsity
Mapping M:N sparse matrices· onto NVIDIA
Accelerating Sparsetensor cores
Deep Neural Networks · 5

Sparse operation
on Tensor Core
B matrix (Dense) Choose matching K/2 B matrix (Dense)
Dense operation elements out of K
elements
on Tensor Core Select

K K
☓ ☓

Accumulator (result) Accumulator (result)

N N

A matrix (Sparse)
A matrix (Dense)

M M M M

K C matrix (Dense) K/2 K/2 C matrix (Dense)


Non-zero data 2-bits
values indices

Dense M✕N✕K GEMM Sparse M✕N✕K GEMM


Fig. 2. Mapping a M⇥N⇥K GEMM onto a Tensor Core. Dense matrix A, of size
TheM⇥K,
indices areside)
(left used to mask
becomes M⇥ outK the inputs. Only 2 multiplications will be done out of four.
2
(right side) after pruning with 2:4 sparsity. Sparse
Tensor Core hardware selects only the elements from B that correspond to the nonzero
values in A, skippingAccelerating Sparse Deepmultiplications
the unnecessary Neural Networks [Mishra et al., arXiv
by zero. 2021] dense and sparse
In both
MIT 6.5940: TinyML GEMMs,
and E cientBDeep
and Learning Computing
C are dense K⇥N and M⇥N matrices, respectively. https://e cientml.ai 86
ffi
ffi
5 Results
We evaluate the workflow proposed in Section 4 across a range of problem domains,

System Support for M:N Sparsity


tasks, and neural network architectures. For training each of the networks, we use
hyper-parameters and training details mentioned in the papers introducing the network
architecture and/or popular public repositories of network implementations. We examine
model accuracy for both floating point networks as well as their quantization to INT8.

Table 2. Top-1 accuracy of image classification networks evaluated on the ImageNet


ILSVRC2012 dataset with 2:4 sparsity.
Accuracy
Network
Dense Sparse Sparse
FP16 FP16 INT8
6 · Accelerating Sparse Deep Neural Networks ·
ResNet-34 73.7 73.9 73.7
ResNet-50 76.1 76.2 76.2
INT8 (TN) cuSPARSELt vs. cuBLAS Performance ResNet-50 (SWSL) 81.1 80.9 80.9
GEMM-M = GEMM-N = 10240 ResNet-101 77.7 78.0 77.9
Sparse vs. Dense Speedup

2.0 ResNeXt-50-32x4 77.6 77.7 77.7


1.9
ResNeXt-101-32x16 79.7 79.9 79.9
1.8
1.7 ResNeXt-101-32x16 (WSL) 84.2 84.0 84.2
1.6 DenseNet-121 75.5 75.3 75.3
1.5
1.4
DenseNet-161 78.8 78.8 78.9
1.3 Wide ResNet-50 78.5 78.6 78.5
1.2 Wide ResNet-101 78.9 79.2 79.1
1.1
1.0 Inception v3 77.1 77.1 77.1
1280 2560 3840 5120 6400 7680 8960 10240 11520 12800 14080 15360 16640 17920 19200 20480 Xception 79.2 79.2 79.2
GEMM-K VGG-11 70.9 70.9 70.8
Fig. 3. Comparison of sparse and dense INT8 GEMMs on NVIDIA A100 Tensor Cores. VGG-16 74.0 74.1 74.1
VGG-19 75.0 75.0 75.0
Larger GEMMs achieve nearly a 2⇥ speedup with Sparse Tensor Cores.
SUNet-128 75.6 76.0 75.4
SUNet-7-128 76.4 76.5 76.3
ure 3 shows speedups achieved over a sampling of GEMM dimensions (cuSPARSELt1 DRN26 75.2 75.3 75.3
cuBLAS2 libraries were used for the sparse and dense GEMMs, respectively). As larger DRN-105 79.4 79.5 79.4
GEMMs tend to have higher arithmetic intensity, they get closer to the 2⇥ speedup
a↵ordedPruning
by SparseCNNs
Tensorwith 2:4For
Cores. sparsity
languagewill bring networks,
modeling about largeN is speedup 5.1 for
often the Image GEMM workloads
Classification Networks and it will not incur
sequence length times the batch size: for a sequence length of 256, one would need Image a classification networks are trained in a single phase, thus retraining simply
batch size of 40 to see this plot with N equal to 10K.performance drop
M and K are related to for DNN
the hidden models.
repeats the training step schedule (with the exact same hyper-parameters and learning
rate schedule as used to train the network) starting with the network initialized to its
dimensions of the layers in the network, whichAccelerating
is typicallySparse
scaledDeep Neural
up to Networks
increase [Mishra
network et al., arXiv 2021]
pruned trained weights.
accuracy; GPT-3
MIT 6.5940: [35], and
TinyML for example, uses aLearning
E cient Deep hidden size of 12,288.
Computing https://e
Table 2 shows the accuracy of a wide variety of networks: popularcientml.ai
networks like 87
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 88
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity
• TorchSparse: Sparse Convolution Library
• PointAcc: Hardware Accelerator for Sparse Convolution

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 89
ffi
ffi
Sparse convolution on sparse inputs
Conventional Convolution Sparse Convolution
~0.01%

Input sparsity from Nonzeros


Input sparsity Nonzeros
the distribution in will not
from ReLU will dilate
physical space dilate

Submanifold Sparse Convolutional Neural Networks [Graham, BMVC 2015]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 90
ffi
ffi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute

Maps
(In, Out, Wgt)

Computation
(fOut = fOut + fIn × WWgt) for
each entry in the maps

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 91
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps
(In, Out, Wgt)

Computation
(fOut = fOut + fIn × WWgt) for
each entry in the maps

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 92
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt)

Computation
(fOut = fOut + fIn × WWgt) for
each entry in the maps

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 93
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute

Computation
(fOut = fOut + fIn × WWgt) for
each entry in the maps

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 94
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation
(fOut = fOut + fIn × WWgt) for
each entry in the maps

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 95
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation (P0, Q5, W0,-1) No compute
(fOut = fOut + fIn × WWgt) for
each entry in the maps

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 96
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation (P0, Q5, W0,-1) No compute
(fOut = fOut + fIn × WWgt) for (P0, Q8, W-1,1) No compute
each entry in the maps

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 97
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation (P0, Q5, W0,-1) No compute
(fOut = fOut + fIn × WWgt) for (P0, Q8, W-1,1) No compute
each entry in the maps (P0, Q9, W-1,0) No compute

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 98
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution

(P0, Q0, W1,1) No compute


(P0, Q1, W1,0) No compute
Maps (P0, Q2, W1,-1) No compute
(In, Out, Wgt) (P0, Q3, W0,1) No compute
(P0, Q4, W0,0) (P0, Q0, W0,0)
Computation (P0, Q5, W0,-1) No compute
(fOut = fOut + fIn × WWgt) for (P0, Q8, W-1,1) No compute
each entry in the maps (P0, Q9, W-1,0) No compute
(P0, Q10, W-1,-1) (P0, Q1, W-1,-1)

9 matrix multiplications 2 matrix multiplications


TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 99
ffi
ffi
ffi
fi
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for di erent weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
P0 PSUM1
(P1, Q1, W0,0) P2 w-1,-1 Q2
P3 PSUM4
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 2 × Cin Cin × Cout 2 × Cout Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cout
(P3, Q1, W1,0) f1 = f1 + f0 × W-1,-1
f4 = f4 + f3 × W-1,-1
(P1, Q0, W1,1)
(P4, Q3, W1,1)

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 100
ffi
ffi
ffi
ff
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for di erent weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
(P1, Q1, W0,0) P2 P1 w-1,0 PSUM3 Q2
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 1 × Cin Cin × Cout 1 × Cout Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cout
(P3, Q1, W1,0) f3 = f3 + f1 × W-1,0
(P1, Q0, W1,1)
(P4, Q3, W1,1)

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 101
ffi
ffi
ffi
ff
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for di erent weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 P0 PSUM0 Q0
(P0, Q0, W0,0) P1 P1 PSUM1 Q1
(P1, Q1, W0,0) P2 P2 ww-1,-1
0,0 PSUM2 Q2
(P2, Q2, W0,0) P3 P3 PSUM3 Q3
(P3, Q3, W0,0) P4 P4 1 × Cout
PSUM4 Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cin Cin × Cout 5 × Cout 5 × Cout
(P3, Q1, W1,0)
(P1, Q0, W1,1) fi = fi + fi × W0,0
(P4, Q3, W1,1) i = 0, 1, 2, 3, 4

Note: maps for W0,0 contains all entries.


TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 102
ffi
ffi
ffi
ff
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for di erent weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
(P1, Q1, W0,0) P2 P3 w1,0 PSUM1 Q2
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 1 × Cin Cin × Cout 1 × Cout Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cout
(P3, Q1, W1,0) f1 = f1 + f3 × W1,0
(P1, Q0, W1,1)
(P4, Q3, W1,1)

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 103
ffi
ffi
ffi
ff
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for di erent weights

Workload
Maps
(In, Out, Wgt)

(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
P1 PSUM0
(P1, Q1, W0,0) P2 w1,1 Q2
P4 PSUM3
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 2 × Cin Cin × Cout 2 × Cout Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cout
(P3, Q1, W1,0) f0 = f0 + f1 × W1,1
f3 = f3 + f4 × W1,1
(P1, Q0, W1,1)
(P4, Q3, W1,1)

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 104
ffi
ffi
ffi
ff
TorchSparse optimization overview

Locality-Aware Access Adaptive Grouping Locality-Aware Access


Gather Matrix-Matrix Multiplication Scatter-Accumulate

F0 PSUM 1
F3 × W-1,-1 = PSUM 4

F1 PSUM 3
pad × W-1,0 =
F0
Gather F3 PSUM 1
Scatter
F1 pad × W1,0 = F0
F1
F2 F1 PSUM 0 F2
F3 F4 × W1,1 = PSUM 3 F3
F4 Apply BMM F4

F0 PSUM 0
Input
Features F1 × W0,0 = PSUM 1
Output
Features
F2 PSUM 2
F3 PSUM 3
F4 Apply MM PSUM 4

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 105
ffi
ffi
ffi
Trading computation for regularity
Separate computation (baseline) : many kernel calls, low device utilization

MM MM MM MM MM MM MM

Separate Computation

Worst Best

Computation overhead

Computation regularity

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 106
ffi
ffi
ffi
Trading computation for regularity
Dense convolution: best regularity but large computation overhead

MM MM MM MM MM MM MM BMM (batch=7)

Separate Computation Dense Convolution

Worst Best Worst Best

Computation overhead Computation overhead

Computation regularity Computation regularity

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 107
ffi
ffi
ffi
Trading computation for regularity
Computation with grouping: balancing overhead and regularity

Extra computation = 2 / 28
(Small overhead)

MM MM MM MM MM MM MM BMM (batch=7) MM BMM (batch=4) BMM (batch=2)

Separate Computation Dense Convolution Computation with grouping

Worst Best Worst Best Worst Best

Computation overhead Computation overhead Computation overhead

Computation regularity Computation regularity Computation regularity

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 108
ffi
ffi
ffi
Trading computation for regularity
Searching customized strategy for di erent model and datasets
1.6

Speedup Over Baseline


Increasing regularity helps 6 groups
improve latency
1.2
13 groups 3 groups
26 groups Padding overhead hurts
0.8 (=33 − 1, assume
latency
o set=(0,0,0) separately
0.4 computed.)

1 group
0
26 24 22 20 18 16 14 12 10 8 6 4 2 0
Number of Groups

105
100000 105
100000
SemanticKITTI nuScenes
104
10000 104
10000
Map Size

103
1000 103
1000

100
102 100
102
1 4 7 10 13 16 19 22 2527 1 4 7 10 13 16 19 22 2527
Weight Index Weight Index
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 109
ff
ffi
ffi
ff
Results on matrix multiplication optimizations
SemanticKITTI

Baseline Fixed Grouping Adaptive Grouping

12.0 1.40
11.9 1.39

Normalized Speedup
9.0 1.05
8.7 1.00
8.1
TFLOP/s

0.87
6.0 0.70

3.0 0.35

0.0 0.00

TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 110
ffi
ffi
ffi
Results on matrix multiplication optimizations
nuScenes: xed grouping has best TFLOP/s but adaptive grouping is faster

Baseline Fixed Grouping Adaptive Grouping

22.0 1.60
21.1 1.50 1.54

Normalized Speedup
16.5 16.9 1.20
TFLOP/s

1.00
11.0 0.80
10.4

5.5 0.40

0.0 0.00
This is because fixed grouping introduced large amount of redundant computation.
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 111
ffi
ffi
fi
ffi
TorchSparse++: Overlapped memory and computation
oint-wise annotations for 50 classes from which 9 are se- Semantic Semantic segmentation and scene and
segmentation com-scene com-
ected for evaluation. Another recently used large dataset
or autonomous driving [57], but with fewer classes, is not
pletion
Autonomous ofpletion
3D point of 3D clouds point areclouds usuallyare usually
studied separately 3D Segmentation
[2, 3], but with the 3D Detection 3D Scene Reconstruction
ublicly available. Vehicles studied separately [2, 3], but with the
The Virtual KITTI dataset [17] provides synthetically emergenceemergence of large-scale datasets such
of large-scale datasets as such as
iPhone Vision
enerated sequential images with depth information and 15 ScanNet Pro[4] and SemanticKITTI
ScanNet [4] and SemanticKITTI [1], re- [1], re-
ense pixel-wise annotation. The depth information can searchers searchers
have discovered a deep inter-
have discovered a deep inter-
lso be used to generate point clouds. However, these point twining oftwining an object’s of ansemantics with its
Figure 6. Visualization of 3D auto labels on the Waymo Open Dataset val set (best viewed in color with zoom in). Object points are
(a) Input Sparse Tensor (b) Predicted Semantic Label

louds do not show the same characteristics as a real rotat- object’scolored semantics
by object with
types with blue for static its
(a) Input Sparse Tensor
vehicles, red for moving vehicles and orange for pedestrians. Boxes are colored as: green for
(b) Predicted Semantic Label

underlyingunderlyinggeometry, geometry, and since,TorchSparse++


have be- have be-
true positive detections, red for false positives and cyan for ground truth boxes in the cases of false negatives.

ng LiDAR, including defects like reflections and outliers. and since, Waymo | Confidential & Proprietary

In contrast to these datasets, our dataset combines a large gun exploiting this with the
gun exploiting thisjoint with learn-
transform
the joint learn-
segmentation iterative tta [email protected]/0.8
Method Context frames
static dynamic
[email protected]/0.8 [email protected]/0.8
ing of semantic segmentation and Xscene-
- - - - 78.82 / 50.90
mount of labeled points, a large variety of classes, and se- ing of semantic segmentation X X and scene
-
-
-
-
81.35 / 54.76
81.37 / 55.67
S-MVF++
M-MVF++
[ 0, +0] 67.17 / 36.61 80.07 / 57.71
73.96 / 43.56 82.21 / 59.52
completioncompletion to boost model performance
[ 4, +0]
uential scans generated by a commonly employed sensor to boost model X
performance
X X - 82.02 / 56.77 [ 0, +0] 78.13 / 50.30 80.65 / 57.97

sed in autonomous driving, which is distinct from all pub- Sparse


[5]. For instance, Kernel
[5]. Forspeculating Generator
instance, speculatingthat Table an
X
ob-
6. Ablation
X

that an
X

ob-
X 82.28 / 56.92
studies of the static auto labeling model. Met- Sparse
3DAL
[ 5, Autotuner
+5]
79.60 / 52.52 84.34 / 63.60
(c) RGB image
[ 2, +2]
(c) RGB image
80.48 / 55.02 85.10 / 64.51
cly available datasets, also shown in Table 1. Figure 1: forSemantic Scene Completion all on SemanticKITTI
82.28 / 56.92 85.67 / 65.77 Dataset.
ject occluded jectby vehiclesbyand surrounded
rics are the box accuracy at 3D IoU=0.7 and IoU=0.8 vehicles
Figure 1: Semantic Scene Completion on SemanticKITTI Dataset.
occluded vehicles in the Waymo
and surrounded
Open Dataset val set. Table 8. Effects of temporal context sizes for object auto label-
ing. Metrics are the box accuracy at 3D IoU=0.7, 0.8 for vehicles

. The SemanticKITTI Dataset by leaves roadis a trunk


sidewalk simplifies
by leaves is a trunk simplifies
parking the
car task Method of inferring
pole
the task of it’s shape.
inferring
[email protected]/0.8 Conversely,
in the WOD val set. Dynamic vehicles
it’s
because they shape.
are closer to the
inferring
Conversely,
sensor than static ones.
theinferring
have a higher accuracy shape ofthe a shape of a
Dense pole-like
to Sparse object
pole-like
vegetation formsterrain a prior
object
Static trunk on building
forms
to Dynamic it’s
a priorsemantic
Align & refine
Points only
on it’s class 83.79being
83.33 / 60.69
semantic Design
/ 61.95 aSpace
class trunk being rather a than arather
trunk wall. than
Group-Based
While a previous
wall. While previous
Our dataset is based on the odometry dataset of the semantic scene completion
other-structure methods
other-object
built
Box sequence only
on dense 2D
83.13 / 58.96
or 3D convolutional layers
challenging cases with occlusions and very few points. The
have done well
KITTI Vision Benchmark [19] showing inner city traffic, Adaptation semantic scene completion
Adaptation methods built
Points and box sequence joint 85.67
on
Augmentation
/ 65.77
dense busy 2D or
intersection 3D
scene convolutional
also shows a Config
few failure cases in- layers have done well
Tuning
in small-scale in indoor
small-scale environments,
indoor they have struggled
Table 7. Comparing with alternative designs of dynamic object
environments, they have to maintain
struggled totheir maintainaccuracy
objects and falseand
cluding false negatives of pedestrians in rare poses (sitting),
their efficiency
accuracy and efficiency
esidential areas, but also highway scenes and countryside Figure 2: Single scan (top) and multiple
auto labeling. Metrics are box accuracy with 3D IoU thresholds
0.7 andsuperimposed
false negatives of severely occluded pos-
in outdoor environments for several reasons. For one, dense 2D
0.8 for vehicles on the Waymo Open Dataset val set.
convolutional methods
camera informa- that thrived
itive for objects with similar geometry to cars. Those hard
oads around Karlsruhe, Germany. The original odome- scans with labels in outdoor
(bottom). environments
Also shown is afor moving several car reasons. For one, cases can
dense 2D convolutional
potentially be solved with added
methods that thrived
y dataset consists of 22 sequences, splitting sequences inthe
thecenter
feature rich 2D image space Effects are nocontext
longer sufficient
auto labeling when tackling large and sparse LiDAR
tion with multi-modal learning.

in the feature rich 2DTable image space


the contextare noinfluence
longer sufficient when tackling large and sparse LiDAR
of temporal sizes for object
in of the image resulting in a trace of points.
8 studies how frame sizes the
0 to 10 as training set, and 11 to 21 as test set. For con- scans that contain far fewer geometric box predictionand accuracy.semantic
We also compare with descriptors.
our single- Furthermore, the dense 3D convo-
6. Conclusion
MinkowskiEngine SpConv
essary to becomesscans
1.2.1
account thatsparsity
for(FP16)
the contain and far frame
verticalfewer
TorchSparse
(S-MVF++) geometric
field-of-view. (FP16)
and multi-frame and(M-MVF++)
detectors semanticIn descriptors. SpConvwork we have 2.3.5 Furthermore,
(FP16) the dense 3D convo- (FP16)
TorchSparse++
stency with the original benchmark, we adopt the same lution extremely wasteful to show in extra terms
gains the objectof autocomputation
labeling can bring. We and this
memory since
introduced 3D
the majority
Auto Labeling, a
of the
ivision for our training and test set. Moreover, we do not More lution
specifically, we becomes
do not extremely
distinguish between
can clearly seewasteful
persons
that in
using large temporal terms of
contexts improves computation
point cloud sequences as and input. Thememory
state-of-the-art offboard 3D object detection solution using
since the majority of the
3D volume of interest is in fact empty. the performance Thereby,
while using the entire our
object main
track (the last contributions
long-term temporal data are listed as the Key following:
pipeline leverages the

1.00 by provid- riding


1.00 a vehicle3Dtensor
volume
and the of interest
vehicle,
1.00 but label is in fact
the vehicle1.00 empty. Thereby,
and 1.00our main contributions
of objects in the 3D scene.
1.00 powerful are listed as3D the following:
1.00 1.00
Geomean Speed

nterfere with the original odometry benchmark row) leads to the best performance. Note that for the static
(a) a sparse
the person as (a) eithera bicyclistbased or neural
motorcyclist. network
object model, we use architecture that
the detector box with the highest efficiently
score
to our success
learns
are our
features
object-centric formulation,
from sparse
ng labels only for the training data. Overall, we provide
point cloud data sparse
0.80
and tensor
jointly based
solves neural
for the initial
the
coordinatenetwork
coupled
transform, which architecture
scene
gives our auto
method. completion
ing models.that
and efficiently
Evaluated on the Waymo Openlearns
semantic Dataset, our so-features
offboard multi-frame detector and novel object auto label-

segmentation from sparse 3D


0.86
prob-
3 201 full 3D scans for training and 20 351 for testing, 0.73 We furthermore point distinguished between moving
cloud data and
labeling
jointly
an advantage
0.72and solves
over frame-based
non-
the coupled 0.72scene detectors,completion
0.73
lution has shown significant gains over prior art onboard 3D
and semantic segmentation prob- 0.75
which makes it by a wide margin 0.60 the largest dataset pub- lem; (b)
moving a novel
vehicles and geometric-aware
humans, i.e., vehicles5.6.3D tensor
orQualitative
humans Analysissegmentation loss;label(c)
gets study a hasmulti-view
especially with high standard
0.56
further shown the high quality fusion
metrics.
and semantic
A human
of the auto

cly available. 0.45 0.48 post-processing


the corresponding lem; (b)
moving a
strategynovel
class if geometric-aware
addressing
they 0.47
moved the
In Fig. 6, we visualize the
in some scan 3D
challenges tensor
driving: drivingof
auto labels
0.47 segmentation
distant
for two represen-
or occluded loss;
labels reaching comparable
0.45 (c)
regions
performancea multi-view
as
and
experienced hu- fusion and semantic
small-sized
0.34 while observing
0.40
post-processing
them, as shown instrategy
the
tative scenes
lower
parked cars,
part
in autonomous
addressing of Fig- the
and passing a busy intersection.
on a road
challenges
with
Our model
mans.
have of
Moreover,
distant
demonstrated
the semi-supervised
or
the usefulness
learning experiments
ofoccluded
the auto labels for stu-regions0.39and small-sized0.38
We decided to use the KITTI dataset as a basis for our la-
0.25 objects. Given a single sparse 0.30 point is able tocloud frame,
accurately recognize 0.29vehiclesour model
and pedestrians in predicts 0.42
dent training in cases a ofdense
low-label and 3D unseenoccupancy
domains.
0.22
cuboid
eling effort, since it allowed us to exploit one of the largest ure 2. All0.19 objects.
annotated classesGiven a
are listed single
in Figure sparse
3 and a point more cloud frame, our model predicts a dense 3D occupancy 0.21 cuboid
0.32
0.30 0.29 with semantic
0.29
detailed discussion
labels assigned
and definition
to
0.30 each
of theassigned
voxel
different classes
cell (as
0.29 shown in Fig. 1), generating rich information of
vailable collections of raw point cloud data captured with a
the 3D with
environment semantic that labels
is not contained toineach the voxel cell
original input (as such
8
shown as ingaps Fig.between 1), generating LiDAR rich information of
scans,
ar. We furthermore expect that there are also potential syn- can be found in the supplementary material. In summary,
occluded
we have
the 3Dwhere
regions
classes,
environment
and future
classes scenes.
are
thatassigned
is notthecontained at-
in the original input such as gaps between LiDAR scans,
rgies between our annotationsA100and the existing benchmarks 3090 28 Orin regions and 6 2080 Ti scenes. 3090-TF32 1080 Ti-FP32 A100-train 2080 Ti-train
nd this will enable the investigation and evaluation of ad-
occluded
tribute moving or non-moving, and one outlier class is in- future
In order
cluded for
to effectively
erroneous laser
complete occluded
measurements caused by
voxel regions from LiDAR scans, we focus on exploiting
reflec-
itional research directions, such as the usage of semantics In order to effectively complete occluded voxel regions from LiDAR scans, we focus on exploiting
the
tions geometrical
or other effects.
TorchSparse++
relationship of [Tang
the 3D andpoints Yang et. al,locally
both MICRO and 2023] globally. In this work, we utilize
or laser-based odometry estimation.
point-wise the geometrical
normal vectors relationship
as a geometrical of the feature 3D points encoding both to locally
guide our andmodel globally. in In this
filling the work,
gaps we utilize
MIT 6.5940: TinyML and E cient Deep Learning Computing
Compared to other datasets (cf. Table 1), we provide The dataset is publicly available through a benchmark https://e cientml.ai 112
ffi
ffi
TorchSparse++: Overlapped memory and computation
Vanilla + Row Reordering + Column Splitting

Redundancy Redundancy Redundancy

redundant computation = 12 redundant computation = 10 redundant computation = 8

TorchSparse++ [Tang and Yang et. al, MICRO 2023]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 113
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity
• TorchSparse: Sparse Convolution Library
• PointAcc: Hardware Accelerator for Sparse Convolution

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 114
ffi
ffi
Mapping Unit
Merge sort can be used to nd mappings in sparse convolution
Input Point Cloud

Input Point Cloud Output Point Cloud


P0 P0 P1 P2 P3 P4 Q0 Q1 Q2 Q3 Q4
P1 P2 1,1 2,2 2,4 3,2 4,3 1,1 2,2 2,4 3,2 4,3
P3
W-1,-1 W-1,0 W-1,1 + (1, 1) for w-1,-1
P4
Q1 2,2 3,3 3,5 4,3 5,4
stride = 1 P0 W0,-1 W0,0 W0,1 Merge Sort

Q4 W1,-1 W1,0 W1,1 Q0 Q1 P0 Q2 Q3 P1 P2 Q4 P3 P4


Q0 P3
1,1 2,2 2,2 2,4 3,2 3,3 3,5 4,3 4,3 5,4
Q1 Q2
= = = = = = = = =
Q3 Shift Input for W-1,-1 Q1 Q4
P0 Intersection P3
Q4
Output Point Cloud (In, Out, Wgt)
(P0, Q1, W-1,-1)
(P3, Q4, W-1,-1)

PointAcc: E cient Point Cloud Accelerator [Lin et al., MICRO 2021]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 115
ffi
ffi
ffi
fi
Mapping Unit
Merge sort can be used to nd mappings in sparse convolution
Input Point Cloud

P0 P0 P1 P2 P3 P4 Q0 Q1 Q2 Q3 Q4
P1 P2 1,1 2,2 2,4 3,2 4,3 1,1 2,2 2,4 3,2 4,3
P3
Q0 W-1,-1 W-1,0 W-1,1 + (-1, -1) for w1,1
P4 P1
0,0 1,1 1,3 2,1 3,2
stride = 1 W0,-1 W0,0 W0,1 Merge Sort
Q3
P4
W1,-1 W1,0 W1,1 P0 Q0 P1 P2 P3 Q1 Q2 Q3 P4 Q4
Q0 0,0 1,1 1,1 1,3 2,1 2,2 2,4 3,2 3,2 4,3
Q1 Q2
= = = = = = = = =
Q3 Shift Input for W1,1 Q0 Q3
P1 P4
Q4
Output Point Cloud (In, Out, Wgt)
(P0, Q1, W-1,-1)
(P3, Q4, W-1,-1)

(P1, Q0, W1,1)
(P4, Q3, W1,1)
PointAcc: E cient Point Cloud Accelerator [Lin et al., MICRO 2021]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 116
ffi
ffi
ffi
fi
PointAcc: Speedup and Energy Saving
over NVIDIA RTX 2080Ti over Intel Xeon Skylake + TPU V3 over Intel Xeon Gold 6130

269
127 113 97 131
82 65 88 106 102 94
71 51 90
Speedup

53
27 37

8.3
3.7 3.7 3.4 3.7 4.7 3.7
2.8 2.8 2.4

n tN e t + ( c ) + ( p s ) C N N e t++ + (s ) N e t( i ) e t( o ) M e a n
Poi tN e t+ N e t+ D G o i n tN tN e t+ M i n k MinkN Geo
Poin Point F-P Poin

1,319
Energy Saving

682
394 324 268
172 169 119 152 161 221 127 139
210 193
99 91
45
18 25 27 38 36
22
14 16 13

n tN e t + ( c ) ( p s ) C N N e t++ + (s ) N e t( i ) e t( o ) e a n
Poi i n tN e t+ tN e t+ + D G
F-P o i n tN
Poin tN e t+ M i n k MinkN GeoM
Po Poin
PointAcc: E cient Point Cloud Accelerator [Lin et al., MICRO 2021]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 117
ffi
ffi
ffi
Summary of Today’s Lecture
In this lecture, we introduced:
• Automated ways to nd pruning ratios before pruning after pruning

• System and hardware support for di erent


pruning
granularities synapses

• We will cover in the next lecture:


pruning
• Numeric data types in modern computer systems neurons

• Basic concept of neural network quantization


• Common neural network quantization methods

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 118
ffi
ffi
fi
ff
References
1. Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
2. Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
3. Learning Structured Sparsity in Deep Neural Networks [Wen et al., NeurIPS 2016]
4. Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
5. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers
[Zhang et al., ECCV 2018]
6. AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
7. Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA
TensorRT
8. EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
9. ESE: E cient Speech Recognition Engine with Sparse LSTM on FPGA [Han et al., FPGA 2017]
10.Block Sparse Format [NVIDIA, 2021]
11.Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 119
ffi
ffi
ffi
ffi
ffi
ffi

You might also like