Lec04 Pruning II
Lec04 Pruning II
ai Lecture 04:
Pruning and Sparsity
Part II
Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning? before pruning after pruning
2400
# Publications
1600
EIE
800 Optimal
Brain Damage
Deep
Compression
0
1989 1995 2001 2007 2013 2019 2:4 sparsity in A100 GPU
2X peak performance, 1.5X measured BERT speedup
Souce: https://github.com/mit-han-lab/pruning-sparsity-publications
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 3
ffi
ffi
Neural Network Pruning
• In general, we could formulate the pruning as
follows:
x x
arg min L(x; WP)
WP
subject to
∥Wp∥0 < N
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 4
ffi
ffi
Pruning at Different Granularities kw = 3
The case of convolutional layers
kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2
Irregular Regular
Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 5
ffi
ffi
Selection of Synapses to Prune
• When removing parameters from a neural network model,
• the less important the parameters being removed are,
• the better the performance of pruned neural network is.
• Magnitude-based pruning considers weights with larger absolute values are more important
than other weights.
• For element-wise pruning,
Importance = | W |
• Example
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 6
ffi
ffi
Selection of Neurons to Prune
• When removing neurons from a neural network model,
• the less useful the neurons being removed are,
• the better the performance of pruned neural network is.
Recall: Neuron pruning is coarse-grained pruning indeed.
Weight Matrix
Neuron Pruning
in Linear Layer
Channel Pruning
in Convolution Layer
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 7
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
prune 30%?
• Determine the Pruning Ratio prune 50%?
• What should target sparsity be for each layer? prune 70%?
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 9
ffi
ffi
fi
Recap
Non-uniform pruning is better than uniform shrinking
<
Uniform Scaling
… …
Uniform Shrink Channel Prune
Latency (ms)
AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 10
ffi
ffi
Recap
Non-uniform pruning is better than uniform shrinking
<
Uniform Scaling
… …
Uniform Shrink Channel Prune
Latency (ms)
AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 11
ffi
ffi
fi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• We need di erent pruning ratios for each layer since di erent layers have di erent sensitivity
• Some layers are more sensitive (e.g., rst layer)
• Some layers are more redundant
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 12
ffi
ff
ffi
fi
ff
ff
Finding Pruning Ratios
Analyze the sensitivity of each layer
• We need di erent pruning ratios for each layer since di erent layers have di erent sensitivity
• Some layers are more sensitive (e.g., rst layer)
• Some layers are more redundant
• We can perform sensitivity analysis to determine the per-layer pruning ratio
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 13
ffi
ff
ffi
fi
ff
ff
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
100
86
Accuracy (%)
72
58 L0
44
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 14
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
100
86
Accuracy (%)
ΔAcc
72 The higher pruning rate
58 L0 The more accuracy loss
44
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 15
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 16
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 17
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 18
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 19
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 20
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
Some layers are less sensitive to pruning
100
86
Accuracy (%)
72
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 21
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
100
86
Accuracy (%)
86
Accuracy (%)
72 threshold T
58 L0 L1
L2 L3
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 23
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• The process of Sensitivity Analysis (* VGG-11 on CIFAR-10 dataset)
• Pick a layer Li in the model
• Prune the layer Li with pruning ratio r ∈ {0,0.1,0.2,...,0.9} (or other strides)
i
• Observe the accuracy degrade ΔAccr for each pruning ratio
• Repeat the process for all layers
• Pick a degradation threshold T such that the overall pruning rate is desired
100
86
Accuracy (%)
72 threshold T
58 L0 L1
L2 L3 Pruning rates:
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 24
ffi
ffi
Finding Pruning Ratios
Analyze the sensitivity of each layer
• Is this optimal?
• Maybe not. We do not consider the interaction between layers.
• Can we go beyond the heuristics?
• Yes!
100
86
Accuracy (%)
72 threshold T
58 L0 L1
L2 L3 Pruning rates:
44 L4 L5
30
10% 20% 30% 40% 50% 60% 70% 80% 90%
Pruning Rate (Percentage of Weights Pruned Away)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 25
ffi
ffi
Automatic Pruning
• Given an overall compression ratio, how do we choose per-layer pruning ratios?
• Sensitivity analysis ignores the interaction between layers -> sub-optimal
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 26
ffi
ffi
Automatic Pruning
• Given an overall compression ratio, how do we choose per-layer pruning ratios?
• Sensitivity analysis ignores the interaction between layers -> sub-optimal
• Conventionally, such process relies on human expertise and trials and errors
Customers
Engineers
AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 27
ffi
ffi
Automatic Pruning
• Can we develop a push-the-button solution?
Reward
Reward= = -Error
-Error*log(FLOP)
Critic Reward= -Error*log(FLOP) Layer t+1
Critic Layer t+1
?%
Original NN Compressed NN ?%
Original NN Compressed NN
Action: Compress with
Actor Action: Compress
Sparsity with
ratio at (e.g. 50%) Layer t
AMC Engine Actor
AMC Engine Sparsity ratio at (e.g. 50%) Layer
50%t
50%
Reward= -Error*log(FLOP)
Critic Layer t+1
?%
Original NN Compressed NN
Original NN Compressed NN
Agent: DDPG
Model Compression by AI: Environment: Channel Pruning
Automated, Higher Compression Rate, Faster
{−∞,
− Error, if satis es constrains
Reward: R =
• if not
• We can also optimize latency constraints with a pre-built lookup table (LUT)
AMC: AutoML for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
fi
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 31
ffi
ffi
AMC: AutoML for Model Compression
Human*
AutoML
Figure 14: The pruning policy (sparsity ratio) given by our reinforcement learning agent for ResNet-50.
ResNet50 Density Pruned by Human Expert neural network. In Proceedings of the 43rd International Symposium
ResNet50 Density Pruned by AMC (the lower the better) Architecture. IEEE Press, 243–254.
MIT 6.5940: TinyML and E cient Deep Learning Computing [9] Song Han, Huizi Mao, and William J Dally. 2015. cientml.ai
https://e Deep compression
33
ffi
ffi
AMC: AutoML for Model Compression
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 34
ffi
ffi
NetAdapt
A rule-based iterative/progressive method
• The goal of NetAdapt is to nd a per-layer pruning ratio to meet a global resource constraint (e.g.,
latency, energy, …)
• The process is done iteratively
• We take latency constraint as an example
NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 35
ffi
ffi
fi
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount ΔR (manually de ned)
original model
NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 36
ffi
ffi
fi
NetAdapt
• For each iteration, we aim to reduce the latency by a certain amount ΔR (manually de ned)
• For each layer Lk (k in A-Z in the gure)
• Prune the layer s.t. the latency reduction meets ΔR (based on a pre-built lookup table)
Short-term ne-tune
Short-term ne-tune
Short-term ne-tune
Short-term ne-tune
Long-term
ne-tune
• The iterative nature allows us to obtain a serial of models with di erent costs
• #models = #iterations
model series
NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications [Yang et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 42
ffi
ffi
ff
Neural Network Pruning
• Introduction to Pruning
• What is pruning? x
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
arg min L(x; WP)
• What should target sparsity be for each layer? WP
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 44
ffi
ffi
Finetuning Pruned Neural Networks
• After pruning, the model may decrease, especially for larger pruning ratio.
• Fine-tuning the pruned neural networks will help recover the accuracy and push the pruning ratio
higher.
• Learning rate for ne-tuning is usually 1/100 or 1/10 of the original learning rate.
Pruning Pruning+Finetuing
Train Connectivity 0.5%
Accuracy Loss
-0.5%
-1.5%
Prune Connections
-2.5%
-3.5%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 45
ffi
fi
ffi
ffi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
Pruning Pruning+Finetuing
Train Connectivity 0.5%
Accuracy Loss
-0.5%
-1.5%
Prune Connections
-2.5%
-3.5%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 46
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
Pruning Pruning+Finetuing
Train Connectivity 0.5%
Accuracy Loss
-0.5%
-1.5%
30% pruned Prune Connections
-2.5%
-3.5%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 47
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
Pruning Pruning+Finetuing
Train Connectivity 0.5%
Accuracy Loss
-0.5%
-1.5%
30% pruned Prune Connections
-2.5%
-3.5%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 48
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
Pruning Pruning+Finetuing
Train Connectivity 0.5%
Accuracy Loss
-0.5%
-1.5%
50% pruned Prune Connections
-2.5%
-3.5%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 49
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
Pruning Pruning+Finetuing
Train Connectivity 0.5%
Accuracy Loss
-0.5%
-1.5%
50% pruned Prune Connections
-2.5%
-3.5%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 50
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
Pruning Pruning+Finetuing
Train Connectivity 0.5%
Accuracy Loss
-0.5%
-1.5%
70% pruned Prune Connections
-2.5%
-3.5%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 51
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
Pruning Pruning+Finetuing
Train Connectivity 0.5%
Accuracy Loss
-0.5%
-1.5%
70% pruned Prune Connections
-2.5%
-3.5%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 52
ffi
ffi
ffi
fi
Iterative Pruning
• Consider pruning followed by a ne-tuning is one iteration.
• Iterative pruning gradually increases the target sparsity in each iteration.
• boost pruning ratio from 5✕ to 9✕ on AlexNet compared to single-step aggressive pruning.
Accuracy Loss
-0.5%
-1.5%
Prune Connections
-2.5%
-3.5%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 53
ffi
ffi
ffi
fi
Regularization
• When training neural networks or ne-tuning quantized neural networks, regularization is added
to the loss term to
• penalize non-zero parameters
• encourage smaller parameters
• The most common regularization for improving performance of pruning is L1/L2 regularization.
• L1-Regularization
L′ = L(x; W) + λ | W |
• L2-Regularization
2
L′ = L(x; W) + λ∥W∥
• Examples:
• Magnitude-based Fine-grained Pruning applies L2 regularization on weights
• Network Slimming applies smooth-L1 regularization on channel scaling factors.
Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 54
ffi
ffi
ffi
ffi
fi
Neural Network Pruning
• Introduction to Pruning
• What is pruning? x
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
arg min L(x; WP)
• What should target sparsity be for each layer? WP
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 56
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 57
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 58
ffi
ffi
Proposed Paradigm
Conventional
Training Inference
Slow Power
Hungry
Proposed Accelerated
Model Inference
Training
Compression
Fast Power
Han et al ISCA’16 Efficient
Han et al ICLR’17
Han et al FPGA’17
Han et al NeurIPS’15
(best paper award)
Han et al ICLR’16
(best paper award)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 59
ffi
ffi
EIE: Efficient Inference Engine
The First DNN Accelerator for Sparse, Compressed Model
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 60
ffi
ffi
ffi
EIE: Parallelization on Sparsity
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
PE PE PE PE P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
PE PE PE PE P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
Central Control
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
PE PE PE PE B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
PE PE PE PE
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 61
ffi
ffi
ffi
EIE: Parallelization on Sparsity
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
logically B
P E3 B 0 0 0 0 C B
B
C B
C=B
C
b3 C ReLU
C )
B
B
B
b3 C
C
C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
Column Pointer 0 1 2 3 5
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 62
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 63
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 64
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 65
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 66
ffi
ffi
ffi
Dataflow
~
a 0 a1 0 a3
⇥ ~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B
B 0 w 2,1 0 w 2,3
C B
C B b2 C
C
B
B 0 C
C
B
P E3 B 0 0 0 0 C B C B C
b3 C ReLU B b3 C
B C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
rule of thumb:
0*A=0 W*0=0
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 67
ffi
ffi
ffi
Micro Architecture for each PE
Col Weight
Even Ptr SRAM Bank Sparse
Start/ Decoder Bypass Dest Src
End
Matrix Regs
Act Act
Addr SRAM Regs Regs
Address ReLU
Odd Ptr SRAM Bank Absolute Address
Relative Index
Accum
Pointer Read Sparse Matrix Access Arithmetic Unit Act R/W
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 68
ffi
ffi
ffi
Load Balance
Act Value
PE PE PE PE Act Queue
Act Index
PE PE PE PE
Central Control
PE PE PE PE
PE PE PE PE
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 69
ffi
ffi
ffi
Activation Sparsity
Pointer Read
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 70
ffi
ffi
ffi
Weight Sparsity
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 71
ffi
ffi
ffi
Weight Sharing
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 72
ffi
ffi
ffi
Arithmetic & Write Back
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 73
ffi
ffi
ffi
Relu, Non-zero Detection
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 74
ffi
ffi
ffi
What’s Special
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 75
ffi
ffi
ffi
Benchmark
• CPU: Intel Core-i7 5930k
• GPU: NVIDIA TitanX
• Mobile GPU: NVIDIA Jetson TK1
Weight Activation FLOP
Layer Size Description
Density Density Reduction
AlexNet-6 4096 × 9216 9% 35% 33x
AlexNet for
AlexNet-7 4096 × 4096 9% 35% 33x image
classi cation
AlexNet-8 1000 × 4096 25% 38% 10x
VGG-6 4096 × 25088 4% 18% 100x
VGG-16 for
VGG-7 4096 × 4096 4% 37% 50x image
classi cation
VGG-8 1000 × 4096 23% 41% 10x
NeuralTalk-We 600 × 4096 10% 100% 10x
RNN and LSTM
NeuralTalk-Wd 8791 × 600 11% 100% 10x for image
caption
NeuralTalk-LSTM 2400 × 1201 10% 100% 10x
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 76
ffi
fi
fi
ffi
ffi
Comparison: Throughput
EIE
ASIC
1E+05
ASIC
ASIC
1E+04
GPU
1E+03
ASIC
1E+02
CPU mGPU
1E+01 FPGA
1E+00
Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 77
ffi
ffi
ffi
Comparison: Energy Efficiency
EIE
Energy Ef ciency (Layers/J in log scale)
1E+06
1E+05
ASIC ASIC
1E+04
ASIC ASIC
1E+03
1E+02
EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 78
ffi
ffi
fi
ffi
Top-5 most cited papers in 50 years of ISCA
Rank Citations Year Title (★ means it won the ISCA Influential Paper Award) First Author + HOF Authors Type Topic
The SPLASH-2 programs: Characterization and
1 5351 1995 Stephen Woo, Anoop Gupta Tool Benchmark
methodological considerations
In-datacenter performance analysis of a Tensor Processing Norm Jouppi, David Machine
2 4214 2017 Arch
Unit Patterson Learning
★ Wattch: A framework for architectural-level power David Brooks, Margaret
3 3834 2000 Tool Power
analysis and optimizations Martonosi
★ Transactional memory: Architectural support for
4 3386 1993 Maurice Herlihy Micro Parallelism
lock-free data structures
EIE: Efficient inference engine on compressed deep neural Song Han, Bill Dally, Mark Machine
5 2690 2016 Arch
network Horowitz Learning
Xiaobo Fan, Luiz Barroso
6 2620 2007 ★ Power provisioning for a warehouse-sized computer Micro Power
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 80
fi
ffi
fi
fl
ffi
ff
The rst principle of e cient AI computing is to be lazy: avoid redundant computation,
quickly reject the work, or delay the work.
We envision future AI models will be sparse at various granularity and structures. Co-
designed with specialized accelerators, sparse models will become more e cient and
accessible.
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 81
fi
ffi
ffi
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity for GEMM
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 82
ffi
ffi
M:N Sparsity
4 · Accelerating Sparse Deep Neural Networks ·
Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data
C C/2 C/2
= zero entry Non-zero data 2-bits
values indices
Fig. 1. Structured-sparse matrix (W) storage format. The uncompressed matrix is of
C
dimension R Two
⇥ Cweights arecompressed
and the nonzero out of four consecutive
matrix weightsR(2:4
is of dimension ⇥ sparsity).
2
.
Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data
C C/2 C/2
= zero entry Non-zero data 2-bits
values indices
Fig. 1. Structured-sparse matrix (W) storage format. The uncompressed matrix is of
C
dimension R Two
⇥ Cweights arecompressed
and the nonzero out of four consecutive
matrix weightsR(2:4
is of dimension ⇥ sparsity).
2
.
Fine-grained
structured-sparse
matrix format
R R
R ✕ C/2 elements +
R ✕ C/2 2bits meta
data
C C/2 C/2
= zero entry Non-zero data 2-bits
values indices
Fig. 1. Structured-sparse matrix (W) storage format. The uncompressed matrix is of
C
Push R
dimension all⇥
theCnonzero
and theelements to the
compressed left in memory:
matrix save storage
is of dimension R ⇥and
2
computation.
.
Sparse operation
on Tensor Core
B matrix (Dense) Choose matching K/2 B matrix (Dense)
Dense operation elements out of K
elements
on Tensor Core Select
K K
☓ ☓
N N
A matrix (Sparse)
A matrix (Dense)
M M M M
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 88
ffi
ffi
System & Hardware Support for Sparsity
• EIE: Weight Sparsity + Activation Sparsity
• NVIDIA Tensor Core: M:N Weight Sparsity
• TorchSparse & PointAcc: Activation Sparsity
• TorchSparse: Sparse Convolution Library
• PointAcc: Hardware Accelerator for Sparse Convolution
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 89
ffi
ffi
Sparse convolution on sparse inputs
Conventional Convolution Sparse Convolution
~0.01%
Maps
(In, Out, Wgt)
Computation
(fOut = fOut + fIn × WWgt) for
each entry in the maps
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 91
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution
Computation
(fOut = fOut + fIn × WWgt) for
each entry in the maps
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 92
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution
Computation
(fOut = fOut + fIn × WWgt) for
each entry in the maps
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 93
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution
Computation
(fOut = fOut + fIn × WWgt) for
each entry in the maps
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 94
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 95
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 96
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 97
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 98
ffi
ffi
ffi
fi
Sparse convolution computation
A sparse set of dense MMA, with rules de ned by maps
Conventional Convolution Sparse Convolution
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
P0 PSUM1
(P1, Q1, W0,0) P2 w-1,-1 Q2
P3 PSUM4
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 2 × Cin Cin × Cout 2 × Cout Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cout
(P3, Q1, W1,0) f1 = f1 + f0 × W-1,-1
f4 = f4 + f3 × W-1,-1
(P1, Q0, W1,1)
(P4, Q3, W1,1)
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 100
ffi
ffi
ffi
ff
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for di erent weights
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
(P1, Q1, W0,0) P2 P1 w-1,0 PSUM3 Q2
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 1 × Cin Cin × Cout 1 × Cout Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cout
(P3, Q1, W1,0) f3 = f3 + f1 × W-1,0
(P1, Q0, W1,1)
(P4, Q3, W1,1)
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 101
ffi
ffi
ffi
ff
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for di erent weights
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 P0 PSUM0 Q0
(P0, Q0, W0,0) P1 P1 PSUM1 Q1
(P1, Q1, W0,0) P2 P2 ww-1,-1
0,0 PSUM2 Q2
(P2, Q2, W0,0) P3 P3 PSUM3 Q3
(P3, Q3, W0,0) P4 P4 1 × Cout
PSUM4 Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cin Cin × Cout 5 × Cout 5 × Cout
(P3, Q1, W1,0)
(P1, Q0, W1,1) fi = fi + fi × W0,0
(P4, Q3, W1,1) i = 0, 1, 2, 3, 4
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
(P1, Q1, W0,0) P2 P3 w1,0 PSUM1 Q2
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 1 × Cin Cin × Cout 1 × Cout Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cout
(P3, Q1, W1,0) f1 = f1 + f3 × W1,0
(P1, Q0, W1,1)
(P4, Q3, W1,1)
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 103
ffi
ffi
ffi
ff
Existing GPU implementation of sparse convolution
Weight-stationary computation, separate matmul for di erent weights
Workload
Maps
(In, Out, Wgt)
(P0, Q1, W-1,-1) Input Features Input Buffer Weight Partial Sum Output Features
(P3, Q4, W-1,-1)
(P1, Q3, W-1,0) P0 Q0
(P0, Q0, W0,0) P1 Q1
P1 PSUM0
(P1, Q1, W0,0) P2 w1,1 Q2
P4 PSUM3
(P2, Q2, W0,0) P3 Q3
(P3, Q3, W0,0) P4 2 × Cin Cin × Cout 2 × Cout Q4
(P4, Q4, W0,0) 5 × Cin 5 × Cout
(P3, Q1, W1,0) f0 = f0 + f1 × W1,1
f3 = f3 + f4 × W1,1
(P1, Q0, W1,1)
(P4, Q3, W1,1)
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 104
ffi
ffi
ffi
ff
TorchSparse optimization overview
F0 PSUM 1
F3 × W-1,-1 = PSUM 4
F1 PSUM 3
pad × W-1,0 =
F0
Gather F3 PSUM 1
Scatter
F1 pad × W1,0 = F0
F1
F2 F1 PSUM 0 F2
F3 F4 × W1,1 = PSUM 3 F3
F4 Apply BMM F4
F0 PSUM 0
Input
Features F1 × W0,0 = PSUM 1
Output
Features
F2 PSUM 2
F3 PSUM 3
F4 Apply MM PSUM 4
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 105
ffi
ffi
ffi
Trading computation for regularity
Separate computation (baseline) : many kernel calls, low device utilization
MM MM MM MM MM MM MM
Separate Computation
Worst Best
Computation overhead
Computation regularity
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 106
ffi
ffi
ffi
Trading computation for regularity
Dense convolution: best regularity but large computation overhead
MM MM MM MM MM MM MM BMM (batch=7)
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 107
ffi
ffi
ffi
Trading computation for regularity
Computation with grouping: balancing overhead and regularity
Extra computation = 2 / 28
(Small overhead)
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 108
ffi
ffi
ffi
Trading computation for regularity
Searching customized strategy for di erent model and datasets
1.6
1 group
0
26 24 22 20 18 16 14 12 10 8 6 4 2 0
Number of Groups
105
100000 105
100000
SemanticKITTI nuScenes
104
10000 104
10000
Map Size
103
1000 103
1000
100
102 100
102
1 4 7 10 13 16 19 22 2527 1 4 7 10 13 16 19 22 2527
Weight Index Weight Index
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 109
ff
ffi
ffi
ff
Results on matrix multiplication optimizations
SemanticKITTI
12.0 1.40
11.9 1.39
Normalized Speedup
9.0 1.05
8.7 1.00
8.1
TFLOP/s
0.87
6.0 0.70
3.0 0.35
0.0 0.00
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 110
ffi
ffi
ffi
Results on matrix multiplication optimizations
nuScenes: xed grouping has best TFLOP/s but adaptive grouping is faster
22.0 1.60
21.1 1.50 1.54
Normalized Speedup
16.5 16.9 1.20
TFLOP/s
1.00
11.0 0.80
10.4
5.5 0.40
0.0 0.00
This is because fixed grouping introduced large amount of redundant computation.
TorchSparse: E cient Point Cloud Inference Engine [Tang et al., MLSys 2022]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 111
ffi
ffi
fi
ffi
TorchSparse++: Overlapped memory and computation
oint-wise annotations for 50 classes from which 9 are se- Semantic Semantic segmentation and scene and
segmentation com-scene com-
ected for evaluation. Another recently used large dataset
or autonomous driving [57], but with fewer classes, is not
pletion
Autonomous ofpletion
3D point of 3D clouds point areclouds usuallyare usually
studied separately 3D Segmentation
[2, 3], but with the 3D Detection 3D Scene Reconstruction
ublicly available. Vehicles studied separately [2, 3], but with the
The Virtual KITTI dataset [17] provides synthetically emergenceemergence of large-scale datasets such
of large-scale datasets as such as
iPhone Vision
enerated sequential images with depth information and 15 ScanNet Pro[4] and SemanticKITTI
ScanNet [4] and SemanticKITTI [1], re- [1], re-
ense pixel-wise annotation. The depth information can searchers searchers
have discovered a deep inter-
have discovered a deep inter-
lso be used to generate point clouds. However, these point twining oftwining an object’s of ansemantics with its
Figure 6. Visualization of 3D auto labels on the Waymo Open Dataset val set (best viewed in color with zoom in). Object points are
(a) Input Sparse Tensor (b) Predicted Semantic Label
louds do not show the same characteristics as a real rotat- object’scolored semantics
by object with
types with blue for static its
(a) Input Sparse Tensor
vehicles, red for moving vehicles and orange for pedestrians. Boxes are colored as: green for
(b) Predicted Semantic Label
ng LiDAR, including defects like reflections and outliers. and since, Waymo | Confidential & Proprietary
In contrast to these datasets, our dataset combines a large gun exploiting this with the
gun exploiting thisjoint with learn-
transform
the joint learn-
segmentation iterative tta [email protected]/0.8
Method Context frames
static dynamic
[email protected]/0.8 [email protected]/0.8
ing of semantic segmentation and Xscene-
- - - - 78.82 / 50.90
mount of labeled points, a large variety of classes, and se- ing of semantic segmentation X X and scene
-
-
-
-
81.35 / 54.76
81.37 / 55.67
S-MVF++
M-MVF++
[ 0, +0] 67.17 / 36.61 80.07 / 57.71
73.96 / 43.56 82.21 / 59.52
completioncompletion to boost model performance
[ 4, +0]
uential scans generated by a commonly employed sensor to boost model X
performance
X X - 82.02 / 56.77 [ 0, +0] 78.13 / 50.30 80.65 / 57.97
that an
X
ob-
X 82.28 / 56.92
studies of the static auto labeling model. Met- Sparse
3DAL
[ 5, Autotuner
+5]
79.60 / 52.52 84.34 / 63.60
(c) RGB image
[ 2, +2]
(c) RGB image
80.48 / 55.02 85.10 / 64.51
cly available datasets, also shown in Table 1. Figure 1: forSemantic Scene Completion all on SemanticKITTI
82.28 / 56.92 85.67 / 65.77 Dataset.
ject occluded jectby vehiclesbyand surrounded
rics are the box accuracy at 3D IoU=0.7 and IoU=0.8 vehicles
Figure 1: Semantic Scene Completion on SemanticKITTI Dataset.
occluded vehicles in the Waymo
and surrounded
Open Dataset val set. Table 8. Effects of temporal context sizes for object auto label-
ing. Metrics are the box accuracy at 3D IoU=0.7, 0.8 for vehicles
nterfere with the original odometry benchmark row) leads to the best performance. Note that for the static
(a) a sparse
the person as (a) eithera bicyclistbased or neural
motorcyclist. network
object model, we use architecture that
the detector box with the highest efficiently
score
to our success
learns
are our
features
object-centric formulation,
from sparse
ng labels only for the training data. Overall, we provide
point cloud data sparse
0.80
and tensor
jointly based
solves neural
for the initial
the
coordinatenetwork
coupled
transform, which architecture
scene
gives our auto
method. completion
ing models.that
and efficiently
Evaluated on the Waymo Openlearns
semantic Dataset, our so-features
offboard multi-frame detector and novel object auto label-
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 114
ffi
ffi
Mapping Unit
Merge sort can be used to nd mappings in sparse convolution
Input Point Cloud
P0 P0 P1 P2 P3 P4 Q0 Q1 Q2 Q3 Q4
P1 P2 1,1 2,2 2,4 3,2 4,3 1,1 2,2 2,4 3,2 4,3
P3
Q0 W-1,-1 W-1,0 W-1,1 + (-1, -1) for w1,1
P4 P1
0,0 1,1 1,3 2,1 3,2
stride = 1 W0,-1 W0,0 W0,1 Merge Sort
Q3
P4
W1,-1 W1,0 W1,1 P0 Q0 P1 P2 P3 Q1 Q2 Q3 P4 Q4
Q0 0,0 1,1 1,1 1,3 2,1 2,2 2,4 3,2 3,2 4,3
Q1 Q2
= = = = = = = = =
Q3 Shift Input for W1,1 Q0 Q3
P1 P4
Q4
Output Point Cloud (In, Out, Wgt)
(P0, Q1, W-1,-1)
(P3, Q4, W-1,-1)
…
(P1, Q0, W1,1)
(P4, Q3, W1,1)
PointAcc: E cient Point Cloud Accelerator [Lin et al., MICRO 2021]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 116
ffi
ffi
ffi
fi
PointAcc: Speedup and Energy Saving
over NVIDIA RTX 2080Ti over Intel Xeon Skylake + TPU V3 over Intel Xeon Gold 6130
269
127 113 97 131
82 65 88 106 102 94
71 51 90
Speedup
53
27 37
8.3
3.7 3.7 3.4 3.7 4.7 3.7
2.8 2.8 2.4
n tN e t + ( c ) + ( p s ) C N N e t++ + (s ) N e t( i ) e t( o ) M e a n
Poi tN e t+ N e t+ D G o i n tN tN e t+ M i n k MinkN Geo
Poin Point F-P Poin
1,319
Energy Saving
682
394 324 268
172 169 119 152 161 221 127 139
210 193
99 91
45
18 25 27 38 36
22
14 16 13
n tN e t + ( c ) ( p s ) C N N e t++ + (s ) N e t( i ) e t( o ) e a n
Poi i n tN e t+ tN e t+ + D G
F-P o i n tN
Poin tN e t+ M i n k MinkN GeoM
Po Poin
PointAcc: E cient Point Cloud Accelerator [Lin et al., MICRO 2021]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 117
ffi
ffi
ffi
Summary of Today’s Lecture
In this lecture, we introduced:
• Automated ways to nd pruning ratios before pruning after pruning
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 118
ffi
ffi
fi
ff
References
1. Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
2. Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
3. Learning Structured Sparsity in Deep Neural Networks [Wen et al., NeurIPS 2016]
4. Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
5. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers
[Zhang et al., ECCV 2018]
6. AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
7. Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA
TensorRT
8. EIE: E cient Inference Engine on Compressed Deep Neural Network [Han et al., ISCA 2016]
9. ESE: E cient Speech Recognition Engine with Sparse LSTM on FPGA [Han et al., FPGA 2017]
10.Block Sparse Format [NVIDIA, 2021]
11.Accelerating Sparse Deep Neural Networks [Mishra et al., arXiv 2021]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 119
ffi
ffi
ffi
ffi
ffi
ffi