Tutorial On DNN 6 of 9 Network and Hardware Co Design
Tutorial On DNN 6 of 9 Network and Hardware Co Design
Hardware Co-Design
2
Cost of Operations
Operation: Energy Relative Energy Cost Area Relative Area Cost
(pJ) (µm2)
8b Add 0.03 36
16b Add 0.05 67
32b Add 0.1 137
16b FP Add 0.4 1360
32b FP Add 0.9 4184
8b Mult 0.2 282
32b Mult 3.1 3495
16b FP Mult 1.1 1640
32b FP Mult 3.7 7700
32b SRAM Read (8KB) 5 N/A
32b DRAM Read 640 N/A
1 10 102 103 104 1 10 102 103
[Horowitz, “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014]
3
Number Representation
23 Range Accuracy
1 8
FP32 S E M 10-38 – 1038 .000006%
1 5 10
FP16 S E M 6x10-5 - 6x104 .05%
1 31
Int32 S M 0 – 2x109 ½
1 15
Int16 S M 0 – 6x104 ½
1 7
Int8 S M 0 – 127 ½
Fixed Point
sign mantissa (7-bits)
8-bit 01100110
fixed
integer fractional
(4-bits) (3-bits)
12.75 s=0 m=102
5
N-bit Precision
For no loss in precision, M is determined based on largest
filter size (in the range of 10 to 16 bits for popular DNNs)
2N+M-bits
Weight
(N-bits)
2N-bits Quantize Output
+ Accumulate to N-bits (N-bits)
Activation NxN
(N-bits) multiply
6
Dynamic Fixed Point
Floating Point
sign exponent (8-bits) mantissa (23-bits)
Fixed Point
sign mantissa (7-bits) sign mantissa (7-bits)
Top-1 accuracy
on of CaffeNet
on ImageNet
Reduce Bit-width à
Shorter Critical Path
à Reduce Voltage
Power reduction of
2.56x vs. 16-bit fixed
On AlexNet Layer 2
Scale factors (α, βi) can change per layer or position in filter
18
Non-Linear Quantization
• Precision refers to the number of levels
– Number of bits = log2 (number of levels)
19
Computed Non-linear Quantization
Only activation
in log domain
Both weights
and activations
in log domain
WS
22
Reduce Precision Overview
Implement with
look up table
• Additional Properties
– Fixed or Variable (across data types, layers, channels, etc.)
23
Non-Linear Quantization Table Lookup
Trained Quantization: Find K weights via K-means clustering
to reduce number of unique weights per layer (weight sharing)
Example: AlexNet (no accuracy loss)
256 unique weights for CONV layer
16 unique weights for FC layer
Smaller Weight
Memory Weight Overhead Does not reduce
index Weight precision of MAC
Weight Weight
Memory (log2U-bits) Decoder/ (16-bits) MAC
CRSM x Dequant Output
log2U-bits U x 16b Activation
(16-bits)
Input
Activation
(16-bits)
Consequences: Narrow weight memory and second access from (small) table
24
[Han et al., Deep Compression, ICLR 2016]
Summary of Reduce Precision
Category Method Weights Activations Accuracy Loss vs.
(# of bits) (# of bits) 32-bit float (%)
Dynamic Fixed w/o fine-tuning 8 10 0.4
Point w/ fine-tuning 8 8 0.6
Reduce weight Ternary weights 2* 32 3.7
Networks (TWN)
Trained Ternary 2* 32 0.6
Quantization (TTQ)
Binary Connect (BC) 1 32 19.2
Binary Weight Net 1* 32 0.8
(BWN)
Reduce weight Binarized Neural Net 1 1 29.8
and activation (BNN)
XNOR-Net 1* 1 11
Non-Linear LogNet 5(conv), 4(fc) 4 3.2
Weight Sharing 8(conv), 4(fc) 16 0
* first and last layers are 32-bit float
26
Sparsity in Fmaps
Many zeros in output fmaps after ReLU
ReLU
9 -1 -3 9 0 0
1 -5 5 1 0 5
-2 6 -1 0 6 0
0.2
0
1 2 3 4 5
CONV Layer 27
I/O Compression in Eyeriss
Link Clock Core Clock DCNN Accelerator
14×12 PE Array
Filter Filt
…
Run-Length Compression (RLC)
Input Image Buffer Img
…
Example:
SRAM
Decomp Psum
Input: 0, 0, 12, 0, 0, 0, 0, 53, 0, 0, 22, …
…
108KB (64b): RunLevelRunLevel RunLevelTerm
Output Image Output
Psum 2 12 4 53 2 22 0
…
Comp ReLU
5b 16b 5b 16b 5b 16b 1b
…
Off-Chip DRAM
64 bits
[Chen et al., ISSCC 2016] 28
Compression Reduces DRAM BW
Uncompressed Compressed
1.2×
66 1.4×
DRAM Access (MB)
5 1.7×
DRAM 44 1.8× Uncompressed
Access 3 1.9× Fmaps + Weights
(MB) 22
1 RLE Compressed
00 Fmaps + Weights
11 22 33 44 55
AlexNet Conv Layer
AlexNet Conv Layer
1
Input
Psum 0 1
Partial Sum
Scratch Pad 0
(24x16b REG) Reset
Example: AlexNet
Weight Reduction: CONV layers 2.7x, FC layers 9.9x
(Most reduction on fully connected layers)
Overall: 9x weight reduction, 3x MAC reduction
Batch size = 1
36
Energy-Aware Pruning
…
Optimization # acc. at mem. level n Edata
Weights
Energy Consump&on 22%
of GoogLeNet
89% GoogLeNet
87%
85%
83%
81% AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consump&on
Original DNN
89% GoogLeNet
87%
85%
83% SqueezeNet
81% AlexNet SqueezeNet
79% AlexNet
77%
5E+08 5E+09 5E+10
Normalized Energy Consump&on
Original DNN Magnitude-based Pruning [Han
[6] et al., NIPS 2015]
41
Energy-Aware Pruning
93%
91% ResNet-50
VGG-16
Top-5 Accuracy
89% GoogLeNet
87% GoogLeNet
85%
83% 1.74x SqueezeNet
81% AlexNet SqueezeNet
AlexNet AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consump&on
Original DNN Magnitude-based Pruning [6] Energy-aware Pruning (This Work)
a a * x
b a * y
x a * z
c
d
y = Scatter
z b * x network
e b * y
f
b * z
Accumulate MULs
…
PE frontend PE backend
Input Stationary Dataflow
[Parashar et al., ISCA 2017] 47
Structured/Coarse-Grained Pruning
• Scalpel
– Prune to match the underlying data-parallel hardware
organization for speed up
49
Network Architecture Design
Build Network with series of Small Filters
GoogleNet/Inception v3
5x5 filter Apply sequentially
5x1 filter
1x5 filter
decompose
separable
filters
VGG-16
5x5 filter Two 3x3 filters Apply sequentially
decompose
50
Network Architecture Design
Reduce size and computation with 1x1 Filter (bottleneck)
Figure Source:
Stanford cs231n
Figure Source:
Stanford cs231n
Figure Source:
Stanford cs231n
compress
ResNet
expand
GoogleNet
compress
54
SqueezeNet
Reduce weights by reducing number of input
channels by “squeezing” with 1x1
50x fewer weights than AlexNet
(no accuracy loss)
Fire Module
89% GoogLeNet
87%
85%
83%
81% AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consump&on
Original DNN
59
[Kim et al., ICLR 2016]
Knowledge Distillation
class
probabilities
softmax
softmax
DNN A DNN B
(teacher) (teacher)
Try to match
softmax
DNN
(student)