0% found this document useful (0 votes)
7 views60 pages

Tutorial On DNN 6 of 9 Network and Hardware Co Design

The document discusses various approaches to optimize Deep Neural Network (DNN) models through hardware co-design, focusing on reducing operand size and the number of operations. It highlights techniques such as quantization, network pruning, and the use of compact architectures to enhance efficiency while maintaining accuracy. Additionally, it presents energy costs associated with different operations and the impact of precision on model performance.

Uploaded by

Tirth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views60 pages

Tutorial On DNN 6 of 9 Network and Hardware Co Design

The document discusses various approaches to optimize Deep Neural Network (DNN) models through hardware co-design, focusing on reducing operand size and the number of operations. It highlights techniques such as quantization, network pruning, and the use of compact architectures to enhance efficiency while maintaining accuracy. Additionally, it presents energy costs associated with different operations and the impact of precision on model performance.

Uploaded by

Tirth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

DNN Model and

Hardware Co-Design

ISCA Tutorial (2017)


Website: http://eyeriss.mit.edu/tutorial.html
Joel Emer, Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang 1
Approaches

• Reduce size of operands for storage/compute


– Floating point à Fixed point
– Bit-width reduction
– Non-linear quantization

• Reduce number of operations for storage/compute


– Exploit Activation Statistics (Compression)
– Network Pruning
– Compact Network Architectures

2
Cost of Operations
Operation: Energy Relative Energy Cost Area Relative Area Cost
(pJ) (µm2)
8b Add 0.03 36
16b Add 0.05 67
32b Add 0.1 137
16b FP Add 0.4 1360
32b FP Add 0.9 4184
8b Mult 0.2 282
32b Mult 3.1 3495
16b FP Mult 1.1 1640
32b FP Mult 3.7 7700
32b SRAM Read (8KB) 5 N/A
32b DRAM Read 640 N/A
1 10 102 103 104 1 10 102 103
[Horowitz, “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014]
3
Number Representation

23 Range Accuracy
1 8
FP32 S E M 10-38 – 1038 .000006%
1 5 10
FP16 S E M 6x10-5 - 6x104 .05%

1 31
Int32 S M 0 – 2x109 ½

1 15
Int16 S M 0 – 6x104 ½

1 7
Int8 S M 0 – 127 ½

Image Source: B. Dally 4


Floating Point à Fixed Point
Floating Point
sign exponent (8-bits) mantissa (23-bits)

32-bit float 10100101000000000101000000000100


-1.42122425 x 10-13 s=1 e = 70 m = 20482

Fixed Point
sign mantissa (7-bits)

8-bit 01100110
fixed
integer fractional
(4-bits) (3-bits)
12.75 s=0 m=102

5
N-bit Precision
For no loss in precision, M is determined based on largest
filter size (in the range of 10 to 16 bits for popular DNNs)

2N+M-bits
Weight
(N-bits)
2N-bits Quantize Output
+ Accumulate to N-bits (N-bits)
Activation NxN
(N-bits) multiply

6
Dynamic Fixed Point
Floating Point
sign exponent (8-bits) mantissa (23-bits)

32-bit float 10100101000000000101000000000100


-1.42122425 x 10-13 s=1 e = 70 m = 20482

Fixed Point
sign mantissa (7-bits) sign mantissa (7-bits)

8-bit 01100110 8-bit 0 1100110


dynamic dynamic
fixed integer fractional fixed fractional
([7-f ]-bits) (f-bits) (f-bits)
12.75 s=0 m=102 f=3 0.19921875 s=0 m=102 f=9

Allow f to vary based on data type and layer 7


Impact on Accuracy

Top-1 accuracy
on of CaffeNet
on ImageNet

w/o fine tuning

[Gysel et al., Ristretto, ICLR 2016] 8


Avoiding Dynamic Fixed Point

Batch normalization ‘centers’ dynamic range

AlexNet Image Source: Moons


(Layer 6) et al, WACV 2016

‘Centered’ dynamic ranges might reduce need for


dynamic fixed point
9
Nvidia PASCAL

“New half-precision, 16-bit


floating point instructions
deliver over 21 TeraFLOPS for
unprecedented training
performance. With 47 TOPS
(tera-operations per second)
of performance, new 8-bit
integer instructions in Pascal
allow AI algorithms to deliver
real-time responsiveness for
deep learning inference.”

– Nvidia.com (April 2016)


10
Google’s Tensor Processing Unit (TPU)

“ With its TPU Google has


seemingly focused on delivering
the data really quickly by cutting
down on precision. Specifically,
it doesn’t rely on floating point
precision like a GPU
….
Instead the chip uses integer
math…TPU used 8-bit integer.”

- Next Platform (May 19, 2016)

[Jouppi et al., ISCA 2017] 11


Precision Varies from Layer to Layer

[Judd et al., ArXiv 2016] [Moons et al., WACV 2016] 12


Bitwidth Scaling (Speed)
Bit-Serial Processing: Reduce Bit-width à Skip Cycles
Speed up of 2.24x vs. 16-bit fixed

[Judd et al., Stripes, CAL 2016] 13


Bitwidth Scaling (Power)

Reduce Bit-width à
Shorter Critical Path
à Reduce Voltage

Power reduction of
2.56x vs. 16-bit fixed
On AlexNet Layer 2

[Moons et al., VLSI 2016] 14


Binary Nets
Binary Filters
• Binary Connect (BC)
– Weights {-1,1}, Activations 32-bit float
– MAC à addition/subtraction
– Accuracy loss: 19% on AlexNet
[Courbariaux, NIPS 2015]

• Binarized Neural Networks (BNN)


– Weights {-1,1}, Activations {-1,1}
– MAC à XNOR
– Accuracy loss: 29.8% on AlexNet
[Courbariaux, arXiv 2016]
15
Scale the Weights and Activations
• Binary Weight Nets (BWN)
– Weights {-α, α} à except first and last layers are 32-bit float
– Activations: 32-bit float
– α determined by the l1-norm of all weights in a layer
– Accuracy loss: 0.8% on AlexNet
Hardware needs to support
• XNOR-Net both activation precisions
– Weights {-α, α}
– Activations {-βi, βi} à except first and last layers are 32-bit float
– βi determined by the l1-norm of all activations across channels
for given position i of the input feature map
– Accuracy loss: 11% on AlexNet

Scale factors (α, βi) can change per layer or position in filter

[Rastegari et al., BWN & XNOR-Net, ECCV 2016] 16


XNOR-Net

[Rastegari et al., BWN & XNOR-Net, ECCV 2016] 17


Ternary Nets

• Allow for weights to be zero


– Increase sparsity, but also increase number of bits (2-bits)

• Ternary Weight Nets (TWN) [Li et al., arXiv 2016]


– Weights {-w, 0, w} à except first and last layers are 32-bit float
– Activations: 32-bit float
– Accuracy loss: 3.7% on AlexNet
• Trained Ternary Quantization (TTQ) [Zhu et al., ICLR 2017]
– Weights {-w1, 0, w2} à except first and last layers are 32-bit float
– Activations: 32-bit float
– Accuracy loss: 0.6% on AlexNet

18
Non-Linear Quantization
• Precision refers to the number of levels
– Number of bits = log2 (number of levels)

• Quantization: mapping data to a smaller set of levels


– Linear, e.g., fixed-point
– Non-linear
• Computed
• Table lookup

Objective: Reduce size to improve speed and/or reduce energy


while preserving accuracy

19
Computed Non-linear Quantization

Log Domain Quantization

Product = X * W Product = X << W

[Lee et al., LogNet, ICASSP 2017] 20


Log Domain Computation

Only activation
in log domain

Both weights
and activations
in log domain

max, bitshifts, adds/subs

[Miyashita et al., arXiv 2016]


21
Log Domain Quantization
• Weights: 5-bits for CONV, 4-bit for FC; Activations: 4-bits
• Accuracy loss: 3.2% on AlexNet

Shift and Add

WS

[Miyashita et al., arXiv 2016],


[Lee et al., LogNet, ICASSP 2017]

22
Reduce Precision Overview

• Learned mapping of data to quantization levels


(e.g., k-means)

Implement with
look up table

[Han et al., ICLR 2016]

• Additional Properties
– Fixed or Variable (across data types, layers, channels, etc.)
23
Non-Linear Quantization Table Lookup
Trained Quantization: Find K weights via K-means clustering
to reduce number of unique weights per layer (weight sharing)
Example: AlexNet (no accuracy loss)
256 unique weights for CONV layer
16 unique weights for FC layer

Smaller Weight
Memory Weight Overhead Does not reduce
index Weight precision of MAC
Weight Weight
Memory (log2U-bits) Decoder/ (16-bits) MAC
CRSM x Dequant Output
log2U-bits U x 16b Activation
(16-bits)
Input
Activation
(16-bits)

Consequences: Narrow weight memory and second access from (small) table

24
[Han et al., Deep Compression, ICLR 2016]
Summary of Reduce Precision
Category Method Weights Activations Accuracy Loss vs.
(# of bits) (# of bits) 32-bit float (%)
Dynamic Fixed w/o fine-tuning 8 10 0.4
Point w/ fine-tuning 8 8 0.6
Reduce weight Ternary weights 2* 32 3.7
Networks (TWN)
Trained Ternary 2* 32 0.6
Quantization (TTQ)
Binary Connect (BC) 1 32 19.2
Binary Weight Net 1* 32 0.8
(BWN)
Reduce weight Binarized Neural Net 1 1 29.8
and activation (BNN)
XNOR-Net 1* 1 11
Non-Linear LogNet 5(conv), 4(fc) 4 3.2
Weight Sharing 8(conv), 4(fc) 16 0
* first and last layers are 32-bit float

Full list @ [Sze et al., arXiv, 2017] 25


Reduce Number of Ops and Weights

• Exploit Activation Statistics


• Network Pruning
• Compact Network Architectures
• Knowledge Distillation

26
Sparsity in Fmaps
Many zeros in output fmaps after ReLU
ReLU
9 -1 -3 9 0 0
1 -5 5 1 0 5
-2 6 -1 0 6 0

# of activations # of non-zero activations


1
0.8
0.6
(Normalized) 0.4

0.2
0
1 2 3 4 5
CONV Layer 27
I/O Compression in Eyeriss
Link Clock Core Clock DCNN Accelerator
14×12 PE Array
Filter Filt

Run-Length Compression (RLC)
Input Image Buffer Img

Example:
SRAM
Decomp Psum
Input: 0, 0, 12, 0, 0, 0, 0, 53, 0, 0, 22, …

108KB (64b): RunLevelRunLevel RunLevelTerm
Output Image Output
Psum 2 12 4 53 2 22 0


Comp ReLU
5b 16b 5b 16b 5b 16b 1b

Off-Chip DRAM
64 bits
[Chen et al., ISSCC 2016] 28
Compression Reduces DRAM BW

Uncompressed Compressed
1.2×
66 1.4×
DRAM Access (MB)

5 1.7×
DRAM 44 1.8× Uncompressed
Access 3 1.9× Fmaps + Weights
(MB) 22
1 RLE Compressed
00 Fmaps + Weights
11 22 33 44 55
AlexNet Conv Layer
AlexNet Conv Layer

Simple RLC within 5% - 10% of theoretical entropy limit

[Chen et al., ISSCC 2016] 29


Data Ga&ng / Zero Skipping in Eyeriss
Skip MAC and mem reads
Image when image data is zero.
Img Scratch Pad Reduce PE power by 45%
(12x16b REG)
2-stage
== 0 Zero Enable pipelined Accumulate
Buffer multiplier Input Psum
Filt Filter Output
Scratch Pad 0 Psum
(225x16b SRAM)

1
Input
Psum 0 1
Partial Sum
Scratch Pad 0
(24x16b REG) Reset

[Chen et al., ISSCC 2016] 30


Cnvlutin
• Process Convolution Layers
• Built on top of DaDianNao (4.49% area overhead)
• Speed up of 1.37x (1.52x with activation pruning)

[Albericio et al., ISCA 2016] 31


Pruning Activations
Remove small activation values
Speed up 11% (ImageNet) Reduce power 2x (MNIST)
Minerva
Cnvlutin

[Albericio et al., ISCA 2016] [Reagen et al., ISCA 2016] 32


Pruning – Make Weights Sparse

• Optimal Brain Damage


1. Choose a reasonable network
architecture
2. Train network until reasonable
solution obtained
3. Compute the second derivative
for each weight retraining
4. Compute saliencies (i.e. impact
on training error) for each weight
5. Sort weights by saliency and
delete low-saliency weights
6. Iterate to step 2

[Lecun et al., NIPS 1989] 33


Pruning – Make Weights Sparse
Prune based on magnitude of weights

Example: AlexNet
Weight Reduction: CONV layers 2.7x, FC layers 9.9x
(Most reduction on fully connected layers)
Overall: 9x weight reduction, 3x MAC reduction

[Han et al., NIPS 2015] 34


Speed up of Weight Pruning on CPU/GPU
On Fully Connected Layers Only
Average Speed up of 3.2x on GPU, 3x on CPU, 5x on mGPU

Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV


NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV
NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV

Batch size = 1

[Han et al., NIPS 2015] 35


Key Metrics for Embedded DNN

• Accuracy à Measured on Dataset


• Speed à Number of MACs
• Storage Footprint à Number of Weights
• Energy à ?

36
Energy-Aware Pruning

• # of Weights alone is not a good metric for


energy
– Example (AlexNet):
• # of Weights (FC Layer) > # of Weights (CONV layer)
• Energy (FC Layer) < Energy (CONV layer)

• Use energy evaluation method to estimate DNN


energy
– Account for data movement

[Yang et al., CVPR 2017] 37


Energy-Evaluation Methodology

CNN Shape Configuration Hardware Energy Costs of each


(# of channels, # of filters, etc.) MAC and Memory Access

# acc. at mem. level 1


Memory # acc. at mem. level 2
Accesses


Optimization # acc. at mem. level n Edata

# of MACs # of MACs Ecomp


Calculation

CNN Weights and Input Data Energy


[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
L1 L2 L3 …
CNN Energy Consumption
Evaluation tool available at http://eyeriss.mit.edu/energy.html 38
Key Observations

• Number of weights alone is not a good metric for energy


• All data types should be considered
Computa&on
10% Input Feature Map
25%

Weights
Energy Consump&on 22%
of GoogLeNet

Output Feature Map


43%

[Yang et al., CVPR 2017] 39


Energy Consumption of Existing DNNs
93%
91% ResNet-50
VGG-16
Top-5 Accuracy

89% GoogLeNet
87%
85%
83%
81% AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consump&on
Original DNN

Deeper CNNs with fewer weights do not necessarily consume less


energy than shallower CNNs with more weights
[Yang et al., CVPR 2017] 40
Magnitude-based Weight Pruning
93%
91% ResNet-50
VGG-16
Top-5 Accuracy

89% GoogLeNet
87%
85%
83% SqueezeNet
81% AlexNet SqueezeNet
79% AlexNet

77%
5E+08 5E+09 5E+10
Normalized Energy Consump&on
Original DNN Magnitude-based Pruning [Han
[6] et al., NIPS 2015]

Reduce number of weights by removing small magnitude weights

41
Energy-Aware Pruning
93%
91% ResNet-50
VGG-16
Top-5 Accuracy

89% GoogLeNet
87% GoogLeNet

85%
83% 1.74x SqueezeNet
81% AlexNet SqueezeNet
AlexNet AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consump&on
Original DNN Magnitude-based Pruning [6] Energy-aware Pruning (This Work)

Remove weights from layers in order of highest to lowest energy


3.7x reduction in AlexNet / 1.6x reduction in GoogLeNet
DNN Models available at http://eyeriss.mit.edu/energy.html 42
Energy Estimation Tool
Website: https://energyestimation.mit.edu/
Input DNN Configuration File

Output DNN energy breakdown across layers

[Yang et al., CVPR 2017] 43


Compression of Weights & Activations
• Compress weights and activations between DRAM
and accelerator
• Variable Length / Huffman Coding
Example:
Value: 16’b0 à Compressed Code: {1’b0}
Value: 16’bx à Compressed Code: {1’b1, 16’bx}
• Tested on AlexNet à 2× overall BW Reduction

[Moons et al., VLSI 2016; Han et al., ICLR 2016] 44


Sparse Matrix-Vector DSP
• Use CSC rather than CSR for SpMxV
Compressed Sparse Row (CSR) Compressed Sparse Column (CSC)
N

Reduce memory bandwidth (when not M >> N)


For DNN, M = # of filters, N = # of weights per filter
[Dorrance et al., FPGA 2014] 45
EIE: A Sparse Linear Algebra Engine
• Process Fully Connected Layers (after Deep Compression)
• Store weights column-wise in Run Length format
• Read relative column when input is non-zero

Supports Fully Connected Layers Only

Input ~a 0 a 1 0 a3 Dequantize Weight


⇥ Output
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
Weights
B B
C B b2 C
C
B
B 0 C
C
P E3 B
B 0 0 0 0 C B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B
B C B b5 C
C
B
B b5 C
C Keep track of location
B C B C B C
@ 0 0 0 w6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
Output Stationary Dataflow

[Han et al., ISCA 2016] 46


Sparse CNN (SCNN)
Supports Convolutional Layers

Densely Packed All-to all


Storage of Weights Mechanism to Add to
Multiplication of
Scattered Partial Sums
and Activations Weights and Activations

a a * x
b a * y
x a * z
c
d
y = Scatter
z b * x network
e b * y
f
b * z
Accumulate MULs

PE frontend PE backend
Input Stationary Dataflow
[Parashar et al., ISCA 2017] 47
Structured/Coarse-Grained Pruning
• Scalpel
– Prune to match the underlying data-parallel hardware
organization for speed up

Example: 2-way SIMD

Dense weights Sparse weights

[Yu et al., ISCA 2017] 48


Compact Network Architectures

• Break large convolutional layers into a series


of smaller convolutional layers
– Fewer weights, but same effective receptive field

• Before Training: Network Architecture Design

• After Training: Decompose Trained Filters

49
Network Architecture Design
Build Network with series of Small Filters
GoogleNet/Inception v3
5x5 filter Apply sequentially
5x1 filter

1x5 filter
decompose

separable
filters

VGG-16
5x5 filter Two 3x3 filters Apply sequentially

decompose

50
Network Architecture Design
Reduce size and computation with 1x1 Filter (bottleneck)

Figure Source:
Stanford cs231n

Used in Network In Network(NiN) and GoogLeNet


[Lin et al., ArXiV 2013 / ICLR 2014] [Szegedy et al., ArXiV 2014 / CVPR 2015]
51
Network Architecture Design
Reduce size and computation with 1x1 Filter (bottleneck)

Figure Source:
Stanford cs231n

Used in Network In Network(NiN) and GoogLeNet


[Lin et al., ArXiV 2013 / ICLR 2014] [Szegedy et al., ArXiV 2014 / CVPR 2015]
52
Network Architecture Design
Reduce size and computation with 1x1 Filter (bottleneck)

Figure Source:
Stanford cs231n

Used in Network In Network(NiN) and GoogLeNet


[Lin et al., ArXiV 2013 / ICLR 2014] [Szegedy et al., ArXiV 2014 / CVPR 2015]
53
Bottleneck in Popular DNN models

compress

ResNet
expand

GoogleNet

compress

54
SqueezeNet
Reduce weights by reducing number of input
channels by “squeezing” with 1x1
50x fewer weights than AlexNet
(no accuracy loss)
Fire Module

[F.N. Iandola et al., ArXiv, 2016]] 55


Energy Consumption of Existing DNNs
93%
91% ResNet-50
VGG-16
Top-5 Accuracy

89% GoogLeNet
87%
85%
83%
81% AlexNet SqueezeNet
79%
77%
5E+08 5E+09 5E+10
Normalized Energy Consump&on
Original DNN

Deeper CNNs with fewer weights do not necessarily consume less


energy than shallower CNNs with more weights
[Yang et al., CVPR 2017] 56
Decompose Trained Filters
After training, perform low-rank approximation by applying tensor
decomposition to weight kernel; then fine-tune weights for accuracy

R = canonical rank [Lebedev et al., ICLR 2015] 57


Decompose Trained Filters
Visualization of Filters
Original Approx.

• Speed up by 1.6 – 2.7x on CPU/GPU for CONV1,


CONV2 layers
• Reduce size by 5 - 13x for FC layer
• < 1% drop in accuracy
[Denton et al., NIPS 2014] 58
Decompose Trained Filters on Phone
Tucker Decomposition

59
[Kim et al., ICLR 2016]
Knowledge Distillation

class
probabilities

softmax
softmax
DNN A DNN B
(teacher) (teacher)

Try to match

softmax
DNN
(student)

[Bucilu et al., KDD 2006],[Hinton et al., arXiv 2015] 60

You might also like