0% found this document useful (0 votes)
29 views

Dynamic and Progressive Filter Pruning For Compressing Convolutional Neural Networks From Scratch

Uploaded by

ishark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Dynamic and Progressive Filter Pruning For Compressing Convolutional Neural Networks From Scratch

Uploaded by

ishark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

DPFPS: Dynamic and Progressive Filter Pruning for Compressing Convolutional


Neural Networks from Scratch
Xiaofeng Ruan1,2 * , Yufan Liu1,2∗ , Bing Li1,4† , Chunfeng Yuan1 , Weiming Hu1,2,3
1
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
2
School of Artificial Intelligence, University of Chinese Academy of Sciences
3
CAS Center for Excellence in Brain Science and Intelligence Technology
4
PeopleAI Inc.
{ruanxiaofeng2017, yufan.liu}@ia.ac.cn, {bli, cfyuan, wmhu}@nlpr.ia.ac.cn

Abstract Multi-pass
Filter pruning is a commonly used method for compress-
ing Convolutional Neural Networks (ConvNets), due to its Well-trained
friendly hardware supporting and flexibility. However, exist- network
Pruning Finetuning
ing methods mostly need a cumbersome procedure, which
Original Compressed
brings many extra hyper-parameters and training epochs. This network Cumbersome pruning procedure network
is because only using sparsity and pruning stages cannot ob-
tain a satisfying performance. Besides, many works do not (a)
consider the difference of pruning ratio across different lay-
ers. To overcome these limitations, we propose a novel dy-
namic and progressive filter pruning (DPFPS) scheme that
Training with DPFPS Pruning
directly learns a structured sparsity network from Scratch.
In particular, DPFPS imposes a new structured sparsity- Original network Structured sparsity network Compressed network
inducing regularization specifically upon the expected prun- (b)
ing parameters in a dynamic sparsity manner. The dynamic
sparsity scheme determines sparsity allocation ratios of dif-
ferent layers and a Taylor series based channel sensitivity cri- Figure 1: Block diagrams of two types of filter pruning.
teria is presented to identify the expected pruning parame- (a) This pruning procedure needs pre-trained models and
ters. Moreover, we increase the structured sparsity-inducing fine-tuning after pruning, even a multi-pass scheme. (b) Our
penalty in a progressive manner. This helps the model to be method directly learns a structured sparsity network from
sparse gradually instead of forcing the model to be sparse scratch, without any fine-tuning or multi-pass.
at the beginning. Our method solves the pruning ratio based
optimization problem by an iterative soft-thresholding algo-
rithm (ISTA) with dynamic sparsity. At the end of the train-
ing, we only need to remove the redundant parameters with- 2016) and video understanding (Venugopalan et al. 2015).
out other stages, such as fine-tuning. Extensive experimental However, a large number of parameters and FLoating-point
results show that the proposed method is competitive with OPerations (FLOPs) make it difficult to deploy ConvNets
11 state-of-the-art methods on both small-scale and large- in the application of embedded or mobile devices. There-
scale datasets (i.e., CIFAR and ImageNet). Specifically, on fore, model compression and acceleration of ConvNets is a
ImageNet, we achieve a 44.97% pruning ratio of FLOPs by
fundamental problem that has been extensively studied in
compressing ResNet-101, even with an increase of 0.12%
Top-5 accuracy. Our pruned models and codes are released recent years. Several techniques, including weight sparsity,
at https://github.com/taoxvzi/DPFPS. low-rank approximation, parameter quantization, and filter
pruning, have been applied to the study of model compres-
sion and acceleration. Among them, filter pruning methods
Introduction have been widely explored and possessed excellent perfor-
With the development of computer science and artifi- mances on the compression and acceleration ratio because
cial intelligence, ConvNets have significantly improved of their hardware-friendly capabilities without the support
performance in several fields, such as image classifica- of special libraries. For example, by using SSL (Wen et al.
tion (Krizhevsky, Sutskever, and Hinton 2012), object detec- 2016), convolutional layer computation of AlexNet is accel-
tion (Girshick et al. 2014), object tracking (Bertinetto et al. erated on average 5.1x and 3.1x speedups against CPU and
* Equal
GPU, respectively, with off-the-shelf libraries.
contribution.

Corresponding author. Though previous filter pruning algorithms have achieved
Copyright © 2021, Association for the Advancement of Artificial promising results, there still exist several problems. Specif-
Intelligence (www.aaai.org). All rights reserved. ically, a common procedure of filter pruning (Liu et al.

2495
2019b) includes three stages: (1) training, (2) pruning, and Zero parameters
max
(3) fine-tuning, as shown in Figure 1(a). Firstly, well-trained Non‐zero parameters  (t ) 
Expected pruning parameters  1  e  (30t /T 15)
models usually need to be obtained. In some studies (Wen  Sparsity allocation ratio
xx%
%
et al. 2016; Liu et al. 2017), structured sparsity networks Layer 1 4 5 45 max
67.2% 62.5%

•••
are learned by imposing the structured sparsity regulariza- Layer i 45 45
79.9%

•••
tion. Some other studies (Li et al. 2016; Luo, Wu, and Lin 75.0%
Layer L 4 5  T t
2017) directly use available pre-trained models. Then, when 81.3% 83.2% Loss  Lcls   (t )DPSS (Θp )
well-trained models are pruned, there are two typical issues: epoch i epoch i+1 standard loss regularization
(1) it is often difficult to match the pre-setting pruning ra-
tio for the sparse rate of the network (after trained with a Dynamic sparsity Progressive sparsity
sparsity-inducing constrain), and (2) a dedicated handcrafted
pruning criterion needs to be designed to identify the unim- Figure 2: An illustration of DPFPS. (1) Sparsity allocation
portant neurons and remove them. In order to achieve the ratios and expected pruning parameters are dynamically up-
pre-setting pruning ratio, some non-zero parameters are also dated. (2) The sparsity penalty increases gradually as train-
removed. This is bound to bring accuracy drop. Hence, to re- ing proceeds.
store performances, it is critical to fine-tune pruned models.
However, when pruned models are fine-tuned, many hyper-
parameters (e.g., the learning rate, epochs of fine-tuning, tion problem by an iterative soft-thresholding algorithm
multi-pass times) need to be set again. This makes the multi- (ISTA) with dynamic sparsity. At the end of the training,
stage procedure to be more cumbersome. we only need to remove the redundant parameters without
To address the above issues, we propose a novel Dy- other stages, such as fine-tuning. Extensive experimental
namic and Progressive Filter Pruning (DPFPS) scheme that results show that the proposed method is competitive with
directly learns a structured sparsity network from Scratch, 11 state-of-the-art methods.
without pre-trained, fine-tuning, or multi-pass, as shown
in Figure 1(b). In our proposed method, to meet the pre- Related Work
setting pruning ratio, structured sparsity-inducing regular- Recent works on weight sparsity, parameter quantization,
ization is only imposed upon the expected pruning param- low-rank decomposition, filter pruning, and knowledge dis-
eters, as shown in Figure 2. Different from fixed sparsity in tillation can be found in some papers (Idelbayev and
each layer, our method dynamically updates sparsity alloca- Carreira-Perpinan 2020; Jin, Yang, and Liao 2020; Liu et al.
tion ratios in different layers. The dynamic sparsity scheme 2019a, 2020; Ruan et al. 2020). In this section, we review
determines sparsity allocation ratios of different layers and related works in filter pruning, which focus on two aspects
a Taylor series based channel sensitivity criteria is presented of structured sparsity networks and fine-tuning.
to identify the expected pruning parameters. In this manner, Structured Sparsity Networks. Learning a structured
reserved filters can have more space to learn sufficient in- sparsity network is a straightforward pruning method, which
formation, leading to better performance. For redundant fil- is widely used to compress and accelerate neural net-
ters, the constraint can make them towards zero and boost works (Wen et al. 2016; Alvarez and Salzmann 2016; Singh
the generation of a compact network. As a result, the dy- et al. 2019; Lin et al. 2020b). Wen et al. (Wen et al. 2016)
namic sparsity allocation scheme makes it unnecessary to used group Lasso regularization to achieve a structured spar-
pre-define pruning architecture. Moreover, a group Lasso sity learning (SSL). Furthermore, a sparse group Lasso idea
based progressive penalty is designed as the structured spar- was applied to automatically determine the number of neu-
sity regularization, which increases gradually as training rons in each layer of the network during learning (Alvarez
proceeds (i.e., from zero to a proper value). This encour- and Salzmann 2016). Although a structured sparsity net-
ages all the filters to learn useful information at the be- work was directly learned, the pruning ratio depended on
ginning and forces the redundant filters to be sparse in the the penalty term and the excessively pruned model needed
late period. Our method solves the pruning ratio based opti- to be fine-tuned to regain accuracy. In this paper, given a
mization problem by an iterative soft-thresholding algorithm pre-setting pruning ratio, we directly learn a structured spar-
(ISTA) with dynamic sparsity. At the end of the training, we sity network from scratch and do not need any fine-tuning
only need to remove the redundant parameters without other after pruning.
stages, such as fine-tuning, and the additional time consump- Fine-tuning. Due to hurting the performance after prun-
tion is also acceptable. ing the network, the original performance needed to be re-
Our contributions are listed as follows: stored by fine-tuning techniques (Han et al. 2015). Some
• Our method dynamically updates sparsity allocation ra- methods pruned the well-trained models by minimizing re-
tios in different layers and only imposes regularization construction error and then fine-tuned the pruned mod-
upon the expected pruning parameters. els (He, Zhang, and Sun 2017; Jiang et al. 2018). Al-
though these methods have excellent convergence perfor-
• Moreover, we design a group Lasso based progressive mance, the pruning process is often cumbersome, includ-
sparsity penalty to increasingly induce sparsity in ex- ing obtaining the pre-trained model, pruning and recover-
pected pruning parameters. ing accuracy by minimizing layer-wise reconstruction error,
• Our method solves the pruning ratio based optimiza- and fine-tuning. Additionally, although the fine-tuning tech-

2496
niques (Ding et al. 2018; Yu et al. 2018; Molchanov et al. where card(·) returns the number of nonzero of filters’ `2 -
2019; Peng et al. 2019; Ding et al. 2019a) are helpful to re- (i)
norm in its layer, i.e. the number of Wj, : 6= 0 (j row
gain the original accuracy, many hyper-parameters need to 2
be set and this is cumbersome. of W(i) ), and p(i) denotes the desired number of preserved
filters in the i-th layer. Because the second term of Equa-
The Proposed Method tion (2) is not differentiable, the problem cannot be directly
In this section, we first introduce the problem formulation. addressed by stochastic gradient descent.
Then, we provide details of progressive structured sparsity
regularization. Finally, we present the dynamic sparsity al- Progressive Structured Sparsity Regularization
gorithm.
In previous works (Wen et al. 2016; Liu et al. 2017), struc-
Preliminary and Problem Formulation tured sparsity regularization is widely used to introduce the
In an L-layer ConvNet, the i-th convolutional layer parame- objective function as an unconstrained optimization prob-
ters (ignoring biases) can be represented as a 4-dimensional lem, which is formulated as:
(i) (i) (i) (i)
tensor: W (i) ∈ Rc ×r ×k1 ×k2 , where c(i) and r(i) are X
the numbers of output and input channels (feature maps), min L(F (Θ, x), y) + λR(Θ). (4)
Θ
(i) (i) (x,y)∈D
and k1 × k2 corresponds to the 2-dimensional spatial ker-
nel. We reorganize the parameters from the original space
(i) (i) (i) (i) (i) (i)
c(i) ×r (i) k1 k2
where R(·) is the penalty function to make ConvNet be
W (i) ∈ Rc ×r ×k1 ×k2 to (W(i) )2D out ∈ R sparse and λ is the penalty coefficient. However, the hyper-
(i) (i) (i) (i)
or (W(i) )2Din ∈ Rr ×c k1 k2 , where (W(i) )2D out and parameter λ does not directly control the pruning ratio. To
(W(i) )2Din denote the matrix-forms of tensor W
(i)
along meet the pre-setting pruning ratio, the pruned model may
(i) need to be fine-tuned to recover the accuracy or obtained by
out and in channels, respectively. Then Wj can be rewrit-
(i) (i) (i)
some empirical knowledge (e.g., several attempts).
(i)
ten as a 3-dimensional tensor: Kj ∈ Rr ×k1 ×k2 (j = To address this weakness, different from previous sparsity
1, ..., c(i) ) to represent the j-th filter kernel in the i-th convo- strategy (Wen et al. 2016; Liu et al. 2017), during training,
lutional layer, which corresponds to the weights of the j-th we control the pruning ratio like Equation (2) and impose the
output channel. Besides convolutional layers, the parame- structured sparsity regularization upon the expected pruning
(i) (i) filters like (4). Specifically, we divide parameters set Θ into
ters of BatchNorm layers are denoted as {γj , βj } ∈ R,
which scale and shift the normalized value. To integrate the two subsets: Θp and Θp , where Θp is the expected prun-
above parameters, we define Θ as all parameters set in the ing filters that are “unimportant” and Θp is the preserved
network, i.e. Θ = ∪
(i) (i) (i)
(Kj ∪ {γj , βj }), where filters. Therefore, based on our proposed structured sparsity
i∈N1 ,j∈N2 regularization, the loss function can be expressed as:
N1 = {1, ..., L}, N2 = {1, ..., c(i) }. X
We consider pruning the network as a constrained op- Loss = L(F (Θ, x), y) + λRDP SS (Θp ). (5)
timization problem (Boyd and Vandenberghe 2004), given (x,y)∈D
input-output pairs (x, y) from data set D, which has the
form where RDP SS (·) is our designed dynamic and progressive
X
min L(F (Θ, x), y), structured sparsity regularization (DPSS), which only works
Θ
(x,y)∈D (1) on the expected pruning parameters Θp .
s.t. P R0 − P R(Θ) ≤ 0. Inspired by the Group Lasso (Yuan and Lin 2006) method,
we use `21 -norm as the sparsity regularization, which has the
where F (·) is the ConvNet forward function, L(·) is the form:
standard loss function (e.g., the cross-entropy loss for classi-
fication tasks), P R(·) is an evaluation metrics, e.g., pruning RDP SS (Θp ) =
ratio of parameters or FLOPs, and P R0 is a pre-setting prun- −p
L c X (i) (i) (i)
−p(i)
ing ratio. It is not convenient to directly constrain P R(Θ), X (i)
X c X
L−1
(i+1)
so we use the layer-wise form to rewrite the problem (1) as (Wj, : )2D
out + (Wj, : )2D
in .
2 2
i=1 j=1 i=1 j=1
the unconstrained problem (Zhang et al. 2018)
(6)
L
X X
(i) When the channels are removed, parameters of both the
min L(F (Θ, x), y) + ti (W ), (2) current and the next layers are pruned. So, we impose the
Θ
(x,y)∈D i=1
regularization upon the corresponding parameters in the cur-
where ti (·) is an indicator function of W(i) , which has the rent and the next layers.
form In Equation (5), the second term RDP SS (Θp ) is con-
( vex but not differentiable. In order to solve the non-
(i) 0, card(W(i) ) ≤ p(i) ,
ti (W ) = (3) smooth unconstrained optimization problem, we make use
+∞, otherwise. (i)
of ISTA (Beck and Teboulle 2009) and update (Wj, m )2D ∈

2497
(i) (i)
Θp by {(Wj, m )2D }(n+1) = Sλ ({(Wj, m )2D }(n) ), where Algorithm 1: The Proposed Method.
(i)
Sλ ((Wj, m )2D ) Input: N etwork, Dataset D, λmax , α, pruning ratio
 P R0 .
(i)
(W(i) )2D − λ(Wj, m )2D (i)
j, m (i) , if ||(Wj, : )2D ||2 > λ, 1 e = 0;
= ||(Wj, : )2D ||2
2 For each layer i, initialize the sparsity allocation ratio
0, otherwise.
sr(i) ;
(7) 3 while e < Epoch do
Because we train the network from scratch, to avoid ex- 4 for each iteration ∈ epoch e do
cessive restraint in the early training phase, we introduce a 5 Update: Θ ← Θ − α∇L(Θ) using dataset D;
penalty coefficient λ with a gradually increased value (i.e. 6 For each layer i, calculate the sensitivity set
from zero to a proper value). In particular, for simplicity, we (i)
use the sigmoid function to reflect the value of λ at the t-th of channels: Schannel =
(i) (i) (i)
iteration and λ(t) is updated by {St (K1 ), St (K2 ), ..., St (Kc(i) )} via
λmax Equation (13);
λ(t) =30t . (8) 7 For each layer i, sort and truncate the
1 + e−( T −15) (i)
smallest-dsr(i) ∗ c(i) e of Schannel to
where λmax is the maximum of the penalty coefficient and p
identify Θ and index;
T is total iterations. 8 Update λ via Equation (8);
Dynamic Sparsity Algorithm 9 for each (W(i) )2D [index(i) ] ∈ Θp do
Channel Sensitivity Criteria. To identify the unimportant 10 Update (W(i) )2D [index(i) ] via Equation
parameters set Θp , it is critical to calculate the sensitivity (7);
11 end
of channels. From the perspective of the loss function, the
sensitivity of parameter w can be represented by 12 end
13 For each layer i, recalculate sr(i) ;
S(w) = |L(F (Θw→0 , x), y) − L(F (Θw , x), y)| (9) 14 e = e + 1;
where Θw→0 denotes that parameter w tends to 0. 15 end
Like (Ding et al. 2019b), using Taylor series, the loss 16
(i)
Prune the redundant filters ( Wj, : = 0) and return
function can be represented 2
the compressed network with acceptable accuracy.
L (F (Θw , x), y)
∂L (F (Θ, x), y)
w + o w2

= L(F (Θw→0 , x), y) −
∂w We consider the impact of the current and next layers and
(10) (i)

whereo w2 is a remainder term. Because w is very small, the total sensitivity of channel j in the i-th layer St (Wj ) is
o w2 can be ignored. (i) (i)
r (i) k1 k2
Filter pruning removes filter-wise parameters and not just (i)
X ∂L (F (Θ, x), y) (i)
St (Wj ) =| (Wj, m )2D
out
a single parameter, so the sensitivity of channel j concerning (i)
∂(Wj, m )2D
m=1 out
the current layer in the i-th layer is (13)
(i+1) (i+1)
c(i+1) k1 k2
(i) ∂L (F (Θ, x), y)
Sc ((Wj, : )2D
out )
(i+1)
X
+ (i+1)
(Wj, m )2D
in |
= |L(F (Θ(W(i) )2D →0 , x), y) − L(Θ(W(i) )2D , x), y)| m=1 ∂(Wj, m )2D
in
j, : out j, : out

∂L (F (Θ, x), y) (i)


Dynamic Sparsity Ratio Allocation. To adaptively learn
≈| (i)
(Wj, : )2D
out | a structured sparsity network, our method dynamically up-
∂(Wj, : )2D
out dates sparsity allocation ratios in different layers. Specifi-
(i) (i)
r (i) k1 k2 cally, at the end of every epoch, We leverage the sparsity
X ∂L (F (Θ, x), y) (i) result from the epoch to recalculate the sparsity allocation
=| (Wj, m )2D
out |
(i)
∂(Wj, m )2D ratio of every layer in the subsequent epoch. We denote
m=1 out
(11) relative sparsity allocation ratio in the i-th layer as sr(i) ,
(i)
While the channels are removed in the current layer, the sparsity ratio (zero parameters ratio) as srzero , and sparsity-
corresponding input channels are pruned in the next layer (Li inducing ratio (the redundant non-zero parameters ratio) as
et al. 2019). So, the sensitivity concerning the next layer is (i)
srinducing . To avoid introducing extra layer-wise parame-
(i+1) ters, we use a uniform ratio to calculate the sparsity-inducing
Sn ((Wj, : )2D
in ) ratio across all layers. Then, sparsity allocation ratios are
(i+1) (i+1)
c(i+1) k1 k2 (12) equal to sparsity ratios and sparsity-inducing ratios. During
X ∂L (F (Θ, x), y) (i+1) training, as sparsity ratios of different layers continuing to
≈| (i+1)
(Wj, m )2D
in |
m=1 ∂(Wj, m )2D
in
change, our sparsity allocation ratios are dynamically up-

2498
Model Training Settings PR
Method Test Accuracy
Architecture Pre-defined? Pre-trained? Fine-tuning? P arams F LOP s
Baseline N/A N/A N/A (93.85±0.07)% 0 0
Li et al. (Li et al. 2016) 4 4 4 93.40% 63.90% 34.21%
NRE (Jiang et al. 2018) % 4 4 93.40% 92.72% 67.64%
VGG-Small Slimming (Liu et al. 2017) % % 4 93.48% 86.65% 43.50%
HRank (Lin et al. 2020a) % 4 4 93.43% 82.90% 53.50%
SFP (He et al. 2018a) 4 % % 92.66% 69.17% 69.24%
SSS (Huang and Wang 2018) % % % 93.20% 66.67% 69.70%
Ours % % % (93.52±0.15)% 93.32% 70.85%
Baseline N/A N/A N/A (93.81±0.14%) 0 0
Li et al. (Li et al. 2016) 4 4 4 93.06% 13.70% 27.60%
CP (He, Zhang, and Sun 2017) 4 4 4 91.80% -- 50.00%
AMC (He et al. 2018b) % 4 4 91.90% -- 50.00%
ResNet56 HRank (Lin et al. 2020a) % 4 4 93.17% 42.40% 50.00%
SFP (He et al. 2018a) 4 % % (92.26±0.31)% -- 52.60%
FPGM (He et al. 2019) 4 % % (92.93±0.49)% -- 52.60%
ASFP (He et al. 2020) 4 % % (92.44±0.07)% -- 52.60%
Ours % % % (93.20±0.11)% 46.84% 52.86%

Table 1: The pruning results of VGG-Small and ResNet56 on CIFAR-10. The “Pre-defined?” refers to whether a fixed pruning
ratio is pre-set in each layer through empirical studies. The “--” indicates that the results are not listed in the original paper.

94.0%
dated. So, our method does not require pre-define the prun- 93.5%
ing architecture and can adaptively learn an optimal struc- 93.0%
tured sparsity network. 92.5%

Accuracy
92.0%
Computational Complexity Analysis. In our proposed VGG-Small@CIFAR-10
80.0%
method, additional computation mainly comes from iden- PRparams = 90%

tifying and updating the expected pruning parameters. Be- 60.0%

cause the gradients of parameters are obtained by a back- 40.0%


ward process in the SGD algorithm and updating the param- 0.001 0.01 0.1 1 10
eters are achieved by ISTA, the additional time consumption λ max
of DPFPS is acceptable. The detailed procedure of DPFPS
is presented in Algorithm 1. Figure 3: Test accuracy in different λmax settings.

Experiments
Experimental Setup we use the settings like AMC (He et al. 2018b). All experi-
1) Datasets and Networks: We evaluate DPFPS on two ments are implemented on multiple NVIDIA RTX 2080 Ti
datasets: CIFAR (Krizhevsky and Hinton 2009) and Im- GPUs and Intel(R) Xeon(R) Gold 5118 CPU by PyTorch.
ageNet (Russakovsky et al. 2015). The same data aug- On CIFAR-10, considering fluctuations caused by different
mentation strategies are used as PyTorch official exam- random seeds in the experimental results, we use the mean
ples (Paszke et al. 2017). On CIFAR-10, we evaluate the and standard deviation to report the results by conducting
proposed method using VGG-16 (Simonyan and Zisserman the same experiment 5 times.
2014) and ResNet56 (He et al. 2016). As the original VGG- 3) Parameter Settings: In our experiments, the parame-
16 is specially designed for ImageNet classification, we use ter λmax impacts the structured sparsity results, so we study
a variation version (i.e. VGG-Small) taken from (Zagoruyko the performances in different λmax settings. As shown in
2015) in our experiment. On ImageNet dataset, we evaluate Figure 3, we tune λmax exponentially in a relatively wide
DPFPS on ResNets (including ResNet 34, 50, and 101) and range (i.e. [0.001, 10]). When λmax is set to be a small value,
MobileNet v2 (Sandler et al. 2018). the test accuracy is low. This is because inadequate sparsity
2) Implementation Details: All networks are trained from produces low precision after pruning the network, like (Wen
scratch. It takes 200 and 100 epochs on CIFAR-10 and Im- et al. 2016; Liu et al. 2017), leading to needing fine-tuning
ageNet datasets, with an initial learning rate of 0.1, and a the pruned network to recover the precision. When λmax is
mini-batch size of 64 and 256, respectively. The learning rate set to be a big value, excessive sparsity causes the network
is multiplied by 0.1 at 50% and 75% of the training epochs parameters not to be learned well. In our experiment deploy-
on CIFAR-10, and at 30, 60, and 90 epoch on ImageNet. ment, the parameter λmax = 0.01 can meet the experiment
We utilize an SGD optimizer with a weight decay of 10−4 results. For simplicity, we set the parameter λmax to 0.01
and a momentum of 0.9. For MobileNet v2 on ImageNet, in all experiments, though we can obtain better results by

2499
Model Training Settings Test Accuracy
Method P RF LOP s
Architecture Pre-defined? Pre-trained? Fine-tuning? Top-1 Top-5
Baseline N/A N/A N/A 73.92% 91.62% 0
Li et al. (Li et al. 2016) 4 4 4 72.17% -- 24.20%
ResNet34 SFP (He et al. 2018a) 4 % % 71.83% 90.33% 41.10%
FPGM (He et al. 2019) 4 % % 72.11% 90.69% 41.10%
ASFP (He et al. 2020) 4 % % 71.72% 90.65% 41.10%
Ours % % % 72.25% 90.80% 43.29%
Baseline N/A N/A N/A 76.15% 92.87% 0
ThiNet (Luo et al. 2019) 4 4 4 74.03% 92.11% 36.80%
CP (He, Zhang, and Sun 2017) 4 4 4 -- 90.80% 50.00%
ResNet50 HRank (Lin et al. 2020a) % 4 4 74.98% 92.33% 43.77%
SFP (He et al. 2018a) 4 % % 74.61% 92.06% 41.80%
FPGM (He et al. 2019) 4 % % 75.03% 92.40% 42.20%
ASFP (He et al. 2020) 4 % % 74.88% 92.39% 41.80%
Ours % % % 75.55% 92.54% 46.20%
Baseline N/A N/A N/A 77.37% 93.56% 0
ResNet101 SFP (He et al. 2018a) 4 % % 77.03% 93.46% 42.20%
Ours % % % 77.27% 93.68% 44.97%
Baseline N/A N/A N/A 72.00% -- 0
MobileNet v2 0.75x MobileNet v2 (Sandler et al. 2018) % % % 69.80% -- 26.54%
AMC (He et al. 2018b) % 4 4 70.80% -- 26.54%
Ours % % % 71.10% -- 24.89%

Table 2: The pruning results of ResNets and MobileNet v2 on ImageNet.

94% 2.0 100%

93% 1.5
Test accuracy

Our_accuracy 80%

Accuracy
Ours
Our_loss
Loss

92% Slimming 1.0 Baseline_accuracy


SFP Baseline_loss
SSS 60%
91% MW_accuracy
0.5 MW_loss
90%
0% 20% 40% 60% 80% 100% 0.0 40%
0 50 100 150 200
PR params
Epoch

Figure 4: Performance comparison with different baselines. Figure 5: Analysis of training convergence.
fine-tuning λmax in different experiments. for the need of a large pruning ratio, due to excessive re-
4) Evaluation Metrics: To evaluate the performance of straint, SSS is inappropriate and fine-tuning is needed to
DPFPS, we use the parameters or FLOPs pruning ratio:
recover the precision like Slimming. In all the baselines,
#P runed P arams/F LOP s DPFPS achieves the best trade-off curve between test ac-
P RP arams/F LOP s = 1 − curacy and pruning ratio, exceeding the fine-tuning method
#Original P arams/F LOP s
(14) (i.e. Slimming). These show that our dynamic and progres-
sive structured sparsity regularization is effective. More im-
where “#P runed P arams/F LOP s” denote the remaining portantly, our proposed method alleviates the cumbersome
parameters or FLOPs after the model is pruned, respectively. pruning process, without any pre-trained, fine-tuning, or
multi-pass scheme, which is efficient.
Performance Comparison with Different Baselines
Performance Comparison with State-of-the-Art
We use VGG-Small on CIFAR-10 and compare DPFPS with Methods
three baselines: SFP, SSS and Slimming. As shown in Fig-
ure 4, when P RP arams is less than 50%, SFP is inferior to We also compare DPFPS with 11 state-of-the-arts.
other methods. This is because SFP adopts the fixed prun- On CIFAR-10, we compare our method with Li et al.,
ing structure and does not learn an optimal pruning network. NRE, Slimming, SFP, CP, AMC, FPGM, SSS, ASFP, and
With the increase of P RP arams , for SSS, the larger penalty HRank, shown in Table 1. For VGG-Small, the proposed
needs to be imposed upon the parameters. When P RP arams method achieves an accuracy of 93.52% with 93.32% prun-
reaches 65%, the test accuracy of SSS sharply reduces. So, ing ratio of parameters and 70.85% pruning ratio of FLOPs,

2500
1234567891
600
LLLLLLLLLL
a
ye
r
e
1.0 Baseline
y
ay
re
ay
re
0.8 0.8 Ours
re
ay
rr 400
ay

PRparams
0.6
ey
aay

0.6
r
er
ey
aae

0.4 200
0.4
r
ere
01
aaaa yye

0.2
11
LLL

rre

0.2 0
21
yy

0.0
3
r

30 50 70 100


0 150 200

L a y er 8
ye 9

3
y 5
y 6
y 7

L a er 0
L a er 1
y e 12
y 3
y 4
y 1
y 2

4


、、、

r1
y 1
y 1
今、

L a er
L a er
L a er
L a er
L a er
L a er
L a er
L a er

La r
y
Epoch

La
(a) Sparsity allocation ratio. (b) Sparsity ratio. (c) Sparsity ratio of the network. (d) Channel number.

Figure 6: Analysis of visualization in the different layers and network.

VGG-Small@CIFAR-10 ResNet34@ImageNet Dataset Method F LOP s (Backbone) mAP


Method
Accuracy P RF LOP s Top-1 P RF LOP s Faster R-CNN 4.08G 0.734
PASCAL VOC2007
Ours(no) 93.01% 76.26% 72.22% 38.88% Ours 2.81G 0.726
Ours 93.26% 77.43% 72.25% 43.29%
Table 4: Results on PASCAL VOC 2007 dataset.
Table 3: The comparison results with and without the pro-
epoch and pruned structure on VGG-Small@CIFAR-10. As
gressive penalty.
shown in Figure 6(a) and 6(b), during training, our method
much better than all the competing methods. Compared with dynamically allocates the sparsity ratio to each layer and
the baseline, our method prunes VGG-Small more than 93% adaptively learns the sparsity of each layer in the network,
parameters and 70% FLOPs, with a slight accuracy loss. For which is beneficial to learn a structured sparsity network
more compact architecture ResNet56, our method prunes with a good performance, as shown in Figure 6(c). Addi-
more than 52% pruning ratio of FLOPs and outperforms tionally, we visualize the number of each layer between our
other methods in performance. method and baseline in Figure 6(d), which shows that deep
On ImageNet, our method is compared with Li et al., layers have more redundant than shallow layers. These re-
ThiNet, CP, SFP, FPGM, ASFP, AMC, and HRank. The re- sults show our method is effective due to our dynamic struc-
sults are reported in Table 2. For ResNet34, the proposed tured sparsity regularization. On the other hand, we also an-
method achieves the best performance. For ResNet50, our alyze the effectiveness of the progressive penalty. As shown
method obtains the highest accuracy, with a second highest in Table 3, our method is superior to the regularization with-
pruning ratio of FLOPs (46.21% vs 50.00%). For a deeper out the progressive penalty. These results show our method
network of ResNet101, we obtain the highest pruning ra- is effective due to the progressive structured sparsity regu-
tio, even with an increase of 0.12% Top-5 accuracy. For larization.
lightweight network compression (i.e., MobileNet v2), our Its Application to Detection Tasks. As described above,
method also obtains the best accuracy. Thus, it can be con- our method is effective in image classification. To further
cluded that the proposed method is superior to all the com- analyze the generalization of our method, we use ResNet50
peting methods, due to introducing the dynamic and progres- as a backbone network to deploy Faster R-CNN (Ren et al.
sive structured sparsity regularization. 2015) for object detection, and then compress Faster R-
CNN by reducing 30% FLOPs of the backbone network.
Analysis and Discussion In the experimental implementation, we evaluate the per-
formance with mean Average Precision (mAP) on PASCAL
Training Convergence Analysis. In terms of training con- VOC 2007 (Everingham et al. 2015) dataset. As shown in
vergence, Figure 5 plots the training loss and test accuracy Table 4, our pruned model shows a good result. The mAP
curves of our proposed method, WM, and baseline on VGG- of our method is slightly lower than the baseline, but the in-
Small@CIFAR-10. To achieve thinner models, WM is used ference speed is faster than the baseline. This demonstrates
to uniformly prune the channels of each layer, such as Mo- that our method has good generalization on other tasks.
bileNet. We set P RP arams = 90% in our method and WM.
Due to our regularization with a gradually increased value, at
early training epochs, our training loss and test accuracy are Conclusion
consistent with the original network, better than WM. As the In this paper, we have proposed a novel Dynamic and Pro-
penalty continues to increase, our training loss and test accu- gressive Filter Pruning scheme (DPFPS) that directly learns
racy are slightly worse than the baseline but better than WM. a structured sparsity network from Scratch. It has imposed a
But after 100 epochs, our method achieves nearly the same new structured sparsity-inducing regularization specifically
training loss and test accuracy as the baseline and better than upon the expected pruning parameters in a dynamic sparsity
WM. It can be concluded that the proposed method has com- manner. Moreover, We have designed a group Lasso based
parable convergence speed and final performance with the progressive penalty regularization, making the sparsity pro-
baseline and better than WM. cess to be soft and alleviating the harm on model perfor-
Efficacy Analysis. To evaluate the effectiveness of our mance. Extensive experiments have shown that our method
method, on one hand, we analyze the sparsity ratio over is competitive with state-of-the-art methods.

2501
Acknowledgments Advances in Neural Information Processing Systems, 1135–
This work was supported in part by the National 1143.
Key Research and Development Program of China He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual
(Grant No. 2020AAA0106800, No. 2018AAA0102802, learning for image recognition. In Proceedings of the IEEE
No.2018AAA0102803, and No. 2018AAA0102800), in part Conference on Computer Vision and Pattern Recognition,
by the Natural Science Foundation of China (Grant No. 770–778.
62036011, No. 61902401, No. 61972071, No. 61906052, He, Y.; Dong, X.; Kang, G.; Fu, Y.; Yan, C.; and Yang, Y.
No. 61772225, No. 61721004, No. U1803119, No. 2020. Asymptotic Soft Filter Pruning for Deep Convolu-
U1736106, and No. 6187610), in part by the NSFC-General tional Neural Networks. IEEE transactions on Cybernetics
Technology Collaborative Fund for Basic Research (Grant 50(8): 3594 – 3604.
No. U1936204), in part by the Beijing Natural Science
Foundation (Grant No. JQ18018), in part by the Key Re- He, Y.; Kang, G.; Dong, X.; Fu, Y.; and Yang, Y. 2018a.
search Program of Frontier Sciences, CAS, Grant No. Soft filter pruning for accelerating deep convolutional neural
QYZDJ-SSW-JSC040, in part by the National Natural Sci- networks. In Proceedings of International Joint Conference
ence Foundation of Guangdong (No. 2018B030311046), on Artificial Intelligence, 2234–2240.
and in part by the CAS External cooperation key project. He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.-J.; and Han, S.
The work of Bing Li was also supported by the Youth Inno- 2018b. AMC: AutoML for Model Compression and Ac-
vation Promotion Association, CAS. celeration on Mobile Devices. In Proceedings of European
Conference on Computer Vision, 815–832.
References He, Y.; Liu, P.; Wang, Z.; Hu, Z.; and Yang, Y. 2019. Filter
Alvarez, J. M.; and Salzmann, M. 2016. Learning the Num- pruning via geometric median for deep convolutional neural
ber of Neurons in Deep Networks. In Advances in Neural networks acceleration. In Proceedings of the IEEE Confer-
Information Processing Systems, 2270–2278. ence on Computer Vision and Pattern Recognition, 4340–
Beck, A.; and Teboulle, M. 2009. A fast iterative shrinkage- 4349.
thresholding algorithm for linear inverse problems. SIAM He, Y.; Zhang, X.; and Sun, J. 2017. Channel pruning for ac-
journal on imaging sciences 2(1): 183–202. celerating very deep neural networks. In Proceedings of the
Bertinetto, L.; Valmadre, J.; Henriques, J. F.; Vedaldi, A.; IEEE International Conference on Computer Vision, 1389–
and Torr, P. H. 2016. Fully-convolutional siamese networks 1397.
for object tracking. In Proceedings of European Conference Huang, Z.; and Wang, N. 2018. Data-Driven Sparse Struc-
on Computer Vision, 850–865. ture Selection for Deep Neural Networks. In Proceedings of
Boyd, S.; and Vandenberghe, L. 2004. Convex optimization. European Conference on Computer Vision, 317–334.
Cambridge university press. Idelbayev, Y.; and Carreira-Perpinan, M. A. 2020. Low-
Ding, X.; Ding, G.; Guo, Y.; Han, J.; and Yan, C. 2019a. Rank Compression of Neural Nets: Learning the Rank of
Approximated Oracle Filter Pruning for Destructive CNN Each Layer. In Proceedings of the IEEE Conference on
Width Optimization. In Proceedings of International Con- Computer Vision and Pattern Recognition, 8049–8059.
ference on Machine Learning, 1607–1616. Jiang, C.; Li, G.; Qian, C.; and Tang, K. 2018. Efficient
Ding, X.; Ding, G.; Han, J.; and Tang, S. 2018. Auto- DNN Neuron Pruning by Minimizing Layer-wise Nonlinear
balanced Filter Pruning for Efficient Convolutional Neural Reconstruction Error. In Proceedings of International Joint
Networks. In Proceedings of AAAI Conference on Artificial Conference on Artificial Intelligence, 2298–2304.
Intelligence, 6797–6804. Jin, Q.; Yang, L.; and Liao, Z. 2020. AdaBits: Neural Net-
Ding, X.; Ding, G.; Zhou, X.; Guo, Y.; Han, J.; and Liu, J. work Quantization With Adaptive Bit-Widths. In Proceed-
2019b. Global Sparse Momentum SGD for Pruning Very ings of the IEEE Conference on Computer Vision and Pat-
Deep Neural Networks. In Advances in Neural Information tern Recognition, 2146–2156.
Processing Systems, 6382–6394. Krizhevsky, A.; and Hinton, G. 2009. Learning multiple lay-
Everingham, M.; Eslami, S. M.; Van Gool, L.; Williams, C. ers of features from tiny images. Technical report, Citeseer.
K. I.; Winn, J.; and Zisserman, A. 2015. The Pascal Visual Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-
Object Classes Challenge: A Retrospective. International agenet classification with deep convolutional neural net-
Journal of Computer Vision 111(1): 98–136. works. In Advances in Neural Information Processing Sys-
Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014. tems, 1097–1105.
Rich feature hierarchies for accurate object detection and Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P.
semantic segmentation. In Proceedings of the IEEE Con- 2016. Pruning filters for efficient convnets. arXiv preprint
ference on Computer Vision and Pattern Recognition, 580– arXiv:1608.08710 .
587. Li, J.; Qi, Q.; Wang, J.; Ge, C.; Li, Y.; Yue, Z.; and Sun,
Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning H. 2019. OICSR: Out-In-Channel Sparsity Regularization
both weights and connections for efficient neural network. In for Compact Deep Neural Networks. In Proceedings of the

2502
IEEE Conference on Computer Vision and Pattern Recogni- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;
tion, 7046–7055. Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.;
Lin, M.; Ji, R.; Wang, Y.; Zhang, Y.; Zhang, B.; Tian, Y.; Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Vi-
and Shao, L. 2020a. HRank: Filter Pruning Using High- sual Recognition Challenge. International Journal of Com-
Rank Feature Map. In Proceedings of the IEEE Conference puter Vision 115(3): 211–252.
on Computer Vision and Pattern Recognition, 1529–1538. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and
Lin, S.; Ji, R.; Li, Y.; Deng, C.; and Li, X. 2020b. Toward Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and lin-
Compact ConvNets via Structure-Sparsity Regularized Fil- ear bottlenecks. In Proceedings of the IEEE Conference on
ter Pruning. IEEE Transactions on Neural Networks and Computer Vision and Pattern Recognition, 4510–4520.
Learning Systems 31(2): 574–588. Simonyan, K.; and Zisserman, A. 2014. Very deep convo-
Liu, N.; Ma, X.; Xu, Z.; Wang, Y.; Tang, J.; and Ye, J. lutional networks for large-scale image recognition. arXiv
2020. AutoCompress: An Automatic DNN Structured Prun- preprint arXiv:1409.1556 .
ing Framework for Ultra-High Compression Rates. In Singh, P.; Verma, V. K.; Rai, P.; and Namboodiri, V. P. 2019.
Proceedings of AAAI Conference on Artificial Intelligence, Play and Prune: Adaptive filter pruning for deep model com-
4876–4883. pression. In Proceedings of International Joint Conference
Liu, Y.; Cao, J.; Li, B.; Yuan, C.; Hu, W.; Li, Y.; and Duan, on Artificial Intelligence, 3460–3466.
Y. 2019a. Knowledge Distillation via Instance Relationship Venugopalan, S.; Rohrbach, M.; Donahue, J.; Mooney, R.;
Graph. In Proceedings of the IEEE Conference on Computer Darrell, T.; and Saenko, K. 2015. Sequence to sequence-
Vision and Pattern Recognition, 7096–7104. video to text. In Proceedings of the IEEE International Con-
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; and Zhang, ference on Computer Vision, 4534–4542.
C. 2017. Learning efficient convolutional networks through Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016.
network slimming. In Proceedings of the IEEE International Learning structured sparsity in deep neural networks. In
Conference on Computer Vision, 2736–2744. Advances in Neural Information Processing Systems, 2074–
Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; and Darrell, T. 2019b. 2082.
Rethinking the Value of Network Pruning. In International Yu, R.; Li, A.; Chen, C.-F.; Lai, J.-H.; Morariu, V. I.; Han,
Conference on Learning Representations. X.; Gao, M.; Lin, C.-Y.; and Davis, L. S. 2018. Nisp: Prun-
Luo, J.; Zhang, H.; Zhou, H.; Xie, C.; Wu, J.; and Lin, W. ing networks using neuron importance score propagation. In
2019. ThiNet: Pruning CNN Filters for a Thinner Net. IEEE Proceedings of the IEEE Conference on Computer Vision
Transactions on Pattern Analysis and Machine Intelligence and Pattern Recognition, 9194–9203.
41(10): 2525–2538. Yuan, M.; and Lin, Y. 2006. Model selection and estimation
Luo, J.-H.; Wu, J.; and Lin, W. 2017. Thinet: A filter level in regression with grouped variables. Journal of the Royal
pruning method for deep neural network compression. In Statistical Society: Series B (Statistical Methodology) 68(1):
Proceedings of the IEEE International Conference on Com- 49–67.
puter Vision, 5058–5066. Zagoruyko, S. 2015. 92.45% on CIFAR-10 in Torch. Torch
Molchanov, P.; Mallya, A.; Tyree, S.; Frosio, I.; and Kautz, J. Blog .
2019. Importance Estimation for Neural Network Pruning. Zhang, T.; Ye, S.; Zhang, K.; Tang, J.; Wen, W.; Fardad, M.;
In Proceedings of the IEEE Conference on Computer Vision and Wang, Y. 2018. A Systematic DNN Weight Pruning
and Pattern Recognition, 11264–11272. Framework using Alternating Direction Method of Multipli-
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; ers. In Proceedings of European Conference on Computer
DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, Vision, 184–199.
A. 2017. Automatic differentiation in pytorch. In Advances
in Neural Information Processing Systems Workshop.
Peng, H.; Wu, J.; Chen, S.; and Huang, J. 2019. Collabora-
tive Channel Pruning for Deep Networks. In Proceedings of
the International Conference on Machine Learning, 5113–
5122.
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-
cnn: Towards real-time object detection with region proposal
networks. In Advances in Neural Information Processing
Systems, 91–99.
Ruan, X.; Liu, Y.; Yuan, C.; Li, B.; Hu, W.; Li, Y.; and May-
bank, S. 2020. EDP: An Efficient Decomposition and Prun-
ing Scheme for Convolutional Neural Network Compres-
sion. IEEE Transactions on Neural Networks and Learning
Systems 1–15.

2503

You might also like