Dynamic and Progressive Filter Pruning For Compressing Convolutional Neural Networks From Scratch
Dynamic and Progressive Filter Pruning For Compressing Convolutional Neural Networks From Scratch
Abstract Multi-pass
Filter pruning is a commonly used method for compress-
ing Convolutional Neural Networks (ConvNets), due to its Well-trained
friendly hardware supporting and flexibility. However, exist- network
Pruning Finetuning
ing methods mostly need a cumbersome procedure, which
Original Compressed
brings many extra hyper-parameters and training epochs. This network Cumbersome pruning procedure network
is because only using sparsity and pruning stages cannot ob-
tain a satisfying performance. Besides, many works do not (a)
consider the difference of pruning ratio across different lay-
ers. To overcome these limitations, we propose a novel dy-
namic and progressive filter pruning (DPFPS) scheme that
Training with DPFPS Pruning
directly learns a structured sparsity network from Scratch.
In particular, DPFPS imposes a new structured sparsity- Original network Structured sparsity network Compressed network
inducing regularization specifically upon the expected prun- (b)
ing parameters in a dynamic sparsity manner. The dynamic
sparsity scheme determines sparsity allocation ratios of dif-
ferent layers and a Taylor series based channel sensitivity cri- Figure 1: Block diagrams of two types of filter pruning.
teria is presented to identify the expected pruning parame- (a) This pruning procedure needs pre-trained models and
ters. Moreover, we increase the structured sparsity-inducing fine-tuning after pruning, even a multi-pass scheme. (b) Our
penalty in a progressive manner. This helps the model to be method directly learns a structured sparsity network from
sparse gradually instead of forcing the model to be sparse scratch, without any fine-tuning or multi-pass.
at the beginning. Our method solves the pruning ratio based
optimization problem by an iterative soft-thresholding algo-
rithm (ISTA) with dynamic sparsity. At the end of the train-
ing, we only need to remove the redundant parameters with- 2016) and video understanding (Venugopalan et al. 2015).
out other stages, such as fine-tuning. Extensive experimental However, a large number of parameters and FLoating-point
results show that the proposed method is competitive with OPerations (FLOPs) make it difficult to deploy ConvNets
11 state-of-the-art methods on both small-scale and large- in the application of embedded or mobile devices. There-
scale datasets (i.e., CIFAR and ImageNet). Specifically, on fore, model compression and acceleration of ConvNets is a
ImageNet, we achieve a 44.97% pruning ratio of FLOPs by
fundamental problem that has been extensively studied in
compressing ResNet-101, even with an increase of 0.12%
Top-5 accuracy. Our pruned models and codes are released recent years. Several techniques, including weight sparsity,
at https://github.com/taoxvzi/DPFPS. low-rank approximation, parameter quantization, and filter
pruning, have been applied to the study of model compres-
sion and acceleration. Among them, filter pruning methods
Introduction have been widely explored and possessed excellent perfor-
With the development of computer science and artifi- mances on the compression and acceleration ratio because
cial intelligence, ConvNets have significantly improved of their hardware-friendly capabilities without the support
performance in several fields, such as image classifica- of special libraries. For example, by using SSL (Wen et al.
tion (Krizhevsky, Sutskever, and Hinton 2012), object detec- 2016), convolutional layer computation of AlexNet is accel-
tion (Girshick et al. 2014), object tracking (Bertinetto et al. erated on average 5.1x and 3.1x speedups against CPU and
* Equal
GPU, respectively, with off-the-shelf libraries.
contribution.
†
Corresponding author. Though previous filter pruning algorithms have achieved
Copyright © 2021, Association for the Advancement of Artificial promising results, there still exist several problems. Specif-
Intelligence (www.aaai.org). All rights reserved. ically, a common procedure of filter pruning (Liu et al.
2495
2019b) includes three stages: (1) training, (2) pruning, and Zero parameters
max
(3) fine-tuning, as shown in Figure 1(a). Firstly, well-trained Non‐zero parameters (t )
Expected pruning parameters 1 e (30t /T 15)
models usually need to be obtained. In some studies (Wen Sparsity allocation ratio
xx%
%
et al. 2016; Liu et al. 2017), structured sparsity networks Layer 1 4 5 45 max
67.2% 62.5%
•••
are learned by imposing the structured sparsity regulariza- Layer i 45 45
79.9%
•••
tion. Some other studies (Li et al. 2016; Luo, Wu, and Lin 75.0%
Layer L 4 5 T t
2017) directly use available pre-trained models. Then, when 81.3% 83.2% Loss Lcls (t )DPSS (Θp )
well-trained models are pruned, there are two typical issues: epoch i epoch i+1 standard loss regularization
(1) it is often difficult to match the pre-setting pruning ra-
tio for the sparse rate of the network (after trained with a Dynamic sparsity Progressive sparsity
sparsity-inducing constrain), and (2) a dedicated handcrafted
pruning criterion needs to be designed to identify the unim- Figure 2: An illustration of DPFPS. (1) Sparsity allocation
portant neurons and remove them. In order to achieve the ratios and expected pruning parameters are dynamically up-
pre-setting pruning ratio, some non-zero parameters are also dated. (2) The sparsity penalty increases gradually as train-
removed. This is bound to bring accuracy drop. Hence, to re- ing proceeds.
store performances, it is critical to fine-tune pruned models.
However, when pruned models are fine-tuned, many hyper-
parameters (e.g., the learning rate, epochs of fine-tuning, tion problem by an iterative soft-thresholding algorithm
multi-pass times) need to be set again. This makes the multi- (ISTA) with dynamic sparsity. At the end of the training,
stage procedure to be more cumbersome. we only need to remove the redundant parameters without
To address the above issues, we propose a novel Dy- other stages, such as fine-tuning. Extensive experimental
namic and Progressive Filter Pruning (DPFPS) scheme that results show that the proposed method is competitive with
directly learns a structured sparsity network from Scratch, 11 state-of-the-art methods.
without pre-trained, fine-tuning, or multi-pass, as shown
in Figure 1(b). In our proposed method, to meet the pre- Related Work
setting pruning ratio, structured sparsity-inducing regular- Recent works on weight sparsity, parameter quantization,
ization is only imposed upon the expected pruning param- low-rank decomposition, filter pruning, and knowledge dis-
eters, as shown in Figure 2. Different from fixed sparsity in tillation can be found in some papers (Idelbayev and
each layer, our method dynamically updates sparsity alloca- Carreira-Perpinan 2020; Jin, Yang, and Liao 2020; Liu et al.
tion ratios in different layers. The dynamic sparsity scheme 2019a, 2020; Ruan et al. 2020). In this section, we review
determines sparsity allocation ratios of different layers and related works in filter pruning, which focus on two aspects
a Taylor series based channel sensitivity criteria is presented of structured sparsity networks and fine-tuning.
to identify the expected pruning parameters. In this manner, Structured Sparsity Networks. Learning a structured
reserved filters can have more space to learn sufficient in- sparsity network is a straightforward pruning method, which
formation, leading to better performance. For redundant fil- is widely used to compress and accelerate neural net-
ters, the constraint can make them towards zero and boost works (Wen et al. 2016; Alvarez and Salzmann 2016; Singh
the generation of a compact network. As a result, the dy- et al. 2019; Lin et al. 2020b). Wen et al. (Wen et al. 2016)
namic sparsity allocation scheme makes it unnecessary to used group Lasso regularization to achieve a structured spar-
pre-define pruning architecture. Moreover, a group Lasso sity learning (SSL). Furthermore, a sparse group Lasso idea
based progressive penalty is designed as the structured spar- was applied to automatically determine the number of neu-
sity regularization, which increases gradually as training rons in each layer of the network during learning (Alvarez
proceeds (i.e., from zero to a proper value). This encour- and Salzmann 2016). Although a structured sparsity net-
ages all the filters to learn useful information at the be- work was directly learned, the pruning ratio depended on
ginning and forces the redundant filters to be sparse in the the penalty term and the excessively pruned model needed
late period. Our method solves the pruning ratio based opti- to be fine-tuned to regain accuracy. In this paper, given a
mization problem by an iterative soft-thresholding algorithm pre-setting pruning ratio, we directly learn a structured spar-
(ISTA) with dynamic sparsity. At the end of the training, we sity network from scratch and do not need any fine-tuning
only need to remove the redundant parameters without other after pruning.
stages, such as fine-tuning, and the additional time consump- Fine-tuning. Due to hurting the performance after prun-
tion is also acceptable. ing the network, the original performance needed to be re-
Our contributions are listed as follows: stored by fine-tuning techniques (Han et al. 2015). Some
• Our method dynamically updates sparsity allocation ra- methods pruned the well-trained models by minimizing re-
tios in different layers and only imposes regularization construction error and then fine-tuned the pruned mod-
upon the expected pruning parameters. els (He, Zhang, and Sun 2017; Jiang et al. 2018). Al-
though these methods have excellent convergence perfor-
• Moreover, we design a group Lasso based progressive mance, the pruning process is often cumbersome, includ-
sparsity penalty to increasingly induce sparsity in ex- ing obtaining the pre-trained model, pruning and recover-
pected pruning parameters. ing accuracy by minimizing layer-wise reconstruction error,
• Our method solves the pruning ratio based optimiza- and fine-tuning. Additionally, although the fine-tuning tech-
2496
niques (Ding et al. 2018; Yu et al. 2018; Molchanov et al. where card(·) returns the number of nonzero of filters’ `2 -
2019; Peng et al. 2019; Ding et al. 2019a) are helpful to re- (i)
norm in its layer, i.e. the number of Wj, : 6= 0 (j row
gain the original accuracy, many hyper-parameters need to 2
be set and this is cumbersome. of W(i) ), and p(i) denotes the desired number of preserved
filters in the i-th layer. Because the second term of Equa-
The Proposed Method tion (2) is not differentiable, the problem cannot be directly
In this section, we first introduce the problem formulation. addressed by stochastic gradient descent.
Then, we provide details of progressive structured sparsity
regularization. Finally, we present the dynamic sparsity al- Progressive Structured Sparsity Regularization
gorithm.
In previous works (Wen et al. 2016; Liu et al. 2017), struc-
Preliminary and Problem Formulation tured sparsity regularization is widely used to introduce the
In an L-layer ConvNet, the i-th convolutional layer parame- objective function as an unconstrained optimization prob-
ters (ignoring biases) can be represented as a 4-dimensional lem, which is formulated as:
(i) (i) (i) (i)
tensor: W (i) ∈ Rc ×r ×k1 ×k2 , where c(i) and r(i) are X
the numbers of output and input channels (feature maps), min L(F (Θ, x), y) + λR(Θ). (4)
Θ
(i) (i) (x,y)∈D
and k1 × k2 corresponds to the 2-dimensional spatial ker-
nel. We reorganize the parameters from the original space
(i) (i) (i) (i) (i) (i)
c(i) ×r (i) k1 k2
where R(·) is the penalty function to make ConvNet be
W (i) ∈ Rc ×r ×k1 ×k2 to (W(i) )2D out ∈ R sparse and λ is the penalty coefficient. However, the hyper-
(i) (i) (i) (i)
or (W(i) )2Din ∈ Rr ×c k1 k2 , where (W(i) )2D out and parameter λ does not directly control the pruning ratio. To
(W(i) )2Din denote the matrix-forms of tensor W
(i)
along meet the pre-setting pruning ratio, the pruned model may
(i) need to be fine-tuned to recover the accuracy or obtained by
out and in channels, respectively. Then Wj can be rewrit-
(i) (i) (i)
some empirical knowledge (e.g., several attempts).
(i)
ten as a 3-dimensional tensor: Kj ∈ Rr ×k1 ×k2 (j = To address this weakness, different from previous sparsity
1, ..., c(i) ) to represent the j-th filter kernel in the i-th convo- strategy (Wen et al. 2016; Liu et al. 2017), during training,
lutional layer, which corresponds to the weights of the j-th we control the pruning ratio like Equation (2) and impose the
output channel. Besides convolutional layers, the parame- structured sparsity regularization upon the expected pruning
(i) (i) filters like (4). Specifically, we divide parameters set Θ into
ters of BatchNorm layers are denoted as {γj , βj } ∈ R,
which scale and shift the normalized value. To integrate the two subsets: Θp and Θp , where Θp is the expected prun-
above parameters, we define Θ as all parameters set in the ing filters that are “unimportant” and Θp is the preserved
network, i.e. Θ = ∪
(i) (i) (i)
(Kj ∪ {γj , βj }), where filters. Therefore, based on our proposed structured sparsity
i∈N1 ,j∈N2 regularization, the loss function can be expressed as:
N1 = {1, ..., L}, N2 = {1, ..., c(i) }. X
We consider pruning the network as a constrained op- Loss = L(F (Θ, x), y) + λRDP SS (Θp ). (5)
timization problem (Boyd and Vandenberghe 2004), given (x,y)∈D
input-output pairs (x, y) from data set D, which has the
form where RDP SS (·) is our designed dynamic and progressive
X
min L(F (Θ, x), y), structured sparsity regularization (DPSS), which only works
Θ
(x,y)∈D (1) on the expected pruning parameters Θp .
s.t. P R0 − P R(Θ) ≤ 0. Inspired by the Group Lasso (Yuan and Lin 2006) method,
we use `21 -norm as the sparsity regularization, which has the
where F (·) is the ConvNet forward function, L(·) is the form:
standard loss function (e.g., the cross-entropy loss for classi-
fication tasks), P R(·) is an evaluation metrics, e.g., pruning RDP SS (Θp ) =
ratio of parameters or FLOPs, and P R0 is a pre-setting prun- −p
L c X (i) (i) (i)
−p(i)
ing ratio. It is not convenient to directly constrain P R(Θ), X (i)
X c X
L−1
(i+1)
so we use the layer-wise form to rewrite the problem (1) as (Wj, : )2D
out + (Wj, : )2D
in .
2 2
i=1 j=1 i=1 j=1
the unconstrained problem (Zhang et al. 2018)
(6)
L
X X
(i) When the channels are removed, parameters of both the
min L(F (Θ, x), y) + ti (W ), (2) current and the next layers are pruned. So, we impose the
Θ
(x,y)∈D i=1
regularization upon the corresponding parameters in the cur-
where ti (·) is an indicator function of W(i) , which has the rent and the next layers.
form In Equation (5), the second term RDP SS (Θp ) is con-
( vex but not differentiable. In order to solve the non-
(i) 0, card(W(i) ) ≤ p(i) ,
ti (W ) = (3) smooth unconstrained optimization problem, we make use
+∞, otherwise. (i)
of ISTA (Beck and Teboulle 2009) and update (Wj, m )2D ∈
2497
(i) (i)
Θp by {(Wj, m )2D }(n+1) = Sλ ({(Wj, m )2D }(n) ), where Algorithm 1: The Proposed Method.
(i)
Sλ ((Wj, m )2D ) Input: N etwork, Dataset D, λmax , α, pruning ratio
P R0 .
(i)
(W(i) )2D − λ(Wj, m )2D (i)
j, m (i) , if ||(Wj, : )2D ||2 > λ, 1 e = 0;
= ||(Wj, : )2D ||2
2 For each layer i, initialize the sparsity allocation ratio
0, otherwise.
sr(i) ;
(7) 3 while e < Epoch do
Because we train the network from scratch, to avoid ex- 4 for each iteration ∈ epoch e do
cessive restraint in the early training phase, we introduce a 5 Update: Θ ← Θ − α∇L(Θ) using dataset D;
penalty coefficient λ with a gradually increased value (i.e. 6 For each layer i, calculate the sensitivity set
from zero to a proper value). In particular, for simplicity, we (i)
use the sigmoid function to reflect the value of λ at the t-th of channels: Schannel =
(i) (i) (i)
iteration and λ(t) is updated by {St (K1 ), St (K2 ), ..., St (Kc(i) )} via
λmax Equation (13);
λ(t) =30t . (8) 7 For each layer i, sort and truncate the
1 + e−( T −15) (i)
smallest-dsr(i) ∗ c(i) e of Schannel to
where λmax is the maximum of the penalty coefficient and p
identify Θ and index;
T is total iterations. 8 Update λ via Equation (8);
Dynamic Sparsity Algorithm 9 for each (W(i) )2D [index(i) ] ∈ Θp do
Channel Sensitivity Criteria. To identify the unimportant 10 Update (W(i) )2D [index(i) ] via Equation
parameters set Θp , it is critical to calculate the sensitivity (7);
11 end
of channels. From the perspective of the loss function, the
sensitivity of parameter w can be represented by 12 end
13 For each layer i, recalculate sr(i) ;
S(w) = |L(F (Θw→0 , x), y) − L(F (Θw , x), y)| (9) 14 e = e + 1;
where Θw→0 denotes that parameter w tends to 0. 15 end
Like (Ding et al. 2019b), using Taylor series, the loss 16
(i)
Prune the redundant filters ( Wj, : = 0) and return
function can be represented 2
the compressed network with acceptable accuracy.
L (F (Θw , x), y)
∂L (F (Θ, x), y)
w + o w2
= L(F (Θw→0 , x), y) −
∂w We consider the impact of the current and next layers and
(10) (i)
whereo w2 is a remainder term. Because w is very small, the total sensitivity of channel j in the i-th layer St (Wj ) is
o w2 can be ignored. (i) (i)
r (i) k1 k2
Filter pruning removes filter-wise parameters and not just (i)
X ∂L (F (Θ, x), y) (i)
St (Wj ) =| (Wj, m )2D
out
a single parameter, so the sensitivity of channel j concerning (i)
∂(Wj, m )2D
m=1 out
the current layer in the i-th layer is (13)
(i+1) (i+1)
c(i+1) k1 k2
(i) ∂L (F (Θ, x), y)
Sc ((Wj, : )2D
out )
(i+1)
X
+ (i+1)
(Wj, m )2D
in |
= |L(F (Θ(W(i) )2D →0 , x), y) − L(Θ(W(i) )2D , x), y)| m=1 ∂(Wj, m )2D
in
j, : out j, : out
2498
Model Training Settings PR
Method Test Accuracy
Architecture Pre-defined? Pre-trained? Fine-tuning? P arams F LOP s
Baseline N/A N/A N/A (93.85±0.07)% 0 0
Li et al. (Li et al. 2016) 4 4 4 93.40% 63.90% 34.21%
NRE (Jiang et al. 2018) % 4 4 93.40% 92.72% 67.64%
VGG-Small Slimming (Liu et al. 2017) % % 4 93.48% 86.65% 43.50%
HRank (Lin et al. 2020a) % 4 4 93.43% 82.90% 53.50%
SFP (He et al. 2018a) 4 % % 92.66% 69.17% 69.24%
SSS (Huang and Wang 2018) % % % 93.20% 66.67% 69.70%
Ours % % % (93.52±0.15)% 93.32% 70.85%
Baseline N/A N/A N/A (93.81±0.14%) 0 0
Li et al. (Li et al. 2016) 4 4 4 93.06% 13.70% 27.60%
CP (He, Zhang, and Sun 2017) 4 4 4 91.80% -- 50.00%
AMC (He et al. 2018b) % 4 4 91.90% -- 50.00%
ResNet56 HRank (Lin et al. 2020a) % 4 4 93.17% 42.40% 50.00%
SFP (He et al. 2018a) 4 % % (92.26±0.31)% -- 52.60%
FPGM (He et al. 2019) 4 % % (92.93±0.49)% -- 52.60%
ASFP (He et al. 2020) 4 % % (92.44±0.07)% -- 52.60%
Ours % % % (93.20±0.11)% 46.84% 52.86%
Table 1: The pruning results of VGG-Small and ResNet56 on CIFAR-10. The “Pre-defined?” refers to whether a fixed pruning
ratio is pre-set in each layer through empirical studies. The “--” indicates that the results are not listed in the original paper.
94.0%
dated. So, our method does not require pre-define the prun- 93.5%
ing architecture and can adaptively learn an optimal struc- 93.0%
tured sparsity network. 92.5%
Accuracy
92.0%
Computational Complexity Analysis. In our proposed VGG-Small@CIFAR-10
80.0%
method, additional computation mainly comes from iden- PRparams = 90%
Experiments
Experimental Setup we use the settings like AMC (He et al. 2018b). All experi-
1) Datasets and Networks: We evaluate DPFPS on two ments are implemented on multiple NVIDIA RTX 2080 Ti
datasets: CIFAR (Krizhevsky and Hinton 2009) and Im- GPUs and Intel(R) Xeon(R) Gold 5118 CPU by PyTorch.
ageNet (Russakovsky et al. 2015). The same data aug- On CIFAR-10, considering fluctuations caused by different
mentation strategies are used as PyTorch official exam- random seeds in the experimental results, we use the mean
ples (Paszke et al. 2017). On CIFAR-10, we evaluate the and standard deviation to report the results by conducting
proposed method using VGG-16 (Simonyan and Zisserman the same experiment 5 times.
2014) and ResNet56 (He et al. 2016). As the original VGG- 3) Parameter Settings: In our experiments, the parame-
16 is specially designed for ImageNet classification, we use ter λmax impacts the structured sparsity results, so we study
a variation version (i.e. VGG-Small) taken from (Zagoruyko the performances in different λmax settings. As shown in
2015) in our experiment. On ImageNet dataset, we evaluate Figure 3, we tune λmax exponentially in a relatively wide
DPFPS on ResNets (including ResNet 34, 50, and 101) and range (i.e. [0.001, 10]). When λmax is set to be a small value,
MobileNet v2 (Sandler et al. 2018). the test accuracy is low. This is because inadequate sparsity
2) Implementation Details: All networks are trained from produces low precision after pruning the network, like (Wen
scratch. It takes 200 and 100 epochs on CIFAR-10 and Im- et al. 2016; Liu et al. 2017), leading to needing fine-tuning
ageNet datasets, with an initial learning rate of 0.1, and a the pruned network to recover the precision. When λmax is
mini-batch size of 64 and 256, respectively. The learning rate set to be a big value, excessive sparsity causes the network
is multiplied by 0.1 at 50% and 75% of the training epochs parameters not to be learned well. In our experiment deploy-
on CIFAR-10, and at 30, 60, and 90 epoch on ImageNet. ment, the parameter λmax = 0.01 can meet the experiment
We utilize an SGD optimizer with a weight decay of 10−4 results. For simplicity, we set the parameter λmax to 0.01
and a momentum of 0.9. For MobileNet v2 on ImageNet, in all experiments, though we can obtain better results by
2499
Model Training Settings Test Accuracy
Method P RF LOP s
Architecture Pre-defined? Pre-trained? Fine-tuning? Top-1 Top-5
Baseline N/A N/A N/A 73.92% 91.62% 0
Li et al. (Li et al. 2016) 4 4 4 72.17% -- 24.20%
ResNet34 SFP (He et al. 2018a) 4 % % 71.83% 90.33% 41.10%
FPGM (He et al. 2019) 4 % % 72.11% 90.69% 41.10%
ASFP (He et al. 2020) 4 % % 71.72% 90.65% 41.10%
Ours % % % 72.25% 90.80% 43.29%
Baseline N/A N/A N/A 76.15% 92.87% 0
ThiNet (Luo et al. 2019) 4 4 4 74.03% 92.11% 36.80%
CP (He, Zhang, and Sun 2017) 4 4 4 -- 90.80% 50.00%
ResNet50 HRank (Lin et al. 2020a) % 4 4 74.98% 92.33% 43.77%
SFP (He et al. 2018a) 4 % % 74.61% 92.06% 41.80%
FPGM (He et al. 2019) 4 % % 75.03% 92.40% 42.20%
ASFP (He et al. 2020) 4 % % 74.88% 92.39% 41.80%
Ours % % % 75.55% 92.54% 46.20%
Baseline N/A N/A N/A 77.37% 93.56% 0
ResNet101 SFP (He et al. 2018a) 4 % % 77.03% 93.46% 42.20%
Ours % % % 77.27% 93.68% 44.97%
Baseline N/A N/A N/A 72.00% -- 0
MobileNet v2 0.75x MobileNet v2 (Sandler et al. 2018) % % % 69.80% -- 26.54%
AMC (He et al. 2018b) % 4 4 70.80% -- 26.54%
Ours % % % 71.10% -- 24.89%
93% 1.5
Test accuracy
Our_accuracy 80%
Accuracy
Ours
Our_loss
Loss
Figure 4: Performance comparison with different baselines. Figure 5: Analysis of training convergence.
fine-tuning λmax in different experiments. for the need of a large pruning ratio, due to excessive re-
4) Evaluation Metrics: To evaluate the performance of straint, SSS is inappropriate and fine-tuning is needed to
DPFPS, we use the parameters or FLOPs pruning ratio:
recover the precision like Slimming. In all the baselines,
#P runed P arams/F LOP s DPFPS achieves the best trade-off curve between test ac-
P RP arams/F LOP s = 1 − curacy and pruning ratio, exceeding the fine-tuning method
#Original P arams/F LOP s
(14) (i.e. Slimming). These show that our dynamic and progres-
sive structured sparsity regularization is effective. More im-
where “#P runed P arams/F LOP s” denote the remaining portantly, our proposed method alleviates the cumbersome
parameters or FLOPs after the model is pruned, respectively. pruning process, without any pre-trained, fine-tuning, or
multi-pass scheme, which is efficient.
Performance Comparison with Different Baselines
Performance Comparison with State-of-the-Art
We use VGG-Small on CIFAR-10 and compare DPFPS with Methods
three baselines: SFP, SSS and Slimming. As shown in Fig-
ure 4, when P RP arams is less than 50%, SFP is inferior to We also compare DPFPS with 11 state-of-the-arts.
other methods. This is because SFP adopts the fixed prun- On CIFAR-10, we compare our method with Li et al.,
ing structure and does not learn an optimal pruning network. NRE, Slimming, SFP, CP, AMC, FPGM, SSS, ASFP, and
With the increase of P RP arams , for SSS, the larger penalty HRank, shown in Table 1. For VGG-Small, the proposed
needs to be imposed upon the parameters. When P RP arams method achieves an accuracy of 93.52% with 93.32% prun-
reaches 65%, the test accuracy of SSS sharply reduces. So, ing ratio of parameters and 70.85% pruning ratio of FLOPs,
2500
1234567891
600
LLLLLLLLLL
a
ye
r
e
1.0 Baseline
y
ay
re
ay
re
0.8 0.8 Ours
re
ay
rr 400
ay
PRparams
0.6
ey
aay
0.6
r
er
ey
aae
0.4 200
0.4
r
ere
01
aaaa yye
0.2
11
LLL
rre
0.2 0
21
yy
0.0
3
r
、
30 50 70 100
、
0 150 200
L a y er 8
ye 9
3
y 5
y 6
y 7
L a er 0
L a er 1
y e 12
y 3
y 4
y 1
y 2
、
4
"
、、、
r1
y 1
y 1
今、
L a er
L a er
L a er
L a er
L a er
L a er
L a er
L a er
La r
y
Epoch
La
(a) Sparsity allocation ratio. (b) Sparsity ratio. (c) Sparsity ratio of the network. (d) Channel number.
2501
Acknowledgments Advances in Neural Information Processing Systems, 1135–
This work was supported in part by the National 1143.
Key Research and Development Program of China He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual
(Grant No. 2020AAA0106800, No. 2018AAA0102802, learning for image recognition. In Proceedings of the IEEE
No.2018AAA0102803, and No. 2018AAA0102800), in part Conference on Computer Vision and Pattern Recognition,
by the Natural Science Foundation of China (Grant No. 770–778.
62036011, No. 61902401, No. 61972071, No. 61906052, He, Y.; Dong, X.; Kang, G.; Fu, Y.; Yan, C.; and Yang, Y.
No. 61772225, No. 61721004, No. U1803119, No. 2020. Asymptotic Soft Filter Pruning for Deep Convolu-
U1736106, and No. 6187610), in part by the NSFC-General tional Neural Networks. IEEE transactions on Cybernetics
Technology Collaborative Fund for Basic Research (Grant 50(8): 3594 – 3604.
No. U1936204), in part by the Beijing Natural Science
Foundation (Grant No. JQ18018), in part by the Key Re- He, Y.; Kang, G.; Dong, X.; Fu, Y.; and Yang, Y. 2018a.
search Program of Frontier Sciences, CAS, Grant No. Soft filter pruning for accelerating deep convolutional neural
QYZDJ-SSW-JSC040, in part by the National Natural Sci- networks. In Proceedings of International Joint Conference
ence Foundation of Guangdong (No. 2018B030311046), on Artificial Intelligence, 2234–2240.
and in part by the CAS External cooperation key project. He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.-J.; and Han, S.
The work of Bing Li was also supported by the Youth Inno- 2018b. AMC: AutoML for Model Compression and Ac-
vation Promotion Association, CAS. celeration on Mobile Devices. In Proceedings of European
Conference on Computer Vision, 815–832.
References He, Y.; Liu, P.; Wang, Z.; Hu, Z.; and Yang, Y. 2019. Filter
Alvarez, J. M.; and Salzmann, M. 2016. Learning the Num- pruning via geometric median for deep convolutional neural
ber of Neurons in Deep Networks. In Advances in Neural networks acceleration. In Proceedings of the IEEE Confer-
Information Processing Systems, 2270–2278. ence on Computer Vision and Pattern Recognition, 4340–
Beck, A.; and Teboulle, M. 2009. A fast iterative shrinkage- 4349.
thresholding algorithm for linear inverse problems. SIAM He, Y.; Zhang, X.; and Sun, J. 2017. Channel pruning for ac-
journal on imaging sciences 2(1): 183–202. celerating very deep neural networks. In Proceedings of the
Bertinetto, L.; Valmadre, J.; Henriques, J. F.; Vedaldi, A.; IEEE International Conference on Computer Vision, 1389–
and Torr, P. H. 2016. Fully-convolutional siamese networks 1397.
for object tracking. In Proceedings of European Conference Huang, Z.; and Wang, N. 2018. Data-Driven Sparse Struc-
on Computer Vision, 850–865. ture Selection for Deep Neural Networks. In Proceedings of
Boyd, S.; and Vandenberghe, L. 2004. Convex optimization. European Conference on Computer Vision, 317–334.
Cambridge university press. Idelbayev, Y.; and Carreira-Perpinan, M. A. 2020. Low-
Ding, X.; Ding, G.; Guo, Y.; Han, J.; and Yan, C. 2019a. Rank Compression of Neural Nets: Learning the Rank of
Approximated Oracle Filter Pruning for Destructive CNN Each Layer. In Proceedings of the IEEE Conference on
Width Optimization. In Proceedings of International Con- Computer Vision and Pattern Recognition, 8049–8059.
ference on Machine Learning, 1607–1616. Jiang, C.; Li, G.; Qian, C.; and Tang, K. 2018. Efficient
Ding, X.; Ding, G.; Han, J.; and Tang, S. 2018. Auto- DNN Neuron Pruning by Minimizing Layer-wise Nonlinear
balanced Filter Pruning for Efficient Convolutional Neural Reconstruction Error. In Proceedings of International Joint
Networks. In Proceedings of AAAI Conference on Artificial Conference on Artificial Intelligence, 2298–2304.
Intelligence, 6797–6804. Jin, Q.; Yang, L.; and Liao, Z. 2020. AdaBits: Neural Net-
Ding, X.; Ding, G.; Zhou, X.; Guo, Y.; Han, J.; and Liu, J. work Quantization With Adaptive Bit-Widths. In Proceed-
2019b. Global Sparse Momentum SGD for Pruning Very ings of the IEEE Conference on Computer Vision and Pat-
Deep Neural Networks. In Advances in Neural Information tern Recognition, 2146–2156.
Processing Systems, 6382–6394. Krizhevsky, A.; and Hinton, G. 2009. Learning multiple lay-
Everingham, M.; Eslami, S. M.; Van Gool, L.; Williams, C. ers of features from tiny images. Technical report, Citeseer.
K. I.; Winn, J.; and Zisserman, A. 2015. The Pascal Visual Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-
Object Classes Challenge: A Retrospective. International agenet classification with deep convolutional neural net-
Journal of Computer Vision 111(1): 98–136. works. In Advances in Neural Information Processing Sys-
Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014. tems, 1097–1105.
Rich feature hierarchies for accurate object detection and Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P.
semantic segmentation. In Proceedings of the IEEE Con- 2016. Pruning filters for efficient convnets. arXiv preprint
ference on Computer Vision and Pattern Recognition, 580– arXiv:1608.08710 .
587. Li, J.; Qi, Q.; Wang, J.; Ge, C.; Li, Y.; Yue, Z.; and Sun,
Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning H. 2019. OICSR: Out-In-Channel Sparsity Regularization
both weights and connections for efficient neural network. In for Compact Deep Neural Networks. In Proceedings of the
2502
IEEE Conference on Computer Vision and Pattern Recogni- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;
tion, 7046–7055. Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.;
Lin, M.; Ji, R.; Wang, Y.; Zhang, Y.; Zhang, B.; Tian, Y.; Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Vi-
and Shao, L. 2020a. HRank: Filter Pruning Using High- sual Recognition Challenge. International Journal of Com-
Rank Feature Map. In Proceedings of the IEEE Conference puter Vision 115(3): 211–252.
on Computer Vision and Pattern Recognition, 1529–1538. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and
Lin, S.; Ji, R.; Li, Y.; Deng, C.; and Li, X. 2020b. Toward Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and lin-
Compact ConvNets via Structure-Sparsity Regularized Fil- ear bottlenecks. In Proceedings of the IEEE Conference on
ter Pruning. IEEE Transactions on Neural Networks and Computer Vision and Pattern Recognition, 4510–4520.
Learning Systems 31(2): 574–588. Simonyan, K.; and Zisserman, A. 2014. Very deep convo-
Liu, N.; Ma, X.; Xu, Z.; Wang, Y.; Tang, J.; and Ye, J. lutional networks for large-scale image recognition. arXiv
2020. AutoCompress: An Automatic DNN Structured Prun- preprint arXiv:1409.1556 .
ing Framework for Ultra-High Compression Rates. In Singh, P.; Verma, V. K.; Rai, P.; and Namboodiri, V. P. 2019.
Proceedings of AAAI Conference on Artificial Intelligence, Play and Prune: Adaptive filter pruning for deep model com-
4876–4883. pression. In Proceedings of International Joint Conference
Liu, Y.; Cao, J.; Li, B.; Yuan, C.; Hu, W.; Li, Y.; and Duan, on Artificial Intelligence, 3460–3466.
Y. 2019a. Knowledge Distillation via Instance Relationship Venugopalan, S.; Rohrbach, M.; Donahue, J.; Mooney, R.;
Graph. In Proceedings of the IEEE Conference on Computer Darrell, T.; and Saenko, K. 2015. Sequence to sequence-
Vision and Pattern Recognition, 7096–7104. video to text. In Proceedings of the IEEE International Con-
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; and Zhang, ference on Computer Vision, 4534–4542.
C. 2017. Learning efficient convolutional networks through Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016.
network slimming. In Proceedings of the IEEE International Learning structured sparsity in deep neural networks. In
Conference on Computer Vision, 2736–2744. Advances in Neural Information Processing Systems, 2074–
Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; and Darrell, T. 2019b. 2082.
Rethinking the Value of Network Pruning. In International Yu, R.; Li, A.; Chen, C.-F.; Lai, J.-H.; Morariu, V. I.; Han,
Conference on Learning Representations. X.; Gao, M.; Lin, C.-Y.; and Davis, L. S. 2018. Nisp: Prun-
Luo, J.; Zhang, H.; Zhou, H.; Xie, C.; Wu, J.; and Lin, W. ing networks using neuron importance score propagation. In
2019. ThiNet: Pruning CNN Filters for a Thinner Net. IEEE Proceedings of the IEEE Conference on Computer Vision
Transactions on Pattern Analysis and Machine Intelligence and Pattern Recognition, 9194–9203.
41(10): 2525–2538. Yuan, M.; and Lin, Y. 2006. Model selection and estimation
Luo, J.-H.; Wu, J.; and Lin, W. 2017. Thinet: A filter level in regression with grouped variables. Journal of the Royal
pruning method for deep neural network compression. In Statistical Society: Series B (Statistical Methodology) 68(1):
Proceedings of the IEEE International Conference on Com- 49–67.
puter Vision, 5058–5066. Zagoruyko, S. 2015. 92.45% on CIFAR-10 in Torch. Torch
Molchanov, P.; Mallya, A.; Tyree, S.; Frosio, I.; and Kautz, J. Blog .
2019. Importance Estimation for Neural Network Pruning. Zhang, T.; Ye, S.; Zhang, K.; Tang, J.; Wen, W.; Fardad, M.;
In Proceedings of the IEEE Conference on Computer Vision and Wang, Y. 2018. A Systematic DNN Weight Pruning
and Pattern Recognition, 11264–11272. Framework using Alternating Direction Method of Multipli-
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; ers. In Proceedings of European Conference on Computer
DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, Vision, 184–199.
A. 2017. Automatic differentiation in pytorch. In Advances
in Neural Information Processing Systems Workshop.
Peng, H.; Wu, J.; Chen, S.; and Huang, J. 2019. Collabora-
tive Channel Pruning for Deep Networks. In Proceedings of
the International Conference on Machine Learning, 5113–
5122.
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-
cnn: Towards real-time object detection with region proposal
networks. In Advances in Neural Information Processing
Systems, 91–99.
Ruan, X.; Liu, Y.; Yuan, C.; Li, B.; Hu, W.; Li, Y.; and May-
bank, S. 2020. EDP: An Efficient Decomposition and Prun-
ing Scheme for Convolutional Neural Network Compres-
sion. IEEE Transactions on Neural Networks and Learning
Systems 1–15.
2503