EDP An Efficient Decomposition and Pruning Scheme
EDP An Efficient Decomposition and Pruning Scheme
Abstract— Model compression methods have become popular the rank of the weight matrix and identify the redundant
in recent years, which aim to alleviate the heavy load of deep channels. Specifically, we embed the compressed-aware block
neural networks (DNNs) in real-world applications. However, by decomposing one network layer into two layers: a new
most of the existing compression methods have two limitations: weight matrix layer and a coefficient matrix layer. By imposing
1) they usually adopt a cumbersome process, including pretrain- regularizers on the coefficient matrix, the new weight matrix
ing, training with a sparsity constraint, pruning/decomposition, learns to become a low-rank basis weight, and its corresponding
and fine-tuning. Moreover, the last three stages are usually channels become sparse. In this way, the proposed compressed-
iterated multiple times. 2) The models are pretrained under aware block simultaneously achieves low-rank decomposition
explicit sparsity or low-rank assumptions, which are difficult to and channel pruning by only one single data-driven training
guarantee wide appropriateness. In this article, we propose an stage. Moreover, the network of architecture is further com-
efficient decomposition and pruning (EDP) scheme via construct- pressed and optimized by a novel Pruning & Merging (PM)
ing a compressed-aware block that can automatically minimize module which prunes redundant channels and merges redundant
decomposed layers. Experimental results (17 competitors) on
Manuscript received November 28, 2019; revised June 24, 2020; accepted different data sets and networks demonstrate that the pro-
August 8, 2020. This work was supported in part by the National Key Research posed EDP achieves a high compression ratio with accept-
and Development Program of China under Grant 2018AAA0102802, Grant able accuracy degradation and outperforms state-of-the-arts
2018AAA0102803, Grant 2018AAA0102800, Grant 2018YFC0823003, and
Grant 2017YFB1002801; in part by the Natural Science Foundation of China on compression rate, accuracy, inference time, and run-time
under Grant 61902401, Grant 61972071, Grant 61751212, Grant 61721004, memory.
Grant 61972397, Grant 61772225, Grant 61906052, and Grant U1803119; in Index Terms— Data-driven, low-rank decomposition, model
part by the NSFC-General Technology Collaborative Fund for basic research
under Grant U1636218, Grant U1936204, and Grant U1736106; in part
compression and acceleration, structured pruning.
by the Beijing Natural Science Foundation under Grant L172051, Grant
JQ18018, and Grant L182058; in part by the CAS Key Research Program of
Frontier Sciences under Grant QYZDJ-SSW-JSC040; in part by the CAS I. I NTRODUCTION
External Cooperation Key Project; and in part by the NSF of Guangdong
under Grant 2018B030311046. The work of Bing Li was supported by the
Youth Innovation Promotion Association, CAS. (Xiaofeng Ruan and Yufan Liu
contributed equally to this work.) (Corresponding authors: Chunfeng Yuan;
C OMPARED with traditional machine learning algo-
rithms, deep neural network (DNN) models [1]–[4] have
achieved a better performance in several fields such as image
Bing Li.)
Xiaofeng Ruan and Yufan Liu are with the National Laboratory of Pat- classification [1], object detection [5], object tracking [6],
tern Recognition, Institute of Automation, Chinese Academy of Sciences, and video understanding [7]. However, a large number of
Beijing 100190, China, and also with the School of Artificial Intelligence, parameters and FLoating-point OPerations (FLOPs) make it
University of Chinese Academy of Sciences, Beijing 100049, China (e-mail:
[email protected]; [email protected]). difficult to deploy them in mobile and embedding devices.
Chunfeng Yuan is with the National Laboratory of Pattern Recognition, Therefore, the model compression of DNN is a funda-
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China mental problem that has been studied extensively in recent
(e-mail: [email protected]).
Bing Li is with the National Laboratory of Pattern Recognition, Institute of years.
Automation, Chinese Academy of Sciences, Beijing 100190, China, and also Recently, low-rank decomposition methods [8]–[10], such
with PeopleAI Inc., Beijing 100190, China (e-mail: [email protected]). as singular value decomposition (SVD) [8], have been used
Weiming Hu is with the National Laboratory of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, for model compression by decomposing an original network
China, also with the CAS Center for Excellence in Brain Science and layer into two lightweight layers. However, there are some
Intelligence Technology, Beijing 100190, China, and also with the School of limitations in these low-rank decomposition methods. First,
Artificial Intelligence, University of Chinese Academy of Sciences, Beijing
100049, China (e-mail: [email protected]). an appropriate hyperparameter for the rank of filters should be
Yangxi Li is with the National Computer Network Emergency Response selected [8], [11], at the cost of many validation experiments.
Technical Team/Coordination Center of China (CNCERT/CC), Beijing Second, the compression rates of these methods are limited
100029, China (e-mail: [email protected]).
Stephen Maybank is with the Department of Computer Science and Infor- if the DNN is pretrained without low-rank constraint. Third,
mation Systems, Birkbeck College, University of London, London WC1E these methods fail to remove all the redundant channels and
7HX, U.K. (e-mail: [email protected]). occupy much run-time memory. Moreover, the decomposition
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. of each layer makes the network too deep, which may influ-
Digital Object Identifier 10.1109/TNNLS.2020.3018177 ence the accuracy.
2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. Illustration of the proposed method. (a) Blue blocks comprise the original weight matrix at the lth layer, with size cl+1 × cl . The orange blocks
represent the decomposed matrixes: a basis weight matrix with rl × cl size, and a coefficient matrix with cl+1 × rl size. Low-rank decomposition is achieved
. (b) Whole optimization process of the proposed method.
by minimizing rl , whereas the sparse output channels are obtained by minimizing cl+1
Channel pruning methods [12], [13] have been increasingly The optimization of the proposed method is based on a
popular in recent years. They prune unimportant input–output proximal gradient algorithm. By setting approximate penalty
channel-wise connections of DNNs and are more hardware- coefficients of the regularizers, the tradeoff between accuracy
friendly than traditional weight pruning methods. However, and pruned ratio can be adaptively controlled. Moreover,
the learning procedure is cumbersome as it includes four we also find that the early stopping (ES) of these constraints
stages: pretraining, training with a sparsity constraint, pruning, can achieve a better performance. When the network is learned
and fine-tuning. To achieve better performance, the whole to be sparse and low rank, a Pruning & Merging (PM) module
procedure may be iterated several times. In addition, channel is deployed to prune the redundant filters and merge the
pruning methods change the input–output dimensions of a redundant decomposed layers. Note that the merging operation
layer, which may perform not well in some blocks, such as can reduce the redundant layers, avoiding the depth of the
element-wise addition blocks. network being too large. Finally, a lightweight and hardware-
To overcome the above limitations, we propose an efficient friendly DNN model with high performance is obtained. The
decomposition and pruning (EDP) scheme via constructing a whole process of the proposed method is shown in Fig. 1(b).
compressed-aware block. The proposed block can be used to Experiments on several data sets and a range of network
replace every layer of a convolutional neural network (CNN) in architectures show the effectiveness of the proposed method.
a plug-and-play way so as to efficiently compress the network. We can obtain a DNN model with up to 95.59% parame-
The proposed compression procedure contains one single ters’ compression, 80.11% FLOPs’ reduction, 3.3× speedup
efficient training process and adaptively learns an optimal of inference time, and 1.8× run-time memory saving with
architecture. Fig. 1(a) shows the architecture and optimization VGG-small architecture on CIFAR-10 data set, while keeping
process of the proposed method. Specifically, we embed the acceptable accuracy.
compressed-aware block by decomposing the original network The main contributions are summarized as follows.
layer into two layers: one is a basis weight matrix and the other 1) We propose a compressed-aware block, which can be put
is a coefficient matrix. By imposing L 2,1 -based constraints in any CNN layers as a plug-in to efficiently compress
on the coefficient matrix, the block can learn to be compact the network. Low rank and channel sparsity are adap-
adaptively. On the one hand, the proposed constraints make tively achieved by introducing L 2,1 -based constraints on
the columns (i.e., input space) of the coefficient matrix to be the coefficient matrix.
sparse. After pruning the redundant connections, the remaining 2) A PM module is presented to remove not only redundant
channels of the basis weight matrix can be regarded as the filters but also redundant decomposed layers. Thus,
bases of the original weight space. Thus, the low-rank purpose an optimal architecture can be adaptively obtained.
is achieved. On the other hand, the proposed constraints push 3) The proposed method is flexible. It can achieve low-rank
redundant rows (i.e., output space) of the coefficient matrix to decomposition and channel pruning, either separately or
zero. As a result, unimportant connections to the next layer together, when compressing a network.
are identified and then pruned. Thus, the remaining channels
are sparse. Furthermore, the proposed method can degenerate
II. R ELATED W ORKS
into two special cases: a low-rank decomposition case and a
channel pruning case. The two special cases can be utilized Related works for compressing neural networks can be
separately in some situations. For example, the low-rank grouped into four categories: weight sparsity (nonstructured)
decomposition case does not change the output dimension, pruning, parameter (weight) quantization, low-rank decompo-
which can be used in element-wise addition blocks. sition, and structured sparsity pruning.
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Most of the early studies [14]–[17] on weight sparsity [35] tied any strongly correlated neurons to a common value.
pruning focus on the importance of weights. They pruned Since these methods prune parts of the network structures
the unimportant weights via different criteria. Han et al. [14] (e.g., channels) instead of individual weights, they do not need
and Guo et al. [16] pruned the weights based on the magnitude extra libraries or hardware, unlike weight pruning methods.
of parameters. LeCun et al. [15] used second-derivative However, almost all these methods need to pretrain stage and
information to identify insignificant weights. In order to make fine-tune stage, and some even iterate the multiple stages
these weights sparse, some regularizations [18], e.g., L 1 and to further enhance accuracy. Furthermore, channel pruning
L 2 , were imposed on the loss function during the training methods change the input–output dimensions of a layer,
process. After that, to further improve the performance of the which may not match the dimensions in the case of the
compressed model, Ding et al. [19] gradually reduced the element-wise addition blocks. For instance, when ResNets are
redundant weights to zero by directly altering the gradient compressed, due to the unequal dimensions of input–output
flow based on momentum stochastic gradient descent (SGD). channels, the channel pruning methods cannot prune the last
Besides, some recent works [20]–[22] articulated the lottery layer of each residual block. This may lead to insufficient
ticket hypothesis and learned very sparse networks (winning compression.
tickets) which obtained almost the same performance as the In order to overcome the above weakness, the proposed
original network. Although the storage space can be dra- method integrates low-rank decomposition and structured spar-
matically reduced, these methods rely on specific operation sity pruning into a unified framework. The two components
libraries and hardware. The run-time memory saving is limited can be performed either separately or together in an efficient
because most memory space is consumed by the feature maps training process, without cumbersome stages.
rather than the weights.
Parameter (weight) quantization methods [23]–[27] III. P ROPOSED M ETHOD
express the floating-point weights by a few bits. For exam- In this section, we first introduce the EDP algorithm which
ple, XNOR-Net [23], binarized neural networks [24], and integrates low-rank decomposition and channel pruning to
binary connect [25] were proposed to quantize the floating- reduce the redundancy. Second, a proximal gradient method
point weights into binary values, and quantized neural net- is introduced to deal with the optimization of EDP. Then,
works [26] were trained with low precision weights and a PM module is successively presented to further com-
activations. Han et al. [28] proposed a three-stage compress pact the network. The learning procedure is summarized at
pipeline, in which they first pruned the model, then quantized the end.
the weights, and finally adopted Huffman coding to further
increase the compression rate. However, parameter (weight)
A. EDP Algorithm
quantization methods usually result in a moderate accuracy
degradation in large DNNs. In the original DNN, the weights of the lth layer are denoted
Low-rank decomposition [8], [10], [29]–[32] methods as θl ∈ Rcl+1 ×cl ×kl ×kl , where cl and cl+1 are the number of input
have been explored over the past few years. They decom- and output channels, respectively, and kl is the kernel size.
posed the weight matrix of DNN into several pieces, using Without the loss of generality, we reorganize the parameters
techniques such as SVD [8], [10] and canonical decomposi- from the original space θl ∈ Rcl+1 ×cl ×kl ×kl to θl2D ∈ Rcl+1 ×cl kl kl .
tion/parallel factors decomposition (CP decomposition) [32]. Given the input feature map fl , the output response is obtained
In [29], a decomposition method was presented by separating by
k × k filters into k × 1 and 1 × k filters. Kim et al. [30] utilized fl+1 = σ ( fl ∗ θl ) (1)
tucker decomposition on a kernel tensor to compress the
networks. It consists of three steps: rank selection, low-rank where σ (·) is an activation function, such as rectifier linear
tensor decomposition, and fine-tuning. This method requires unit (ReLU), and ∗ is the 2-D convolution operator. Fig. 2(a)
additional experiments to select an appropriate rank. Most shows an illustration of the original DNN layer. The input
works did not consider the low-rank constraint in the training is the feature map fl with cl channels, and the output is the
process, and thus, the compression rate is limited. Recently, feature map fl+1 with cl+1 channels. The parameter θl has cl+1
Alvarez and Salzmann [31] introduced the low-rank constraint filters with cl × kl × kl size.
(i.e., nuclear norm) and the sparse group least absolute shrink- In this approach, the original layer is decomposed into two
age and selection operator (LASSO) regularizer to train the layers: one is a basis weight matrix θl (2D) ∈ Rrl ×cl kl kl (tensor
network, before SVD-based decomposition. However, after θl ∈ Rrl ×cl ×kl ×kl ), and the other one is a coefficient matrix
training, the operation of SVD is time-consuming, and the βl2D ∈ Rcl+1 ×rl (tensor βl ∈ Rcl+1 ×rl ×1×1 ), which represents the
decomposed layers occupy much run-time memory. coefficients of the bases. Therefore, we compute the response
Structured pruning methods [12], [13], [33]–[35] directly of the decomposed layers by
remove redundant neurons and channels rather than irregular fl+1 = σ ( fl ∗ θl ∗ βl ). (2)
weights. Some works [12], [13] imposed LASSO regression to
learn the importance of each channel, whereas several works Note that the size of βl2D reflects the number of ranks
[33], [35] ranked filters by criteria such as the absolute values (i.e., rl ) and channels (i.e., cl+1 ). By imposing L 2,1 regulariza-
and pruned the unimportant ones. In addition, some works tion on the coefficient matrix βl2D and its transposition (βl2D )T ,
take the correlations between filters into account. For example, we obtain the basis weight θl (2D) with fewer rows rl and fewer
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
L
() = λ1 ||βl2D ||2,1 . (5)
l=1
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. Analysis of the ES strategy of optimization with regularizations. Note that the red imaginary line represents the ES epoch. No: training without an
ES strategy; ES: training with ES strategy. (a) Sparse ratio of parameters. (b) Sparse ratio of FLOPs. (c) Test accuracy comparison.
For the convenience of subsequent analysis, (7) is rewritten as where j denotes the column index, and βl (:, j ) represents the
j th column of βl . Similarly
min g() + () [Sαλ2 (βlT )]i
⎧
1 ⎨β (i, :) − αλ2 βl (i, :) , if ||β (i, :)|| > αλ
i.e., g() = (F (, x), y). (8) l l 2 2
|D| = ||βl (i, :)||2 (13)
(x,y)∈D ⎩
0, otherwise.
The term g() is a widely used DNN loss function (e.g., Substituting (12) and (13) into (11), the optimization problem
the cross-entropy loss for classification tasks) which is smooth can be solved.
M
and convex. In this task, (·) = − c=1 pc (x i ) log qc (x i ), Early Stopping: According to the experiments, imposing the
which is the cross-entropy loss. Note that M denotes the num- constraints throughout the whole training process probably
ber of possible class labels, qc (x i ) is a predicted probability is not optimal. Fig. 3 shows the curves of sparse ratios at
after the soft-max function that observation i is of class c, and different epochs. It manifests that both the parameters’ sparse
pc (x i ) denotes a binary indicator (0 or 1) of whether class label ratio and the FLOPs’ sparse ratio reach saturation after certain
c is the correct classification for observation i . The second epochs. In addition, when we early stop the regularizations at
term, (), is convex but not differentiable. In order to solve the saturation point and continue training the pruned model
the resulting nonsmooth unconstrained optimization problem, during the remaining epochs, the accuracy further increases,
we make use of the proximal gradient descent method [36]. as shown in Fig. 3(c). Therefore, we leverage the ES strategy
For the first term g(), the gradient update can be obtained of regularizations in the training process. Note that ES is
by the quadratic approximation different from fine-tuning. In most of the existing methods,
such as Slimming [13] and GrOWL [35], training with regu-
1
+ = argmin ||z − ( − α∇g())||2F + (z) (9) larizations and fine-tuning are two separate training processes.
z 2α Hyperparameters (e.g., LR) are reset in each training process.
On the contrary, the ES is only a small training period in the
in which α is the learning rate (LR), + is the next estimate
whole compressed scheme, where the LR continuously decays,
of the network parameters, z is all possible approximation
and other parameters are continuously updated.
solution of the network parameters that we are going to
predict, and is from the previous iteration.
As () is only relevant to {βl }l=1
L
, we update {θl }l=1
L
and C. PM Module
{βl }l=1 separately. In detail, we choose the initial (0) and
L After optimization, it is usual to prune the redundant para-
then repeat meters of the network. However, in the EDP, it is not enough to
only conduct a pruning operation. The decomposition compo-
{θl }(n+1) = {θl }(n) − α∇g((n) ) nent decomposes one layer into two layers, which brings about
{βl }(n) = {βl }(n) − α∇g((n) ) a deeper network and some redundant blocks. On the one hand,
the deeper network may influence the training convergence.
n = 0, 1, 2, . . . (10)
(n+1) On the other hand, the parameters of the two decomposed
{βl } = Sαλ1 ({βl }(n) ) + Sαλ2 ({βlT }(n) ). layers may be more than that of one single layer. It is because
n = 0, 1, 2, . . . (11) the row number rl of βl and the output channel number cl+1
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I TABLE II
T EST A CCURACY OF D IFFERENT T RAINING S TRATEGY VGG A RCHITECTURE VARIANTS FOR THE CIFAR D ATA S ET IN T HIS
IN D IFFERENT (λ1 , λ2 ) S ETTINGS A RTICLE . T HE C ONVOLUTIONAL K ERNEL I S 3 × 3
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE III
P ERFORMANCE C OMPARISON OF VGG-S MALL AND R ES N ET 56 ON CIFAR-10. N OTE T HAT THE T OP T WO R ESULTS A RE H IGHLIGHTED W ITH R ED
AND P INK F ONTS , R ESPECTIVELY. T HE “--” I NDICATES T HAT THE R ESULTS A RE N OT L ISTED IN THE O RIGINAL A RTICLE
TABLE IV
P RUNED R ESULTS OF VGG19 ON CIFAR-100
on CIFAR-10/100 and 30 and 60 epoch on the ImageNet data including Li et al. [43], SVD [31], ThiNet [52], Slimming [13],
set. We utilize a weight decay of 10−4 and a momentum of 0.9. NRE [44], GrOWL [35], NISP [53], SFP [33], DCP [47],
Besides, we set E poch E S to be 120 (for CIFAR-10/100 data AMC [49], FPGM [48], SSR [54], ASFP [46], AOFP [55],
sets) and 10 (for ImageNet data set) in all experiments. KSE [50], HRank [45], and DMC [56].
4) Evaluation Metrics: We evaluate the proposed method
using the number of network parameters and FLOPs (multiply B. Performance Comparisons
adds). Note that the FLOPs are counted for the operations on In this section, we compare the method on three data sets
convolutional and FC layers. Some calculations such as BN including CIFAR-10, CIFAR-100, and ImageNet, with VGG,
and other overheads are not accounted for. We also report the ResNets, and MobileNet V2 architecture. The results of these
ratio of parameters or FLOPs reduction, as given by competing methods are reported according to the original
Pruned Params article.
RParams = 1) CIFAR-10: On CIFAR-10, we compare the EDP with
Original Params
baseline, GrOWL [35], Li et al. [43], NRE [44], SVD [31],
Pruned FLOPs
RFLOPs = . (15) Slimming [13], SFP [33], ASFP [46], DCP [47], AMC [49],
Original FLOPs FPGM [48], KSE [50], and HRank [45] in Table III.
In addition, the inference time and run-time memory are Among these compared methods, Li et al. [43], NRE [44],
leveraged to further evaluate the proposed method. Slimming [13], SFP [33], DCP [47], ASFP [46], FPGM [48],
5) Competing Methods: To analyze the effectiveness of and HRank [45] are the state-of-the-art channel pruning meth-
EDP, 17 state of the arts are taken into account for comparison, ods. SVD [31] is a low-rank decomposition method, and
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
P RUNED R ESULTS OF R ES N ETS ON ILSVRC 2012 I MAGE N ET
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VII
D ETAILED R ESULTS OF PARAMETERS AND FLOPs OF VGG-S MALL ON CIFAR-10
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VIII
C OMPARISON R ESULTS OF A DDING THE R EGULARIZATION TO θ AND β
W ITH VGG S MALL ON THE CIFAR-10 D ATA S ET. N OTE T HAT “O URS
(θ )” I NDICATES A DDING THE R EGULARIZATION TO θ
Fig. 7. Test accuracy at different hyperparameter settings. (a) Loose grid TABLE IX
search of λ1 . (b) Fine grid search of λ1 . (c) Loose grid search of λ2 . (d) Fine
T EST A CCURACY AT D IFFERENT P RESET P RUNED R ATIOS . I N THE E XPER -
grid search of λ2 .
IMENTS , W E U SE VGG-S MALL AND R ES N ET 56 ON THE CIFAR-
10 D ATA S ET AND R ES N ET 50 ON THE I MAGE N ET D ATA S ET
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE X
T RAINING T IME R ESULTS B ETWEEN BASELINE AND THE P ROPOSED M ETHOD . I N THE E XPERIMENTS , W E U SE VGG-S MALL ON THE CIFAR-10
D ATA S ET AND R ES N ET 50 ON THE I MAGE N ET D ATA S ET. T OTAL T IME↓(%) D ENOTES THE T OTAL T IME D ROP P ERCENTAGE ,
THE H IGHER THE B ETTER
TABLE XI
T RAINING E POCHS ’ C OMPARISON OF VGG-S MALL ON THE CIFAR-10 D ATA S ET. W E C ONDUCT THE E XPERIMENTS O NLY U SING
150 E POCHS AT D IFFERENT P RUNED R ATIOS
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE XII [4] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
O BJECT D ETECTION C OMPARISON FOR FASTER R-CNN [57] connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
M ETHOD ON THE PASCAL VOC2007 DATA S ET [58] Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
chies for accurate object detection and semantic segmentation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
[6] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr,
“Fully-convolutional siamese networks for object tracking,” in Proc. Eur.
Conf. Comput. Vis., vol. 2016, pp. 850–865.
[7] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and
K. Saenko, “Sequence to sequence-video to text,” in Proc. IEEE Int.
Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4534–4542.
9) Generalization Ability on Detection Tasks: As described [8] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploit-
ing linear structure within convolutional networks for efficient evalua-
above, the method is effective in image classification. To fur- tion,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1269–1277.
ther analyze the generalization of the method, we conduct [9] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li, “Coordinating
experiments with the compressed model. Specially, we use filters for faster deep neural networks,” in Proc. IEEE Int. Conf. Comput.
Vis. (ICCV), Oct. 2017, pp. 658–666.
ResNet50 as a backbone network to deploy Faster R-CNN [57] [10] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convo-
for object detection and then compress Faster R-CNN by lutional networks for classification and detection,” IEEE Trans. Pattern
reducing 40% FLOPs of the backbone network. In the exper- Anal. Mach. Intell., vol. 38, no. 10, pp. 1943–1955, Oct. 2016.
imental implementation, we evaluate the performance with [11] S. Lin, R. Ji, C. Chen, and F. Huang, “ESPACE: Accelerating convolu-
tional neural networks via eliminating spatial and channel redundancy,”
inference time (Fps) and mean average precision (mAP) on the in Proc. AAAI Conf. Artif. Intell., 2017, pp. 1424–1430.
PASCAL VOC 2007 data set [58], which consists of about 5K [12] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
training/validation images and 5K testing images. As shown deep neural networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 1389–1397.
in Table XII, the pruned model shows a good result. The [13] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning
mAP of the method is slightly lower than the baseline, but the efficient convolutional networks through network slimming,” in Proc.
inference speed is faster than the baseline. This demonstrates IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2736–2744.
[14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and
that the method has good generalization on other tasks. connections for efficient neural network,” in Proc. Adv. Neural Inf.
Process. Syst., 2015, pp. 1135–1143.
V. C ONCLUSION [15] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in
Proc. Adv. Neural Inf. Process. Syst., 1990, pp. 598–605.
In this article, we have proposed an effective decomposition [16] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
and pruning (EDP) scheme via a compressed-aware block. DNNs,” in Proc. Adv. In Neural Inf. Process. Syst., 2016, pp. 1379–1387.
The block can be used to replace every layer of a CNN [17] B. Hassibi and D. G. Stork, “Second order derivatives for network
pruning: Optimal brain surgeon,” in Proc. Adv. Neural Inf. Process. Syst.,
in a plug-and-play way so as to efficiently compress the 1993, pp. 164–171.
network. EDP takes only one single process to train a DNN [18] A. Krogh and J. A. Hertz, “A simple weight decay can improve gener-
and obtain a high compression rate with barely declining alization,” in Proc. Adv. Neural Inf. Process. Syst., 1992, pp. 950–957.
accuracy. To be specific, we have embedded the compressed- [19] X. Ding, G. Ding, X. Zhou, Y. Guo, J. Han, and J. Liu, “Global sparse
momentum SGD for pruning very deep neural networks,” in Proc. Adv.
aware block by decomposing the original network layer into Neural Inf. Process. Syst., 2019, pp. 6382–6394.
two layers: one represents the weight basis of this layer and the [20] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse,
other represents the coefficient matrix. By adding regularizers trainable neural networks,” 2018, arXiv:1803.03635. [Online]. Available:
http://arxiv.org/abs/1803.03635
on the coefficient matrix during training, a low-rank weight [21] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
basis and sparse channels have been obtained. Moreover, value of network pruning,” 2018, arXiv:1810.05270. [Online]. Available:
a PM module has been presented to prune redundant channels http://arxiv.org/abs/1810.05270
[22] N. Lee, T. Ajanthan, and P. H. S. Torr, “SNIP: Single-shot network prun-
and merge redundant decomposed layers, which can further ing based on connection sensitivity,” 2018, arXiv:1810.02340. [Online].
compress and optimize the DNN. Experiments have shown Available: http://arxiv.org/abs/1810.02340
that EDP outperforms the state of the arts on pruned ratio, [23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in
test accuracy, inference time, and run-time memory and is Proc. Eur. Conf. Comput. Vis., 2016, pp. 525–542.
generally applicable to different data sets and networks. [24] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
In the future, we will explore the layer pruning for deeper “Binarized neural networks: Training deep neural networks with weights
networks. Moreover, we will consider integrating the proposed and activations constrained to +1 or -1,” 2016, arXiv:1602.02830.
[Online]. Available: http://arxiv.org/abs/1602.02830
method with other compression methods (e.g., quantization) in [25] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training
real applications. deep neural networks with binary weights during propagations,” in Proc.
Adv. Neural Inf. Process. Syst., 2015, pp. 3123–3131.
[26] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
R EFERENCES “Quantized neural networks: Training neural networks with low pre-
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification cision weights and activations,” J. Mach. Learn. Res., vol. 18, no. 1,
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. pp. 6869–6898, 2017.
Process. Syst., 2012, pp. 1097–1105. [27] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks quantization,” 2016, arXiv:1612.01064. [Online]. Available:
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. http://arxiv.org/abs/1612.01064
Available: http://arxiv.org/abs/1409.1556 [28] S. Han, H. Mao, and W. J. Dally, “Deep compression: Com-
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for pressing deep neural networks with pruning, trained quantization
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- and Huffman coding,” 2015, arXiv:1510.00149. [Online]. Available:
nit. (CVPR), Jun. 2016, pp. 770–778. http://arxiv.org/abs/1510.00149
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[29] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional [54] S. Lin, R. Ji, Y. Li, C. Deng, and X. Li, “Toward compact ConvNets via
neural networks with low rank expansions,” 2014, arXiv:1405.3866. structure-sparsity regularized filter pruning,” IEEE Trans. Neural Netw.
[Online]. Available: http://arxiv.org/abs/1405.3866 Learn. Syst., vol. 31, no. 2, pp. 574–588, Feb. 2020.
[30] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Com- [55] X. Ding, G. Ding, Y. Guo, J. Han, and C. Yan, “Approximated oracle
pression of deep convolutional neural networks for fast and low power filter pruning for destructive CNN width optimization,” in Proc. Int.
mobile applications,” 2015, arXiv:1511.06530. [Online]. Available: Conf. Mach. Learn., 2019, pp. 1607–1616.
http://arxiv.org/abs/1511.06530 [56] S. Gao, F. Huang, J. Pei, and H. Huang, “Discrete model compres-
[31] J. M. Alvarez and M. Salzmann, “Compression-aware training of deep sion with resource constraint for deep neural networks,” in Proc.
networks,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 856–867. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
[32] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and pp. 1899–1908.
V. Lempitsky, “Speeding-up convolutional neural networks using [57] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
fine-tuned CP-decomposition,” 2014, arXiv:1412.6553. [Online]. time object detection with region proposal networks,” in Proc. Adv.
Available: http://arxiv.org/abs/1412.6553 Neural Inf. Process. Syst., 2015, pp. 91–99.
[33] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter [58] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,
pruning for accelerating deep convolutional neural networks,” 2018, J. Winn, and A. Zisserman, “The Pascal visual object classes challenge:
arXiv:1808.06866. [Online]. Available: http://arxiv.org/abs/1808.06866 A retrospective,” Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136,
[34] G. Huang, S. Liu, L. V. D. Maaten, and K. Q. Weinberger, Jan. 2015.
“CondenseNet: An efficient DenseNet using learned group convo-
lutions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 2752–2761.
[35] D. Zhang, H. Wang, M. Figueiredo, and L. Balzano, “Learning to share: Xiaofeng Ruan received the B.E. degree in mechan-
Simultaneous parameter tying and sparsification in deep learning,” in ical design, manufacturing, and automation from
Proc. Int. Conf. Learn. Represent. (ICLR), 2018, pp. 1–14. Dalian Maritime University, Dalian, China, in 2012,
[36] M. Schmidt, N. L. Roux, and F. R. Bach, “Convergence rates of and the M.E. degree in aerospace manufacturing
inexact proximal-gradient methods for convex optimization,” in Proc. engineering from the Harbin Institute of Tech-
Adv. Neural Inf. Process. Syst., 2011, pp. 1458–1466. nology, Harbin, China, in 2014. He is currently
[37] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding pursuing the Ph.D. degree in pattern recognition
algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, and intelligent system with the National Labora-
pp. 183–202, Jan. 2009. tory of Pattern Recognition, Institute of Automation,
[38] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from Chinese Academy of Sciences (CASIA), Beijing,
tiny images,” Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, China.
Tech. Rep., 2009. His current research interests include pattern recognition, deep learning, and
[39] O. Russakovsky et al., “ImageNet large scale visual recognition chal- model compression.
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.
[40] S. Zagoruyko, “92.45% on cifar-10 in torch,” Torch Blog, 2015.
[Online]. Available: http://torch.ch/blog/2015/07/30/cifar.html
[41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating Yufan Liu received the B.S. degree from Zhejiang
deep network training by reducing internal covariate shift,” 2015, University, Hangzhou, China, in 2015, and the M.S.
arXiv:1502.03167. [Online]. Available: http://arxiv.org/abs/1502.03167 degree from Beihang University, Beijing, China,
[42] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and in 2018.
L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” She is currently a Research Associate with the
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, National Laboratory of Pattern Recognition, Insti-
pp. 4510–4520. tute of Automation, Chinese Academy of Sciences
[43] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. Peter Graf, “Prun- (CASIA), Beijing. She has published several articles
ing filters for efficient ConvNets,” 2016, arXiv:1608.08710. [Online]. in international journals and conference proceedings,
Available: http://arxiv.org/abs/1608.08710 including the IEEE T RANSACTIONS ON I MAGE
[44] C. Jiang, G. Li, C. Qian, and K. Tang, “Efficient DNN neuron pruning P ROCESSING, the IEEE Conference on Computer
by minimizing layer-wise nonlinear reconstruction error,” in Proc. 27th Vision and Pattern Recognition (CVPR), European Conference on Computer
Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 2298–2304. Vision (ECCV), and so on. Her research interests include computer vision,
[45] M. Lin et al., “HRank: Filter pruning using high-rank feature map,” model compression, and saliency prediction.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2020, pp. 1529–1538.
[46] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan, and Y. Yang, “Asymptotic
Chunfeng Yuan received the Ph.D. degree from the
soft filter pruning for deep convolutional neural networks,” IEEE Trans.
National Laboratory of Pattern Recognition, Insti-
Cybern., vol. 50, no. 8, pp. 3594–3604, Aug. 2020.
tute of Automation, Chinese Academy of Sciences
[47] Z. Zhuang et al., “Discrimination-aware channel pruning for deep neural
(CASIA), Beijing, China, in 2010.
networks,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 875–886.
She is currently an Associate Professor with
[48] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geometric CASIA. Her research interests and publications
median for deep convolutional neural networks acceleration,” in Proc. range from statistics to computer vision, including
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, sparse representation, deep learning, motion analy-
pp. 4340–4349. sis, action recognition, and event detection.
[49] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “AMC: Automl for
model compression and acceleration on mobile devices,” in Proc. Eur.
Conf. Comput. Vis., 2018, pp. 784–800.
[50] Y. Li et al., “Exploiting kernel sparsity and entropy for interpretable
CNN compression,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Bing Li received the Ph.D. degree from the Depart-
Recognit. (CVPR), Jun. 2019, pp. 2800–2809.
ment of Computer Science and Engineering, Beijing
[51] A. Paszke et al., “Automatic differentiation in Pytorch,” in Proc. Adv. Jiaotong University, Beijing, China, in 2009.
Neural Inf. Process. Syst. Workshop, 2017, pp. 1–4. [Online]. Available: He is currently a Professor with the Institute
pytorch.org of Automation, Chinese Academy of Sciences
[52] J.-H. Luo, J. Wu, and W. Lin, “ThiNet: A filter level pruning method for (CASIA), Beijing. His current research interests
deep neural network compression,” in Proc. IEEE Int. Conf. Comput. include video understanding, color constancy, visual
Vis. (ICCV), Oct. 2017, pp. 5058–5066. saliency, multiinstance learning, and web content
[53] R. Yu et al., “NISP: Pruning networks using neuron importance score security.
propagation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 9194–9203.
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Weiming Hu (Senior Member, IEEE) received Stephen Maybank (Fellow, IEEE) received the
the Ph.D. degree from the Department of Com- B.A. degree in mathematics from King’s College,
puter Science and Engineering, Zhejiang University, Cambridge, U.K., in 1976, and the Ph.D. degree
Zhejiang, China, in 1998. in computer science from Birkbeck, University of
From 1998 to 2000, he was a Postdoctoral London, London, U.K., in 1988.
Research Fellow with the Institute of Computer He is currently a Professor with the School
Science and Technology, Peking University, Beijing, of Computer Science and Information Systems,
China. He is currently a Professor with the Insti- Birkbeck College, University of London. His
tute of Automation, Chinese Academy of Sciences research interests include the geometry of multiple
(CASIA), Beijing. His research interests are visual images, camera calibration, and visual surveillance.
motion analysis, recognition of web objectionable
information, and network intrusion detection.
Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.