0% found this document useful (0 votes)
36 views15 pages

EDP An Efficient Decomposition and Pruning Scheme

Its an article.

Uploaded by

Erica Balondo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views15 pages

EDP An Efficient Decomposition and Pruning Scheme

Its an article.

Uploaded by

Erica Balondo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

EDP: An Efficient Decomposition and


Pruning Scheme for Convolutional
Neural Network Compression
Xiaofeng Ruan , Yufan Liu , Chunfeng Yuan , Bing Li , Weiming Hu , Senior Member, IEEE,
Yangxi Li, and Stephen Maybank , Fellow, IEEE

Abstract— Model compression methods have become popular the rank of the weight matrix and identify the redundant
in recent years, which aim to alleviate the heavy load of deep channels. Specifically, we embed the compressed-aware block
neural networks (DNNs) in real-world applications. However, by decomposing one network layer into two layers: a new
most of the existing compression methods have two limitations: weight matrix layer and a coefficient matrix layer. By imposing
1) they usually adopt a cumbersome process, including pretrain- regularizers on the coefficient matrix, the new weight matrix
ing, training with a sparsity constraint, pruning/decomposition, learns to become a low-rank basis weight, and its corresponding
and fine-tuning. Moreover, the last three stages are usually channels become sparse. In this way, the proposed compressed-
iterated multiple times. 2) The models are pretrained under aware block simultaneously achieves low-rank decomposition
explicit sparsity or low-rank assumptions, which are difficult to and channel pruning by only one single data-driven training
guarantee wide appropriateness. In this article, we propose an stage. Moreover, the network of architecture is further com-
efficient decomposition and pruning (EDP) scheme via construct- pressed and optimized by a novel Pruning & Merging (PM)
ing a compressed-aware block that can automatically minimize module which prunes redundant channels and merges redundant
decomposed layers. Experimental results (17 competitors) on
Manuscript received November 28, 2019; revised June 24, 2020; accepted different data sets and networks demonstrate that the pro-
August 8, 2020. This work was supported in part by the National Key Research posed EDP achieves a high compression ratio with accept-
and Development Program of China under Grant 2018AAA0102802, Grant able accuracy degradation and outperforms state-of-the-arts
2018AAA0102803, Grant 2018AAA0102800, Grant 2018YFC0823003, and
Grant 2017YFB1002801; in part by the Natural Science Foundation of China on compression rate, accuracy, inference time, and run-time
under Grant 61902401, Grant 61972071, Grant 61751212, Grant 61721004, memory.
Grant 61972397, Grant 61772225, Grant 61906052, and Grant U1803119; in Index Terms— Data-driven, low-rank decomposition, model
part by the NSFC-General Technology Collaborative Fund for basic research
under Grant U1636218, Grant U1936204, and Grant U1736106; in part
compression and acceleration, structured pruning.
by the Beijing Natural Science Foundation under Grant L172051, Grant
JQ18018, and Grant L182058; in part by the CAS Key Research Program of
Frontier Sciences under Grant QYZDJ-SSW-JSC040; in part by the CAS I. I NTRODUCTION
External Cooperation Key Project; and in part by the NSF of Guangdong
under Grant 2018B030311046. The work of Bing Li was supported by the
Youth Innovation Promotion Association, CAS. (Xiaofeng Ruan and Yufan Liu
contributed equally to this work.) (Corresponding authors: Chunfeng Yuan;
C OMPARED with traditional machine learning algo-
rithms, deep neural network (DNN) models [1]–[4] have
achieved a better performance in several fields such as image
Bing Li.)
Xiaofeng Ruan and Yufan Liu are with the National Laboratory of Pat- classification [1], object detection [5], object tracking [6],
tern Recognition, Institute of Automation, Chinese Academy of Sciences, and video understanding [7]. However, a large number of
Beijing 100190, China, and also with the School of Artificial Intelligence, parameters and FLoating-point OPerations (FLOPs) make it
University of Chinese Academy of Sciences, Beijing 100049, China (e-mail:
[email protected]; [email protected]). difficult to deploy them in mobile and embedding devices.
Chunfeng Yuan is with the National Laboratory of Pattern Recognition, Therefore, the model compression of DNN is a funda-
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China mental problem that has been studied extensively in recent
(e-mail: [email protected]).
Bing Li is with the National Laboratory of Pattern Recognition, Institute of years.
Automation, Chinese Academy of Sciences, Beijing 100190, China, and also Recently, low-rank decomposition methods [8]–[10], such
with PeopleAI Inc., Beijing 100190, China (e-mail: [email protected]). as singular value decomposition (SVD) [8], have been used
Weiming Hu is with the National Laboratory of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, for model compression by decomposing an original network
China, also with the CAS Center for Excellence in Brain Science and layer into two lightweight layers. However, there are some
Intelligence Technology, Beijing 100190, China, and also with the School of limitations in these low-rank decomposition methods. First,
Artificial Intelligence, University of Chinese Academy of Sciences, Beijing
100049, China (e-mail: [email protected]). an appropriate hyperparameter for the rank of filters should be
Yangxi Li is with the National Computer Network Emergency Response selected [8], [11], at the cost of many validation experiments.
Technical Team/Coordination Center of China (CNCERT/CC), Beijing Second, the compression rates of these methods are limited
100029, China (e-mail: [email protected]).
Stephen Maybank is with the Department of Computer Science and Infor- if the DNN is pretrained without low-rank constraint. Third,
mation Systems, Birkbeck College, University of London, London WC1E these methods fail to remove all the redundant channels and
7HX, U.K. (e-mail: [email protected]). occupy much run-time memory. Moreover, the decomposition
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. of each layer makes the network too deep, which may influ-
Digital Object Identifier 10.1109/TNNLS.2020.3018177 ence the accuracy.
2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 1. Illustration of the proposed method. (a) Blue blocks comprise the original weight matrix at the lth layer, with size cl+1 × cl . The orange blocks

represent the decomposed matrixes: a basis weight matrix with rl × cl size, and a coefficient matrix with cl+1 × rl size. Low-rank decomposition is achieved
 . (b) Whole optimization process of the proposed method.
by minimizing rl , whereas the sparse output channels are obtained by minimizing cl+1

Channel pruning methods [12], [13] have been increasingly The optimization of the proposed method is based on a
popular in recent years. They prune unimportant input–output proximal gradient algorithm. By setting approximate penalty
channel-wise connections of DNNs and are more hardware- coefficients of the regularizers, the tradeoff between accuracy
friendly than traditional weight pruning methods. However, and pruned ratio can be adaptively controlled. Moreover,
the learning procedure is cumbersome as it includes four we also find that the early stopping (ES) of these constraints
stages: pretraining, training with a sparsity constraint, pruning, can achieve a better performance. When the network is learned
and fine-tuning. To achieve better performance, the whole to be sparse and low rank, a Pruning & Merging (PM) module
procedure may be iterated several times. In addition, channel is deployed to prune the redundant filters and merge the
pruning methods change the input–output dimensions of a redundant decomposed layers. Note that the merging operation
layer, which may perform not well in some blocks, such as can reduce the redundant layers, avoiding the depth of the
element-wise addition blocks. network being too large. Finally, a lightweight and hardware-
To overcome the above limitations, we propose an efficient friendly DNN model with high performance is obtained. The
decomposition and pruning (EDP) scheme via constructing a whole process of the proposed method is shown in Fig. 1(b).
compressed-aware block. The proposed block can be used to Experiments on several data sets and a range of network
replace every layer of a convolutional neural network (CNN) in architectures show the effectiveness of the proposed method.
a plug-and-play way so as to efficiently compress the network. We can obtain a DNN model with up to 95.59% parame-
The proposed compression procedure contains one single ters’ compression, 80.11% FLOPs’ reduction, 3.3× speedup
efficient training process and adaptively learns an optimal of inference time, and 1.8× run-time memory saving with
architecture. Fig. 1(a) shows the architecture and optimization VGG-small architecture on CIFAR-10 data set, while keeping
process of the proposed method. Specifically, we embed the acceptable accuracy.
compressed-aware block by decomposing the original network The main contributions are summarized as follows.
layer into two layers: one is a basis weight matrix and the other 1) We propose a compressed-aware block, which can be put
is a coefficient matrix. By imposing L 2,1 -based constraints in any CNN layers as a plug-in to efficiently compress
on the coefficient matrix, the block can learn to be compact the network. Low rank and channel sparsity are adap-
adaptively. On the one hand, the proposed constraints make tively achieved by introducing L 2,1 -based constraints on
the columns (i.e., input space) of the coefficient matrix to be the coefficient matrix.
sparse. After pruning the redundant connections, the remaining 2) A PM module is presented to remove not only redundant
channels of the basis weight matrix can be regarded as the filters but also redundant decomposed layers. Thus,
bases of the original weight space. Thus, the low-rank purpose an optimal architecture can be adaptively obtained.
is achieved. On the other hand, the proposed constraints push 3) The proposed method is flexible. It can achieve low-rank
redundant rows (i.e., output space) of the coefficient matrix to decomposition and channel pruning, either separately or
zero. As a result, unimportant connections to the next layer together, when compressing a network.
are identified and then pruned. Thus, the remaining channels
are sparse. Furthermore, the proposed method can degenerate
II. R ELATED W ORKS
into two special cases: a low-rank decomposition case and a
channel pruning case. The two special cases can be utilized Related works for compressing neural networks can be
separately in some situations. For example, the low-rank grouped into four categories: weight sparsity (nonstructured)
decomposition case does not change the output dimension, pruning, parameter (weight) quantization, low-rank decompo-
which can be used in element-wise addition blocks. sition, and structured sparsity pruning.

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

RUAN et al.: EDP SCHEME FOR CONVOLUTIONAL NEURAL NETWORK COMPRESSION 3

Most of the early studies [14]–[17] on weight sparsity [35] tied any strongly correlated neurons to a common value.
pruning focus on the importance of weights. They pruned Since these methods prune parts of the network structures
the unimportant weights via different criteria. Han et al. [14] (e.g., channels) instead of individual weights, they do not need
and Guo et al. [16] pruned the weights based on the magnitude extra libraries or hardware, unlike weight pruning methods.
of parameters. LeCun et al. [15] used second-derivative However, almost all these methods need to pretrain stage and
information to identify insignificant weights. In order to make fine-tune stage, and some even iterate the multiple stages
these weights sparse, some regularizations [18], e.g., L 1 and to further enhance accuracy. Furthermore, channel pruning
L 2 , were imposed on the loss function during the training methods change the input–output dimensions of a layer,
process. After that, to further improve the performance of the which may not match the dimensions in the case of the
compressed model, Ding et al. [19] gradually reduced the element-wise addition blocks. For instance, when ResNets are
redundant weights to zero by directly altering the gradient compressed, due to the unequal dimensions of input–output
flow based on momentum stochastic gradient descent (SGD). channels, the channel pruning methods cannot prune the last
Besides, some recent works [20]–[22] articulated the lottery layer of each residual block. This may lead to insufficient
ticket hypothesis and learned very sparse networks (winning compression.
tickets) which obtained almost the same performance as the In order to overcome the above weakness, the proposed
original network. Although the storage space can be dra- method integrates low-rank decomposition and structured spar-
matically reduced, these methods rely on specific operation sity pruning into a unified framework. The two components
libraries and hardware. The run-time memory saving is limited can be performed either separately or together in an efficient
because most memory space is consumed by the feature maps training process, without cumbersome stages.
rather than the weights.
Parameter (weight) quantization methods [23]–[27] III. P ROPOSED M ETHOD
express the floating-point weights by a few bits. For exam- In this section, we first introduce the EDP algorithm which
ple, XNOR-Net [23], binarized neural networks [24], and integrates low-rank decomposition and channel pruning to
binary connect [25] were proposed to quantize the floating- reduce the redundancy. Second, a proximal gradient method
point weights into binary values, and quantized neural net- is introduced to deal with the optimization of EDP. Then,
works [26] were trained with low precision weights and a PM module is successively presented to further com-
activations. Han et al. [28] proposed a three-stage compress pact the network. The learning procedure is summarized at
pipeline, in which they first pruned the model, then quantized the end.
the weights, and finally adopted Huffman coding to further
increase the compression rate. However, parameter (weight)
A. EDP Algorithm
quantization methods usually result in a moderate accuracy
degradation in large DNNs. In the original DNN, the weights of the lth layer are denoted
Low-rank decomposition [8], [10], [29]–[32] methods as θl ∈ Rcl+1 ×cl ×kl ×kl , where cl and cl+1 are the number of input
have been explored over the past few years. They decom- and output channels, respectively, and kl is the kernel size.
posed the weight matrix of DNN into several pieces, using Without the loss of generality, we reorganize the parameters
techniques such as SVD [8], [10] and canonical decomposi- from the original space θl ∈ Rcl+1 ×cl ×kl ×kl to θl2D ∈ Rcl+1 ×cl kl kl .
tion/parallel factors decomposition (CP decomposition) [32]. Given the input feature map fl , the output response is obtained
In [29], a decomposition method was presented by separating by
k × k filters into k × 1 and 1 × k filters. Kim et al. [30] utilized fl+1 = σ ( fl ∗ θl ) (1)
tucker decomposition on a kernel tensor to compress the
networks. It consists of three steps: rank selection, low-rank where σ (·) is an activation function, such as rectifier linear
tensor decomposition, and fine-tuning. This method requires unit (ReLU), and ∗ is the 2-D convolution operator. Fig. 2(a)
additional experiments to select an appropriate rank. Most shows an illustration of the original DNN layer. The input
works did not consider the low-rank constraint in the training is the feature map fl with cl channels, and the output is the
process, and thus, the compression rate is limited. Recently, feature map fl+1 with cl+1 channels. The parameter θl has cl+1
Alvarez and Salzmann [31] introduced the low-rank constraint filters with cl × kl × kl size.
(i.e., nuclear norm) and the sparse group least absolute shrink- In this approach, the original layer is decomposed into two
age and selection operator (LASSO) regularizer to train the layers: one is a basis weight matrix θl (2D) ∈ Rrl ×cl kl kl (tensor
network, before SVD-based decomposition. However, after θl ∈ Rrl ×cl ×kl ×kl ), and the other one is a coefficient matrix
 
training, the operation of SVD is time-consuming, and the βl2D ∈ Rcl+1 ×rl (tensor βl ∈ Rcl+1 ×rl ×1×1 ), which represents the
decomposed layers occupy much run-time memory. coefficients of the bases. Therefore, we compute the response
Structured pruning methods [12], [13], [33]–[35] directly of the decomposed layers by
remove redundant neurons and channels rather than irregular fl+1 = σ ( fl ∗ θl ∗ βl ). (2)
weights. Some works [12], [13] imposed LASSO regression to
learn the importance of each channel, whereas several works Note that the size of βl2D reflects the number of ranks

[33], [35] ranked filters by criteria such as the absolute values (i.e., rl ) and channels (i.e., cl+1 ). By imposing L 2,1 regulariza-
and pruned the unimportant ones. In addition, some works tion on the coefficient matrix βl2D and its transposition (βl2D )T ,

take the correlations between filters into account. For example, we obtain the basis weight θl (2D) with fewer rows rl and fewer

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

By setting the values of λ1 and λ2 , we get different special


cases, as discussed in detail as follows.

1 1 = 0, λ2 = 0: Low-rank decomposition component
In this case, (4) is reduced to


L
() = λ1 ||βl2D ||2,1 . (5)
l=1

Hence, after training the model, a number of columns of


βl2D are forced to be zero. The columns in βl2D with zero

values are pruned, and the corresponding rows of θl (2D) are
also pruned. Fig. 2(b) shows this case: the original 
parameter
θl2D is decomposed into a basis weight matrix θl (2D) with rl
filters and a coefficient matrix βl2D with cl+1 filters (each filter
represents a row of the matrix). Note that each filter size of
βl is rl × 1 × 1, and the initial value of rl is cl+1 . During 
the optimization process, the filter number (i.e., rl ) of θl (2D)

is trained to be smaller and smaller so that θl (2D) becomes a
low-rank basis weight matrix. In this case, we achieve low-
rank decomposition of a DNN layer by training with (5). It is
applicable to every type of DNN, including ResNet [3], since
it does not change the input–output dimensions. By using
the coefficient matrix βl2D with L 2,1 regularizer, the low-rank

basis weight θl (2D) can also be automatically generated without
manual rank selection.
For the need of keeping invariant in input–output dimen-
Fig. 2. Illustration of the proposed algorithm. The blocks shown only by
outlines are eventually pruned. (a) Original layer. (b) Our layer: Low-rank sions, we use the decomposition case to compress the layer.
decomposition. (c) Our layer: Channel pruning. This is because the decomposition case degenerated from the
proposed method only decomposes one layer into two layers
and does not influence the input–output dimensions.

output channels cl+1 , simultaneously. In this manner, given 2 λ2 = 0, λ1 = 0: Channel pruning component
input–output pairs (x, y) from data set D, the loss function is In terms of this case, (4) is reduced to
formulated as
1  
L
Loss = (F (, x), y) + () (3) () = λ2 ||(βl2D )T ||2,1 . (6)
|D| (x,y)∈D
l=1

where  = {θl , βl }l=1


L
encompasses all network parameters, Similarly, βl2D is trained to have sparse rows by (6). Pruning
and F (·) is the DNN forward function. The first item (·) is the the rows with zero values leads to a reduction in the number of
standard loss function, such as cross-entropy loss. The second output channels. As shown in Fig. 2(c), the number of output
item () is a regularization term. In order to encourage the channels cl+1
is initialized to cl+1 . The number of output
parameters in each layer to have low-rank and sparse channels, channels becomes smaller as the training proceeds. Hence,
() is computed by a more compact output fl+1 is obtained, with the saving of
much run-time memory. In comparison with existing channel

L 
L
() = λ1 ||βl2D ||2,1 + λ2 ||(βl2D )T ||2,1 pruning methods, such as Slimming [13], this case only needs
l=1 l=1
one whole training process without multistage pruning and


rl 

cl+1
achieves a high performance.
  (β 2D )2 (i, j ). 3 λ1 = 0, λ2 = 0: EDP
i.e., ||βl2D ||2,1 = (4)
l When λ1 and λ2 are both nonzero, we make use of the low-
j =1 i=1
rank decomposition component and channel pruning compo-

Note that λ1 and λ2 are the penalty coefficients, and i and nent, simultaneously. The initial values of rl and cl+1 are both
j denote the i th row and j th column, respectively. It is worth set to cl+1 . The DNN is then trained by minimizing the loss
mentioning that adding constraints on β means less parameter function in (3). After pruning, a low-rank basis weight θl with
optimization and joint optimization, which can help to promote a small number (i.e., rl ) of filters and an output feature map

the training performance. In ablation analysis, it is verified that fl+1 with a small number (i.e., cl+1 ) of channels are obtained.
adding regularizations to β is more effective and efficient than In EDP, the two components compensate for the drawbacks
adding regularizations to θ , and joint optimization outperforms of each other. This ensures an extreme compression rate with
separate optimization. only a small decline in performance.

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

RUAN et al.: EDP SCHEME FOR CONVOLUTIONAL NEURAL NETWORK COMPRESSION 5

Fig. 3. Analysis of the ES strategy of optimization with regularizations. Note that the red imaginary line represents the ES epoch. No: training without an
ES strategy; ES: training with ES strategy. (a) Sparse ratio of parameters. (b) Sparse ratio of FLOPs. (c) Test accuracy comparison.

B. Optimization Simultaneously, using the soft-thresholding algorithm [37],


we can obtain
Optimization: The optimization objective is to minimize the
loss defined in (3) [Sαλ1 (βl )] j

1  ⎨β (:, j ) − αλ1 βl (:, j ) , if ||β (:, j )|| > αλ
l l 2 1
min (F (, x), y) + (). (7) = ||βl (:, j )||2 (12)
 |D| ⎩
(x,y)∈D 0, otherwise

For the convenience of subsequent analysis, (7) is rewritten as where j denotes the column index, and βl (:, j ) represents the
j th column of βl . Similarly
min g() + () [Sαλ2 (βlT )]i


1  ⎨β (i, :) − αλ2 βl (i, :) , if ||β (i, :)|| > αλ
i.e., g() = (F (, x), y). (8) l l 2 2
|D| = ||βl (i, :)||2 (13)
(x,y)∈D ⎩
0, otherwise.
The term g() is a widely used DNN loss function (e.g., Substituting (12) and (13) into (11), the optimization problem
the cross-entropy loss for classification tasks) which is smooth can be solved.
M
and convex. In this task, (·) = − c=1 pc (x i ) log qc (x i ), Early Stopping: According to the experiments, imposing the
which is the cross-entropy loss. Note that M denotes the num- constraints throughout the whole training process probably
ber of possible class labels, qc (x i ) is a predicted probability is not optimal. Fig. 3 shows the curves of sparse ratios at
after the soft-max function that observation i is of class c, and different epochs. It manifests that both the parameters’ sparse
pc (x i ) denotes a binary indicator (0 or 1) of whether class label ratio and the FLOPs’ sparse ratio reach saturation after certain
c is the correct classification for observation i . The second epochs. In addition, when we early stop the regularizations at
term, (), is convex but not differentiable. In order to solve the saturation point and continue training the pruned model
the resulting nonsmooth unconstrained optimization problem, during the remaining epochs, the accuracy further increases,
we make use of the proximal gradient descent method [36]. as shown in Fig. 3(c). Therefore, we leverage the ES strategy
For the first term g(), the gradient update can be obtained of regularizations in the training process. Note that ES is
by the quadratic approximation different from fine-tuning. In most of the existing methods,
such as Slimming [13] and GrOWL [35], training with regu-
1
+ = argmin ||z − ( − α∇g())||2F + (z) (9) larizations and fine-tuning are two separate training processes.
z 2α Hyperparameters (e.g., LR) are reset in each training process.
On the contrary, the ES is only a small training period in the
in which α is the learning rate (LR), + is the next estimate
whole compressed scheme, where the LR continuously decays,
of the network parameters, z is all possible approximation
and other parameters are continuously updated.
solution of the network parameters that we are going to
predict, and  is from the previous iteration.
As () is only relevant to {βl }l=1
L
, we update {θl }l=1
L
and C. PM Module
{βl }l=1 separately. In detail, we choose the initial (0) and
L After optimization, it is usual to prune the redundant para-
then repeat meters of the network. However, in the EDP, it is not enough to
only conduct a pruning operation. The decomposition compo-
{θl }(n+1) = {θl }(n) − α∇g((n) ) nent decomposes one layer into two layers, which brings about

{βl }(n) = {βl }(n) − α∇g((n) ) a deeper network and some redundant blocks. On the one hand,
the deeper network may influence the training convergence.
n = 0, 1, 2, . . . (10)
(n+1)   On the other hand, the parameters of the two decomposed
{βl } = Sαλ1 ({βl }(n) ) + Sαλ2 ({βlT }(n) ). layers may be more than that of one single layer. It is because

n = 0, 1, 2, . . . (11) the row number rl of βl and the output channel number cl+1

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE I TABLE II
T EST A CCURACY OF D IFFERENT T RAINING S TRATEGY VGG A RCHITECTURE VARIANTS FOR THE CIFAR D ATA S ET IN T HIS
IN D IFFERENT (λ1 , λ2 ) S ETTINGS A RTICLE . T HE C ONVOLUTIONAL K ERNEL I S 3 × 3

the PM module to prune the zero-value channels and merge the


Algorithm 1: Proposed Method EDP redundant decomposed layers. After that, we continue training
Input: Model networ k, Data set D, LR α. the network until the end. At this moment, we finish the whole
Output: a lightweight network. training process and get a lightweight network with slight
1 Decompose networ k and initialize model parameters 0 ; accuracy degradation.
2 t = 0;
3 while t < E poch E S do IV. E XPERIMENT
4 for each iteration ∈ epoch t do A. Settings
5 Update:  ←  − α∇g() using data set D; 1) Data Sets: We evaluate the proposed method on
6 for each βl ∈ {βl }l=1
L
do three data sets: CIFAR-10, CIFAR-100 [38], and ILSVRC
7 Update βl via (11), (12) and (13); 2012 ImageNet [39]. There are 50 000 training images and
8 end 10 000 test images with resolution 32 × 32 in CIFAR-
9 end 10/100 data sets. CIFAR-10 is drawn from 10 classes, whereas
10 t = t + 1; CIFAR-100 is split into 100 categories. All the images
11 end in the CIFAR-10 data set are normalized using mean =
12 Remove the redundant parameters leveraging PM module; [0.4914, 0.4822, 0.4465] and std = [0.2470, 0.2435, 0.2616],
13 Continue the training until the end (E poch end ); and CIFAR-100 data set is normalized using mean =
14 Obtain a lightweight network with acceptable accuracy [0.5071, 0.4867, 0.4408] and std = [0.2675, 0.2565, 0.2761].
degradation. ImageNet is a large-scale data set, which contains 1.28 mil-
lion training images and 50 000 validation images from
1 000 classes. Images are resized to 256 × 256 size and
are determined based on learning, unlike the existing low-rank normalized with mean = [0.485, 0.456, 0.406] and std =

decomposition methods that rl is set to be much less than cl+1 . [0.229, 0.224, 0.225].
Thus, in this case, it is better to merge the two layers and get 2) Networks: On the CIFAR-10 data set, we evaluate the
more compact architecture. proposed method using VGG-16 [2] and ResNet56 [3]. As the
Therefore, we present a PM module to prune the redundant original VGG-16 [2] is specially designed for ImageNet clas-
channels and to merge the redundant decomposed layers. sification, we use a variation version (i.e., VGG-small) taken
Specifically, we first prune the zero-value columns and rows from [40] in the experiment. To be specific, the architecture
of {βl }l=1
L
to get a pruned model. Based on the pruned model, consists of 13 convolutional layers and 2 much smaller fully
if there exists connected (FC) layers. Considering the convergence and the
performance, we adopt batch normalization (BN) [41] after
cl × cl+1

× kl kl  cl × rl × kl kl + rl × cl+1

× (1 · 1) each convolutional layer and remove dropout after the FC
 
cl cl+1 kl kl layer. On the CIFAR-100 data set, we use VGG-19 as in [2],
⇒ rl    (14)
cl kl kl + cl+1 with one FC layer. These VGG architecture variants are shown
in Table II. On the ImageNet data set, we evaluate the method
we conduct merging operation on the two decomposed layers
 on ResNets [3] (including ResNet 34, 50, and 101) and
and obtain a single convolutional layer with the size of cl+1 ×
 MobileNet V2 [42].
cl × kl × kl . Experimental results in Table I show that the PM
3) Implementation Details: We implement all experiments
module further improves the model performance and brings
using PyTorch [51] on multiple NVIDIA GTX 1080 Ti GPUs
about the smaller model size. It compensates for the drawback
in the training process and an Intel Core i7-6850K in the
that ES slightly increases the model size.
test process. During training, all the networks are trained
using SGD. The batch size is set to be 100 on CIFAR-
D. Learning Procedure 10/100 and 256 on the ImageNet data set. It takes 300 epochs
We summarize the proposed method in Algorithm 1. for training on CIFAR-10/100 and 90 epochs on the ImageNet
In the whole training process, the original network is first data set. The models are trained using an initial LR of 0.05 on
decomposed and updated repeatedly by optimizing (3) jointly. CIFAR-10 and 0.01 on the CIFAR-100/ImageNet data set. The
At E poch E S , we early stop the regularization () and utilize LR is multiplied by 0.1 at 50% and 75% of the training epochs

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

RUAN et al.: EDP SCHEME FOR CONVOLUTIONAL NEURAL NETWORK COMPRESSION 7

TABLE III
P ERFORMANCE C OMPARISON OF VGG-S MALL AND R ES N ET 56 ON CIFAR-10. N OTE T HAT THE T OP T WO R ESULTS A RE H IGHLIGHTED W ITH R ED
AND P INK F ONTS , R ESPECTIVELY. T HE “--” I NDICATES T HAT THE R ESULTS A RE N OT L ISTED IN THE O RIGINAL A RTICLE

TABLE IV
P RUNED R ESULTS OF VGG19 ON CIFAR-100

on CIFAR-10/100 and 30 and 60 epoch on the ImageNet data including Li et al. [43], SVD [31], ThiNet [52], Slimming [13],
set. We utilize a weight decay of 10−4 and a momentum of 0.9. NRE [44], GrOWL [35], NISP [53], SFP [33], DCP [47],
Besides, we set E poch E S to be 120 (for CIFAR-10/100 data AMC [49], FPGM [48], SSR [54], ASFP [46], AOFP [55],
sets) and 10 (for ImageNet data set) in all experiments. KSE [50], HRank [45], and DMC [56].
4) Evaluation Metrics: We evaluate the proposed method
using the number of network parameters and FLOPs (multiply B. Performance Comparisons
adds). Note that the FLOPs are counted for the operations on In this section, we compare the method on three data sets
convolutional and FC layers. Some calculations such as BN including CIFAR-10, CIFAR-100, and ImageNet, with VGG,
and other overheads are not accounted for. We also report the ResNets, and MobileNet V2 architecture. The results of these
ratio of parameters or FLOPs reduction, as given by competing methods are reported according to the original
Pruned Params article.
RParams = 1) CIFAR-10: On CIFAR-10, we compare the EDP with
Original Params
baseline, GrOWL [35], Li et al. [43], NRE [44], SVD [31],
Pruned FLOPs
RFLOPs = . (15) Slimming [13], SFP [33], ASFP [46], DCP [47], AMC [49],
Original FLOPs FPGM [48], KSE [50], and HRank [45] in Table III.
In addition, the inference time and run-time memory are Among these compared methods, Li et al. [43], NRE [44],
leveraged to further evaluate the proposed method. Slimming [13], SFP [33], DCP [47], ASFP [46], FPGM [48],
5) Competing Methods: To analyze the effectiveness of and HRank [45] are the state-of-the-art channel pruning meth-
EDP, 17 state of the arts are taken into account for comparison, ods. SVD [31] is a low-rank decomposition method, and

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE V
P RUNED R ESULTS OF R ES N ETS ON ILSVRC 2012 I MAGE N ET

GrOWL [35] learns to group parameters with a high corre- TABLE VI


lation. GrOWL+L 2 represents the GrOWL method adding L 2 P ERFORMANCE C OMPARISON OF THE C OMPRESSED M OBILE N ETV 2
( A L IGHTWEIGHT N ETWORK ) ON THE I MAGE N ET D ATA S ET
regularizer in optimization. AMC [49] is an AutoML-based
pruning method. KSE [50] is a kernel-level redundancy-based
compression method. For VGG-small, shown in Table III,
the proposed method achieves an accuracy of 93.52% and
a 95.59% pruned parameter ratio, performing the best among
the comparative methods. Moreover, Slimming [13], the best
competing method, achieves a comparable accuracy of 93.48%
but a much lower pruned parameter ratio of 86.65%. Compared
with the baseline, EDP compresses VGG-small more than the proposed method gets the second high accuracy but
95% parameters and 80% FLOPs, with less than 0.1% accu- achieves the highest compressed ratio and low accuracy loss.
racy loss. For a more compact architecture ResNet56, EDP For ResNet50, the compressed model achieves the highest test
still outperforms all the compared methods and compresses accuracy with a high compressed ratio. Especially, it reaches
ResNet56 more than 54% parameters and 57% FLOPs, without a very high value of FLOPs reduction ratio, far larger than
any accuracy degradation. most other methods (close to or even more than 10%). For
ResNet101, a deep residual network, the EDP obtains the
2) CIFAR-100: On CIFAR-100, the EDP is compared with
highest compressed ratio among the four methods (SFP [33],
baseline, Li et al. [43], Slimming [13], and SVD [31].
AOFP [55], FPGM [48], and the EDP). The accuracy of the
As shown in Table IV, the results completely exceed all the
compressed model even outperforms the baseline, improving
competing methods, not only on the accuracy but also on
0.46% in top-1 and 0.14% in top-5. For lightweight network
the ratios of parameters’/ FLOPs’ reduction. Among these
compression (i.e., MobileNet V2), as shown in Table VI,
methods, Li et al. [43] achieved high accuracy but has a
the method also obtains the best performance. These results
low compressed ratio. Slimming [13] obtained a relatively
indicate that the method is effective and superior to compared
small compressed model, but the accuracy is not high. On the
methods on both small-scale and large-scale data sets, due
contrary, the EDP, containing both the advantages of decom-
to fusing decomposition and pruning scheme and adaptively
position and channel pruning, shows its superior effectiveness.
learning to compress networks.
3) ImageNet: On the ImageNet data set, EDP is further
compared with Li et al. [43], ThiNet [52], NISP [53],
AMC [49], SFP [33], SSR [54], ASFP [46], AOFP [55], C. Ablation Analysis
FPGM [48], HRank [45], and DMC [56] on ResNets with dif- In Section IV-C, we conduct extensive experiments to
ferent depths and MobileNet V2, as shown in Tables V and VI. further analyze and discuss the effectiveness of the proposed
For ResNet34, although ASFP [46] archives the highest accu- method. Without explicit explanation, the results are obtained
racy, the pretrained model (i.e., baseline) also has the highest by compressing VGG-small on the CIFAR-10 data set. More-
accuracy, and the compressed ratio is low. On the contrary, over, the “baseline” results are trained from scratch.

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

RUAN et al.: EDP SCHEME FOR CONVOLUTIONAL NEURAL NETWORK COMPRESSION 9

Fig. 4. Accuracy curves at different compression rates. The proposed method


works best among different compression methods.

1) Effectiveness of Each Component in EDP: To verify


the effectiveness of EDP, we analyze each component of the
proposed method. Fig. 4 shows the accuracy of EDP and
its individual component at different compression rates. The
results of compared methods, such as low-rank decomposition
method–SVD [31], channel pruning method–Slimming [13],
and SVD + Slimming, are also reported. In particular, SVD
Fig. 5. Layer-wise comparison of rank number and channel number.
performs SVD on the layers of a DNN model which is pre- (a) Channel number. (b) Rank number.
trained with nuclear norm regularizer. Different compression
rates are achieved by choosing the different ranks of the
parameter matrix. Slimming imposes L 1 regularization on the components by inserting two regularizers both on the proposed
scaling factors in BN and then prunes channels with low coefficient matrix βl , and the overall framework is optimized
scaling factors. Further, we combine both SVD and Slimming uniformly to get an optimal model. Hence, the weights’
on one model as a competing method of the full method EDP. matrixes of DNN compressed by EDP naturally fulfill the low-
According to the results, both of the two components of EDP rank and channel sparsity properties, which enable a higher
outperform the competing methods, and the whole proposed compression rate with less performance gap. On the contrary,
method outperforms all the other methods by a large margin. the competing method SVD+Slimming performs only similar
Concretely, in Fig. 4, “The low-rank decomposition compo- or even worse, compared with individual SVD and individual
nent” is one of the components of the method, and it belongs Slimming. In this case, the arbitrary two-method combination
to case  1 in Section III-A. It takes only a single process is only able to obtain a suboptimal model and severely harms
to achieve the low-rank decomposition component, with less the performance of the original model.
accuracy loss. On the contrary, SVD suffers from rapid perfor- 2) Layer-Wise Comparisons: The layer-wise channel num-
mance drop when the network has a high compression rate. bers and rank numbers of different compressed DNNs are
Fig. 4 shows that at a similar performance, the model size shown in Fig. 5. For the comparison of the channel number
of the low-rank decomposition component is only 50%–65% in Fig. 5(a), we can see that the proposed method prunes much
of that of the SVD model. Thus, the low-rank decomposition more channels in deep layers than in shallow layers, which
component is superior to some typical low-rank decomposition indicates that the last several layers are more redundant. Com-
method–SVD. pared with channel pruning method–Slimming, the method has
“The channel pruning component” is the second compo- more sparse channels. Even so, the pruned method with a
nent of the proposed method, and it belongs to case  2 in higher compression rate achieves comparable accuracy. The
Section III-A. Similar to the low-rank decomposition compo- layer-wise comparison of the rank of weight matrix is also
nent, the channel pruning component is automatically achieved shown in Fig. 5(b), between SVD and the proposed method.
during the optimization procedure. On the contrary, for Slim- It depicts that the ranks of EDP are significantly lower than
ing, the performance of compressed DNN without fine-tuning that of SVD, especially in the middle layers. Hence, the coop-
is extremely low (approximately 10% accuracy). Thus, it takes eration of the EDP’s two components shows the effectiveness
three stages (i.e., training with regularizer, pruning, and fine- to make the weight matrix to be low rank.
tuning) for sliming to compress a model. The experimental 3) Visualization: We also give some subjective analysis
results verify that at a similar performance, the model size of of the proposed method. Fig. 6 shows the filters of the
the channel pruning component is only 50% or even less of first convolutional layer of VGG-small trained on CIFAR-10.
that of the Slimming model. Besides, at the same compression In particular, the 64×3×3×3 tensor is represented by 64 RGB
rate, the channel pruning exceeds Slimming by 0.5%–2.0%. images with 3 × 3 size. Moreover, the values of the filters
The full method EDP consists of the low-rank decomposi- are normalized to [0, 255]. As shown in Fig. 6, the filters
tion component and channel pruning component, which has of the proposed method are much sparser compared with the
the best performance in Fig. 4. EDP integrates these two baseline. Since most of the filters are all zero, it is easier

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE VII
D ETAILED R ESULTS OF PARAMETERS AND FLOPs OF VGG-S MALL ON CIFAR-10

ratios of parameters and FLOPs than the shallower layers. This


reflects that the redundant parameters are densely distributed
in deep layers, which is consistent with the conclusion in [35].
This may mean that the network maintains the diversity and
large information of the inputs in the first few convolutional
layers.
4) Sensitivity of Hyperparameter Settings: For the hyper-
parameter (i.e., λ1 and λ2 ) selection of the proposed method,
we adopt a “grid-search” approach using the training data,
which are tuned exponentially at first and then selected by a
fine grid search approach. Especially, for λ1 and λ2 , the grid
Fig. 6. Filter visualization results. Note that the filters come from the first search is divided into two steps: one for loose grid search in
layer of VGG-small. (a) Baseline. (b) Proposed method. [0.0001, 0.001, 0.01, 0.1] to find a relatively good FLOPs’
accuracy tradeoff, and then, the other for a fine grid search
for EDP to achieve low rank as well as channel sparsity. We in a small subset near the relatively good point. As shown
also give a detailed observation of the layer-wise compression in Fig. 7, λ1 = 0.01 and λ2 = 0.001 are selected in the
results. Table VII shows the details of the compressed VGG- loose grid search stage, and then, we adjust λ1 in [0.006,
small on CIFAR-10. Note that the absolute compressed ratio in 0.007, 0.008, 0.009, 01] and λ2 in [0.001, 0.002, 0.003, 0.004,
some block is the removed parameter proportion of all network 0.005] to choose the optimal hyperparameters. In the loose
parameters, whereas the relative compressed ratio means the grid search stage, the hyperparameters are tuned exponentially
removed parameter proportion of all the block parameters. in a relatively wide range. In the fine grid search stage,
Compared with the baseline, each layer of the compressed we find that the hyperparameters are not sensitive at this
architecture has fewer parameters and FLOPs, which verifies range, and the accuracy keeps steady with slight fluctuation.
the effectiveness of the compression method. Besides, it can be These observations show that the λs are not difficult to be
found that deeper layers have much higher relative compressed tuned.

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

RUAN et al.: EDP SCHEME FOR CONVOLUTIONAL NEURAL NETWORK COMPRESSION 11

Fig. 9. Accuracy curves at different compression ratios of joint optimization


and separate optimization.

TABLE VIII
C OMPARISON R ESULTS OF A DDING THE R EGULARIZATION TO θ AND β
W ITH VGG S MALL ON THE CIFAR-10 D ATA S ET. N OTE T HAT “O URS
(θ )” I NDICATES A DDING THE R EGULARIZATION TO θ

Fig. 7. Test accuracy at different hyperparameter settings. (a) Loose grid TABLE IX
search of λ1 . (b) Fine grid search of λ1 . (c) Loose grid search of λ2 . (d) Fine
T EST A CCURACY AT D IFFERENT P RESET P RUNED R ATIOS . I N THE E XPER -
grid search of λ2 .
IMENTS , W E U SE VGG-S MALL AND R ES N ET 56 ON THE CIFAR-
10 D ATA S ET AND R ES N ET 50 ON THE I MAGE N ET D ATA S ET

Fig. 8. Sensitivity analysis of ES epoch. (a) Accuracy versus E poch E S


curve with VGG-small on the CIFAR-10 data set at 90% pruned FLOPs rate.
(b) Accuracy versus E poch E S curve with ResNet50 on the ImageNet data set
at 50% pruned FLOPs rate.

On the other hand, we also analyze the sensitivity of the


ES epoch E poch E S . It is also selected by the “grid-search” the effectiveness of the joint optimization scheme, we compare
approach. In the loose grid search, we search E poch E S in it with separate optimization. In the experiment of separate
[1, 120, 240] for CIFAR-10 data set and [1, 40, 80] for optimization, we optimize case  1 and case 2 in sequence.
ImageNet data set, and then, a fine grid search is conducted in As shown in Fig. 9, the joint optimization outperforms separate
a small subset near a relatively good point. As shown in Fig. 8, optimization. The test accuracy of joint optimization is 0.5%
E poch E S = 120 (for CIFAR-10) and E poch E S = 1 (for higher than that of separate optimization at the same pruning
ImageNet) are selected in the loose grid search stage, and rate. Besides, joint optimization has fewer hyperparameters
then, we adjust E poch E S in [40, 60, 80, 100, 140] for CIFAR- and simpler pruning process, which helps the EDP to be
10 and E poch E S in [5, 8, 10, 12, 15] for ImageNet to choose efficient and effective.
the optimal E poch E S . When E poch E S = 120 and 10, the test In addition, in the proposed joint optimization, all the
accuracy reaches the best. Hence, we set E poch E S to be 120 constraints are imposed on β. In order to verify that it is more
(for CIFAR-10) and 10 (for ImageNet) in all the experiments. effective to impose constraints on β instead of θ , we compare
The accuracy at different E poch E S is shown in Fig. 8. It can the performance of adding regularization to β and θ . As shown
be seen that E poch E S has low sensitivity in the fine grid range in Table VIII, the test accuracy of adding regularizations to
and is not difficult to be tuned. θ is 0.81% lower than that of adding regularizations to β.
5) Joint Optimization Versus Separate Optimization: The Moreover, the pruning rate and training time are also inferior
proposed EDP optimizes (7) jointly, with low-rank decomposi- to the proposed method. Therefore, adding regularizations to
tion component (i.e., the first item of ()) and channel prun- β is more effective and efficient than adding regularizations
ing component (i.e., the second item of ()). To evaluate to θ .

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE X
T RAINING T IME R ESULTS B ETWEEN BASELINE AND THE P ROPOSED M ETHOD . I N THE E XPERIMENTS , W E U SE VGG-S MALL ON THE CIFAR-10
D ATA S ET AND R ES N ET 50 ON THE I MAGE N ET D ATA S ET. T OTAL T IME↓(%) D ENOTES THE T OTAL T IME D ROP P ERCENTAGE ,
THE H IGHER THE B ETTER

TABLE XI
T RAINING E POCHS ’ C OMPARISON OF VGG-S MALL ON THE CIFAR-10 D ATA S ET. W E C ONDUCT THE E XPERIMENTS O NLY U SING
150 E POCHS AT D IFFERENT P RUNED R ATIOS

6) Preset Compressed Ratio: The proposed method adap-


tively learns a compact model with a proper compressed
ratio and high accuracy. When presetting the compressed
ratio, EDP can also be deployed and show a good per-
formance in Table IX. Specifically, after obtaining a sparse
network, we rank the parameter magnitudes at each layer
and prune a certain proportion of parameters that ranked the
last. In Table IX, the compressed model even performs better
than baseline (i.e., 94.34% versus 93.60% and 93.79%, versus
93.61%). This may be because fewer parameters can avoid
over-fitting. At the same pruned FLOPs ratio on CIFAR-10,
the test accuracy of ResNet56 decreases more evidently than
that of VGG-small, due to Resnet56 more compact than
VGG-small. Moreover, the EDP prunes FLOPs to 70% only
with the accuracy degradation of less than 0.5%, for VGG- Fig. 10. Inference time and run-time memory comparison using VGG-small
small and ResNet56. On a larger data set (i.e., ImageNet), on CIFAR-10.
we prune ResNet50 and obtain better performance than base-
line at pruned FLOPs of 40%, which further manifests the
effectiveness of the proposed method. performs better than others. Therefore, it can be concluded
7) Training Complexity Analysis: Although the imposed that the proposed training scheme is efficient.
regularizers may influence the training complexity, the overall 8) Inference Time and Run-Time Memory Consump-
framework is efficient. For the training time, we compare the tion: Besides the number of network parameters and FLOPs,
training time per epoch between the method and baseline (the we also evaluate the proposed method on inference time and
original network architecture). As shown in Table X, baseline run-time memory. Fig. 10 shows the results of VGG-small
costs 0.53 h/epoch during [0; E) to train ResNet50 on the on CIFAR-10. For the inference time, Fig. 10 shows that
ImageNet data set, whereas the method spends 0.63 h/epoch the proposed method accelerates the DNN model by more
and 0.43 h/epoch during [0; E poch E S ) and [E poch E S ; E) than 3 times and outperforms all the competing methods.
epochs, respectively. Although it takes more time for the In terms of the run-time memory, it can be found that the
proposed method during [0; E poch E S ) epochs, comparable run-time memory consumption of the proposed method is the
training time costs less at later epochs due to the PM module lowest among the competing methods. Moreover, SVD takes
and ES strategy. Overall, the training time spends only 10.8% 138.50 M CPU memory, even more than the memory occupied
hours more than baseline on CIFAR-10 and 14.4% hours less by baseline. As SVD is a decomposition method, it generates
than baseline on the ImageNet data set. This indicates that more layers with feature maps. More feature maps must result
the training time complexity of the proposed method is com- in more run-time memory consumption. Fortunately, the chan-
parable with baseline. For the convergence speed, as shown nel pruning component decreases the redundant feature maps,
in Table XI, the method spends fewer epochs on training and overcoming the shortcoming of decomposition.

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

RUAN et al.: EDP SCHEME FOR CONVOLUTIONAL NEURAL NETWORK COMPRESSION 13

TABLE XII [4] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
O BJECT D ETECTION C OMPARISON FOR FASTER R-CNN [57] connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
M ETHOD ON THE PASCAL VOC2007 DATA S ET [58] Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
chies for accurate object detection and semantic segmentation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
[6] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr,
“Fully-convolutional siamese networks for object tracking,” in Proc. Eur.
Conf. Comput. Vis., vol. 2016, pp. 850–865.
[7] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and
K. Saenko, “Sequence to sequence-video to text,” in Proc. IEEE Int.
Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4534–4542.
9) Generalization Ability on Detection Tasks: As described [8] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploit-
ing linear structure within convolutional networks for efficient evalua-
above, the method is effective in image classification. To fur- tion,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1269–1277.
ther analyze the generalization of the method, we conduct [9] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li, “Coordinating
experiments with the compressed model. Specially, we use filters for faster deep neural networks,” in Proc. IEEE Int. Conf. Comput.
Vis. (ICCV), Oct. 2017, pp. 658–666.
ResNet50 as a backbone network to deploy Faster R-CNN [57] [10] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convo-
for object detection and then compress Faster R-CNN by lutional networks for classification and detection,” IEEE Trans. Pattern
reducing 40% FLOPs of the backbone network. In the exper- Anal. Mach. Intell., vol. 38, no. 10, pp. 1943–1955, Oct. 2016.
imental implementation, we evaluate the performance with [11] S. Lin, R. Ji, C. Chen, and F. Huang, “ESPACE: Accelerating convolu-
tional neural networks via eliminating spatial and channel redundancy,”
inference time (Fps) and mean average precision (mAP) on the in Proc. AAAI Conf. Artif. Intell., 2017, pp. 1424–1430.
PASCAL VOC 2007 data set [58], which consists of about 5K [12] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
training/validation images and 5K testing images. As shown deep neural networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 1389–1397.
in Table XII, the pruned model shows a good result. The [13] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning
mAP of the method is slightly lower than the baseline, but the efficient convolutional networks through network slimming,” in Proc.
inference speed is faster than the baseline. This demonstrates IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2736–2744.
[14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and
that the method has good generalization on other tasks. connections for efficient neural network,” in Proc. Adv. Neural Inf.
Process. Syst., 2015, pp. 1135–1143.
V. C ONCLUSION [15] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in
Proc. Adv. Neural Inf. Process. Syst., 1990, pp. 598–605.
In this article, we have proposed an effective decomposition [16] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
and pruning (EDP) scheme via a compressed-aware block. DNNs,” in Proc. Adv. In Neural Inf. Process. Syst., 2016, pp. 1379–1387.
The block can be used to replace every layer of a CNN [17] B. Hassibi and D. G. Stork, “Second order derivatives for network
pruning: Optimal brain surgeon,” in Proc. Adv. Neural Inf. Process. Syst.,
in a plug-and-play way so as to efficiently compress the 1993, pp. 164–171.
network. EDP takes only one single process to train a DNN [18] A. Krogh and J. A. Hertz, “A simple weight decay can improve gener-
and obtain a high compression rate with barely declining alization,” in Proc. Adv. Neural Inf. Process. Syst., 1992, pp. 950–957.
accuracy. To be specific, we have embedded the compressed- [19] X. Ding, G. Ding, X. Zhou, Y. Guo, J. Han, and J. Liu, “Global sparse
momentum SGD for pruning very deep neural networks,” in Proc. Adv.
aware block by decomposing the original network layer into Neural Inf. Process. Syst., 2019, pp. 6382–6394.
two layers: one represents the weight basis of this layer and the [20] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse,
other represents the coefficient matrix. By adding regularizers trainable neural networks,” 2018, arXiv:1803.03635. [Online]. Available:
http://arxiv.org/abs/1803.03635
on the coefficient matrix during training, a low-rank weight [21] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
basis and sparse channels have been obtained. Moreover, value of network pruning,” 2018, arXiv:1810.05270. [Online]. Available:
a PM module has been presented to prune redundant channels http://arxiv.org/abs/1810.05270
[22] N. Lee, T. Ajanthan, and P. H. S. Torr, “SNIP: Single-shot network prun-
and merge redundant decomposed layers, which can further ing based on connection sensitivity,” 2018, arXiv:1810.02340. [Online].
compress and optimize the DNN. Experiments have shown Available: http://arxiv.org/abs/1810.02340
that EDP outperforms the state of the arts on pruned ratio, [23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in
test accuracy, inference time, and run-time memory and is Proc. Eur. Conf. Comput. Vis., 2016, pp. 525–542.
generally applicable to different data sets and networks. [24] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
In the future, we will explore the layer pruning for deeper “Binarized neural networks: Training deep neural networks with weights
networks. Moreover, we will consider integrating the proposed and activations constrained to +1 or -1,” 2016, arXiv:1602.02830.
[Online]. Available: http://arxiv.org/abs/1602.02830
method with other compression methods (e.g., quantization) in [25] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training
real applications. deep neural networks with binary weights during propagations,” in Proc.
Adv. Neural Inf. Process. Syst., 2015, pp. 3123–3131.
[26] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
R EFERENCES “Quantized neural networks: Training neural networks with low pre-
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification cision weights and activations,” J. Mach. Learn. Res., vol. 18, no. 1,
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. pp. 6869–6898, 2017.
Process. Syst., 2012, pp. 1097–1105. [27] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks quantization,” 2016, arXiv:1612.01064. [Online]. Available:
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. http://arxiv.org/abs/1612.01064
Available: http://arxiv.org/abs/1409.1556 [28] S. Han, H. Mao, and W. J. Dally, “Deep compression: Com-
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for pressing deep neural networks with pruning, trained quantization
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- and Huffman coding,” 2015, arXiv:1510.00149. [Online]. Available:
nit. (CVPR), Jun. 2016, pp. 770–778. http://arxiv.org/abs/1510.00149

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[29] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional [54] S. Lin, R. Ji, Y. Li, C. Deng, and X. Li, “Toward compact ConvNets via
neural networks with low rank expansions,” 2014, arXiv:1405.3866. structure-sparsity regularized filter pruning,” IEEE Trans. Neural Netw.
[Online]. Available: http://arxiv.org/abs/1405.3866 Learn. Syst., vol. 31, no. 2, pp. 574–588, Feb. 2020.
[30] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Com- [55] X. Ding, G. Ding, Y. Guo, J. Han, and C. Yan, “Approximated oracle
pression of deep convolutional neural networks for fast and low power filter pruning for destructive CNN width optimization,” in Proc. Int.
mobile applications,” 2015, arXiv:1511.06530. [Online]. Available: Conf. Mach. Learn., 2019, pp. 1607–1616.
http://arxiv.org/abs/1511.06530 [56] S. Gao, F. Huang, J. Pei, and H. Huang, “Discrete model compres-
[31] J. M. Alvarez and M. Salzmann, “Compression-aware training of deep sion with resource constraint for deep neural networks,” in Proc.
networks,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 856–867. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
[32] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and pp. 1899–1908.
V. Lempitsky, “Speeding-up convolutional neural networks using [57] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
fine-tuned CP-decomposition,” 2014, arXiv:1412.6553. [Online]. time object detection with region proposal networks,” in Proc. Adv.
Available: http://arxiv.org/abs/1412.6553 Neural Inf. Process. Syst., 2015, pp. 91–99.
[33] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter [58] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,
pruning for accelerating deep convolutional neural networks,” 2018, J. Winn, and A. Zisserman, “The Pascal visual object classes challenge:
arXiv:1808.06866. [Online]. Available: http://arxiv.org/abs/1808.06866 A retrospective,” Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136,
[34] G. Huang, S. Liu, L. V. D. Maaten, and K. Q. Weinberger, Jan. 2015.
“CondenseNet: An efficient DenseNet using learned group convo-
lutions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 2752–2761.
[35] D. Zhang, H. Wang, M. Figueiredo, and L. Balzano, “Learning to share: Xiaofeng Ruan received the B.E. degree in mechan-
Simultaneous parameter tying and sparsification in deep learning,” in ical design, manufacturing, and automation from
Proc. Int. Conf. Learn. Represent. (ICLR), 2018, pp. 1–14. Dalian Maritime University, Dalian, China, in 2012,
[36] M. Schmidt, N. L. Roux, and F. R. Bach, “Convergence rates of and the M.E. degree in aerospace manufacturing
inexact proximal-gradient methods for convex optimization,” in Proc. engineering from the Harbin Institute of Tech-
Adv. Neural Inf. Process. Syst., 2011, pp. 1458–1466. nology, Harbin, China, in 2014. He is currently
[37] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding pursuing the Ph.D. degree in pattern recognition
algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, and intelligent system with the National Labora-
pp. 183–202, Jan. 2009. tory of Pattern Recognition, Institute of Automation,
[38] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from Chinese Academy of Sciences (CASIA), Beijing,
tiny images,” Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, China.
Tech. Rep., 2009. His current research interests include pattern recognition, deep learning, and
[39] O. Russakovsky et al., “ImageNet large scale visual recognition chal- model compression.
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.
[40] S. Zagoruyko, “92.45% on cifar-10 in torch,” Torch Blog, 2015.
[Online]. Available: http://torch.ch/blog/2015/07/30/cifar.html
[41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating Yufan Liu received the B.S. degree from Zhejiang
deep network training by reducing internal covariate shift,” 2015, University, Hangzhou, China, in 2015, and the M.S.
arXiv:1502.03167. [Online]. Available: http://arxiv.org/abs/1502.03167 degree from Beihang University, Beijing, China,
[42] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and in 2018.
L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” She is currently a Research Associate with the
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, National Laboratory of Pattern Recognition, Insti-
pp. 4510–4520. tute of Automation, Chinese Academy of Sciences
[43] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. Peter Graf, “Prun- (CASIA), Beijing. She has published several articles
ing filters for efficient ConvNets,” 2016, arXiv:1608.08710. [Online]. in international journals and conference proceedings,
Available: http://arxiv.org/abs/1608.08710 including the IEEE T RANSACTIONS ON I MAGE
[44] C. Jiang, G. Li, C. Qian, and K. Tang, “Efficient DNN neuron pruning P ROCESSING, the IEEE Conference on Computer
by minimizing layer-wise nonlinear reconstruction error,” in Proc. 27th Vision and Pattern Recognition (CVPR), European Conference on Computer
Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 2298–2304. Vision (ECCV), and so on. Her research interests include computer vision,
[45] M. Lin et al., “HRank: Filter pruning using high-rank feature map,” model compression, and saliency prediction.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2020, pp. 1529–1538.
[46] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan, and Y. Yang, “Asymptotic
Chunfeng Yuan received the Ph.D. degree from the
soft filter pruning for deep convolutional neural networks,” IEEE Trans.
National Laboratory of Pattern Recognition, Insti-
Cybern., vol. 50, no. 8, pp. 3594–3604, Aug. 2020.
tute of Automation, Chinese Academy of Sciences
[47] Z. Zhuang et al., “Discrimination-aware channel pruning for deep neural
(CASIA), Beijing, China, in 2010.
networks,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 875–886.
She is currently an Associate Professor with
[48] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geometric CASIA. Her research interests and publications
median for deep convolutional neural networks acceleration,” in Proc. range from statistics to computer vision, including
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, sparse representation, deep learning, motion analy-
pp. 4340–4349. sis, action recognition, and event detection.
[49] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “AMC: Automl for
model compression and acceleration on mobile devices,” in Proc. Eur.
Conf. Comput. Vis., 2018, pp. 784–800.
[50] Y. Li et al., “Exploiting kernel sparsity and entropy for interpretable
CNN compression,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Bing Li received the Ph.D. degree from the Depart-
Recognit. (CVPR), Jun. 2019, pp. 2800–2809.
ment of Computer Science and Engineering, Beijing
[51] A. Paszke et al., “Automatic differentiation in Pytorch,” in Proc. Adv. Jiaotong University, Beijing, China, in 2009.
Neural Inf. Process. Syst. Workshop, 2017, pp. 1–4. [Online]. Available: He is currently a Professor with the Institute
pytorch.org of Automation, Chinese Academy of Sciences
[52] J.-H. Luo, J. Wu, and W. Lin, “ThiNet: A filter level pruning method for (CASIA), Beijing. His current research interests
deep neural network compression,” in Proc. IEEE Int. Conf. Comput. include video understanding, color constancy, visual
Vis. (ICCV), Oct. 2017, pp. 5058–5066. saliency, multiinstance learning, and web content
[53] R. Yu et al., “NISP: Pruning networks using neuron importance score security.
propagation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 9194–9203.

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

RUAN et al.: EDP SCHEME FOR CONVOLUTIONAL NEURAL NETWORK COMPRESSION 15

Weiming Hu (Senior Member, IEEE) received Stephen Maybank (Fellow, IEEE) received the
the Ph.D. degree from the Department of Com- B.A. degree in mathematics from King’s College,
puter Science and Engineering, Zhejiang University, Cambridge, U.K., in 1976, and the Ph.D. degree
Zhejiang, China, in 1998. in computer science from Birkbeck, University of
From 1998 to 2000, he was a Postdoctoral London, London, U.K., in 1988.
Research Fellow with the Institute of Computer He is currently a Professor with the School
Science and Technology, Peking University, Beijing, of Computer Science and Information Systems,
China. He is currently a Professor with the Insti- Birkbeck College, University of London. His
tute of Automation, Chinese Academy of Sciences research interests include the geometry of multiple
(CASIA), Beijing. His research interests are visual images, camera calibration, and visual surveillance.
motion analysis, recognition of web objectionable
information, and network intrusion detection.

Yangxi Li received the Ph.D. degree from Peking


University, Beijing, China, in 2012.
He is currently a Senior Engineer with the National
Computer Network Emergency Response Technical
Team/Coordination Center of China. His research
interests lie primarily in multimedia search, infor-
mation retrieval, and computer vision.

Authorized licensed use limited to: Birkbeck University of London. Downloaded on November 05,2020 at 14:03:18 UTC from IEEE Xplore. Restrictions apply.

You might also like