0% found this document useful (0 votes)
8 views21 pages

Survey Pruning 1 - 2022-Methods For Pruning Deep

This paper surveys over 150 studies on methods for pruning deep neural networks, categorizing them into three main approaches: magnitude-based pruning, clustering methods, and sensitivity analysis. It provides a comprehensive resource for comparing results across various architectures and datasets, highlighting key studies and their findings. The paper concludes by identifying research gaps and suggesting future directions in the field of neural network pruning.

Uploaded by

Randa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views21 pages

Survey Pruning 1 - 2022-Methods For Pruning Deep

This paper surveys over 150 studies on methods for pruning deep neural networks, categorizing them into three main approaches: magnitude-based pruning, clustering methods, and sensitivity analysis. It provides a comprehensive resource for comparing results across various architectures and datasets, highlighting key studies and their findings. The paper concludes by identifying research gaps and suggesting future directions in the field of neural network pruning.

Uploaded by

Randa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Received April 13, 2022, accepted June 5, 2022, date of publication June 13, 2022, date of current version

June 20, 2022.


Digital Object Identifier 10.1109/ACCESS.2022.3182659

Methods for Pruning Deep Neural Networks


SUNIL VADERA AND SALEM AMEEN
School of Science, Engineering and Environment, University of Salford, Salford M5 4WT, U.K.
Corresponding author: Sunil Vadera ([email protected])

ABSTRACT This paper presents a survey of methods for pruning deep neural networks. It begins by
categorising over 150 studies based on the underlying approach used and then focuses on three categories:
methods that use magnitude based pruning, methods that utilise clustering to identify redundancy, and
methods that use sensitivity analysis to assess the effect of pruning. Some of the key influencing studies
within these categories are presented to highlight the underlying approaches and results achieved. Most
studies present results which are distributed in the literature as new architectures, algorithms and data sets
have developed with time, making comparison across different studied difficult. The paper therefore provides
a resource for the community that can be used to quickly compare the results from many different methods
on a variety of data sets, and a range of architectures, including AlexNet, ResNet, DenseNet and VGG. The
resource is illustrated by comparing the results published for pruning AlexNet and ResNet50 on ImageNet
and ResNet56 and VGG16 on the CIFAR10 data to reveal which pruning methods work well in terms
of retaining accuracy whilst achieving good compression rates. The paper concludes by identifying some
research gaps and promising directions for future research.

INDEX TERMS Deep learning, neural networks, pruning deep networks.

I. INTRODUCTION an increasing number and variety of algorithms for pruning


Deep learning and its use in high profile applications such as deep neural networks. Hence, this paper presents a survey of
autonomous vehicles [1], predicting breast cancer [2], speech recent work on pruning neural networks that can be used to
recognition [3] and natural language processing [4] have understand the types of algorithms developed, appreciate the
propelled interest in Artificial Intelligence to new heights, key ideas underpinning the algorithms and gain familiarity
with most countries making it central to their industrial and with the major approaches and issues in the field. The paper
commercial strategies for innovation. aims to achieve this goal by presenting the progressive path
Although there are different types of architectures [5], from the earlier algorithms to the recent work, categorising
deep networks typically consist of layers of neurons that are algorithms based on the approach used, contrasting the simi-
connected to neurons in preceding layers via weighted links. larities and differences between the algorithms and conclud-
Another characteristic, which is considered central to their ing with some directions for future research.
predictive power [6], is that they have a large number of The studies on pruning methods all carry out empirical
parameters that need to be learned, with networks such as evaluations that compare the performance of algorithms on
ResNet50 [7] having more than 25 million parameters and different architectures and benchmark data sets. These evalu-
VGG16 [8] having more than 138 million weights. An obvi- ations have evolved as new deep learning architectures have
ous question, therefore, is to ask whether it is possible to developed, as new data sets have become available and as
develop smaller, more efficient networks without compro- new pruning algorithms have been proposed. This paper also
mising accuracy? One direction of work aimed at addressing provides a useful resource that brings together the reported
this question has been to first train a large network and then results in one place, allowing researchers to quickly compare
to prune and fine-tune a network. Although methods for the reported results on different architectures and data sets.
pruning shallow neural networks were proposed in the 1980s The survey identified over 150 studies on pruning neu-
and 90s [9]–[11], recent advances in deep learning and its ral networks, which can be categorised into the fol-
potential for applications in embedded systems has led to lowing eight groups based on the underlying approach
used:
The associate editor coordinating the review of this manuscript and 1) Magnitude based pruning methods [12]–[15], which
approving it for publication was Shenghong Li. are based on the view that the saliency of weights and

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
63280 VOLUME 10, 2022
S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

FIGURE 1. A selection of pruning methods grouped in terms of the approach adopted.

neurons can be determined by local measures such as 8) Hybrid methods [47]–[49], which utilise a combi-
their magnitude. nation of methods aimed at taking advantage of the
2) Similarity and clustering methods [16]–[21], which cumulative compressing effects of the different types
aim to identify duplicate or similar weights which of methods.
are redundant and can be pruned without impacting Table 1 classifies over 150 studies identified by the sur-
accuracy. vey into the 8 categories, enabling researchers working on
3) Sensitivity analysis methods [9], [22]–[27], that a particular type of method to locate related studies. Given
assess the effect of removing or perturbing weights on the range of studies, and availability of surveys already cov-
the loss and then remove a proportion of the weights ering some of the above categories, this paper focuses on
that have least impact on accuracy. recent algorithms in the first three categories for pruning.
4) Knowledge distillation methods [28]–[31], which Reed [11] provides an excellent survey of pruning methods
utilise the original model, termed the Teacher, prior to the deep learning era. Readers interested in the use
to learn a more compact new model called the of quantization, low rank and knowledge distillation methods
Student. are referred to the survey by Lebedev et al. [50] and read-
5) Low rank methods [27], [32], [33], that factor a ers interested in architectural design methods are referred
weight matrix into a product of two smaller matri- to the comprehensive survey by Elsken et al. [51]. Pruning
ces which can then be used to perform an equivalent networks is just one step in developing efficient models and
function more efficiently than the single larger weight a recent survey by Menghan [52] summarises the full range
matrix. of methods, from use of quantization and learning, to the
6) Quantization methods [34]–[39], which are based on available software and hardware infrastructure for efficient
using quantization, hashing, low precision and binary deployment of models. Another important direction of work,
representations of the weights as a way of reducing the worthy of a survey in its own right, and not in the scope
computations. of this paper, is the use of variational Bayesian methods for
7) Architectural design methods [40]–[46], that utilise regularization [53]–[59].
intelligent search and reinforcement learning methods Fig. 1 shows a selection of the methods covered in greater
to generate neural network architectures. detail in this survey and includes a sub-categorization of

VOLUME 10, 2022 63281


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

TABLE 1. Categorisation of studies on pruning.

63282 VOLUME 10, 2022


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

magnitude and sensitivity analysis methods. The survey known as ImageNet by a significant margin [62]. This success
found relatively few methods that utilise similarity and clus- was followed by the development of architectures like VGG,
tering, and further sub-categorization is not useful. Mag- ResNet, and ResNeXT that used an increasing number of
nitude based methods can be sub-categorised into: (i) data layers and parameters to gain further improvements in the
dependent methods that utilise a sample of examples to assess ImageNet competition [5]. The huge number of parame-
the extent to which removing weights impacts the outputs ters in these models does necessitate greater computational
from the next layer; (ii) data independent methods, that utilise resources and inhibits their use in embedded systems, which
measures such as the magnitude of a weight; and (iii) the use has motivated the research on pruning that is surveyed in this
of optimisation methods to reduce the number of weights in a paper.
layer whilst approximating the function of the layer. Methods The pruning methods developed are evaluated on a range
that utilise sensitivity analysis can be sub-categorized into of architectures (e.g., ResNet, VGG, DenseNet) and data sets
those that: (i) adopt a Taylor series approximation of the loss (e.g., ImageNet CIFAR, SVHN). Khan et al. [63] present
and (ii) use sampling to estimate the change in loss when a tutorial on deep learning architectures and Appendix A
weights are removed. summarises the data sets. When evaluating pruning methods,
The rest of this paper is organised as follows. Section II the surveyed papers use the following measures to report their
presents the background. Sections III to V describe repre- results:
sentative methods in the three categories: magnitude based • The Top-1 and Top-5 accuracy, which report the pro-
pruning, clustering and similarity, and sensitivity analysis. portion of times the correct classification appears first
Section III also includes coverage of the Lottery Hypothesis, or in the top 5 list of ranked results. In the sections
an issue about the existence of smaller networks and fine- below, unless we explicitly qualify a measure, the Top-1
tuning, that cuts across the different methods. Section VI accuracy should be assumed.
presents a comparison of the published results for pruning • The compression rate, which is the ratio of parameters
AlexNet, ResNet and VGG to illustrate the resource pro- before and after a model is pruned.
vided for comparing the methods. Section VII concludes by • The computational efficiency in terms of the FLOPS
highlighting some key insights and suggesting directions for (Floating Point Operations) required to perform a
future research. classification.
The notation used in the paper is defined where it is used
II. BACKGROUND and also summarised in Appendix B. With this background in
This section introduces the background knowledge assumed place, Sections III to V describe and contrast key influential
in the survey.1 Fig. 2 shows the structure of one of the earliest studies that bring out the features of the categories of methods
convolutional neural networks (CNNs), LeNet-5 [60], which surveyed in this paper.
recognises handwritten digits by applying convolutions and
pooling operations to identify features. These features then III. MAGNITUDE BASED PRUNING
provide the input to fully connected layers that classify the This section presents pruning methods that remove weights,
images. The pooling operation takes feature maps as input nodes, and filters based on a measure of magnitude or the
and reduces their size by applying an operation, such as the effect filters have on the next layer. Section III-A summa-
maximum value within a neighbourhood while the convolu- rizes an early influential method for pruning weights and
tion operation applies filters (or kernels) to the input channels Section III-B presents a recent hot topic, termed the Lottery
(or feature maps) to produce the output feature maps. The Hypothesis, that reinvigorates research on the existence of
filters are k × k matrices that slide over the input feature smaller networks and raises issues about fine-tuning a pruned
maps and convolve with the corresponding elements of the network. Section III-C describes the key ideas behind meth-
input feature maps to produce the output feature maps. The ods that prune filters and feature maps.
elements of a filter correspond to the weights (or parameters)
that are used to transform regions in feature maps in one A. NETWORK PRUNING OF WEIGHTS
layer to the next and need to be learned through training. The
One of the first studies to utilise magnitude based pruning
weights (or parameters), either individually or collectively as
for deep networks is due to Han et al. (2015) who adopt
filters, are therefore the primary candidates for pruning.
a process in which weights below a given threshold are
The LeNet-5 model, with 60K parameters in 5 lay-
pruned [64].3 Once pruned, the network is fine-tuned and the
ers, achieved impressive results on a data set known as
process repeated until its accuracy begins to deteriorate.
MNIST [61].2 In a breakthrough in 2012, AlexNet built upon
Han et al. [64] carry out several experiments to compare
the concepts in LeNet-5 and developed a deeper network
the merits of their magnitude based iterative pruning method.
with over 60M parameters in 8 layers to win a competition
First, they apply their method on a fully connected network
1 Readers unfamiliar with deep neural networks are referred to tutorial 3 We use a number citation style, but include the name(date) format to
accounts such as by Goodfellow et al. [54] for further details highlight the date of the publication where we think it is relevant. The
2 Modified National Institute of Standards and Technology corresponding reference number is provided at the end of the sentence

VOLUME 10, 2022 63283


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

FIGURE 2. The LeNet-5 network and how it processes an input image via convolutions (Conv.) and pooling
operations to produce features maps (FMs) and uses fully connected (FC) layers to perform classification.

known as LeNet-300-100 and then on Lenet-5 (Fig. 2), both the original network using no more than the number of epochs
of which are trained on the MNIST data. Their results show used for training the original network [66]. This subnetwork
that it is possible to reduce the number of weights by a factor is termed a winning lottery ticket, given that it was lucky to
of 12 without compromising accuracy. Second, they apply be initialised with suitable weights.
iterative pruning to AlexNet and VGG16 trained on the Ima- To test this hypothesis, they propose two pruning methods.
geNet data, and show that it is possible to reduce the number First, in a one-shot method, they use magnitude pruning to
of weights by a factor of 9 and 12 respectively. Thirdly, they prune p% of the weights, reset the remaining weights to their
compare the merits of using regularisation to drive down initial values and retrain. Second, they utilise an iterative
the magnitude of weights to aid subsequent pruning. They pruning method with n cycles, with each cycle pruning p1/n
explore regularisation with both the L1 and L2 norms and of the weights.
conclude that L1 is better immediately after pruning (without They perform experiments on the fully connected
fine-tuning), but L2 is better if the weights of the pruned LeNet-300-100 network for the MNIST data, and variants of
model are fine-tuned. Their experiments also suggest that the VGG and ResNet for the CIFAR10 data. Their experiments
1 earlier layers (i.e., closer to the inputs) are the most sensitive on the LeNet-300-100 network prune a percent of the weights
2 to pruning and that iterative pruning is better than pruning from each layer except the final layer, in which the percent
the required proportion of weights in one cycle (i.e., one-shot pruned is reduced by half. Their results with iterative pruning
pruning). show that: (i) a subnetwork that is only 3.6% of its original
The study by Han et al. [64] is notable in that (i) it demon- size performs just as well, (ii) random initialization of the
strated that it was possible to reduce the size of deep networks pruned networks results in slower learning in comparison
significantly without compromising accuracy, (ii) it high- to use of the original weight initializations, (iii) that the
lighted the benefits of iterative pruning and (iii) it prompted subnetworks (termed winning tickets) found, learn faster than
further research on questions such as whether retraining from the original network, (iv) there is continual improvement in
3 scratch or fine-tuning is better following pruning. the rate of learning as the size of the network reduces, but
Guo et al. [65] note that magnitude pruning can lead to only up to a point, after which learning slows down and
premature removal of weights that can become important begins to regress to the performance of the original network,
given removal of other weights. To address this, they propose (v) iterative pruning tends to result in more accurate smaller
a method known as Dynamic Network Surgery (Dyn Surg) networks than one-shot pruning.
which maintains a mask that indicates which weights should Their experiments on the larger networks, VGG and
be removed and retained in each training cycle, thereby allow- ResNet, show that identification of winning lotteries depends
ing reinstatement of weights previously marked to be pruned on the learning rate, with a lower rate successfully identifying
if they turn out to be important. They compare their method winning lottery subnets, and that pruning weights over all the
with magnitude pruning, with the results showing that it is network, as opposed to layer by layer produces better results.
reduces the number of weights by a factor of over 17 for These results provide good empirical evidence for the
AlexNet on ImageNet. Lottery Hypothesis and the award of a best paper prize in
the 2019 International Conference on Learning Representa-
B. THE LOTTERY TICKET HYPOTHESIS tions is indicative of the significance of the paper and the
One of the most interesting observations in Han et al. [64] is attention it has attracted.
that re-initialization of the weights does not lead to accurate In their paper, ‘‘Rethinking the value of network pruning’’,
models and, based on their trials, it was better to fine-tune the Liu et al. (2019) challenge the claim that it is better to utilise
weights of the pruned model. Following on from this obser- the initial weights of a pruned model when compared with
vation, Frankle and Carbin (2019) propose the Lottery Ticket random initialization [67]. To test this, they carry out exper-
Hypothesis which states that: a trained network contains a iments on VGG, ResNet, and DenseNet using the CIFAR10,
subnetwork, which can be trained to be at least as accurate as CIFAR100, and ImageNet data. They define three types of

63284 VOLUME 10, 2022


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

pruning regime: structured pruning, where the proportion of


channels that are pruned per layer is predefined; automatic
pruning, where the proportion of channels pruned overall
is predefined but the per layer rate is determined by the
algorithm; and unstructured weight pruning, where only the
proportion of weights pruned is predefined. Their results
suggest that for structured and automatic pruning, random
initialization is equally (if not more) effective. However,
for unstructured networks, random initialization can achieve FIGURE 3. Illustration of how the feature maps are computed, where Wj ,i
similar results on small data sets but for large scale data such are the k × k filters used on the input channels Xi to obtain output
feature maps Yj .
as ImageNet, fine-tuning produces better results.
At first sight, their findings contradict the Lottery Hypoth-
esis. However, in a follow up study, Frankle et al. (2019) moment estimation), and then utilise a different optimizer,
acknowledge that setting the weights of pruned networks to SGD (Stochastic Gradient Descent) with momentum, and
their initial values does not work well on larger networks vice versa on the CIFAR10 data. Their results suggest that,
and suggest that methods for retraining from random initial- in general, winning tickets are optimizer independent [69].
izations do not work well either, except for moderate levels To test if the lottery hypothesis holds in other types of
of pruning (up to 30%) [68]. They therefore propose setting problems, Yu et al. (2019) carry out experiments on natural
the weights to those obtained in a later iteration of training, language processing (NLP) and control tasks in games [71].
which they then demonstrate to be beneficial in identifying For NLP, they utilise LSTMs for the Wikitext-2 data [72] and
good initialization of weights for larger scale problems such Transformer models for translating news in English to Ger-
as ImageNet. man [73]. The experiments were carried out with 20 rounds
The above studies focus on empirical evaluations of net- of iterative pruning and with one-shot pruning. A pruning
works trained and used on the same data sets, and primarily rate of 20% was used and following pruning, weights were
on image processing classification tasks. Morcos et al. (2019) reset to those learned during a later round of training. For
explore a number of other interesting questions [69]: control tasks, they utilise Reinforcement Learning (RL) and
carry out experiments on fully connected networks used for 3
• Are the lotteries found for one image classification task
OpenAI Gym environments [74] and 9 Atari games that
transferable to other tasks?
utilise convolutional networks [75].
• Are lotteries observable in other tasks (such as natural
From their results on NLP and the RL control tasks,
language processing), and architectures?
they conclude that both iterative pruning and late setting of
• Are they transferable across different optimizers?
weights are superior in comparison to random initialization
To explore these questions, they carry out experi- of pruned networks, with iterative pruning being essential
ments with VGG19 and ResNet50 using six data sets when a significant number of weights (i.e., more than two-
(Fashion-MNIST, SVHN, CIFAR10, CIFAR100, ImageNet, thirds) are pruned. For the Atari games, the results varied: in
Places365), in which the lotteries (i.e., subnetworks with ini- one case, it led to improvements over the original network
tializations) identified for one task are used for another task. (Berzerk game) while in another, an initial improvement was
Their experiments use iterative magnitude based pruning, followed by a significant drop in accuracy as the amount
selecting 20% of the weights over all the layers, and with of pruning increased (Space Invaders game). In other cases,
late setting of weights (as proposed in Frankle et al. [66]). pruning resulted in a reduction in performance (e.g., Assault
The results are interesting: in general, winning initializations game). Thus in summary, Yu et al. [71] provide some evi-
carry across similar image processing tasks and winning dence that the lottery hypothesis holds for NLP tasks and for
tickets from larger scale tasks were more transferable than some control tasks that utilise RL.
the tickets from the smaller scale tasks. In some cases, for
example, the use of VGG19 on the Fashion-MNIST data, the C. PRUNING FEATURE MAPS AND FILTERS
winning tickets obtained from the use of VGG19 on the larger Although the kind of methods described in Section III-A
data sets (CIFAR100, ImageNet) performed better than those result in fewer weights, they require specialist libraries or
obtained directly from the Fashion-MNIST data. hardware for processing the resulting sparse weight matri-
Hubens et al. (2020) carry out empirical trials that confirm ces [76]–[78]. In contrast, pruning at higher levels of gran-
similar results on the size of the pruned networks [70]. They ularity, such as pruning filters and channels benefits from the
show that when a network is trained on a larger data set, such optimizations already available in many current toolkits. This
as ImageNet, and transferred and fine-tuned for a different has led to a number of methods for pruning feature maps and
task, pruning can result in a smaller network than if it was filters which are summarized in this section.
trained from scratch on the new task. To appreciate the intuition and notation behind these meth-
Morcos et al. (2019) carry out experiments in which lottery ods, it is worth bearing in mind how filters are applied to
tickets are identified using one optimizer, ADAM (adaptive the input channels to produce the output feature maps. Fig. 3

VOLUME 10, 2022 63285


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

illustrates the process, showing how an image with 3 channels a proportion of the output feature maps and corresponding
is taken as input and convolved with the filters to produce filters that results in the greatest variation.
the 4 output feature maps. Given the visualisation offered by Removal of an output feature map is problematic given it
Fig. 3, how can one best prune the filters and channels? The is expected as an input channel in the next layer. To overcome
survey revealed three main directions of research: this, they approximate a removed channel using the other
• Data dependent channel pruning methods, which are channels. That is, if Yi , Yi0 are the outputs of a layer before and
based on the view that when different inputs are pre- after pruning a layer respectively, the aim is to find a matrix
sented, the output channels (i.e., feature maps) should A such that:
vary given they are meant to detect discriminative 2
X
min Yi − AYi0 2 (4)
features. A
i
• Data independent pruning methods, that use prop-
The matrix A is then included as an additional convo-
erties of the filters and output channels, such as the
lutional layer of 1 × 1 filters along the lines proposed by
proportion of zeros present, to decide which filters and
Lin et al. [80].
channels should be pruned.
Polyak & Wolf [79] evaluate the above approach on the
• Optimization based channel approximation pruning
Scratch network, using the CASIA-WebFace and the Labeled
methods, that use optimization methods to recreate the
Faces in the Wild (LFW) data sets. They utilise layer by
filters to approximate the output feature maps.
layer pruning, where each layer is pruned, and the network
The following describes and contrasts methods that typify fine-tuned before moving on to the next layer. They exper-
these three directions. iment with their two pruning methods individually and in
combination, and compare the results with the use of random
1) PRUNING BASED ON VARIANCE OF CHANNELS AND pruning, a low rank approximation method [81] and Fitnets,
FILTERS a method that uses the Knowledge Distillation approach to
Polyak & Wolf (2015) propose two methods for pruning learn smaller networks [82]. In the experiments with the
channels: Inbound pruning, which aims to reduce the number Inbound pruning method, they prune channels where σj,i 2
of channels incoming to a filter and Reduce and Reuse prun- is below a given threshold, selected such that the overall
ing, which aims to reduce the number of output channels [79]. accuracy is maintained above 84%. For the experiments with
The idea behind Inbound pruning is to assess the extent to the Reduce and Reuse method, they try different levels of
which an input channel’s contribution to producing an output pruning: 50%, 75%, and 90% for the earlier layers followed
feature map varies with different examples. This assessment by 50% for the later layers. The adoption of a lower pruning
is done by applying the network to a sample of the images rate for the later layers follows an observation that heavy
and then using the variance in a feature map as a measure of pruning of the later layers results in a marked reduction in
its contribution. accuracy.
More formally, given Wj,i , the jth filter for the ith input The results from their experiments show that: (i) the vari-
p
channel, and Xi , the input from the ith channel for the pth ance based method is more effective than use of random
p
example, the contribution to the jth output feature map, Yj,i is pruning, (ii) the use of fine-tuning does help in recovering
defined by: accuracy, especially in the later layers, (iii) their methods
p p result in greater compression than use of a low rank method
Yj,i = Wj,i · Xi F
(1) and the use of Fitnets when applied to the Scratch network.
Given this definition, the measure used to assess the variation
2) ENTROPY-BASED CHANNEL PRUNING
in its contribution, σj,i
2 from the N samples is:
Instead of the variance, Luo & Wu (2017) propose an entropy-
based metric to evaluate the importance of each filter [83].
n o
p
σj,i
2
= var Yj,i | p = 1 . . . N (2)
In their filter pruning method, if a feature map contains
Inbound pruning uses this measure to rank the filters Wj,i and less information, its corresponding filter is considered less
removes any that fall below a specified threshold. important, and could be pruned. To compute the entropy of
The Reduce and Reuse pruning method focuses on assess- a particular feature map, they first sample the data and obtain
ing the variations in the output feature maps when different a set of feature maps for each filter. Each feature map is
samples are presented. That is, the method first computes the reduced to a point measure using a global average pooling
variations in the output feature maps σj2 using: method, and the set of measures associated with each filter
are discretized into q groups. The entropy of a filter, Hj is
m
!
X p
then used to assess the discriminative power of a filter [83]:
σj = var
2
Yj,i | p = 1 . . . N (3) q
X
i=1 F Hj = Pi log (Pi ) (5)
where m is the number of channels and N is the number of i=1
samples. Reduce and Reuse then uses this measure to retain where Pi is the probability of an example being in group i.

63286 VOLUME 10, 2022


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

They explore both one-shot pruning followed by fine- 4) PRUNING SMALL FILTERS AND FILTER SKETCHING
tuning and layer wise pruning in which they fine-tune with Li et al. (2017) extend the idea of magnitude pruning of
just one or two epochs of learning immediately after pruning weights to filters by proposing the removal of filters that have
a layer. Their layer wise strategy is an interesting compromise the smallest absolute sum among the filters in a layer [76].
between fully fine-tuning after pruning each layer, which can That is, if the filters for producing the jth feature map are
be computationally expensive, and only fine-tuning at the Wj,i ∈ Rk×k and m is the number of input feature maps, then
end, which can fail to take account of the knock-on effects the magnitude of the jth filter is defined by:
of pruning previous layers. m
They evaluate the merits of the entropy-based method X
sj = Wj,i 1
(6)
by applying it to VGG16 and ResNet-50 on the ImageNet
i=1
data. For VGG16, they focus on the first 10 layers and, also
replace the fully connected layers by use of average pooling Once the sj are computed, a proportion of the smallest filters
to obtain further reductions. They compare their results on together with their associated feature maps and filters in the
VGG16 with those obtained by the magnitude based pruning next layer are removed. After a layer is pruned, the network
method and APoZ method (described below). Their results is fine-tuned, and pruning is continued layer by layer.
suggest that: (i) the entropy-based method achieves more than To test this approach, they carry out experiments on
a 16 fold compression, though this is at the expense of a VGG16 and ResNet56 & 110 on CIFAR10 and ResNet34 on
1.56% reduction in accuracy, (ii) use of magnitude pruning ImageNet. By analyzing the sensitivity of the layers through
results in a 13 fold compression, and (iii) APoZ results in experimentation, they determine appropriate pruning ratios
a 2.7 fold compression. However, it should be noted that for each layer that would not compromise accuracy signifi-
the higher compression rate achieved by the use of entropy cantly. Overall, for VGG16, they are able to prune the param-
includes the reduction due to the replacement of the fully eters by 64%. A significant proportion of this pruning is in
connected layers by average pooling, without which the use layers 8 to 13 which consist of the smaller filters (2 × 2 and
of the entropy-based method leads to a lower compression 4×4), which they notice can be pruned by 50% without reduc-
rate than APoZ (Table 3 in [83]). ing accuracy. The level of pruning for the other networks is
more modest, with the best pruning rate for ResNet-56 and
ResNet110 on CIFAR10 being 3.7% and 32.4% respectively,
3) APoZ: NETWORK TRIMMING BASED ON ZEROS IN A and for ResNet-34 on ImageNet being 10.8%.
CHANNEL They also compare their approach with the variance-based
In contrast to the use of samples of data to compute the method described above and conclude that use of the above
variance of a feature map or its entropy, Hu et al. (2016), measure over filters performs at least as well but without the
suggest a direct method that is based on the view that the additional need to compute the feature maps via samples of
number of zeros in an output feature map is indicative of its the data.
redundancy [84]. Based on this view, they propose a method A more recent method, proposed by Lin et al. (2020),
that uses the average number of zero activations (APoZ) in known as filter sketch also aims to reduce the number of
a feature map as a measure of the weakness of a filter that filters without the need to sample examples [85]. The key
generates the feature map. idea in filter sketching is to minimize the difference between
Their experiments are with LeNet5 on MNIST and VGG16 the co-variances of the original set of filters and the reduced
on ImageNet and aimed at first finding the most appropriate set. Although this can be done using optimization methods,
layers to prune and then to iteratively prune these layers in a filter sketch utilises a greedy algorithm known as Frequent
bespoke way that maintains or improves accuracy. Following Direction [86] which is more efficient.
pruning, they experiment with both retraining from scratch Lin et al. [85] evaluate the filter sketch method on
and fine-tuning the weights and prefer the latter given better GoogleNet, ResNet56 and ResNet110 using the CIFAR10
results. data, and on ResNet50 with the ImageNet data. The results
For LeNet-5, they observe that most of the parameters show that it performs well relative to the method for pruning
(over 90%) are in the 2nd convolution layer and the first fully small filters and a method that uses optimization to prune
connected layer and hence they focus on pruning these two channels (described below in Section III-C7) in terms of
layers in four iterations of pruning and fine-tuning, resulting reducing the number of parameters without a significant loss
in the size of the convolutional layer reducing from 50 to in accuracy.
24 filters and the number of neurons in the fully connected
layer reducing from 500 to 252. Overall, this represents a 5) PRUNING FILTERS BASED ON GEOMETRIC MEDIAN
compression rate of 3.85. He et al. (2019) point out that pruning based on the magnitude
For VGG16, they also focus on one convolutional layer that of filters assumes that there are some small filters and that the
has 512 filters and a fully connected layer with 4096 nodes. spread of magnitude is wide enough to adequately distinguish
After 6 iterations, they reduce these to 390 filters and those filters that contribute from those do not contribute [87].
1513 nodes, achieving a compression rate of 2.59. So, for example, if most of the weights are small, one could

VOLUME 10, 2022 63287


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

end up removing a significant number of filters and if most of the expense of a 1% reduction in Top-1 accuracy. For ResNet,
the filters have large values, no filters would be removed, even ThiNet is applied on the first two convolutional layers of each
though there may be filters that are relatively small. Hence, residual block, keeping the output dimensions of the blocks
they propose a method based on the view that the geometric the same. After pruning each layer, one epoch of fine-tuning
median of the filters shares most of the information common is performed, and 9 epochs are used for fine-tuning at the end.
in the other filters and hence a filter that is close to it can be The results show that ThiNet is able to halve the number of
covered by the other filters if deleted. Computing the geo- parameters with a 1.87% loss in Top-1 accuracy.
metric median can be time-consuming, so they approximate Ding et al. (2019) propose a similar method to ThiNet,
its computation by assuming that one of the filters will be the called Approximated Oracle Filter Pruning (AOFP), which
geometric mean. Their pruning strategy is to prune and fine- aims to identify the subset of filters, which if removed, will
tune repeatedly using a fixed pruning factor for all layers. have the least effect on the feature maps in the next layer [91].
They carry out an evaluation with respect to several meth- However, whereas, the search procedure adopted in ThiNet
ods including pruning small filters [76], ThiNet [88], Soft uses a greedy bottom up approach, AOFP adopts a top-down
filter pruning [89], and NISP [90]. These methods are evalu- binary search in which half of the filters in a layer are ran-
ated on ResNets trained on the CIFAR10 and ImageNet data, domly selected and set to be pruned. The effect of removing
with pruning rates of 30 and 40 percent. In general, the drop these filters on the feature map produced in the next layer is
in accuracy is similar across the different methods, though measured and recorded against each filter that is set as pruned.
there is a significant reduction in FLOPS when using the This process is repeated for different random selections, and
geometric median method on ResNet-50 (53.5%) compared the average effect per filter used as an indication of the effect
to the other methods (e.g., ThiNet 36.7%, Soft filter pruning of removing a filter. The top 50% of the filters that would
41%, NISP 44%). result in the worst effect if removed are retained and the
process repeated unless this would result in an unacceptable
6) ThiNet AND AOFP reduction in accuracy. In comparison to ThiNet, and other
Luo et al. (2017) formulate the pruning task as an opti- methods, AOFP does not require the rate of pruning to be
mization problem and propose a system ThiNet in which fixed in advance of pruning a layer.
the objective is to find a subset of input channels that can AOFP is evaluated by pruning AlexNet, VGG and ResNet
best approximate the output feature maps [88]. The channels trained on the CIFAR10 and ImageNet data. They com-
not in the subset and their corresponding filters can then be pare AOFP with several methods including: ThinNet, Net-
removed. Solving the optimization problem is computation- work Slimming [92], Pruning using Agents [93], Online
ally challenging, so ThiNet uses a greedy algorithm that finds Filter Weakening [15], NISP [90], Optimizing Channel
a channel that contributes the least, adds it to the list to be Pruning [94], Structured Probabilistic Pruning [95], Auto-
removed, and repeats the process with the remaining channels pruner [96], and ISTA [97], with their results showing that
until the number of channels selected equals the number to AOFP is capable of greater reductions in FLOPS without
be pruned. Once a subset of filters to be retained is identified, compromising accuracy.
their weights are obtained by using least squares to find the
filters W that minimize [88]: 7) OPTIMIZING CHANNEL SELECTION WITH LASSO
m 
REGRESSION
2
He et al. (2017) also formulate channel selection as an
X
Yi − W T · Xi (7)
i=1
optimization problem [94]. Given a channel Y obtained by
applying a filter Wi to m input channels Xi :
where Yi are the m sampled points in the output channels and
m
Xi their corresponding input channels. X
They evaluate their approach in two sets of experiments. Y = Xi WiT (8)
In the first, they adapt VGG16, replacing the fully connected i=1
layers by global average pooling (GAP) layers, apply it to They define the task as one to optimize:
the UCSD-Birds data and then prune it using ThiNet, APoZ
and the small filters method. Their results show there is less c 2
1 X
degradation in accuracy with ThiNet than ApoZ, which in arg min Y− βi Xi WiT
β,W 2
turn, is better than the small filters method. i=1 F
In their second set of experiments, they utilise VGG16 and subject to kβk0 ≤ p (9)
ResNet50 trained on the ImageNet data. For VGG16, their
procedure involves pruning a layer and then minor fine-tuning where p indicates the number of channels retained and βi ∈
with one epoch of training with an additional epoch at the end {0, 1} indicates the retention or removal of a channel.
of each group of convolutional layers and a further 12 epochs In contrast to ThiNet, which adopts a greedy heuristic to
of fine-tuning after the final layer. With the use of GAP, solve this optimization problem, He et al. (2017) relax the
ThiNet, reduces the number of parameters by about 94% at problem from L0 to L1 regularization and utilise LASSO

63288 VOLUME 10, 2022


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

regression to solve [94]: the CIFAR10 data. The MLP has three layers: the first two
2 are fully connected layers and the third is a softmax layer
c
1 X with 10 nodes representing the class for CIFAR10. The CNN
arg min Y− βi Xi WiT + λkβk1
β,W 2 has two convolution layers, each followed by a ReLU, and
i=1 F
a 2 × 2 max pooling layer. The convolutional layers are
subject to kβk0 ≤ p (10)
followed by two fully connected layers to perform the clas-
Following the selection of the channels they utilise least sification. In both cases, the first layer is varied with 100,
squares to obtain the revised weights in a manner similar to 500 and 1000 units (nodes or filters) to explore the effects
the approach adopted in ThiNet. of increasing over parametrisation. Their main finding is that
They carry out empirical evaluations on VGG16, ResNet50 there is a much greater propensity for similar weights/filters
and a version of the Xception network, trained on the to occur in MLPs than in CNNs. As a consequence, there is a
CIFAR10 and ImageNet data. They also explore the extent greater opportunity for using similarity as a basis for pruning
to which the pruned models can be used for transfer learning MLPs than for pruning CNNs. Nevertheless, their results
by using them for the PASCAL VOC 2007 object detection suggest that a similarity based pruning algorithm is better at
task. retaining accuracy than using the small filters method.
In their first set of experiments, they evaluate their method Ayinde et al. (2019) also develop a method that uses clus-
on single layers of VGG16 trained on CIFAR10 without any tering to identify similar filters [78]. They too adopt the inner
fine-tuning, and show that their algorithm maintains Top-5 product as a measure of similarity, but use an agglomerative
accuracy better than the method of pruning small filters. They hierarchical clustering method to group similar filters and
also include results from a naïve method that selects the first replace the filters by randomly selecting one filter from each
k feature maps and show that, for some layers (e.g. conv3_3 cluster. They carry out various experiments with VGG16
in VGG16), this sometimes performs better than the method on CIFAR10 and ResNet34 on ImageNet. For the trial on
of pruning small filters, highlighting a potential weakness of VGG16 with the CIFAR10 data, they show that, once an opti-
magnitude-based pruning. mal value for the threshold for similarity is determined, their
In a second set of experiments, with VGG16 on CIFAR10, method achieves both a better pruning rate and accuracy than
they apply their method on the full network, using bespoke other methods, including pruning of small filters, Network
pruning ratios for the layers and fine-tuning to achieve 2, Slimming [92], a method that uses regularization to identify
4 and 5 fold improvements in run-time, but resulting in drops weak channels, and try-and-learn [93], a method that uses
of Top-5 accuracy of 0%,1%, and 1.7% respectively. In com- sensitivity analysis.
parison, the method for pruning small filters results in larger
drops of 0.8%, 8.6% and 14.6%. V. SENSITIVITY ANALYSIS METHODS
Their experiments on ResNet50 adopt bespoke pruning The primary goal of pruning is to remove weights, filters and
rates per layer, retaining 70% of layers that are very sensitive channels that have least effect on the accuracy of a model.
to pruning, and 30% of the less sensitive layers. The Top-5 The magnitude and similarity based methods described above
accuracy results on ImageNet show a two-fold improvement address this goal implicitly by using properties of weights,
in run-time at the expense of a 1.4% drop in accuracy com- filters and channels that can affect accuracy. In contrast,
pared to a baseline accuracy of 92.2%, while the results on this section presents methods that use sensitivity analysis to
the Xception network show a drop of 1% in accuracy from a model the effect of perturbing and removing weights, filters
baseline of 92.8%. and channels on the loss function.
The experiment on using a pruned version of a VGG16 Section V-A describes methods that assess the importance
model on the PASCAL VOC 2007 object detection bench- of channels and Sections V-B to V-D present the development
mark task results in a 2-fold increase in speed with a 0.4% of a line of research that approximates the effect of perturbing
drop in average precision. the weights on the loss function using the Taylor Series, from
the earliest work which developed methods for MLPs to the
IV. PRUNING BASED ON SIMILARITY AND CLUSTERING
more recent research on methods for pruning CNNs.
Given that neural networks can be over-parametrised, it is
plausible that there could be duplicate weights or filters
that perform similar functions and can be removed without A. PRUNING BY ASSESSING THE IMPORTANCE OF NODES
impacting accuracy [19], [20], [98]–[100]. AND CHANNELS
RoyChoudry et al. (2017) explore this hypothesis by using Skeletonization, a method proposed by Mozer & Smolen-
the inner product of two filters (or weight matrices) as a sky(1988), was one of the earliest approaches to pruning
measure of similarity [19]. Their pruning algorithm involves neural networks [9]. To calculate the effect of removing
grouping filters that are similar and then replacing each group nodes, Skeletonization introduced the notion of attentional
of filters by their mean filter. They carry out experiments strength to denote the importance of nodes when computing
with both a multilayer perceptron (MLP) and a CNN for activations. Given the attentional strengths of the nodes, αi ,

VOLUME 10, 2022 63289


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

the output yj from node j, is defined by: The two methods differ in the way they implement the
! training process aimed at minimizing L0 with Network Slim-
X
yj = f wji αi yi (11) ming taking advantage of the use of batch normalization
i layers that are sometimes present following convolutional
layers while SSS implements a more general process that
where f is assumed to be the sigmoid function. The impor-
does not assume the presence of batch layers, and allows use
tance of a node ρi is then defined in terms of the difference
of scaling factors for blocks (such as residual and inception
in loss when αi is set to zero and when it is set to one and can
blocks) that can enable reduction of the depth of a network.
be approximated by the derivative of the loss with respect to
Huang and Wang(2018) experiment with SSS on the
the attentional strength αi :
CIFAR-10, CIFAR-100, and ImageNet data on VGG16, and
∂L ResNet [55]. For CIFAR10, SSS is able to reduce the number
ρi = Lαi =0 − Lαi =1 ≈ − (12)
∂αi αi =1 of parameters in VGG16 by 30% without loss of accuracy. For
ResNet-164, it is able to achieve a 2.5 times speedup at the
Through experimentation, they found that the linear loss
cost of a 2% loss in accuracy for CIFAR-10 and CIFAR-100.
worked better a than the quadratic loss because the difference
For VGG16 on ImageNet, SSS is able to reduce the FLOPs
between the outputs and targets was small following training.
by about 75%, though parameter reduction is minimal, which
In addition, they noticed that the ∂L(t)/∂αi were not stable
is consistent with other methods given the large number
with time, so they used a weighted average measure to com-
of parameters in the fully connected layers in VGG16.
pure the importance ρbi :
On ResNet50, SSS achieves a 15% reduction in FLOPs at
∂L(t) a cost of a 0.1% reduction in Top-1 accuracy.
ρ ρi (t) + 0.2
bi (t + 1) = 0.8b (13)
∂αi
Mozer & Smolensky [9] present a number of small but very B. PRUNING WEIGHTS WITH OBD AND OBS
interesting experiments. These include generating examples Several studies utilise the Taylor series to approximate the
where the output is correlated to four inputs, A,B,C, and D, effect of weight perturbations on the loss function [22],
with full correlation on A and reducing to no correlation [23], [101]. Given the change in weights 1W , a Taylor
with D. They provide this as input to a network with one hid- Series approximation of the change in loss 1L can be stated
den node and following training they observe that the weights as [102]:
from the inputs to the hidden node follow the correlations, ∂LT 1  
although the relevance measure only shows input node A as 1L = 1W + 1W T H 1W + O kδW k3 (15)
∂W 2
important, providing some reassurance that the measure is
where H is a Hessian matrix whose elements are the second
different from the weights. In another example, they develop
order derivatives of the loss with respect to the weights:
a network to model a 4-bit multiplexor network, which has
4 bits as input and two bits to control which of the 4 bits is ∂ 2L
Hij = (16)
output. They try two network configurations: in the first, they ∂wi ∂wj
utilise 4 hidden nodes and in the second they utilise 8 hidden
Most methods that adopt this approximation assume that
nodes and use skeletonization to reduce its size to 4 hidden
the third order term is negligible. In Optimal Brain Damage
nodes. When limiting training to 1000 epochs, they find that
(OBD), LeCun et al. (1990), also assume that the first order
starting with 4 hidden nodes initially, results in failure to
term can be ignored given that the network will have been
converge in 17% of the cases, while beginning with 8 hidden
trained to achieve a local minima, resulting in a simplified
layer nodes followed by skeletonization converges in all the
quadratic approximation [22]:
cases and, also retains accuracy. This appears to be one of
the first demonstrations that, to begin, it may be necessary to 1
1L = 1W T H 1W (17)
overparameterize a network in order to find winning lotteries. 2
This idea of assessing the importance of nodes has been Given the large number of weights, computing the Hessian
extended to channels by two methods, namely Network Slim- is computationally expensive, so they also assume that the
ming [92] and Sparse Structure Selection (SSS) [55], that change in loss can be approximated by the diagonal elements
learn a measure of importance as part of the training process. of the Hessian, resulting in the following measure of the
Both utilise a parameter γ for each channel (analogous to saliency sk of a weight wk :
the attentional strength) which scales the output of a channel.
Hkk w2k
Given a loss function L, the new loss L0 is defined with an sk = (18)
additional regularization term over the scaling factors γ : 2
X where the second order derivatives, Hkk are computed in
L0 = L + λ g(γ ) (14) a manner similar to the way the gradient is computed in
u backpropagation.
where the function g is selected as the L1 norm to reduce γ Hassibi et al. (1993) argue that ignoring the non-diagonal
towards zero (as in Lasso regression). elements of a Hessian is a strong assumption, and propose

63290 VOLUME 10, 2022


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

an alternative pruning method, called Optimal Brain Surgeon using the measure, pruning it, and then fine-tuning the net-
(OBS), that aims to take account of all the elements of a work before repeating the process until a stopping condition,
Hessian [23], [24]. that takes account of the need to reduce the number of FLOPs
Using a unit vector, em , to denote the selection of the mth while maintaining accuracy, is met. Their experiments reveal
weight as the one to be pruned, OBS reformulates pruning as several interesting findings:
a constraint-based optimization task: 1) From experiments on VGG16 and AlexNet on the

1 T
 UCSD-Birds and Oxford-Flowers data, they show that
min δw · H · δw the features maps selected by their criteria correlate
δw 2
significantly more closely to those selected by an oracle
subject to eTm .δw + δwm = 0 (19)
method than OBD and APoZ. On the ImageNet data,
Formulating this with a Lagrangian multiplier, λ, the task is they find that OBD correlates best when AlexNet is
to minimize: used.
2) In experiments on transfer learning, where they fine-
1 T  
δw · H · δw + λ eTm · δw + δwm (20) tune VGG16 on the UCSD-Birds data, they present
2 results showing that their method performs better than
By taking derivatives and utilizing the above constraint, they APoZ and OBD as the number of parameters pruned
show the saliency, sk of weight wk can be computed using: increases. In an experiment in which AlexNet is fine-
tuned for the Oxford Flowers data, they show that both
1 w2k
sk =   (21) their method and OBD perform better than APoZ.
2 H −1 k,k 3) In a striking example of the potential benefits of
pruning, they demonstrate their method on a network
They show that on the XOR problem, modelled using a
for recognizing hand gestures that requires over 37
MLP network with 2 inputs, 2 hidden layer nodes and one
GFLOPs for a single inference but only requires 3
output, OBS is better at detecting the correct weights to delete
GFLOPs after pruning, all be it with a 2.6% reduction
than OBD or magnitude pruning. They also show that OBS
in accuracy.
is able to significantly reduce the number of weights required
for neural networks trained on the Monk problems [103] and In a follow up publication, Molchanov et al. (2019)
for NetTalk [104], one of the classical applications of neural acknowledge some limitations of the above approach, namely
networks, it is able to reduce the number of weights required that assuming that all layers have the same importance does
from 18000 to 1560. not work for skip connections (used in the ResNet architec-
ture) and that assessing the impact of changes in feature maps
C. PRUNING FEATURE MAPS WITH FIRST-ORDER TAYLOR leads to increases in memory requirements [106]. They there-
APPROXIMATIONS fore propose an alternative formulation, also using a Taylor
The methods described in Section V-B focus on the effect of series approximation, but based on estimating the squared
removing weights in a fully connected network. Molchanov loss due to the removal of the mth parameter:
et al. (2016) introduce a method that uses the Taylor series to  2
approximate what happens if a feature map is removed [105]. 1
(1Lm )2 = gm wm + wm Hm W (23)
In contrast to OBD and OBS, which assume that the first order 2
term can be ignored, they adopt a first order approximation,
ignoring the higher order terms, primarily on grounds of where gm is the first order gradient and Hm is the mth row
computational complexity. Using a first order approxima- of the Hessian matrix. The measure of importance of a filter
tion seems odd given the convincing argument for ignoring is then obtained by summing the contributions due to each
these terms; however they argue that although the first order parameter in a filter.
gradient tends to zero, the expected value of the change in The pruning algorithm employed proceeds as follows.
loss is proportional to the variance, which is not zero and In each epoch, they utilise a fixed number of mini-batches
is a measure of the stability as a local solution is reached. to estimate the importance of each filter and then, based on
Given a feature map with N elements Yi,j , the first order their importance, a predefined number of filters is removed.
approximation using the Taylor series leads to the following The network is then fine-tuned, and the process repeated until
measure of the absolute change in loss [105]: a pruning goal, such as the desired number of filters or a limit
for an acceptable drop in accuracy is reached.
1 X ∂L They carry out initial experiments on versions of LeNet
1L(Y ) = Yi,j (22)
N ∂Yi,j and ResNet on the CIFAR10 data, using both the second and
i,j
first order approximations (in equation 23) and given that the
The scale of this measure will vary in different layers, and results from both correlate well with an oracle method, they
they therefore apply L2 normalization within each layer. The utilise the first-order measure which is significantly more
pruning process they adopt involves selecting a feature map efficient to compute.

VOLUME 10, 2022 63291


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

In experiments with versions of ResNet, VGG, and minimize the loss L:


DenseNet on the ImageNet data, they consider the effect co
X
of using the measure of importance at points before and L(β, W ) = L(W ) + (βi − 1) gTi wi
after the batch normalization layers, and conclude that the i=1
latter option results in greater correlation with an oracle co
1 X
(βi − 1) βj − 1 wTi Hi,j wj

method. The results from their method show that it works + (24)
2
especially well on pruning ResNet-50 and ResNet-34, out- i=1,j=1
performing the results from ThiNet and NISP. The reported where gi are the first order derivatives of the loss with respect
results for other networks are also impressive, with their to the weights in the ith output channel, Hi,j are Hessians, and
method able to prune 76% of the parameters in VGG with co denotes the number of output channels.
a 0.19% loss in accuracy and able to reduce the number of By setting ui = gTi wi and si,j = 21 wTi Hi,j wj the above
parameters in DenseNet by 43% at the expense of a 0.29% equation can be written as the following 0-1 quadratic opti-
reduction in accuracy. mization problem [108]:
co
X co
X
ui (βi − 1) + si,j (βi − 1) βj − 1

min
D. PRUNING FEATURE MAPS WITH SECOND-ORDER βi
i=1 i=1,j=1
TAYLOR APPROXIMATIONS
subject to: kβk0 = p and βi ∈ {0, 1} (25)
The first order methods described above assume minimal
interaction across channels and filters. This section summa- where p denotes the number of channels to be retained in
rizes recent pruning methods that aim to take account of the a layer. They note that the gradients gi and hence ui can
effect of the potential dependencies amongst the channels and be computed in linear time. However, given the complexity
filters. of computing the Hessian matrices, they derive first order
In a method called EigenDamage, that also utilises the approximations for the loss functions, which they adopt when
Taylor series approximation, Wang et al. (2019) revisit the computing si,j . To solve the quadratic optimization problem,
assumptions made by OBD and OBS when approximating they relax the constraint to βi ∈ [0, 1] and use a quadratic
the Hessian [101]. To motivate their method, they begin programming method to find the βi which are used to select
by illustrating that although OBS is better than OBD when the top p channels to retain. They apply the optimization
pruning one weight at a time, it is not necessarily superior process on each layer to obtain the masks βi , use these to
when pruning multiple weights at a time. This is primarily prune and then perform fine-tuning at the end.
because OBS does not correctly model the effect of remov- An empirical evaluation of CCP is carried out by pruning
ing multiple weights, especially when they are correlated. the ResNet models trained on the CIFAR10 and ImageNet
To avoid this problem, they utilise a Fisher Information data, and the results compared to several methods including:
Matrix to approximate the Hessian and then they utilise a pruning small filters, ThinNet, optimizing channel pruning,
method, proposed by Grosse & Martens [107], to repre- Soft Filter pruning [89], NISP [90] and AutoML [109]. For
sent a Fisher Matrix by a Kronecker Factored Eigenbasis CIFAR10, the experiments are carried out with a pruning
(KFE). This reparameterization allows pruning to be done rate of 35% and 40%, and in each case, CCP has a smaller
in a new space in which the Fisher Matrix is approximately drop in accuracy (0.04% and 0.08% respectively) than the
diagonal. Pruning can thus be done by first mapping the other methods, with the exception of the method for pruning
weights to a KFE space in which they are approximately small filters which results in a small improvement in accuracy
independent, and then mapping back the results to the original (0.02%). However, the pruning small filters method has a
space. much lower reduction in the FLOPS (27.6%) in comparison
EigenDamage is evaluated on VGG, and ResNet on the to CCP (52.6%). The results for ImageNet show, that for
CIFAR10, CIFAR100 and the Tiny-ImageNet data. Exper- similar reductions in FLOPS, CCP has less of a drop in
iments are carried out with one-shot pruning where fine- accuracy than the other methods.
tuning is performed at the end and with iterative prun- It’s worth noting, that like EigenDamage, CCP is able to
ing in which fine-tuning is performed after each cycle. obtain good results without the need for an iterative process
In both cases, the results show that EigenDamage out- that uses fine-tuning after pruning each layer.
performs adapted versions of OBD, OBS and Network
Slimming. VI. COMPARISON OF PUBLISHED RESULTS
Peng et al. (2019) also utilise a Taylor series approximation As the above sections describe, previous studies of pruning
to develop a Collaborative Channel Pruning (CCP) method report results on varying data sets, architectures and methods
that is based on a measure of the impact of a combination that have evolved with time, making comparison of results
of channels [108]. Given a mask β, where βi = 1, indicates across the different studies difficult. The survey provides a
the retention of a channel and βi = 0, indicates a channel to resource in the form of a pivot table that can be used by
be pruned, they formulate the task as one to find the βi that the community to explore the reported performance of over

63292 VOLUME 10, 2022


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

TABLE 2. Number of reported results for a given architecture and data set.

50 methods on different architectures and data.4 Table 2 racy. As accuracy is compromised, there are methods
shows how many times each combination of data and archi- that show significant reductions in parameters. These
tecture has been used, indicating the wide variety of compar- include KSE and ThinNet with reductions in param-
isons possible. eters by 78% and 95%, with a decline of 0.64%, and
To illustrate the use of the resource, we use it to compare 0.84% in accuracy, respectively.
the reported results on two combinations of architecture and 3) When pruning ResNet56 on the CIFAR10 data, the
data for which there are a significant number of comparisons methods KSE and SFP-NFT show reductions in FLOPs
across different pruning methods, namely: (i) AlexNet and of 60% and 28% without compromising accuracy. For
ResNet50 on ImageNet and (ii) ResNet56 and VGG16 on VGG16 on CIFAR10, AOFP, PF_EC and NetSlimming
CIFAR10. Fig. 4 show the results reported in terms of the result in a 75%, 63%, and 51% reduction in flops
drop in Top-1 accuracy, and percent reduction in FLOPs or respectively without reductions in accuracy. For both
parameters where the labels used for the pruning methods are networks, it appears to be difficult to gain further reduc-
from the primary sources, with suffixes reflecting the varia- tions (beyond KSE and AOFP) even when compromis-
tions in pruning methodology used. The main observations ing accuracy.
are that: 4) The charts show that several methods are able to
1) For AlexNet on ImageNet, AOFP-B2 achieves a 41% reduce FLOPs and parameters without compromising
reduction in FLOPs with a 0.46% increase in accuracy accuracy and aid generalizability (e.g., AOFP, SFP,
and Dyn Surg [65], SSR-L [56] and NeST [41] achieve NetSlimming), though compromising accuracy a little
over 93% reduction in parameters without loss in accu- can sometimes lead to more significant reductions in
racy. Other methods that compromise accuracy do not FLOPs and parameters.
necessarily result in a greater reduction in parameters. 5) When looking at results within methods, it is pos-
2) In comparison to AlexNet, it is harder to prune sible to confirm our expectation that compromising
ResNet50 on ImageNet, although AOFP-C1 achieves accuracy can result in greater reductions in parame-
33% reduction in FLOPs without affecting accu- ters and FLOPS (e.g, see results for Filter Sketch and
FPGM for ResNet50 in Fig. 4). However, this trade-off
4 This resource is available from https://1drv.ms/x/s!ArCIJ6nceQY3- is not evident when considering results across different
BWxGMfJRbmhyUK_?e=WZUyhm pruning methods.

VOLUME 10, 2022 63293


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

FIGURE 4. Results of pruning AlexNet and, ResNet50 on ImageNet (left column), and ResNet56 and VGG16
on the CIFAR10 data (right column). Charts show the percent reduction in parameters where available (blue
bars, left axis) and FLOPs (orange bars, left axis), and reduction in baseline Top-1 accuracy (grey line, right
axis). The labels used for the methods are from the primary sources, with suffixes reflecting the variations in
pruning rates.

6) Although AOFP does perform well in retaining accu- that aimed to take account of non-diagonal elements of the
racy for three of the four cases, in general, the per- Hessian but has been shown to struggle when pruning mul-
formance of the methods varies depending on the tiple weights that are correlated. The EigenDamage method
architecture and the data set. aims to take better account of correlations by approximat-
ing the Hessian with a Fisher Information Matrix and using
VII. CONCLUSION AND FUTURE WORK a reparameterization to a new space in which the weights
This paper has presented a survey of methods for pruning are approximately independent. In an alternative approach,
neural networks, focusing on methods based on magnitude the Collaborative Channel Pruning (CCP) method formu-
pruning, use of similarity and clustering, and methods based lates the pruning task as a quadratic programming problem.
on sensitivity analysis. Molchanov et al. [105] develop a method based on a first-
Magnitude based pruning methods have developed from order approximation, arguing that the variance in the loss,
removal of small weights in MLPs to methods for pruning as training approaches a local solution, is an indicator of
filters and channels which lead to substantial reductions in stability and provides a good measure of the importance of
the size of deep networks. The range of methods developed filters. In contrast to most of the other methods that adopt
include: (i) those that are data dependent and use examples to layer by layer pruning with fine tuning after each layer, both
assess the relevance of output channels, (ii) methods that are EigenDamage and CCP show that it is possible to obtain good
independent of data, which assess the contributions of filters results with one-shot pruning followed by fine-tuning. These
and channels directly, and (iii) methods that utilise optimiza- three recent methods all show good results on large scale
tion algorithms to find filters that approximate channels. networks and data sets, though direct empirical comparisons
Methods based on sensitivity analysis are the most trans- between them have yet to be published. The survey also found
parent in that they are based on approximating the loss due two alternatives to use of Taylor series approximations: a
to changes to a network. The development of methods based method that aims to learn which filters to prune [93] and a
on a Taylor series approximation represents the primary line method based on the use of multi-armed bandits [118], both of
of research in this category of methods. Different studies which have the potential to explore new avenues of research
have adopted different assumptions in order to make the on pruning methods.
computation of the Taylor approximation feasible. In one of The survey reveals a number of positive results about
the first studies, the OBD method assumed a diagonal Hessian the Lottery Hypothesis: Lotteries appear to perform well in
matrix, ignoring both first order gradients and second order transfer learning, and lotteries exist for tasks such as NLP
non-diagonal gradients. This was followed by OBS, a method and for architectures such as LSTMs. Lotteries even seem to

63294 VOLUME 10, 2022


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

be independent of the type of optimizer used during training. Net, FCN32 and datasets such as CIFAR100, Flowers102,
Much of the current research on lotteries is based on deep CUB200-2011 (see Table 2). A comprehensive independent
networks, but it is interesting that one of the earliest papers in evaluation of the methods that includes consideration of
the field demonstrates the need to overparameterize a small the issues raised by the Lottery hypothesis across a wider
feedforward network for modelling a 4-bit multiplexor. Thus, range of data and architectures would be a useful advance
it might prove fruitful to explore the properties of lotteries in the field. In conclusion, this survey has presented the key
on smaller problems as well as the larger networks of today. research directions in pruning neural networks by summariz-
The existence of good lotteries does appear to depend on the ing how the field has progressed from the early algorithms
fine-tuning process adopted and an interesting observation, that focused on small fully connected networks to the much
that challenges some of the empirical studies reported, is that larger deep neural networks of today. The survey has aimed to
even random pruning can achieve good results following fine- highlight the motivations and insights identified in the papers,
tuning [119], so further studies of how the remaining weights and provides a resource for comparison of the reported
compensate for those that are removed, could results in results, architectures and data sets used in several studies
new insights. Although studies on lotteries provide valuable which we hope will be useful to researchers in the field.
insight, further research on specialist hardware and libraries is
needed for methods that prune individual weights to become APPENDIX A
practical [120]. SUMMARY OF DATA SETS USED IN COMPARING
The survey found least research on methods that use sim- PRUNING METHODS
ilarity and clustering to develop pruning methods. A method MNIST [61]: The MNIST (Modified National Institute of
that utilised a cosine similarity measure concluded that it was Standards and Technology) data set consists of hand-
more suitable for MLPs than CNNs, while a method that written 28 × 28 images of digits. It has 60,000 examples
utilises agglomerative clustering of filters results in up to a of training data and 10,000 examples for the test set.
3-fold reduction on ResNet when it is applied to ImageNet. PASCAL VOC [198]: The PASCAL VOC data sets have
These results suggest there is merit in developing a more formed the basis of an annual competitions from 2005
theoretical understanding of the functional equivalence of to 2012. The VOC 2007 data annotates objects in
different classes of deep networks, analogous to the studies 20 classes and consists of 9,963 images and 24,640
on equivalence of MLPs [121]. annotated objects. The VOC 2012 data, which consists
Given the different approaches to pruning, some may be of 11530 images, are annotated with 27450 regions of
complimentary, and there is some evidence that combining interest and 6929 segmentations.
them might result in further compression of networks. For CamVid [199]: CamVid (Cambridge-driving Labelled
example,He et al. [94] present results showing that combin- Video Database) is a data set with videos captured from
ing their method based on the use of Lasso regression with an automobile. In total over 10mins of video is provided
factorization results in additional gains, and Han et al. [98] together with over 700 images from the videos that have
use a pipeline of magnitude pruning, clustering and Huff- been labelled. Each pixel of an image is labelled to
man coding to increase the level of compression that can be indicate if it is part of an objects in one of 32 semantic
achieved. classes.
One of the challenges in making sense of the empirical Oxford-Flowers [200]: The Oxford-Flowers data consists
evaluations reported in the papers surveyed is that, as new of 102 classes of common flowers in the UK. It provides
deep learning architectures have developed and as new meth- 2040 training images and 6129 images for testing.
ods have been published, the comparisons carried out have LFW [201]: The LFW (Labelled Faces in the Wild) is one
evolved. The survey has therefore collated the published of the largest and widely used data sets to evaluate
results of over 50 methods for a variety of data sets and archi- face recognition algorithms. It includes 250 × 250 pixel
tecture that is available as a resource for other researchers. images of over 5.7K individuals, with over 13K images
Section VI uses this resource to present the first comprehen- in total.
sive comparison of published results across different prun- CIFAR-10 &100 [202]: The CIFAR-10 (Canadian Institute
ing methods for different architectures. The comparison of for Advanced Research) data set is a collection of
published results shows that significant reductions can be 32 × 32 colour images in 10 different classes. The data
obtained for AlexNet, ResNet and VGG, though there is no set splits into two sets: 50,000 images for training and
single method that is best, and that it is harder to prune ResNet 10,000 for testing. CIFAR-100 is similar to CIFAR-10
than the other architectures. One can hypothesize that its use but has 100 classes, where each class has 500 training
of skip connections makes it more optimal, though this is images and 100 test images.
something that needs exploring. Likewise, given that different ImageNet [203]: ImageNet contains millions of images
methods seem best for different architectures, it is worth organized using the WordNet hierarchy. It has over
studying and developing methods for specific architectures. 14M images classified in over 21K groups and has pro-
The data also reveals that there are limited evaluations on vided the data sets for the ImageNet Large Scale Visual
other networks such as the InceptionNet, DenseNet, Seg- Recognition Challenges (ILSVRC) held since 2010.

VOLUME 10, 2022 63295


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

It is one of the most widely used data sets in bench- REFERENCES


marking deep learning models and methods for prun- [1] S. Kuutti, R. Bowden, Y. C. Jin, P. Barber, and S. Fallah, ‘‘A survey of
ing. A smaller subset known as TinyImageNet is deep learning applications to autonomous vehicle control,’’ IEEE Trans.
Intell. Transp. Syst., vol. 22, no. 2, pp. 712–733, Feb. 2020.
sometimes used and is also available (https://tiny- [2] S. M. McKinney, M. Sieniek, V. Godbole, N. Antropova, H. Ashrafian,
imagenet.herokuapp.com). It consists of 200 classes T. Back, M. Chesus, C. GC, A. Darzi, M. Etemadi, F. Garcia-Vicente,
with 500 training, 50 validation and 50 testing images F. Gilbert, M. Halling-Brown, D. Hassabis, and S. Jansen, ‘‘International
evaluation of an AI system for breast cancer screening,’’ Nature, vol. 577,
per class. no. 7788, pp. 89–94, Jan. 2020.
SVHN [204]: The SVHN (Street View House Number) data [3] G. Hinton, D. Y. Deng, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior,
V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, ‘‘Deep neural
set is a collection of 600K, 32×32 images of house num- networks for acoustic modeling in speech recognition,’’ IEEE Signal
bers in Google Street View images. The data set provides Process. Mag., vol. 29, pp. 82–97, 2012.
73,257 images for training and 26,032 for testing. [4] D. W. Otter, J. R. Medina, and J. K. Kalita, ‘‘A survey of the usages of
deep learning for natural language processing,’’ IEEE Trans. Neural Netw.
UCSD-Birds [205]: The UCSD-Birds data set provides Learn. Syst., vol. 32, no. 2, pp. 604–624, Feb. 2021.
11788 images of birds, labelled as one of 200 different [5] S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes, M.-L. Shyu,
types of species. The data is split into training and testing S.-C. Chen, and S. S. Iyengar, ‘‘A survey on deep learning: Algorithms,
techniques, and applications,’’ ACM Comput. Surv., vol. 51, no. 5, p. 36,
sets of 5994 and 5794 respectively. 2019.
Places365 [206]: Places365 is a data set with 8 million [6] T. J. Sejnowski, ‘‘The unreasonable effectiveness of deep learning in
200 × 200 pixel images of scenes labeled with one of artificial intelligence,’’ Proc. Nat. Acad. Sci. USA, vol. 117, no. 48,
pp. 30033–30038, Dec. 2020.
434 categories, such as bridge, kitchen, boxing ring, etc. [7] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
It provides 50 images per class for validation and the test image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778.
set consist of 500 images per class. [8] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
CASIA-WebFace [207]: CASIA-WebFace is a data set that large-scale image recognition,’’ 2014, arXiv:1409.1556.
was created for evaluating face recognition systems. [9] M. C. Mozer and P. Smolensky, ‘‘Skeletonization: A technique for trim-
ming the fat from a network via relevance assessment,’’ in Proc. Adv.
It provides over 494K images of over 10K individuals. Neural Inf. Process. Syst. (NIPS), 1988, pp. 107–115.
WMT’14 En2De [208]: WMT’14 En2De is one of the [10] J. K. Kruschke, ‘‘Creating local and distributed bottlenecks in hidden
benchmark language data sets provided for a task set for layers of back-propagation networks,’’ in Proc. Connectionist Models
Summer School, 1988, pp. 120–126.
the Workshop on Statistical Machine Translation held [11] R. Reed, ‘‘Pruning algorithms—A survey,’’ IEEE Trans. Neural Netw.,
in 2014. This data set consists of 4.5M English-German vol. 4, no. 5, pp. 740–747, 1993.
[12] Y. Chauvin, ‘‘A back-propagation algorithm with optimal use of hidden
pairs of sentences. units,’’ in Proc. Adv. Neural Inf. Process. Syst., 1988, pp. 519–526.
FashionMNIST [209]: FashionMNIST is an alternative to [13] D. E. Weigend, ‘‘Back-propagation, weight-elimination and time series
the MNIST data set with 28 × 28 images of fashion prediction,’’ in Proc. Connectionist Models Summer School, 1990,
pp. 105–116.
products classified in 10 categories. Like MNIST, there [14] A. S. Weigend, D. E. Rumelhart, and B. A. Huberman, ‘‘Generalization
are 60,000 images for training and 10,000 images for by weight-elimination applied to currency exchange rate prediction,’’
in Proc. Seattle Int. Joint Conf. Neural Netw. (IJCNN), Nov. 1991,
testing, pp. 837–841.
[15] Z. Zhou, W. Zhou, R. Hong, and H. Li, ‘‘Online filter weakening and
pruning for efficient convnets,’’ in Proc. IEEE Int. Conf. Multimedia
Expo. (ICME), Jul. 2018, pp. 1–6.
APPENDIX B [16] A. M. Chen, H.-M. Lu, and R. Hecht-Nielsen, ‘‘On the geometry of
SUMMARY OF NOTATION feedforward neural network error surfaces,’’ Neural Comput., vol. 5, no. 6,
• In general, we use X to denote input channels, W pp. 910–927, Nov. 1993.
[17] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally,
to denote weights of filters and Y to denote output ‘‘EIE: Efficient inference engine on compressed deep neural network,’’
channels. 2016, arXiv:1602.01528.
[18] L. Li, J. Zhu, and M.-T. Sun, ‘‘Deep learning based method for pruning
• Yi j is used to denote the output feature map obtained by deep neural networks,’’ in Proc. IEEE Int. Conf. Multimedia Expo. Work-
applying a filter Wj,i on input channels Xi shops (ICMEW), Jul. 2019, pp. 312–317.
• wi , wj , wj,i are used to denote individual weights. [19] A. RoyChowdhury, P. Sharma, E. Learned-Miller, and A. Roy, ‘‘Reducing
duplicate filters in deep neural networks,’’ in Proc. NIPS Workshop Deep
• β is used to denote a binary mask where βi = 1 indicates Learning: Bridging Theory Pract., 2017, pp. 1–7.
that a feature map or filter should be retained and β = [20] H. J. Sussmann, ‘‘Uniqueness of the weights for minimal feedforward
0 indicates that it should be removed. nets with a given input-output map,’’ Neural Netw., vol. 5, no. 4,
pp. 589–593, Jul. 1992.
• L is used to denotes a loss function [21] Z. Zhou, W. Zhou, H. Li, and R. Hong, ‘‘Online filter clustering and
• L0 , L1 , L2 denote norms, with L0 counting non-zero val- pruning for efficient convnets,’’ in Proc. 25th IEEE Int. Conf. Image
ues, L1 , being the sum of absolute. values, L2 being the Process. (ICIP), Oct. 2018, pp. 11–15.
[22] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel,
square root of the sum of squares (Euclidean distance). ‘‘Optimal brain damage,’’ in Proc. Neural Inf. Process. Syst.vol. 89,
• kW kn will be used to indicate the use of a norm in D. S. Touretzky, Ed. San Mateo, CA, USA: Morgan Kaufmann, 1990,
pp. 506–598.
an equation with the an equation, with the subscript n [23] B. Hassibi, D. G. Stork, and G. J. Wolff, ‘‘Optimal brain surgeon and gen-
indicating the specific norm. eral network pruning,’’ in Proc. IEEE Int. Conf. Neural Netw., Mar. 1993,
• kW kF , known as the Frobenius norm is sometimes used pp. 293–299.
[24] B. Hassibi, D. G. Stork, G. Wolff, and T. Watanabe, ‘‘Optimal brain
to denote the application of the Euclidean distance to the surgeon: Extensions and performance comparison,’’ in Proc. Neural Inf.
elements of a matrix Process. Syst., 1993, pp. 263–279.

63296 VOLUME 10, 2022


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

[25] J. P. Cohen, H. Z. Lo, and W. Ding, ‘‘RandomOut: Using a convolutional [49] P. K. Gadosey, Y. Li, and P. T. Yamak, ‘‘On pruned, quantized and
gradient norm to rescue convolutional filters,’’ 2016, arXiv:1602.05931. compact CNN architectures for vision applications: An empirical study,’’
[26] N. Lee, T. Ajanthan, and P. H. S. Torr, ‘‘SNIP: Single-shot network in Proc. Int. Conf. Artif. Intell., Inf. Process. Cloud Comput. (AIIPCC),
pruning based on connection sensitivity,’’ 2018, arXiv:1810.02340. 2019, pp. 1–8.
[27] S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, ‘‘Accelerating [50] V. Lebedev and V. Lempitsky, ‘‘Speeding-up convolutional neural net-
convolutional networks via global & dynamic filter pruning,’’ in Proc. works: A survey,’’ Bull. Polish Acad. Sci. Tech. Sci., vol. 66, no. 6,
27th Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 2425–2432. pp. 799–810, 2018.
[28] C. Bucilua, R. Caruana, and A. Niculescu-Mizil, ‘‘Model compression,’’ [51] T. Elsken, J. H. Metzen, and F. Hutter, ‘‘Neural architecture search: A
in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, survey,’’ J. Mach. Learn. Res., vol. 20, no. 55, pp. 1–21, 2019.
pp. 535–541. ACM, 2006. [52] G. Menghani, ‘‘Efficient deep learning: A survey on making deep learning
[29] G. Hinton, O. Vinyals, and J. Dean, ‘‘Distilling the knowledge in a neural models smaller, faster, and better,’’ 2021, arXiv:2106.08962.
network,’’ 2015, arXiv:1503.02531. [53] M. A. Arbib, The Handbook of Brain Theory and Neural Networks.
[30] G. Urban, J. K. Geras, S. E. Kahou, S. Aslan, R. C. Wang, A. Mohamed, Cambridge, MA, USA: MIT Press, 2003.
M. Philipose, and M. Richardson, ‘‘Do deep convolutional nets really [54] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
need to be deep and convolutional,’’ in Proc. Int. Conf. Learn. Represent., MA, USA: MIT Press, 2016.
2017, pp. 1–13. [55] Z. Huang and N. Wang, ‘‘Data-driven sparse structure selection for deep
[31] L. Zhang, Z. Tan, J. Song, J. Chen, C. Bao, and K. Ma, ‘‘SCAN: A scal- neural networks,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018,
able neural networks framework towards compact and efficient models,’’ pp. 304–320.
2019, arXiv:1906.03951. [56] S. Lin, ‘‘Toward compact convnets via structure-sparsity regularized
[32] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramab- filter pruning,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 2,
hadran, ‘‘Low-rank matrix factorization for deep neural network training pp. 574–588, Feb. 2019.
with high-dimensional output targets,’’ in Proc. IEEE Int. Conf. Acoust., [57] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
Speech Signal Process., May 2013, pp. 6655–6659. R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
[33] M. Jaderberg, A. Vedaldi, and A. Zisserman, ‘‘Speeding up convolutional from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 56, pp. 1929–1958,
neural networks with low rank expansions,’’ 2014, arXiv:1405.3866. 2014.
[34] S. Jung, C. Son, S. Lee, J. Son, J.-J. Han, Y. Kwak, S. J. Hwang, and [58] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, ‘‘Learning structured
C. Choi, ‘‘Learning to quantize deep networks by optimizing quantization sparsity in deep neural networks,’’ in Proc. 30th Int. Conf. Neural Inf.,
intervals with task loss,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 2016, pp. 2082–2090.
[59] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian, ‘‘Variational
Recognit. (CVPR), Jun. 2019, pp. 4350–4359.
[35] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, ‘‘Incremental network convolutional neural network pruning,’’ in Proc. IEEE/CVF Conf. Com-
quantization: Towards lossless CNNs with low-precision weights,’’ 2017, put. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2780–2789.
[60] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
arXiv:1702.03044.
W. Hubbard, and L. D. Jackel, ‘‘Backpropagation applied to handwritten
[36] Y. Zhao, X. Gao, D. Bates, R. Mullins, and C.-Z. Xu, ‘‘Focused quanti-
zip code recognition,’’ Neural Comput., vol. 1, no. 4, pp. 541–551, 1989.
zation for sparse CNNs,’’ in Proc. Adv. Neural Inf. Process. Syst., 2019,
[61] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
pp. 5585–5594.
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
[37] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
pp. 2278–2324, Nov. 1998.
‘‘Binarized neural networks: Training deep neural networks with weights [62] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
and activations constrained to +1 or -1,’’ 2016, arXiv:1602.02830. with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf.
[38] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, ‘‘Com-
Process. Syst., 2012, pp. 1097–1105.
pressing neural networks with the hashing trick,’’ in Proc. Int. Conf. [63] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, ‘‘A survey of the
Mach. Learn., 2015, pp. 2285–2294. recent architectures of deep convolutional neural networks,’’ Artif. Intell.
[39] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, Rev., vol. 53, no. 8, pp. 5455–5516, 2019.
and D. Kalenichenko, ‘‘Quantization and training of neural networks for [64] S. Han, J. Pool, J. Tran, and W. J. Dally, ‘‘Learning both weights and
efficient integer-arithmetic-only inference,’’ in Proc. IEEE/CVF Conf. connections for efficient neural networks,’’ 2015, arXiv:1506.02626.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 2704–2713. [65] Y. Guo, A. Yao, and Y. Chen, ‘‘Dynamic network surgery for effi-
[40] B. Baker, O. Gupta, N. Naik, and R. Raskar, ‘‘Designing neural network cient DNNs,’’ in Proc. 30th Int. Conf. Neural Inf. Process. Syst., 2016,
architectures using reinforcement learning,’’ in Proc. Int. Conf. Learn. pp. 1387–1395.
Represent., Nov. 2017, arXiv:1611.02167. [66] J. Frankle and M. Carbin, ‘‘The lottery ticket hypothesis: Finding
[41] X. Dai, H. Yin, and N. K. Jha, ‘‘NeST: A neural network synthesis tool sparse, trainable neural networks,’’ in Proc. Int. Conf. Learn. Represent.,
based on a grow-and-prune paradigm,’’ IEEE Trans. Comput., vol. 68, New Orleans, LA, USA, 2019, arXiv:1803.03635.
no. 10, pp. 1487–1497, Oct. 2019. [67] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, ‘‘Rethinking the value
[42] X. Li, Y. Zhou, Z. Pan, and J. Feng, ‘‘Partial order pruning: For of network pruning,’’ in Proc. 7th Int. Conf. Learn. Represent., (ICLR),
best speed/accuracy trade-off in neural architecture search,’’ in Proc. New Orleans, LA, USA, May 2019, arXiv:1810.05270.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, [68] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin, ‘‘Stabilizing the
pp. 9145–9153. lottery ticket hypothesis,’’ 2019, arXiv:1903.01611.
[43] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, and J. Sun, [69] S. Ari Morcos, H. Yu, M. Paganini, and Y. Tian, ‘‘One ticket to win
‘‘MetaPruning: Meta learning for automatic neural network channel prun- them all: Generalizing lottery ticket initializations across datasets and
ing,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, optimizers,’’ in Proc. Neural Inf. Process. Syst. (NeuralPS), 2019, pp.
pp. 3295–3304. 4932–4942.
[44] M. Lin, R. Ji, Y. Zhang, B. Zhang, Y. Wu, and Y. Tian, ‘‘Channel pruning [70] N. Hubens, M. Mancas, M. Decombas, M. Preda, T. Zaharia, B. Gosselin,
via automatic structure search,’’ in Proc. 29th Int. Joint Conf. Artif. Intell., and T. Dutoit, ‘‘An experimental study of the impact of pre-training on the
Jul. 2020, pp. 673–679. pruning of a convolutional neural network,’’ in Proc. 3rd Int. Conf. Appl.
[45] Z. Zhong, J. Yan, W. Wu, J. Shao, and C.-L. Liu, ‘‘Practical block- Intell. Syst., Jan. 2020, pp. 1–6.
wise neural network architecture generation,’’ in Proc. IEEE/CVF Conf. [71] H. Yu, S. Edunov, Y. Tian, and A. S. Morcos, ‘‘Playing the lottery with
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2423–2432. rewards and multiple languages: Lottery tickets in RL and NLP,’’ 2019,
[46] B. Zoph and V. Q. Le, ‘‘Neural architecture search with reinforcement arXiv:1906.02768.
learning,’’ in Proc. Int. Conf. Learn. Represent., 2017. [72] S. Merity, C. Xiong, J. Bradbury, and R. Socher, ‘‘Pointer sentinel mixture
[47] J. Chung and T. Shin, ‘‘Simplifying deep neural networks for neuromor- models,’’ in Proc. Int. Conf. Learn. Represent., 2017, arXiv:1609.07843.
phic architectures,’’ in Proc. 53rd Annu. Design Autom. Conf., Jun. 2016, [73] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, N. Aidan
pp. 1–6. Gomez, U. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc.
[48] K. Goetschalckx, B. Moons, P. Wambacq, and M. Verhelst, ‘‘Efficiently 31st Int. Conf. Neural Inf. Process. Syst. (NIPS), Red Hook, NY, USA:
combining SVD, pruning, clustering and retraining for enhanced neural Curran Associates, 2017, pp. 6000–6010.
network compression,’’ in Proc. 2nd Int. Workshop Embedded Mobile [74] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,
Deep Learn., Jun. 2018, pp. 1–6. J. Tang, and W. Zaremba, ‘‘OpenAI gym,’’ 2016, arXiv:1606.01540.

VOLUME 10, 2022 63297


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

[75] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, ‘‘The arcade [101] C. Wang, R. Grosse, S. Fidler, and G. Zhang, ‘‘EigenDamage: Structured
learning environment: An evaluation platform for general agents,’’ pruning in the kronecker-factored eigenbasis,’’ in Proc. 36th Int. Conf.
J. Artif. Intell. Res., vol. 47, pp. 253–279, Jun. 2015. Mach. Learn., 2019, pp. 6566–6575.
[76] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, ‘‘Pruning filters [102] C. M. Bishop, Pattern Recognition and Machine Learning (Information
for efficient convnets,’’ in Proc. 5th Int. Conf. Learn. Represent., (ICLR), Science and Statistics). New York, NY, USA: Springer, 2006.
Toulon, France, Apr. 2017, arXiv:1608.08710. [103] S. Thrun, ‘‘The MONK’s problems—A performance comparison of dif-
[77] M. Denil, B. Shakibi, L. Dinh, and N. D. Freitas, ‘‘Predicting param- ferent learning algorithms,’’ Carnegie Mellon Univ., Pittsburgh, PA, USA,
eters in deep learning,’’ in Proc. Adv. Neural Inf. Process. Syst., 2013, Tech. Rep. CMU-CS-91-197, 1991.
pp. 2148–2156. [104] T. J. Sejnowski and C. R. Rosenberg, ‘‘Parallel networks that learn
[78] B. O. Ayinde, T. Inanc, and J. M. Zurada, ‘‘Redundant feature pruning for to pronounce English text,’’ Complex Syst., vol. 1, pp. 145–168,
accelerated inference in deep neural networks,’’ Neural Netw., vol. 118, Feb. 1987.
pp. 148–158, Oct. 2019. [105] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, ‘‘Pruning
[79] A. Polyak and L. Wolf, ‘‘Channel-level acceleration of deep face repre- convolutional neural networks for resource efficient inference,’’ 2016,
sentations,’’ IEEE Access, vol. 3, pp. 2163–2175, 2015. arXiv:1611.06440.
[80] M. Lin, Q. Chen, and S. Yan, ‘‘Network in a network,’’ in Proc. 2nd Int. [106] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, ‘‘Importance
Conf. Learn. Represent. (ICLR), 2014, arXiv:1312.4400. estimation for neural network pruning,’’ in Proc. IEEE/CVF Conf. Com-
[81] X. Zhang, J. Zou, K. He, and J. Sun, ‘‘Accelerating very deep
put. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 11264–11272.
convolutional networks for classification and detection,’’ IEEE
[107] R. Grosse and J. Martens, ‘‘A Kronecker-factored approximate Fisher
Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 1943–1955,
matrix for convolution layers,’’ in Proc. Int. Conf. Mach. Learn.,
Oct. 2015.
Feb. 2016, pp. 573–582.
[82] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio,
[108] H. Peng, J. Wu, S. Chen, and J. Huang, ‘‘Collaborative channel prun-
‘‘FitNets: Hints for thin deep nets,’’ 2014, arXiv:1412.6550.
[83] J.-H. Luo and J. Wu, ‘‘An entropy-based pruning method for CNN com- ing for deep networks,’’ in Proc. Int. Conf. Mach. Learn., 2019,
pression,’’ 2017, arXiv:1706.05791. pp. 5113–5122.
[84] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, ‘‘Network trimming: A data- [109] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, ‘‘AMC: Automl for
driven neuron pruning approach towards efficient deep architectures,’’ model compression and acceleration on mobile devices,’’ in Proc. Eur.
2016, arXiv:1607.03250. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 784–800.
[85] M. Lin, L. Cao, S. Li, Q. Ye, Y. Tian, J. Liu, Q. Tian, and R. Ji, ‘‘Filter [110] C. Jiang, G. Li, C. Qian, and K. Tang, ‘‘Efficient DNN neuron pruning by
sketch for network pruning,’’ 2020, arXiv:2001.08514. minimizing layer-wise nonlinear reconstruction error,’’ in Proc. 27th Int.
[86] E. Liberty, ‘‘Simple and deterministic matrix sketching,’’ in Proc. 19th Joint Conf. Artif. Intell., Jul. 2018, pp. 2298–2304.
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Aug. 2013, [111] X. Dong, S. Chen, and S. J. Pan, ‘‘Learning to prune deep neural networks
pp. 581–588. via layer-wise optimal brain surgeon,’’ in Proc. 31st Int. Conf. Neural Inf.
[87] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, ‘‘Filter pruning via geometric Process. Syst. (NIPS), Red Hook, NY, USA: Curran Associates, 2017,
median for deep convolutional neural networks acceleration,’’ in Proc. pp. 4860–4874.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, [112] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang,
pp. 4340–4349. and J. Zhu, ‘‘Discrimination-aware channel pruning for deep neural net-
[88] J.-H. Luo, J. Wu, and W. Lin, ‘‘ThiNet: A filter level pruning method for works,’’ in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 875–886.
deep neural network compression,’’ in Proc. IEEE Int. Conf. Comput. Vis. [113] K. Xu, X. Wang, Q. Jia, J. An, and D. Wang, ‘‘Globally soft filter
(ICCV), Oct. 2017, pp. 5058–5066. pruning for efficient convolutional neural networks,’’ Tech. Rep., 2018.
[89] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, ‘‘Soft filter pruning for [Online]. Available: https://paperswithcode.com/paper/globally-soft-
accelerating deep convolutional neural networks,’’ in Proc. 27th Int. Joint filter-pruning-for-efficient
Conf. Artif. Intell., Jul. 2018, pp. 2234–2240. [114] K. Neklyudov, D. Molchanov, A. Ashukha, and D. Vetrov, ‘‘Structured
[90] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, Bayesian pruning via log-normal multiplicative noise,’’ in Proc. 31st Int.
C.-Y. Lin, and L. S. Davis, ‘‘NISP: Pruning networks using neuron Conf. Neural Inf. Process. Syst. (NIPS), Red Hook, NY, USA: Curran
importance score propagation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Associates, 2017, pp. 6778–6787.
Pattern Recognit., Jun. 2018, pp. 9194–9203. [115] R. Bao, X. Yuan, Z. Chen, and R. Ma, ‘‘Cross-entropy pruning for
[91] X. Ding, G. Ding, Y. Guo, J. Han, and C. Yan, ‘‘Approximated Ora- compressing convolutional neural networks,’’ Neural Comput., vol. 30,
cle filter pruning for destructive CNN width optimization,’’ 2019, no. 11, pp. 3128–3149, Nov. 2018.
arXiv:1905.04748. [116] S. Srinivas and R. V. Babu, ‘‘Data-free parameter pruning for deep neural
[92] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, ‘‘Learning efficient networks,’’ 2015, arXiv:1507.06149.
convolutional networks through network slimming,’’ in Proc. IEEE Int. [117] Y. Sun, X. Wang, and X. Tang, ‘‘Sparsifying neural network connections
Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2755–2763. for face recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[93] Q. Huang, K. Zhou, S. You, and U. Neumann, ‘‘Learning to prune filters (CVPR), Jun. 2016, pp. 4856–4864.
in convolutional neural networks,’’ in Proc. IEEE Winter Conf. Appl. [118] S. Ameen and S. Vadera, ‘‘Pruning neural networks using multi-armed
Comput. Vis. (WACV), Mar. 2018, pp. 709–718. bandits,’’ Comput. J., vol. 63, no. 7, pp. 1099–1108, Jul. 2020.
[94] Y. He, X. Zhang, and J. Sun, ‘‘Channel pruning for accelerating very
[119] D. Mittal, S. Bhardwaj, M. M. Khapra, and B. Ravindran, ‘‘Studying the
deep neural networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
plasticity in deep convolutional neural networks using random pruning,’’
Oct. 2017, pp. 1398–1406.
Mach. Vis. Appl., vol. 30, no. 2, pp. 203–216, Mar. 2019.
[95] H. Wang, Q. Zhang, Y. Wang, and H. Hu, ‘‘Structured probabilis-
[120] E. Elsen, M. Dukhan, T. Gale, and K. Simonyan, ‘‘Fast sparse ConvNets,’’
tic pruning for convolutional neural network acceleration,’’ 2017,
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
arXiv:1709.06994.
[96] J.-H. Luo and J. Wu, ‘‘AutoPruner: An end-to-end trainable filter pruning Jun. 2020, pp. 14629–14638.
method for efficient deep model inference,’’ 2018, arXiv:1805.08941. [121] V. Kårková and P. C. Kainen, ‘‘Functionally equivalent feedfor-
[97] J. Ye, X. Lu, Z. Lin, and J. Z. Wang, ‘‘Rethinking the smaller-norm-less- ward neural networks,’’ Neural Comput., vol. 6, no. 3, pp. 543–558,
informative assumption in channel pruning of convolution layers,’’ 2018, May 1994.
arXiv:1802.00124. [122] S. J. Hanson and L. Y. Pratt, ‘‘Comparing biases for minimal network
[98] S. Han, H. Mao, and W. J. Dally, ‘‘A deep neural network compres- construction with back-propagation,’’ in Proc. Adv. Neural Inf. Process.
sion pipeline: Pruning, quantization, Huffman encoding,’’ in Proc. ICLR, Syst., 1989, pp. 177–185.
2016, arXiv:1510.00149v5. [123] A. Graves, ‘‘Practical variational inference for neural networks,’’ in Proc.
[99] S. Son, S. Nah, and K. Lee, ‘‘Clustering convolutional kernels to com- Adv. Neural Inf. Process. Syst., 2011, pp. 2348–2356.
press deep neural networks,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), [124] A. Aghasi, A. Abdi, N. Nguyen, and J. Romberg, ‘‘Net-trim: Convex
Sep. 2018, pp. 225–240. pruning of deep neural networks with performance guarantee,’’ in Proc.
[100] Y. Li, S. Lin, B. Zhang, J. Liu, D. Doermann, Y. Wu, F. Huang, and 31st Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA: Curran
R. Ji, ‘‘Exploiting kernel sparsity and entropy for interpretable CNN Associates, 2017, pp. 3180–3189.
compression,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [125] M. Zhu and S. Gupta, ‘‘To prune, or not to prune: Exploring the efficacy
(CVPR), Jun. 2019, pp. 2800–2809. of pruning for model compression,’’ 2017, arXiv:1710.01878.

63298 VOLUME 10, 2022


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

[126] C. Chen, F. Tung, N. Vedula, and G. Mori, ‘‘Constraint-aware deep [150] M. Lin, R. Ji, B. Chen, F. Chao, J. Liu, W. Zeng, Y. Tian, and Q. Tian,
neural network compression,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), ‘‘Training compact CNNs for image classification using dynamic-coded
Sep. 2018, pp. 400–415. filter fusion,’’ 2021, arXiv:2107.06916.
[127] D. Lee, S. Kang, and K. Choi, ‘‘ComPEND: Computation prun- [151] B. Hassibi and D. G. Stork, ‘‘Second order derivatives for net-
ing through early negative detection for ReLU in a deep neural work pruning: Optimal brain surgeon,’’ in Proc. Adv. Neural Inf.
network accelerator,’’ in Proc. Int. Conf. Supercomput., Jun. 2018, Process. Syst., San Mateo, CA, USA: Morgan Kaufmann, 1992,
pp. 139–148. pp. 164–171.
[128] G. Li, C. Qian, C. Jiang, X. Lu, and K. Tang, ‘‘Optimization based layer- [152] J. Xu and W. C. Daniel Ho, ‘‘A node pruning algorithm based on optimal
wise magnitude-based pruning for DNN compression,’’ in Proc. 27th Int. brain surgeon for feedforward neural networks,’’ in Proc. 3rd Int. Conf.
Joint Conf. Artif. Intell., Jul. 2018, pp. 2383–2389. Adv. Neural Netw. (ISNN), vol. 1. Berlin, Germany: Springer, 2006,
[129] C. Liu and Q. Liu, ‘‘Improvement of pruning method for convolution neu- pp. 524–529.
ral network compression,’’ in Proc. 2nd Int. Conf. Deep Learn. Technol. [153] C. Endisch, C. Hackl, and D. Schröder, ‘‘Optimal brain surgeon for
(ICDLT), 2018, pp. 57–60. general dynamic neural networks,’’ in Proc. Aritficial Intell. 13th Por-
[130] Z. Qin, F. Yu, C. Liu, and X. Chen, ‘‘Demystifying neural network filter tuguese Conf. Prog. Artif. Intell. (EPIA), Berlin, Germany: Springer,
pruning,’’ 2018, arXiv:1811.02639. 2007, pp. 15–28.
[131] R. Yazdani, M. Riera, J.-M. Arnau, and A. Gonzalez, ‘‘The dark side [154] C. Endisch, P. Stolze, P. Endisch, C. Hackl, and R. Kennel, ‘‘Levenberg–
of DNN pruning,’’ in Proc. ACM/IEEE 45th Annu. Int. Symp. Comput. Marquardt-based OBS algorithm using adaptive pruning interval for sys-
Archit. (ISCA), Jun. 2018, pp. 790–801. tem identification with dynamic neural networks,’’ in Proc. IEEE Int.
[132] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, Conf. Syst., Man Cybern., Oct. 2009, pp. 3402–3408.
‘‘A systematic DNN weight pruning framework using alternating direc- [155] S. Ameen, ‘‘Optimizing deep learning networks using multi-armed ban-
tion method of multipliers,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), dits,’’ Ph.D. thesis, School Comput., Sci. Eng., Univ. Salford, Greater
Sep. 2018, pp. 184–199. Manchester, U.K., 2017.
[133] X. Ding, G. DIng, X. Zhou, Y. Guo, J. Han, and J. Liu, ‘‘Global sparse [156] S. Anwar, K. Hwang, and W. Sung, ‘‘Structured pruning of deep convolu-
momentum SGD for pruning very deep neural networks,’’ in Proc. Adv. tional neural networks,’’ J. Emerg. Technol. Comput. Syst., vol. 13, no. 3,
Neural Inf. Process. Syst., 2019, pp. 6379–6391. pp. 1–18, Feb. 2017.
[134] T. Dettmers and L. Zettlemoyer, ‘‘Sparse networks from scratch: Faster [157] J. Guo and M. Potkonjak, ‘‘Pruning filters and classes: Towards on-
training without losing performance,’’ 2019, arXiv:1907.04840. device customization of convolutional neural networks,’’ in Proc. 1st Int.
[135] S. Gui, H. N. Wang, H. Yang, C. Yu, Z. Wang, and J. Liu, ‘‘Model com- Workshop Deep Learn. Mobile Syst. Appl. (EMDL), 2017, pp. 13–17.
pression with adversarial robustness: A unified optimization framework,’’ [158] M. A. Carreira-Perpinan and Y. Idelbayev, ‘‘‘Learning-compression’
in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 1283–1294. algorithms for neural net pruning,’’ in Proc. IEEE/CVF Conf. Comput.
[136] K. Helwegen, J. Widdicombe, L. Geiger, Z. Liu, K.-T. Cheng, and Vis. Pattern Recognit., Jun. 2018, pp. 8532–8541.
R. Nusselder, ‘‘Latent weights do not exist: Rethinking binarized neural [159] L. N. Huynh, Y. Lee, and R. K. Balan, ‘‘D-pruner: Filter-based pruning
network optimization,’’ 2019, arXiv:1906.02107. method for deep convolutional neural network,’’ in Proc. 2nd Int. Work-
[137] L. Hou, J. Zhu, J. Kwok, F. Gao, T. Qin, and T.-Y. Liu, ‘‘Normalization shop Embedded Mobile Deep Learn., Jun. 2018, pp. 7–12.
helps training of quantized LSTM,’’ in Proc. Adv. Neural Inf. Process. [160] S. Chen, L. Lin, Z. Zhang, and M. Gen, ‘‘Evolutionary NetArchitecture
Syst., 2019, pp. 7344–7354. search for deep neural networks pruning,’’ in Proc. 2nd Int. Conf. Algo-
[138] K. Lee, H. Kim, H. Lee, and D. Shin, ‘‘Flexible group-level pruning rithms, Comput. Artif. Intell., Dec. 2019, pp. 189–196.
of deep neural networks for fast inference on mobile CPUs: Work-in- [161] W. Deng, X. Zhang, F. Liang, and G. Lin, ‘‘An adaptive empirical
progress,’’ in Proc. Int. Conf. Compliers, Archit. Synth. Embedded Syst. Bayesian method for sparse deep learning,’’ in Proc. Adv. Neural Inf.
Companion (CASES), 2019, pp. 1–2. Process. Syst., 2019, pp. 5564–5574.
[139] J. Li, Q. Qi, J. Wang, C. Ge, Y. Li, Z. Yue, and H. Sun, ‘‘OICSR: Out- [162] S. Jin, S. Di, X. Liang, J. Tian, D. Tao, and F. Cappello, ‘‘DeepSZ:
in-channel sparsity regularization for compact deep neural networks,’’ A novel framework to compress deep neural networks by using error-
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), bounded lossy compression,’’ in Proc. 28th Int. Symp. High-Perform.
Jun. 2019, pp. 7046–7055. Parallel Distrib. Comput., Jun. 2019, pp. 159–170.
[140] Z. Liu, H. Tang, Y. Lin, and S. Han, ‘‘Point-voxel CNN for efficient [163] H. Li, N. Liu, X. Ma, S. Lin, S. Ye, T. Zhang, X. Lin, W. Xu, and Y. Wang,
3D deep learning,’’ in Proc. Adv. Neural Inf. Process. Syst., 2019, ‘‘ADMM-based weight pruning for real-time deep learning accelera-
pp. 963–973. tion on mobile devices,’’ in Proc. Great Lakes Symp. VLSI, May 2019,
[141] J. Song, Y. Chen, X. Wang, C. Shen, and M. Song, ‘‘Deep model trans- pp. 501–506.
ferability from attribution maps,’’ in Proc. Adv. Neural Inf. Process. Syst., [164] Z. Qin, F. Yu, C. Liu, and X. Chen, ‘‘CAPTOR: A class adaptive filter
2019, pp. 6179–6189. pruning framework for convolutional neural networks in mobile applica-
[142] Y. Xu, Y. Wang, J. Zeng, K. Han, X. U. Chunjing, D. Tao, and C. Xu, tions,’’ in Proc. 24th Asia South Pacific Design Autom. Conf., Jan. 2019,
‘‘Positive-unlabeled compression on the cloud,’’ in Proc. Adv. Neural Inf. pp. 444–449.
Process. Syst., 2019, pp. 2561–2570. [165] X. Xiao, Z. Wang, and S. Rajasekaran, ‘‘AutoPrune: Automatic network
[143] H. Zhou, J. Lan, R. Liu, and J. Yosinski, ‘‘Deconstructing lottery tickets: pruning by regularizing auxiliary parameters,’’ in Proc. Adv. Neural Inf.
Zeros, signs, and the supermask,’’ 2019, arXiv:1905.01067. Process. Syst., 2019, pp. 13681–13691.
[144] Y. Zhu, C. Li, B. Luo, J. Tang, and X. Wang, ‘‘Dense feature aggregation [166] Z. Bao, J. Liu, and W. Zhang, ‘‘Using distillation to improve network
and pruning for RGBT tracking,’’ in Proc. 27th ACM Int. Conf. Multime- performance after pruning and quantization,’’ in Proc. 2nd Int. Conf.
dia, Oct. 2019, pp. 465–472. Mach. Learn. Mach. Intell., Sep. 2019, pp. 3–6.
[145] T. Kim, D. Ahn, and J.-J. Kim, ‘‘V-LSTM: An efficient LSTM accelerator [167] C. Lemaire, A. Achkar, and P.-M. Jodoin, ‘‘Structured pruning of neural
using fixed nonzero-ratio viterbi-based pruning,’’ in Proc. ACM/SIGDA networks with budget-aware regularization,’’ in Proc. IEEE/CVF Conf.
Int. Symp. Field-Program. Gate Arrays, Feb. 2020, p. 326. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 9108–9116.
[146] Q. Li, C. Li, and H. Chen, ‘‘Incremental filter pruning via random walk [168] X. Dong and Y. Yang, ‘‘Network pruning via transformable architecture
for accelerating deep convolutional neural networks,’’ in Proc. 13th Int. search,’’ 2019, arXiv:1905.09717.
Conf. Web Search Data Mining, Jan. 2020, pp. 358–366. [169] S. Kundu and S. Sundaresan, ‘‘AttentionLite: Towards efficient self-
[147] W. Niu, X. Ma, S. Lin, S. Wang, X. Qian, X. Lin, Y. Wang, and B. Ren, attention models for vision,’’ 2020, arXiv:2101.05216.
‘‘PatDNN: Achieving real-time DNN execution on mobile devices with [170] P. Kaliamoorthi, A. Siddhant, E. Li, and M. Johnson, ‘‘Distilling large
pattern-based weight pruning,’’ in Proc. 25th Int. Conf. Architectural language models into tiny and effective students using pQRNN,’’ 2021,
Support Program. Lang. Operating Syst., Mar. 2020, pp. 907–922. arXiv:2101.08890.
[148] A. Dubey, M. Chatterjee, and N. Ahuja, ‘‘Coreset-based neural network [171] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky,
compression,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, ‘‘Speeding-up convolutional neural networks using fine-tuned CP-
pp. 454–470. decomposition,’’ 2014, arXiv:1412.6553.
[149] X. Ding, G. Ding, Y. Guo, and J. Han, ‘‘Centripetal SGD for pruning [172] S. Lin, R. Ji, C. Chen, D. Tao, and J. Luo, ‘‘Holistic CNN com-
very deep convolutional networks with complicated structure,’’ in Proc. pression via low-rank decomposition with knowledge transfer,’’ IEEE
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, Trans. Pattern Anal. Mach. Intell., vol. 41, no. 12, pp. 2889–2905,
pp. 4943–4953. Dec. 2019.

VOLUME 10, 2022 63299


S. Vadera, S. Ameen: Methods for Pruning Deep Neural Networks

[173] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, and L. Shao, [197] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, ‘‘Model compression and
‘‘HRank: Filter pruning using high-rank feature map,’’ in Proc. acceleration for deep neural networks: The principles, progress, and
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, challenges,’’ IEEE Signal Process. Mag., vol. 35, no. 1, pp. 126–136,
pp. 1526–1535. Jan. 2018.
[174] V. Vanhoucke, A. Senior, and Z. Mark Mao, ‘‘Improving the speed of [198] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,
neural networks on cpus,’’ in Proc. Deep Learn. Unsupervised Feature J. Winn, and A. Zisserman, ‘‘The Pascal visual object classes challenge: A
Learn. Workshop, Neural Inf. Process. Syst., 2011. [Online]. Available: retrospective,’’ Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, Jan. 2015.
https://research.google/pubs/pub37631/ [199] G. J. Brostow, J. Fauqueur, and R. Cipolla, ‘‘Semantic object classes in
[175] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, ‘‘Exploit- video: A high-definition ground truth database,’’ Pattern Recognit. Lett.,
ing linear structure within convolutional networks for efficient evalua- vol. 30, no. 2, pp. 88–97, Jan. 2009.
tion,’’ in Adv. Neural Inf. Process. Syst., 2014, pp. 1269–1277. [200] M.-E. Nilsback and A. Zisserman, ‘‘Automated flower classification over
[176] Y. Gong, L. Liu, M. Yang, and L. Bourdev, ‘‘Compressing deep convolu- a large number of classes,’’ in Proc. 6th Indian Conf. Comput. Vis., Graph.
tional networks using vector quantization,’’ 2014, arXiv:1412.6115. Image Process., Dec. 2008, pp. 722–729.
[177] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, [201] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, ‘‘Labeled
‘‘Binarized neural networks,’’ in Proc. 30th Int. Conf. Neural Inf. Process. faces in the wild: A database forstudying face recognition in uncon-
Syst., pp. 4114–4122, 2016. strained environments,’’ in Proc. Workshop Faces Real-Life’Images,
[178] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, ‘‘XNOR-Net: Detection, Alignment, Recognit., 2008. [Online]. Available: http://vis-
ImageNet classification using binary convolutional neural networks,’’ www.cs.umass.edu/lfw/lfw.pdf
2016, arXiv:1603.05279. [202] A. Krizhevsky, V. Nair, and G. Hinton. (2009). The CIFAR-10
[179] R. Krishnamoorthi, ‘‘Quantizing deep convolutional networks for effi- and CIFAR-100 Datasets. [Online]. Available: http://www.cs.toronto.
cient inference: A whitepaper,’’ 2018, arXiv:1806.08342. edu/kriz/cifar.html
[180] Z. Liu, J. Xu, X. Peng, and R. Xiong, ‘‘Frequency-domain dynamic [203] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
pruning for convolutional neural networks,’’ in Proc. Adv. Neural Inf. A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
Process. Syst., 2018, pp. 1043–1053. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[181] F. Tung and G. Mori, ‘‘CLIP-Q: Deep network compression learning by [204] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
in-parallel pruning-quantization,’’ in Proc. IEEE/CVF Conf. Comput. Vis. ‘‘Reading digits in natural images with unsupervised feature learning,’’ in
Pattern Recognit., Jun. 2018, pp. 7873–7882. Proc. NIPS Workshop Deep Learn. Unsupervised Feature Learn., 2011.
[182] R. Banner, Y. Nahshan, and D. Soudry, ‘‘Post training 4-bit quantization [Online]. Available: https://research.google/pubs/pub37648/
of convolutional networks for rapid-deployment,’’ in Proc. Adv. Neural [205] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, The
Inf. Process. Syst., 2019, pp. 7948–7956. CALTECH-UCSD Birds-200-2011 Dataset, document CNS-TR-201,
[183] S. Chen, W. Wang, and S. J. Pan, ‘‘MetaQuant: Learning to quantize by 2011.
learning to penetrate non-differentiable quantization,’’ in Adv. Neural Inf. [206] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, ‘‘Learning deep
Process. Syst., 2019, pp. 3918–3928. features for scene recognition using places database,’’ in Proc. 27th Int.
[184] P. Wang, Q. Chen, X. He, and J. Cheng, ‘‘Towards accurate post-training Conf. Neural Inf. Process. Syst., 2014, pp. 487–495.
network quantization via bit-split and stitching,’’ in Proc. Mach. Learn. [207] D. Yi, Z. Lei, S. Liao, and S. Z. Li, ‘‘Learning face representation from
Res., vol. 119, 2020, pp. 9847–9856. scratch,’’ 2014, arXiv:1411.7923.
[185] A. Fan, P. Stock, B. Graham, E. Grave, R. Gribonval, H. Jegou, and [208] O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling,
A. Joulin, ‘‘Training with quantization noise for extreme model compres- C. Monz, P. Pecina, M. Post, H. Saint-Amand, R. Soricut, L. Specia, and
sion,’’ in Proc. Int. Conf. Learn. Represent., 2021, arXiv:2004.07320. A. Tamchyna, ‘‘Findings of the 2014 workshop on statistical machine
[186] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, translation,’’ in Proc. 9th Workshop Stat. Mach. Transl., 2014, pp. 12–58.
M. Andreetto, and H. Adam, ‘‘MobileNets: Efficient convolutional neural [209] H. Xiao, K. Rasul, and R. Vollgraf, ‘‘Fashion-MNIST: A novel
networks for mobile vision applications,’’ 2017, arXiv:1704.04861. image dataset for benchmarking machine learning algorithms,’’ 2017,
[187] J. Lin, Y. Rao, J. Lu, and J. Zhou, ‘‘Runtime neural pruning,’’ in Proc. arXiv:1708.07747.
31st Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA: Curran
Associates, 2017, pp. 2178–2188.
[188] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi, SUNIL VADERA received the Ph.D. degree in
‘‘MorphNet: Fast & simple resource-constrained structure learning of computer science from The University of Manch-
deep networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- ester, in 1992. He is currently a Professor of
nit., Jun. 2018, pp. 1586–1595. computer science at the University of Salford,
[189] C.-H. Hsu, S.-H. Chang, J.-H. Liang, H.-P. Chou, C.-H. Liu, S.-C. Chang, U.K., where he has served in many leadership
J.-Y. Pan, Y.-T. Chen, W. Wei, and D.-C. Juan, ‘‘MONAS: Multi- roles, including as the Dean and the Head of the
objective neural architecture search using reinforcement learning,’’ 2018,
School of Computing, Science and Engineering,
arXiv:1806.10332.
[190] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, from 2011 to 2019. He currently leads the AI
A. Yuille, J. Huang, and K. Murphy, ‘‘Progressive neural architecture Foundry Project at the University of Salford which
search,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 19–34. supports SMEs to develop innovative products and
[191] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, ‘‘Efficient neural services using AI. His main research interests include pruning deep networks
architecture search via parameters sharing,’’ in Proc. 35th Int. Conf. and applications of AI. He is a fellow of the British Computer Society,
Mach. Learn., J. D. A. Krause, Ed., Jul. 2018, pp. 4095–4104. a Chartered Engineer (C.Eng.), and a Chartered IT Professional (CITP).
[192] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, ‘‘Regularized evolution In 2014, he was awarded the U.K. BDO Best Indian Scientist and Engineer
for image classifier architecture search,’’ in Proc. AAAI Conf. Artrificial in recognition of his contributions to computing, science, and engineering.
Intgelligence, vol. 33, 2019, pp. 4780–4798.
[193] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard,
and Q. V. Le, ‘‘MnasNet: Platform-aware neural architecture search SALEM AMEEN received the B.Eng. degree
for mobile,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. from the Department of Electrical and Electronic
(CVPR), Jun. 2019, pp. 2815–2823. Engineering, University of Seventh April, Libya,
[194] J. Zhang, X. Chen, M. Song, and T. Li, ‘‘Eager pruning: Algorithm and in 1999, the M.Tech. degree in computer science
architecture support for fast training of deep neural networks,’’ in Proc. and engineering from the Jaypee Institute of Infor-
46th Int. Symp. Comput. Archit., Jun. 2019, pp. 292–303.
mation Technology, India, in 2009, and the Ph.D.
[195] U. Evci, T. Gale, J. Menick, P. S. Castro, and E. Elsen, ‘‘Rigging the
lottery: Making all tickets winners,’’ in Proc. 37th Int. Conf. Mach. degree in computer science from the University
Learn., 2020, pp. 2943–2952. of Salford, in 2018. His main research interests
[196] J. Cheng, P.-S. Wang, G. Li, Q.-H. Hu, and H.-Q. Lu, ‘‘Recent include machine learning, deep learning, multi-
advances in efficient computation of deep convolutional neural net- armed bandits, image mining, and time series
works,’’ Frontiers Inf. Technol. Electron. Eng., vol. 19, no. 1, pp. 64–77, forecasting.
Jan. 2018.

63300 VOLUME 10, 2022

You might also like