Recent Advances on Neural Network Pruning at Initialization
Recent Advances on Neural Network Pruning at Initialization
Huan Wang1 , Can Qin1 , Yue Bai1 , Yulun Zhang2 and Yun Fu1
1
Northeastern University 2 ETH Zürich
{wang.huan, qin.ca, bai.yue}@northeastern.edu, [email protected], [email protected]
arXiv:2103.06460v3 [cs.LG] 23 May 2022
Abstract Pruning paradigm Weight src Mask src Train sparse net?
Neural network pruning typically removes con- Pruning after training Convg. net Convg. net 3 (mostly)
Pruning at Initialization
nections or neurons from a pretrained converged Sparse training (LTH) Init. net Convg. net 3
model; while a new pruning paradigm, pruning at Sparse training (SNIP) Init. net Init. net 3
initialization (PaI), attempts to prune a randomly Sparse selection (Hidden) Init. net Init. net 7
initialized network. This paper offers the first sur-
vey concentrated on this emerging pruning fashion. Table 1: Comparison between pruning at initialization (PaI)
We first introduce a generic formulation of neu- and pruning after training (PaT). As seen, weight source is the axis
ral network pruning, followed by the major clas- that differentiates PaI methods from their traditional counterparts:
sic pruning topics. Then, as the main body of this PaI methods inherit weights from a randomly initialized (Init.) net-
paper, a thorough and structured literature review work instead of a converged (Convg.) network. PaI methods can be
of PaI methods is presented, consisting of two ma- further classified into two major categories: sparse training (picked
sparse network will be trained) and sparse selection (picked sparse
jor tracks (sparse training and sparse selection). Fi- network will not be trained). In the parentheses is the representative
nally, we summarize the surge of PaI compared to method of each genre: LTH [Frankle and Carbin, 2019], SNIP [Lee
PaT and discuss the open problems. Apart from the et al., 2019], Hidden [Ramanujan et al., 2020].
dedicated literature review, this paper also offers a
code base for easy sanity-checking and benchmark- neural network without seriously compromising the perfor-
ing of different PaI methods. mance. A typical pruning pipeline comprises 3 steps [Reed,
1993]: (1) pretraining a (redundant) dense model ⇒ (2)
pruning the dense model to a sparse one ⇒ (3) finetun-
1 Introduction ing the sparse model to regain performance. Namely, prun-
Network architecture design is a central topic when develop- ing is considered as a post-processing solution to fix the
ing neural networks for various artificial intelligence tasks, side-effect brought by pretraining the dense model. This
especially in deep learning [LeCun et al., 2015; Schmid- post-processing paradigm of pruning has been practiced for
huber, 2015]. The wisdom from learning theory suggests more than 30 years and well covered by many surveys, e.g.,
that the best generalization comes from a good trade-off be- a relatively outdated survey [Reed, 1993], recent surveys
tween sample size and model complexity [Kearns et al., 1994; of pruning alone [Gale et al., 2019; Blalock et al., 2020;
Vapnik, 2013], which implies to use neural networks of Hoefler et al., 2021] or as a sub-topic under the umbrella
proper sizes. However, it is non-trivial to know in prac- of model compression and acceleration [Sze et al., 2017;
tice what size fits properly. Over-parameterized networks Cheng et al., 2018a; Cheng et al., 2018b; Deng et al., 2020].
are thus preferred due to their abundant expressivity when However, a newly surging pruning paradigm, pruning at
tackling complex real-world problems. In deep learning era, initialization (PaI), is absent from these surveys (we are
this rule-of-thumb is even more pronounced not only be- aware that PaI is discussed in one very recent survey [Hoe-
cause we are handling more complex problems, but also fler et al., 2021], yet merely with one subsection; our pa-
over-parameterized networks are observed easier to optimize per aims to offer a more thorough and concentrated cover-
(with proper regularization) [Simonyan and Zisserman, 2015; age). Unlike pruning after training (PaT) (see Tab. 1), PaI
He et al., 2016] and possibly lead to better generaliza- prunes a randomly initialized dense network instead of a pre-
tion [Soltanolkotabi et al., 2018; Allen-Zhu et al., 2019; trained one. PaI aims to train (or purely select) a sparse net-
Zou et al., 2020] than their compact counterparts. work out of a randomly initialized dense network to achieve
However, over-parameterization brings cost in either test- (close to) full accuracy (the accuracy reached by the dense
ing or training phase, such as excessive model footprint, slow network). This had been believed unpromising as plenty of
inference speed, extra model transportation and energy con- prior works have observed training the sparse network from
sumption. As a remedy, neural network pruning is pro- scratch underperforms PaT, until recently, lottery ticket hy-
posed to remove unnecessary connections or neurons in a pothesis (LTH) [Frankle and Carbin, 2019] and SNIP [Lee
et al., 2019] successfully find sparse networks which can be 2.2 Classic Topics in Pruning
trained from scratch to full accuracy. They open new doors to There are mainly four critical questions we need to ask when
efficient sparse network training with possibly less cost. pruning a specific model: what to prune, how many (connec-
This paper aims to present a comprehensive coverage of tions or neurons) to prune, which to prune exactly, and how
this emerging pruning paradigm, discussing its historical ori- to schedule the pruning process, corresponding to four clas-
gin, the status quo, and possible future directions. The rest of sic topics in pruning as follows. We only briefly explain them
this paper is organized as follows. First, Sec. 2 introduces the here as they are not our main focus. For a more comprehen-
background of network pruning, consisting of a generic for- sive coverage, we refer the readers to [Hoefler et al., 2021].
mulation of pruning and the classic topics in PaT. Next, Sec. 3
presents the thorough coverage of PaI methods. Then, Sec. 4 (1) Sparsity structure. Weights can be pruned in patterns.
summarizes the PaI fashion and discusses the open problems. The shape of the pattern, named sparsity structure, decides
Finally, Sec. 5 concludes this paper. the basic pruning element of a pruning algorithm. The small-
est structure, of course, is a single weight element, i.e., no
2 Background of Neural Network Pruning structure at all. This kind of pruning is called unstructured
pruning. The pruning pattern larger than a single weight el-
2.1 A Generic Formulation of Pruning ement can be called structured pruning in general. Although
The stochastic gradient descent (SGD) learner of a neural net- there are different levels of granularity when defining struc-
work parameterized by w produces a model sequence which tured pruning (see [Mao et al., 2017]), structured pruning typ-
finally converges to a model with desired performance: ically means filter or channel pruning in the literature.
(0) (1) Practically, unstructured pruning is mainly favored for
w , w , · · · , w(k) , · · · , w(K) , (1) model size compression, structured pruning more favored for
where K denotes the total number of training iterations. In acceleration due to the hardware-friendly sparsity structure.
PaT, we only need the last model checkpoint (k = K) as the Acceleration is more imperative than compression for mod-
base model for pruning. However, the new pruning paradigm ern deep networks, so most PaT works focus on structured
asks for models at different iterations, e.g., LTH needs the pruning currently. We will see that, differently, most PaI
model at iteration 0 [Frankle and Carbin, 2019] (or an early methods focus on unstructured pruning instead.
iteration [Frankle et al., 2020a], we consider this also as 0 (2) Pruning ratio. Pruning ratios indicate how many weights
for simplicity). To accommodate these new cases, we allow to remove. In general, there are two ways to determine
a pruning algorithm to have access to the whole model se- pruning ratios. (i) The first is to pre-define them. Namely,
quence (which we can easily cache during the model train- we know exactly how many parameters will be pruned be-
ing). Then, pruning can be defined as a function, which takes fore the algorithm actually runs. This scheme can be fur-
the model sequence as input (along with the training dataset ther specified into two sub-schemes. One is to set a global
D), and outputs a pruned model w∗ . pruning ratio (i.e., how many weights will be pruned for the
We need two pieces of information to make a neural net- whole network); the other is to set layer-wise pruning ra-
work exact, its topology and the associated parameter val- tios. (ii) The second is to decide the pruning ratio by other
ues. (1) In the case of a sparse network, the sparse topology means. This way mostly appears in the regularization-based
is typically defined by a mask tensor (denoted by m, with pruning methods, which remove weights by driving them to-
the same shape as w), which can be obtained via a func- wards zero via penalty terms. A larger regularization fac-
tion, m = f1 (w(k1 ) ; D), that is, f1 can utilize the data avail- tor typically leads to more sparsity, i.e., a larger pruning ra-
able to help decide the masks (e.g., in regularization-based tio. However, how to set a proper factor to achieve the de-
pruning methods, such as [Wang et al., 2021]). (2) Mean- sired sparsity usually demands heavy tuning. Several meth-
while, the weights in the pruned model can be adjusted from ods have been proposed to improve this [Wang et al., 2019c;
their original values to mitigate the incurred damage [Has- Wang et al., 2021] – the pruning ratios are usually pre-
sibi and Stork, 1993; Wang et al., 2019a; Wang et al., 2019c; specified. Recent years also have seen some works that auto-
Wang et al., 2021]. This step can be modeled as another func- matically search the optimal layer-wise pruning ratio [He et
tion f2 (w(k2 ) ; D). Together, pruning can be formulated as al., 2018]. No consensus has been reached on which is better.
w0 = f1 (w(k1 ) ; D) f2 (w(k2 ) ; D), (3) Pruning criterion. Pruning criterion decides which
(2) weights to remove given a sparsity budget. It is considered
w∗ = f3 (w0 ; D), one of the most critical problems in network pruning, thus has
where represents the Hadamard (element-wise) product; f3 received the most research attention so far. The most simple
models the finetuning process (which is omitted in the sparse criterion is weight magnitude (or equivalently, L1 -norm for
selection PaI methods). Note that, the input models for f1 (∗) a tensor) [Han et al., 2015; Li et al., 2017]. Because of its
and f2 (∗) can be from different iterations. By this definition, simplicity, it is the most prevailing criterion in PaT now. We
the pruning paradigms in Tab. 1 can be specified as follows, will see it also being widely used in PaI.
• Pruning after training (PaT): k1 = K, k2 = K; Albeit much exploration in this topic, there are no criteria
that prove to be significantly better than the others (actually,
• Sparse training: k1 = K (LTH) or 0 (SNIP), k2 = 0, the simple magnitude pruning has been argued as (one of)
and f2 = I (identity function), f3 6= I; the SOTA [Gale et al., 2019]). Since this topic is already
• Sparse selection: k1 = k2 = 0, and f2 = f3 = I. well discussed in prior surveys in PaT, we will not cover it in
Dynamic Masks: DeepR’18, SET’18, DSR’19, SNFS’19,
RigL’20 - Technical: OneTicket’19, LTH+’20, EB’20,
Extensions RL&NLP’20, BERT’20, Lifelong’20, Pretraining’21,
Sparse Training GNN’21, PrAC’21, E-LTH’21, Multi-Prize’21
Post-selected Masks - Theoretical: EarlyPhase’20, GradFlow’20, Manifold’21
(LTH’19)
Static Masks Sanity-checks: RandomTickets’20, MissMark’21, Correlation’21,
PaI ReallyWin?’21
Figure 1: Overview of pruning at initialization (PaI) approaches, classified into two general groups: sparse training and sparse selection.
For readability, references are omitted in this figure but the paper abbreviations (right beside the abbreviation is the year when the paper
appeared). Please see Sec. 3 for detailed introductions of them. Due to limited length, this paper only outlines the primary methods, see full
collection at https://github.com/mingsun-tse/awesome-pruning-at-initialization.
depth here. One point worth mention is that, the core idea un- Follow-ups of LTH and SNIP. (1) There are two major lines
derpinning most pruning criteria now is to select the weights of LTH follow-ups. Acknowledging the efficacy of LTH,
whose absence induces the least loss change. This idea and one line is to expand its universe, e.g., scaling LTH to larger
its variant have been followed for a long time in PaT [LeCun datasets and models (e.g., ResNet50 on ImageNet), validating
et al., 1990; Hassibi and Stork, 1993; Wang et al., 2019a; it on non-vision domains (e.g., natural language processing),
Molchanov et al., 2017; Molchanov et al., 2019], which con- and proposing theoretical foundations. The other line takes a
tinues to PaI, as we will see. decent grain of salt about the efficacy of LTH, keeps sanity-
(4) Pruning schedule. After the three aspects determined checking. (2) At the same time, for the direction led by SNIP,
above, we finally need to specify the schedule of pruning. more methods come out focusing on better pruning criteria.
There are three typical choices [Wang et al., 2019b]. (i) One- Dynamic masks. The masks in both LTH and SNIP are
shot: network sparsity (defined by the ratio of zeroed weights static, namely, they are fixed during training. Some re-
in a network) goes from 0 to a target number in a single step, searchers (such as [Evci et al., 2020a]) conjecture that dy-
then finetune. (ii) Progressive: network sparsity goes from namic and adaptive masks during training may be better,
0 to a target number gradually, typically along with network thus introduce another group of methods featured by dynamic
training; then finetune. (iii) Iterative: network sparsity goes masks. Together, the static-and dynamic-mask methods com-
from 0 to an intermediate target number, then finetune; then plete the realm of sparse training.
repeat the process until the target sparsity is achieved. Note, Sparse selection. When some researchers attempt to under-
both (ii) and (iii) are characterized by pruning interleaved stand LTH through empirical studies, they discover an inter-
with network training. Therefore, there is no fundamental esting phenomenon. [Zhou et al., 2019] surprisingly find that
boundary between the two. Some works thus use the two the random network selected by winning ticket in LTH actu-
terms interchangeably. One consensus is that, progressive and ally has non-trivial accuracy without any training. This dis-
iterative pruning outperform the one-shot counterpart when covery brings us strong LTH: Sparse subnet picked from a
pruning the same number of weights because they allow more dense network can achieve full accuracy even without further
time for the network to adapt. In PaI, LTH employs itera- training, and opens a new direction named sparse selection.
tive pruning, which increases the training cost significantly. The current PaI universe is primarily made up of the
PaI works featured by pre-selected masks (e.g., [Wang et al., above two categories, sparse training and sparse selection. A
2020]) are thereby motivated to resolve this problem. method tree overview of PaI is presented in Fig. 1. Next, we
elaborate the major approaches in each genre in length.
3 Pruning at Initialization (PaI)
3.2 Sparse Training
3.1 Overview: History Sketch
Static masks: post-selected. Sparse training methods of this
Debut: LTH and SNIP. The major motivation of PaI vs. PaT group are pioneered by LTH [Frankle and Carbin, 2019]. The
is to achieve cheaper and simpler pruning – pruning a net- pruning pipeline in LTH has three steps: First, a randomly
work at initialization prior to training “eliminates the need initialized network is trained to convergence; Second, em-
for both pretraining and the complex pruning schedule” (as ploy magnitude pruning to obtain the masks, which defines
quoted from SNIP [Lee et al., 2019]). Specially, two works the topology of the subnet; Third, apply the masks to the ini-
lead this surge, LTH [Frankle and Carbin, 2019] and SNIP. tial network to obtain a subnet and train the subnet to conver-
Importantly, both make a similar claim: Non-trivially sparse gence. This process can be repeated iteratively (i.e., iterative
networks can be trained to full accuracy in isolation, which magnitude pruning, or IMP). The authors surprisingly find the
was believed barely feasible before. Differently, LTH selects subnet can achieve comparable (or even better occasionally)
masks from a pretrained model while SNIP selects masks accuracy to the dense network. The lottery ticket hypothesis
from the initialized model, i.e., post-selected vs. pre-selected is thus proposed: “dense, randomly-initialized, feed-forward
masks, as shown in Fig. 1. networks contain subnetworks (winning tickets) that—when
trained in isolation— reach test accuracy comparable to the aging dynamical systems theory and inertial manifold theory.
original network in a similar number of iterations.” More works on the LTH theories study a stronger version of
Follow-up works of LTH mainly fall into two groups. One LTH, i.e., sparse selection. We thus defer them to Sec. 3.3.
group identifies the legitimacy of LTH and attempts to seek In short, extensions of LTH mainly focus on LTH+X (i.e.,
a broader application or understanding of it. The other group generalizing LTH to other tasks or learning settings, etc.),
proposes several sanity-check ablations of LTH, taking a de- identifying cheaper tickets, understanding LTH better.
cent grain of salt with its validity. These two groups are thus
summarized as extensions and sanity-checks in Fig. 1. (2) Sanity-checks. The validation of LTH seriously hinges on
(1) Extensions. The original LTH is only validated on small experimental setting, which actually has been controversial
since the very beginning of LTH. In the same venue where
datasets (MNIST and CIFAR-10). [Frankle et al., 2020a]
LTH is published, another pruning paper [Liu et al., 2019]
(LTH+) later generalize LTH to ResNet50 on ImageNet by
actually draws a pretty different conclusion from LTH. They
applying the masks not to the initial weights but to the
argue that a subnetwork with random initialization (vs. the
weights after a few epochs of training. [Morcos et al., 2019]
winning tickets picked by IMP in LTH) can be trained from
(OneTicket) discover that winning tickets can generalize
across a variety of datasets and optimizers within the natu- scratch to match pruning a pretrained model. Besides, [Gale
ral images domain. Besides, winning tickets generated with et al., 2019] also report that they cannot reproduce LTH. The
larger datasets consistently transfer better than those gener- reasons, noted by [Liu et al., 2019], may reside in the ex-
ated with smaller datasets, suggesting that winning tickets perimental settings – [Liu et al., 2019] focus on filter prun-
contain inductive biases generic to neural networks. [Yu et ing and use momentum SGD with a large initial learning
al., 2020] (RL&NLP) find the lottery ticket phenomenon also rate (LR) (0.1), while [Frankle and Carbin, 2019] tackle un-
structured pruning and “mostly uses Adam [Kingma and Ba,
presents on RL and NLP tasks and use it to train compressed
2015] with much lower learning rates”. This debate contin-
Transformers to high performance. [Chen et al., 2020a] val-
ues on. [Su et al., 2020] (RandomTickets) notice that ran-
idates lottery ticket phenomenon on BERT subnetworks at
domly changing the preserved weights in each layer (the re-
40% to 90% sparsity. [Chen et al., 2020b] (Lifelong) in-
troduce two pruning methods, bottom-up and top-down, to sulted initializations are thus named random tickets), while
find winning tickets in the lifelong learning scenario. [Chen keeping the layer-wise pruning ratio, does not affect the fi-
et al., 2021a] (Pretraining) locate matching subnetworks nal performance, thus introduce the random tickets. [Fran-
at 59.04% to 96.48% sparsity that transfer universally to kle et al., 2021] (MissMark) later report similar observation.
multiple downstream tasks with no performance degrada- This property “suggests broader challenges with the under-
tion for supervised and self-supervised pre-training. [Chen lying pruning heuristics, the desire to prune at initialization,
or both” [Frankle et al., 2021]. Meanwhile, [Liu et al., 2021]
et al., 2021b] (GNN) present a unified graph neural network
(Correlation) find that there is a strong correlation between
(GNN) sparsification framework that simultaneously prunes
initialized weights and the final weights in LTH, when the LR
the graph adjacency matrix and the model weights and gen-
is not sufficiently large, thereby they argue that “the existence
eralizes LTH to GNN for the first time. The iterative train-
prune-retrain cycles on full training set in LTH can be very of winning property is correlated with an insufficient DNN
expensive. To resolve this, [Zhang et al., 2021b] introduce pretraining, and is unlikely to occur for a well-trained DNN”.
Pruning Aware Critical (PrAC) set, which is a subset of train- This resonates with the conjecture by [Liu et al., 2019] above
ing data. PrAC set takes only 35.32% to 78.19% of the full that learning rate is a critical factor making their results seem-
set (CIFAR10, CIFAR100, Tiny ImageNet) with similar ef- ingly contradicted with LTH. [Ma et al., 2021] (ReallyWin?)
fect, saving up to 60% to 90% training iterations. Moreover, follow up this direction and present more concrete evidence
to clarify whether the winning ticket exists across the major
they find the PrAC set can generalize across different network
DNN architectures and/or applications.
architectures. Similarly, E-LTH [Chen et al., 2021c] also at-
tempt to find winning tickets generic to different networks Static masks: Pre-selected. SNIP [Lee et al., 2019] pioneers
without the expensive IMP process. They achieve so by trans- the direction featured by pre-selected masks. SNIP proposes
forming winning tickets found in one network to another a pruning criterion named connectivity sensitivity to select
deeper or shallower one from the same family. [You et al., weights based on a straightforward idea of loss preservation,
2020] find that winning tickets can be identified at the very i.e., removing the weights whose absence leads to the least
early training stage (thus they term the tickets early-bird, or loss change. After initialization, each weight can be assigned
EB, tickets) via low-cost training schemes (e.g., early stopping a score with the above pruning criterion, then remove the last-
and low-precision training) at large learning rates, achieving p fraction (p is the desired pruning ratio) parameters for sub-
up to 4.7× energy saving while maintaining the performance. sequent training. Later, [Wang et al., 2020] argue that it is the
[Diffenderfer and Kailkhura, 2021] (Multi-Prize) find win- training dynamics rather than the loss value itself that matters
ning tickets on binary neural networks for the first time. more at the beginning of training. Therefore they propose
Beside the technical extensions of LTH above, researchers gradient signal preservation (GraSP) in contrast to the pre-
also try to understand LTH more theoretically. [Frankle et al., vious loss preservation, based on Hessian. Concurrently to
2020b] (EarlyPhase) analyze the early phase of deep net- GraSP [Wang et al., 2020], [Lee et al., 2020] seek to explain
work training. [Evci et al., 2020b] (GradFlow) present a gra- the feasibility of SNIP through the lens of signal propagation.
dient flow perspective to explain why LTH happens. [Zhang They empirically find pruning damages the dynamical isom-
et al., 2021a] (Manifold) verify the validity of LTH by lever- etry [Saxe et al., 2014] of neural networks and propose an
Method Pruning criterion Method Pruning criterion Growing criterion
Skeletonization (1989) −∇w L w DNS (2016) |w| |w| (pruned weights also updated)
OBD (1990) diag(H)w w DeepR (2018) stochastic random
Taylor-FO (2019) (∇w L w)2 SET (2018) |w| random
SNIP (2019) |∇w L w| DSR (2019) |w| random
GraSP (2020) −H∇w L w SNFS (2020) |w| momentum of ∇w L
∂R
w, R = 1> (ΠL [l] RigL (2020) |w| |∇w L|
SynFlow (2020) ∂w l=1 |w |)1
Table 2: Summary of pruning criteria in static-mask sparse training Table 3: Summary of pruning and growing criteria in dynamic-mask
methods. Above the dash line are PaT methods. L denotes the loss sparse training methods. Above the dash line are the PaT methods.
function; H represents Hessian; 1 is the all ones vector; l denotes DNS [Guo et al., 2016]; see other references in Sec. 3.3.
the l-th layer of all L layers. Skeletonization [Mozer and Smolensky,
1989]. Taylor-FO [Molchanov et al., 2019]. optimizing the weight values, optimize the network topology.
Namely, when a network is randomly initialized, all the val-
data-independent initialization, approximated isometry (AI), ues of each connection are fixed. The goal is to find a subnet
which is an extension of exact isometry to sparse networks. from the dense network without further training the subnet.
Later, SynFlow [Tanaka et al., 2020] proposes a new data- This line of works is firstly pioneered by [Zhou et al., 2019]
independent criterion, which a variant of magnitude prun- (Deconstruct) where they try to understand the mysteries
ing, yet taking into account the interaction of different lay- of LTH. They find the subnet picked by LTH can achieve
ers. Meanwhile, SupSup [Wortsman et al., 2020] proposes non-trivial accuracies already (without training). This im-
a training method to select different masks (supermasks) for plies that, although the original full network is randomly ini-
thousands of tasks in continual learning, from a single fixed tialized, the chosen subnet is not really random. The subnet
random base network. Another very recent static-mask work picking process itself serves as a kind of training. There-
is DLTH (dual LTH) [Bai et al., 2022]. In LTH, initial weights fore, [Zhou et al., 2019] propose the notion of supermasks
are given, the problem is to find the matching masks, while in (or masking as training) along with a proposed algorithm to
DLTH, the masks are given, the problem is to find the match- optimize the masks in order to find better supermasks.
ing initial weights. Specifically, they employ a growing L2 The method in [Zhou et al., 2019] is only evaluated on
regularization [Wang et al., 2021] technique to transform the small-scale datasets (MNIST and CIFAR). Later, [Ramanu-
original random weights to the desired condition. jan et al., 2020] (Hidden) introduce a trainable score for each
To summarize, similar to PaT, the primary research atten- weight and update the score to minimize the loss function.
tion in this line of PaI works also lies in pruning criterion. The trainable scores are eventually used to decide the selected
Actually, many PaI works propose pretty similar criterion for- subnetwork topology. The method achieves strong perfor-
mulas to those in PaT (see Tab. 2). mance for the first time. E.g., they manage to pick a sub-
Dynamic masks. Another sparse training track allows the net out of a random Wide ResNet50. The subnet is smaller
masks to be changed during training. DeepR [Bellec et than ResNet34 while delivers better top-1 accuracy than the
al., 2018] employs dynamic sparse parameterization with trained ResNet34 on ImageNet. This discovery is summa-
stochastic parameter updates for training. SET [Mocanu et rized as a stronger version of LTH, strong LTH: “within a
al., 2018] adopts magnitude pruning and random growth sufficiently over-parameterized neural network with random
for dynamic network topology optimization. DSR [Mostafa weights (e.g. at initialization), there exists a subnet- work that
and Wang, 2019] adaptively allocates layer-wise sparsity ra- achieves competitive accuracy” [Ramanujan et al., 2020].
tio with no need of manual setting and shows training-time The above are the breakthroughs in terms of technical al-
structural exploration is necessary for best generalization. gorithms. In terms of theoretical progress, [Malach et al.,
SNFS [Dettmers and Zettlemoyer, 2019] proposes to employ 2020] (Proving) propose the theoretical basis of the strong
the momentum of gradients to bring back pruned weights. LTH, suggesting that “pruning a randomly initialized net-
RigL [Evci et al., 2020a] also uses magnitudes for pruning, work is as strong as optimizing the value of the weights”.
yet they employ the absolute gradients for weight growing. Less favorably, they make assumptions about the norms of
The essential idea of dynamic masks is to enable zeroed the inputs and of the weights. Later, [Orseau et al., 2020;
weights to rejoin the training again. Of note, this idea is Pensia et al., 2020] (Logarithmic) remove the aforemen-
not newly brought up in PaI. It has been explored too in tioned limiting assumptions while providing significantly
pruning after training. In terms of the criteria, notably, be- tighter bounds: “the over-parameterized network only needs
cause the masks need to re-evaluated frequently during train- a logarithmic factor (in all variables but depth) number of
ing in dynamic-mask methods, the pruning and growing cri- neurons per weight of the target subnetwork”. [Diffenderfer
teria cannot be costly. Therefore, all these criteria are based and Kailkhura, 2021] have shown that scaled binary networks
on magnitudes or gradients (see Tab. 3), which are readily can be pruned to approximate any target function, where the
available during SGD training. optimization is polynomial in the width, depth, and approx-
imation error, akin to [Malach et al., 2020]. [Sreenivasan
3.3 Sparse Selection et al., 2022] further demonstrate that a logarithmically over-
Sparse training methods still need to optimize the values of parameterized binary network is enough to approximate any
the subnet after picking it out of the dense model. Some other target function, similar to the advance of [Orseau et al., 2020;
works have found another optimization scheme: instead of Pensia et al., 2020] over [Malach et al., 2020].
4 Summary and Open Problems find better pruning criteria, how to allocate the sparsity more
properly cross different layers, and how to schedule the spar-
4.1 Summary of Pruning at Initialization
sification process more wisely, in the hopes of higher perfor-
Easily seen, the biggest paradigm shift from PaT to PaI is to mance given the same sparsity budget. We do not reiterate
use a randomly initialized model to inherit weights for the these topics here considering they have been well treated in
sparse network, just as the PaI term (pruning at initialization) existing pruning surveys (e.g., [Hoefler et al., 2021]).
suggested. The primary goal is to achieve efficient training, Under-performance of PaI. The idea of PaI is intriguing,
in contrast to efficient inference aimed by PaT. This section however, in terms of practical performance, PaI methods still
answers what are the similarities and major differences be- underperform PaT methods by an obvious margin. E.g.,
tween PaI vs. the traditional fashion in a big-picture level. according to the experiments in [Wang et al., 2020], with
Sparse training: Much overlap between PaI and PaT. Not VGG19 and ResNet32 networks on CIFAR-10/100, both
surprisingly, there is considerable overlap between PaI and SNIP and GraSP are consistently outperformed across dif-
PaT. (1) For LTH-like works, the adopted pruning scheme ferent sparsity levels by two traditional pruning methods
is mainly (iterative) magnitude pruning, which has been de- (OBD [LeCun et al., 1990] and MLPrune [Zeng and Urta-
veloped for more than thirty years and widely considered as sun, 2019]), which are not even close to the state-of-the-art.
the most simple pruning method. (2) For pre-selected-mask [Frankle et al., 2021] also report similar observation. In this
methods, the proposed pruning criteria for PaI are reminis- sense, there is still a long road ahead before we can really
cent of those proposed in PaT (Tab. 2). Therefore, we may “save resources at training time” [Wang et al., 2020].
continue seeing the wisdom in PaT being transferred to PaI. Under-development of sparse libraries. Despite the
In terms of the four primary aspects in pruning (Sec. 2.2), promising potential of sparse training, the practical accel-
PaI does not bring much change to the pruning ratio, crite- eration benefit to date has not been satisfactory. E.g.,
rion, and schedule. However, for sparsity structure, it does SNFS [Dettmers and Zettlemoyer, 2019] claims “up to 5.61x
have a significant impact. First, structured pruning (or filter faster training”, yet in practice, due to under-development of
pruning) itself is not much interesting in the context of PaI be- sparse matrix multiplication libraries, this benefit cannot be
cause given the layer-wise sparsity ratios filter pruning at ini- materialized at present. To our best knowledge, few works
tialization reduces to training from scratch a dense network have reported wall-time speedup in sparse training. Develop-
(with a smaller width). Then the major technical problem is ment of sparse training libraries thus can be a worthy future
how to initialize a slimmer network properly, which has been cause. In this regard, notably, dynamic-mask methods pose
well-explored. This is probably why most PaI works focus on even more severe issues than static-mask methods as the con-
unstructured pruning rather than structured pruning. Second, stantly changing masks will make it harder for the hardware
the well-known LTH is proposed for unstructured pruning. to gain acceleration. If the training is not really getting faster,
As far as we know, to date, no work has validated the hypoth- this may fundamentally undermine the motivation of sparse
esis with filter pruning. The reason behind is still elusive. training, which actually is the main force of PaI now.
Sparse selection: Significant advance of PaI vs. PaT. The To sum, the major challenges facing PaI is to deliver the
common belief behind PaT is that the remaining parameters practical training speedup with no (serious) performance
possess knowledge learned by the original redundant model. compromised, as it promises. This hinges on a more profound
Inheriting the knowledge is better than starting over, which comprehension of the performance gap between sparse train-
has been also empirically justified by many works and thus ing and PaT, as well as the advance of sparse matrix libraries.
become a common practice for more than thirty years. In Code base. As discussed, PaI works (especially LTH-like)
stark contrast, PaI inherits the parameter values right from severely hinge on empirical studies. However, it is non-trivial
the beginning. It is tempting to ask: At such an early phase of in deep learning to tune hyper-parameters. In this paper, we
neural network training, do the weights really possess enough thereby also offer a code base1 with systematic support of
knowledge (to provide base models for subsequent pruning)? popular networks, datasets, and logging tools, in the hopes of
Or more fundamentally, is neural network training essentially aiding researchers and practitioner focus more on the insights
to learn the knowledge from nothing or to reveal the knowl- and core methods instead of the tedious engineering details.
edge the model already has? PaT implies the former, while
PaI (especially sparse selection [Ramanujan et al., 2020;
Malach et al., 2020]) suggests the latter. This, we believe, is 5 Conclusion
(arguably) the most profound advance PaI has brought about. This paper presents the first survey concentrated on neu-
The understanding of this question will not only bestow us ral network pruning at initialization (PaI). The road map of
practical benefits, but also, more importantly, a deeper theo- PaI vs. its pruning-after-training counterpart is studied with a
retical understanding of deep neural networks. thorough and structured literature review. We close by out-
lining the open problems and offer a code base towards easy
4.2 Open Problems sanity-checking and benchmarking of different PaI methods.
Same problems as pruning after training. Given the over- Acknowledgments. We thank Jonathan Frankle, Alex
lap between PaI and PaT, nearly all the open problems in PaT Renda, and Michael Carbin from MIT for their very helpful
will also apply to PaI, too, centering on the four classic topics suggestions to our work.
in pruning: how to find better sparsity structure being hard-
1
ware friendly and meanwhile optimization friendly, how to https://github.com/mingsun-tse/smile-pruning
References [Frankle et al., 2020a] Jonathan Frankle, Gintare Karolina Dziu-
[Allen-Zhu et al., 2019] Zeyuan Allen-Zhu, Yuanzhi Li, and gaite, Daniel Roy, and Michael Carbin. Linear mode connectivity
Yingyu Liang. Learning and generalization in overparameterized and the lottery ticket hypothesis. In ICML, 2020.
neural networks, going beyond two layers. In NeurIPS, 2019. [Frankle et al., 2020b] Jonathan Frankle, David J Schwab, and
[Bai et al., 2022] Yue Bai, Huan Wang, Zhiqiang Tao, Kunpeng Li, Ari S Morcos. The early phase of neural network training. In
and Yun Fu. Dual lottery ticket hypothesis. In ICLR, 2022. ICLR, 2020.
[Bellec et al., 2018] Guillaume Bellec, David Kappel, Wolfgang [Frankle et al., 2021] Jonathan Frankle, Gintare Karolina Dziu-
Maass, and Robert Legenstein. Deep rewiring: Training very gaite, Daniel M Roy, and Michael Carbin. Pruning neural net-
sparse deep networks. In ICLR, 2018. works at initialization: Why are we missing the mark? In ICLR,
2021.
[Blalock et al., 2020] Davis Blalock, Jose Javier Gonzalez,
Jonathan Frankle, and John V Guttag. What is the state of neural [Gale et al., 2019] Trevor Gale, Erich Elsen, and Sara Hooker.
network pruning? In SysML, 2020. The state of sparsity in deep neural networks. arXiv preprint
arXiv:1902.09574, 2019.
[Chen et al., 2020a] Tianlong Chen, Jonathan Frankle, Shiyu
Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael [Guo et al., 2016] Yiwen Guo, Anbang Yao, and Yurong Chen. Dy-
Carbin. The lottery ticket hypothesis for pre-trained bert net- namic network surgery for efficient dnns. In NeurIPS, 2016.
works. In NeurIPS, 2020. [Han et al., 2015] Song Han, Jeff Pool, John Tran, and William J
[Chen et al., 2020b] Tianlong Chen, Zhenyu Zhang, Sijia Liu, Dally. Learning both weights and connections for efficient neural
Shiyu Chang, and Zhangyang Wang. Long live the lottery: The network. In NeurIPS, 2015.
existence of winning tickets in lifelong learning. In ICLR, 2020. [Hassibi and Stork, 1993] B. Hassibi and D. G. Stork. Second or-
[Chen et al., 2021a] Tianlong Chen, Jonathan Frankle, Shiyu der derivatives for network pruning: Optimal brain surgeon. In
Chang, Sijia Liu, Yang Zhang, Michael Carbin, and Zhangyang NeurIPS, 1993.
Wang. The lottery tickets hypothesis for supervised and self-
[He et al., 2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid-
supervised pre-training in computer vision models. In CVPR,
2021. ual learning for image recognition. In CVPR, 2016.
[Chen et al., 2021b] Tianlong Chen, Yongduo Sui, Xuxi Chen, As- [He et al., 2018] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia
ton Zhang, and Zhangyang Wang. A unified lottery ticket hy- Li, and Song Han. AMC: Automl for model compression and
pothesis for graph neural networks. In ICML, 2021. acceleration on mobile devices. In ECCV, 2018.
[Chen et al., 2021c] Xiaohan Chen, Yu Cheng, Shuohang Wang, [Hoefler et al., 2021] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun,
Zhe Gan, Jingjing Liu, and Zhangyang Wang. The elastic lot- Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning:
tery ticket hypothesis. In NeurIPS, 2021. Pruning and growth for efficient inference and training in neural
networks. JMLR, 22(241):1–124, 2021.
[Cheng et al., 2018a] Jian Cheng, Pei-song Wang, Gang Li, Qing-
hao Hu, and Han-qing Lu. Recent advances in efficient computa- [Kearns et al., 1994] Michael J Kearns, Umesh Virkumar Vazirani,
tion of deep convolutional neural networks. Frontiers of Informa- and Umesh Vazirani. An introduction to computational learning
tion Technology & Electronic Engineering, 19(1):64–77, 2018. theory. MIT Press, 1994.
[Cheng et al., 2018b] Yu Cheng, Duo Wang, Pan Zhou, and Tao [Kingma and Ba, 2015] Diederik P Kingma and Jimmy Ba. Adam:
Zhang. Model compression and acceleration for deep neural net- A method for stochastic optimization. In ICLR, 2015.
works: The principles, progress, and challenges. IEEE Signal [LeCun et al., 1990] Y. LeCun, J. S. Denker, and S. A. Solla. Opti-
Processing Magazine, 35(1):126–136, 2018.
mal brain damage. In NeurIPS, 1990.
[Deng et al., 2020] Lei Deng, Guoqi Li, Song Han, Luping Shi, and
[LeCun et al., 2015] Yann LeCun, Yoshua Bengio, and Geoffrey
Yuan Xie. Model compression and hardware acceleration for
neural networks: A comprehensive survey. Proceedings of the Hinton. Deep learning. Nature, 521(7553):436, 2015.
IEEE, 108(4):485–532, 2020. [Lee et al., 2019] Namhoon Lee, Thalaiyasingam Ajanthan, and
[Dettmers and Zettlemoyer, 2019] Tim Dettmers and Luke Zettle- Philip Torr. Snip: Single-shot network pruning based on con-
moyer. Sparse networks from scratch: Faster training without nection sensitivity. In ICLR, 2019.
losing performance. arXiv preprint arXiv:1907.04840, 2019. [Lee et al., 2020] Namhoon Lee, Thalaiyasingam Ajanthan,
[Diffenderfer and Kailkhura, 2021] James Diffenderfer and Stephen Gould, and Philip HS Torr. A signal propagation
Bhavya Kailkhura. Multi-prize lottery ticket hypothesis: Finding perspective for pruning neural networks at initialization. In
accurate binary neural networks by pruning a randomly weighted ICLR, 2020.
network. In ICLR, 2021. [Li et al., 2017] Hao Li, Asim Kadav, Igor Durdanovic, Hanan
[Evci et al., 2020a] Utku Evci, Trevor Gale, Jacob Menick, Samet, and Hans Peter Graf. Pruning filters for efficient con-
Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Mak- vnets. In ICLR, 2017.
ing all tickets winners. In ICML, 2020. [Liu et al., 2019] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao
[Evci et al., 2020b] Utku Evci, Yani A Ioannou, Cem Keskin, and Huang, and Trevor Darrell. Rethinking the value of network
Yann Dauphin. Gradient flow in sparse neural networks and how pruning. In ICLR, 2019.
lottery tickets win. arXiv preprint arXiv:2010.03533, 2020. [Liu et al., 2021] Ning Liu, Geng Yuan, Zhengping Che, Xuan
[Frankle and Carbin, 2019] Jonathan Frankle and Michael Carbin. Shen, Xiaolong Ma, Qing Jin, Jian Ren, Jian Tang, Sijia Liu, and
The lottery ticket hypothesis: Finding sparse, trainable neural Yanzhi Wang. Lottery ticket implies accuracy degradation, is it a
networks. In ICLR, 2019. desirable phenomenon? In ICML, 2021.
[Ma et al., 2021] Xiaolong Ma, Geng Yuan, Xuan Shen, Tianlong [Su et al., 2020] Jingtong Su, Yihang Chen, Tianle Cai, Tianhao
Chen, Xuxi Chen, Xiaohan Chen, Ning Liu, Minghai Qin, Sijia Wu, Ruiqi Gao, Liwei Wang, and Jason D Lee. Sanity-checking
Liu, et al. Sanity checks for lottery tickets: Does your winning pruning methods: Random tickets can win the jackpot. In
ticket really win the jackpot? In NeurIPS, 2021. NeurIPS, 2020.
[Malach et al., 2020] Eran Malach, Gilad Yehudai, Shai Shalev- [Sze et al., 2017] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and
Shwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Joel S Emer. Efficient processing of deep neural networks: A tu-
Pruning is all you need. In ICML, 2020. torial and survey. Proceedings of the IEEE, 105(12):2295–2329,
[Mao et al., 2017] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, 2017.
Xingyu Liu, Yu Wang, and William J Dally. Exploring the gran- [Tanaka et al., 2020] Hidenori Tanaka, Daniel Kunin, Daniel L
ularity of sparsity in convolutional neural networks. In CVPR Yamins, and Surya Ganguli. Pruning neural networks without
Workshop, 2017. any data by iteratively conserving synaptic flow. In NeurIPS,
[Mocanu et al., 2018] Decebal Constantin Mocanu, Elena Mocanu, 2020.
Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Anto- [Vapnik, 2013] Vladimir Vapnik. The nature of statistical learning
nio Liotta. Scalable training of artificial neural networks with theory. Springer Science & Business Media, 2013.
adaptive sparse connectivity inspired by network science. Nature
[Wang et al., 2019a] Chaoqi Wang, Roger Grosse, Sanja Fidler,
Communications, 9(1):1–12, 2018.
and Guodong Zhang. Eigendamage: Structured pruning in the
[Molchanov et al., 2017] P. Molchanov, S. Tyree, and T. Karras. kronecker-factored eigenbasis. In ICML, 2019.
Pruning convolutional neural networks for resource efficient in-
[Wang et al., 2019b] Huan Wang, Xinyi Hu, Qiming Zhang, Yuehai
ference. In ICLR, 2017.
Wang, Lu Yu, and Haoji Hu. Structured pruning for efficient con-
[Molchanov et al., 2019] Pavlo Molchanov, Arun Mallya, Stephen volutional neural networks via incremental regularization. IEEE
Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neu- Journal of Selected Topics in Signal Processing, 14(4):775–788,
ral network pruning. In CVPR, 2019. 2019.
[Morcos et al., 2019] Ari Morcos, Haonan Yu, Michela Paganini, [Wang et al., 2019c] Huan Wang, Qiming Zhang, Yuehai Wang,
and Yuandong Tian. One ticket to win them all: generalizing Lu Yu, and Haoji Hu. Structured pruning for efficient convnets
lottery ticket initializations across datasets and optimizers. In via incremental regularization. In IJCNN, 2019.
NeurIPS, 2019.
[Wang et al., 2020] Chaoqi Wang, Guodong Zhang, and Roger
[Mostafa and Wang, 2019] Hesham Mostafa and Xin Wang. Pa- Grosse. Picking winning tickets before training by preserving
rameter efficient training of deep convolutional neural networks gradient flow. In ICLR, 2020.
by dynamic sparse reparameterization. In ICML, 2019.
[Wang et al., 2021] Huan Wang, Can Qin, Yulun Zhang, and Yun
[Mozer and Smolensky, 1989] Michael C Mozer and Paul Smolen- Fu. Neural pruning via growing regularization. In ICLR, 2021.
sky. Skeletonization: A technique for trimming the fat from a
network via relevance assessment. In NeurIPS, 1989. [Wortsman et al., 2020] Mitchell Wortsman, Vivek Ramanujan,
Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Ja-
[Orseau et al., 2020] Laurent Orseau, Marcus Hutter, and Omar Ri-
son Yosinski, and Ali Farhadi. Supermasks in superposition. In
vasplata. Logarithmic pruning is all you need. In NeurIPS, 2020. NeurIPS, 2020.
[Pensia et al., 2020] Ankit Pensia, Shashank Rajput, Alliot Nagle, [You et al., 2020] Haoran You, Chaojian Li, Pengfei Xu, Yonggan
Harit Vishwakarma, and Dimitris Papailiopoulos. Optimal lot- Fu, Yue Wang, Xiaohan Chen, Richard G Baraniuk, Zhangyang
tery tickets via subset sum: Logarithmic over-parameterization is Wang, and Yingyan Lin. Drawing early-bird tickets: Toward
sufficient. In NeurIPS, 2020. more efficient training of deep networks. In ICLR, 2020.
[Ramanujan et al., 2020] Vivek Ramanujan, Mitchell Wortsman,
[Yu et al., 2020] Haonan Yu, Sergey Edunov, Yuandong Tian, and
Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari.
Ari S Morcos. Playing the lottery with rewards and multiple lan-
What’s hidden in a randomly weighted neural network? In
guages: lottery tickets in rl and nlp. In ICLR, 2020.
CVPR, 2020.
[Zeng and Urtasun, 2019] Wenyuan Zeng and Raquel Urtasun. Ml-
[Reed, 1993] R. Reed. Pruning algorithms – a survey. IEEE Trans-
prune: Multi-layer pruning for automated neural network com-
actions on Neural Networks, 4(5):740–747, 1993.
pression. 2019.
[Saxe et al., 2014] Andrew M Saxe, James L McClelland, and
[Zhang et al., 2021a] Zeru Zhang, Jiayin Jin, Zijie Zhang, Yang
Surya Ganguli. Exact solutions to the nonlinear dynamics of
learning in deep linear neural networks. In ICLR, 2014. Zhou, Xin Zhao, Jiaxiang Ren, Ji Liu, Lingfei Wu, Ruoming Jin,
and Dejing Dou. Validating the lottery ticket hypothesis with in-
[Schmidhuber, 2015] Jürgen Schmidhuber. Deep learning in neural ertial manifold theory. In NeurIPS, 2021.
networks: An overview. Neural networks, 61:85–117, 2015.
[Zhang et al., 2021b] Zhenyu Zhang, Xuxi Chen, Tianlong Chen,
[Simonyan and Zisserman, 2015] Karen Simonyan and Andrew and Zhangyang Wang. Efficient lottery ticket finding: Less data
Zisserman. Very deep convolutional networks for large-scale im- is more. In ICML, 2021.
age recognition. In ICLR, 2015.
[Zhou et al., 2019] Hattie Zhou, Janice Lan, Rosanne Liu, and Ja-
[Soltanolkotabi et al., 2018] Mahdi Soltanolkotabi, Adel Javan- son Yosinski. Deconstructing lottery tickets: Zeros, signs, and
mard, and Jason D Lee. Theoretical insights into the optimization the supermask. In NeurIPS, 2019.
landscape of over-parameterized shallow neural networks. TIT,
65(2):742–769, 2018. [Zou et al., 2020] Difan Zou, Yuan Cao, Dongruo Zhou, and Quan-
quan Gu. Gradient descent optimizes over-parameterized deep
[Sreenivasan et al., 2022] Kartik Sreenivasan, Shashank Rajput, Jy- relu networks. 109(3):467–492, 2020.
yong Sohn, and Dimitris Papailiopoulos. Finding everything
within random binary networks. In AISTATS, 2022.