Training Large Language Models Efficiently With Sparsity and Dataflow
Training Large Language Models Efficiently With Sparsity and Dataflow
A BSTRACT
arXiv:2304.05511v1 [cs.LG] 11 Apr 2023
Large foundation language models have shown their versatility in being able to be
adapted to perform a wide variety of downstream tasks, such as text generation,
sentiment analysis, semantic search etc. However, training such large founda-
tional models is a non-trivial exercise that requires a significant amount of com-
pute power and expertise from machine learning and systems experts. As models
get larger, these demands are only increasing. Sparsity is a promising technique to
relieve the compute requirements for training. However, sparsity introduces new
challenges in training the sparse model to the same quality as the dense counter-
parts. Furthermore, sparsity drops the operation intensity and introduces irregular
memory access patterns that makes it challenging to efficiently utilize compute
resources. This paper demonstrates an end-to-end training flow on a large lan-
guage model - 13 billion GPT - using sparsity and dataflow. The dataflow execu-
tion model and architecture enables efficient on-chip irregular memory accesses
as well as native kernel fusion and pipelined parallelism that helps recover device
utilization. We show that we can successfully train GPT 13B to the same quality
as the dense GPT 13B model, while achieving an end-end speedup of 4.5x over
dense A100 baseline.
1 I NTRODUCTION
Foundation models (Bommasani et al. (2021)) in natural language processing (NLP) (e.g. BERT,
GPT) and computer vision (e.g. VIT, DALL-E) domain have accelerated deployment of machine
learning systems in research and commercial domain. Their key characteristics of self supervision
and adaptation, allows a myriad of applications to be built to solve specific problems such as, text
generation, sentiment analysis, image segmentation, image detection etc. With the intention of
extracting more capabilities out of these models and to train on large corpus of data, researchers
have proposed increasing parameters count by orders of magnitude (Brown et al. (2020b)Wang et al.
(2022) Raffel et al. (2019)).
Due to power and physical constraints, underlying hardware to train such humongous models does
not scale proportional to model parameters (Thompson et al. (2020); Leiserson et al. (2020)), a
number of techniques, such as network restructuring (Dong et al. (2017) Wu et al. (2020)), network
pruning(Blalock et al. (2020)), network quantization (Hubara et al. (2016)), low rank decomposition
Mao et al. (2020), knowledge distillation Sanh et al. (2019), model sparsity (Liu et al. (2021) Mocanu
et al. (2017)) etc. have been explored to handle this computational challenge. Various kinds of
sparse techniques have been proposed Mocanu et al. (2021) Raihan & Aamodt (2020) to reduce
computational intensity and to mimic human brain neuron connections (LeCun et al. (1989),Azevedo
et al. (2009)).
As sparsity techniques continue to evolve and become mainstream in training and inference applica-
tions, an entirely new set of challenges are posed to the underlying hardware architecture (Dave et al.
(2020)). In contrast to coping up to ML computational challenge by mere Tera-FLOPs and memory
bandwidth increase, sparse computations demand flexibility, programmability and efficiency from
next generation of hardware due to a wide range of possible patterns and training flows (Kitaev
et al. (2020), Chen et al. (2021a), Han et al. (2016)). A well-balanced system should be able to
1
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
effectively handle a generally compute intensive dense deployment of a model, a memory intensive
highly sparse deployment of a model and variations in between. A successful deployment of sparse
techniques on a friendly architecture can help mitigate current roadblocks, such as immense power,
huge machine cost and long training time, in an effective way.
With the expansion of machine learning and artificial intelligent applications and their intrinsic char-
acteristics, a number of computational frameworks have been suggested over time. Some of the
examples include Google TPU (Jouppi et al. (2017)), Cambricon (Zhou et al. (2018) Zhang et al.
(2016)), NVIDIA A100 Nvidia (2020), Cerebras CS-2 (Fricker (2022)), Graphcore IPU (Jia et al.
(2019)) and SambaNova RDU (Prabhakar et al. (2022)) in addition to traditional CPU based ar-
chitectures. While there are a few attempts to evaluate and compare these hardware and software
systems (Emani et al. (2022), DBL (2019)), full scope of their capabilities especially in terms of
handling full range of sparse and dense applications, remain unknown. Many of these frameworks
also remain proprietary and unavailable for a generic study in the public domain.
Although lucrative, sparse techniques come with their own set of challenges beyond architectural
compatibility. There is a huge spectrum of variables, such as structured Wen et al. (2016) /semi-
structured Zhou et al. (2021a) /unstructured sparsity(Han et al. (2015), Guo et al. (2016)), percent-
ages of sparsity Sanh et al. (2020), weights/activation sparsity Raihan & Aamodt (2020) and training
schedule Han et al. (2016), that impacts accuracy of a given model compared to a dense only base-
line. It is difficult and time-consuming to determine these decision variables to achieve state of the
art (SOTA) metrics on a given model. Large language models, such as 13B parameter GPT are
pervasive foundation models in NLP domain, which can support a variety of language applications.
In this paper, we use this model to showcase a successful inclusion of sparsity in an end-to-end
training flow to achieve comparable accuracy metrics. In the process, we make the following key
contributions:
• A systematic analysis of the interplay between of sparsity, fusion, and dataflow capabilities.
• Evaluation of sparse GPT 13B on SambaNova RDU demonstrating speedups over A100.
• Loss, Zero-shot and Few-shot analysis of the sparse 13B GPT model when compared to its
dense baseline
The rest of this paper is organized as follows. Section 2 provides a brief background on sparse pre-
training and dataflow architectures. Section 3 quantifies the advantages with sparsity and dataflow.
Section 4 describes our methology to train a 13 billion GPT model using sparsity. Section 5 evaluates
the methodology, and section 6 concludes.
Sparse training has gained significant interest in recent times. Different sparse training methods
have emerged, where sparse weights are maintained during training. Work in this domain includes
exploration into various pruning (or growth) criteria such as weight magnitude and sign, random
selection and gradient magnitude Zhou et al. (2021b); Liu et al. (2022); Huang et al. (2022); Chen
et al. (2021b). This work does not focus on developing a new method for sparse pretraining. Instead,
we follow the S2D methodology developed in Chen et al. (2021b). This work differentiates from
these previous work in terms of the scale of the model being explored (13 billion parameters),
the hardware used for the exploration (dataflow architecture) and its implication. All prior work
has either focused on training on more traditional hardware - TPUs, GPUs or CPUs, which are
significantly different than a more efficient and faster dataflow hardware.
Hardware dataflow accelerator architectures have recently emerged as a promising design choice to
keep up with the ever-increasing compute demands of large language models. Dataflow accelerators
are typically composed of programmable compute and memory units placed in a programmable in-
terconnect fabric. Unlike a conventional architecture that executes programs as instruction streams
2
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
TILE 0 TILE 1
AG S S S S
AG S S S S
TILE 4 TILE 5
PCU PMU PCU
TILE 6 TILE 7
AG S S S S
DDR PCIe AG S S S S
(TBs) Scale-Out
Figure 1: RDU Dataflow Architecture. PCUs are the programmable compute elements, PMUs form
the on-chip memory system, and S represents the programmable interconnect. DDR and other IO is
accessed via AGs and CUs.
with a global program counter, dataflow architecture components are often statically configured.
Dataflow architectures avoid the power and area overheads of traditional instruction management,
and enable automatic kernel fusion with pipelining without manually writing fused kernels. Input
compute graphs are lowered by dataflow compilers into a graph of primitive compute and memory
units, which then gets placed and routed on the available physical units on the target dataflow hard-
ware. Numerous types of dataflow architectures have been proposed both in industry and academia
in the past decade Prabhakar et al. (2017); Liu et al. (2019). The architectures explore different
design points in the granularity of compute, on-chip memory system, and flexibility in the intercon-
nect.
In this paper, we study the impact of sparsity in the context of SambaNova’s Reconfigurable
Dataflow Unit (RDU) Prabhakar & Jairath (2021); Prabhakar et al. (2022). Figure 1 describes the
high-level architecture of the SambaNova RDU. The RDU is organized as multiple tiles of compute,
memory, and interconnect components. Within each tile, PCUs (orange) are the programmable com-
pute elements that contain multiple pipeline stages of SIMD ALUs. PMUs (blue) are distributed
software-managed on-chip scratchpad memories with programmable address generation and tensor
transformations. PCUs and PMUs are connected to each other to form a software pipeline via the
programmable interconnect switches S (yellow). DDR and other IO is accessed via AGs and CUs
(gray).
In this section, we describe and quantify two execution models: Kernel-By-Kernel (KBK) and
Dataflow (DF) with respect to weight sparsity. In KBK execution, operators in a compute graph
are executed one at a time. Intermediate results between two operators are exchanged through off-
chip memory. The compute and memory resources available can be used to execute each operation
fully in parallel. In DF execution, multiple operators are connected together to form a dataflow
pipeline. Intermediate results between operators are exchanged through on-chip double buffers.
Consequently, the compute and memory resources available need to be shared between all the oper-
ators in the pipeline.
3
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
5120 x 5120 x
20480 W0 2048 x 2048 x 2048 x
20480 W1
20480 20480 20480
GEMM 0 GELU GEMM 1
Off-Chip Memory
5120 x 5120 x
20480 W0 20480 W1
GEMM 0 A1 GELU GEMM 1
2048 x
5120 A0 A2
2048 x 2048 x 2048 x
20480 20480 5120
Off-Chip Memory
Figure 2: Kernel-By-Kernel (KBK) vs. Dataflow (DF) execution for a simple example with GEMM
followed by GELU followed by GEMM. White arrows represent traffic to off-chip memory, gray
arrows represent on-chip traffic. Blue boxes represent on-chip memory, white boxes represent on-
chip compute resources. We use matrix dimensions from GPT 13B in this example. Each edge / box
is labeled with the Tensor dimensions being accessed.
Figure 2 pictorially shows a simple compute graph involving the FFN block in GPT encoder. This
constitutes two GEMM computations and a GELU operation between them. We use tensor dimen-
sions from a 13 billion GPT encoder as an example for the remainder of this section. Figure 2(a)
shows KBK while figure 2(b) shows DF execution. Arrows show both off-chip (white) and on-chip
(gray) traffic. The edges are labeled with the Tensor dimensions being transferred. Blue boxes show
Tensors in on-chip memory in DF execution.
Table 1 quantifies the impact of weight sparsity on KBK vs DF. We study three parameters: off-chip
memory bandwidth (BW), on-chip memory capacity (M), and achievable speedup (X) for various
sparsity levels (S). For both KBK and DF, we assume a compute capability of 300 TFLOPs. BW
shows the minimum bandwidth required to fully utilize the available compute capabilities (100%
of TFLOPS). While we recognizing that 100% utilization is impractical in general, this analysis
enables identifying and quantifying fundamental bottlenecks that are unavoidable even with perfect
execution of sparse GEMMs. M shows the total memory capacity required by DF to construct the
dataflow pipeline. X shows the achievable speedup when latencies of all operations are factored in.
From Table 1, we can see that BW requirement for KBK increases proportional to sparsity. Fur-
thermore, the bandwidth requirements for KBK can be an order of magnitude higher than DF. This
observation has important system-level implications: an accelerator architecture built around KBK
execution requires a significantly higher memory bandwidth than DF to fully exploit sparsity. The
high BW requirement stems from the fact sparsity is reducing total compute which makes GEMMs
execute faster, and that all intermediate results are stored and loaded back from off-chip memory,
resulting in more memory traffic. The DF model is able to successfully capture data locality between
operations on-chip, and hence does not incur the same penalty. DF execution fully utilizes compute
TFLOPS by exploiting pipeline parallelism between operators. DF gets the effect of kernel fusion
natively by enabling the construction of such operator pipelines in software. The impact of operator
fusion have been studied in previous literature too Zhang et al. (2022). Low BW requirements for
DF enable such accelerators to be built with dense, lower bandwidth off-chip memory technolo-
gies like DDR with TBs of capacity. The larger capacity enables more efficient mapping of large
foundational models without complex sharding requirements.
4
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
Table 1: Kernel-By-Kernel (KBK) and Dataflow (DF) execution impact for a sequence of GEMM
- GELU - GEMM operations. The table shows off-chip memory bandwidth and on-chip memory
capacity requirements for various weight sparsity levels.
S: Sparsity BW: Bandwidth (GB/s) M: On-Chip Memory (MB) X: Speedup vs. Dense
KBK DF DF Ideal KBK DF
Table 1 also shows that DF needs comparatively larger on-chip memory capacity. Note that this ex-
ample effectively serves as the upper bound for M, as this study excludes commonly used techniques
like tiling to reduce the on-chip memory requirement on DF. The blue boxes in Figure 2 shows that
the on-chip memory is being used to hold weights and intermediate results between pipeline stages.
Furthermore, these buffers need to be double-buffered to decouple the producing stage from the
consuming stage. As sparsity reduces the size of weights, M goes down as sparsity increases.
Finally, Table 1 also compares achievable speedups on KBK vs. DF. The speedup calculation as-
sumes that KBK has a peak off-chip bandwidth of 2 TB/s, and that both KBK and DF can run sparse
GEMMs at full efficiency. As sparsity increases, sparse GEMMs get proportionally faster. However,
note that the output of each sparse GEMM is still a dense tensor. Due to Amdahl’s law, total speedup
gets impacted by the memory-bound GELU which is not helped by sparsity. Even if GELU is able
to fully utilize all 2 TB/s bandwidth on KBK, we see that the total speedup scales to only 12.9x on
KBK. In contrast, DF speedup can scale almost linearly with sparsity. This is because, while DF
loses some TFLOPs to compute GELU, it is able to proportionally parallelize GELU to match the
sparse GEMM throughput. This is enabled by larger on-chip bandwidth, a flexible compute and
memory system that allows processing multiple streams of data from A1 in Figure 2(b).
In summary, the benefits of sparsity at the compute graph level can vary widely between KBK and
DF. KBK is more sensitive to available off-chip bandwidth and other memory-bound operations in
the model like GELU. DF requires large memory capacity, but can sustain higher overall utiliza-
tion with an order of magnitude lesser off-chip bandwidth. The next section describes how these
insights are utilized to train a 13 billion GPT model on the SambaNova Reconfigurable Dataflow
Unit (RDU).
4 T RAINING M ETHODOLOGY
We train a 13 billion GPT model from scratch on the C4 dataset. We use the recipe from Brown
et al. (2020a) to train the model. We train the dense version of the model using a batch size of 1024,
sequence length of 2048 and train for 150,000 steps. We use a learning rate of 3e-5 with a warm-up
of 3000 steps. We use a similar recipe for our sparse version of the model. We train the sparse model
for some number of the steps and ”densify” and train the model using the dense version for the rest
of the steps (S2D) Chen et al. (2021b). We vary the sparse steps and chose the one that leads to
iso-accuracy. During the sparse phase, we experiment with sparsity levels of 75-87.5% sparsity. We
train the model end-end on the SambaNova hardware.
5 E VALUATION
We evaluate the dense and S2D version of the 13 billion GPT model by comparing the loss on a held
out evaluation C4 dataset and zero shot accuracy on LAMBADA Paperno et al. (2016), HellaSwag
Zellers et al. (2019), TriviaQA Joshi et al. (2017), OpenBookQA Mihaylov et al. (2018), PiQA Bisk
et al. (2020), RTE Wang et al. (2019), Winogrande ai2 (2019), COPA Gordon et al. (2012) and
ANLI R1 Williams et al. (2022). Results for the same can be viewed in figure 3 and tables 2 and
3. For both zero shot accuracy and loss on a held out dataset, the S2D version achieves the same
5
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
accuracy as the dense version. Overall, our pipeline on RDU HW gives us a speed-up of 4.5x over
A100.
6 C ONCLUSION
In this paper, we describe a method to train large language models efficiently with sparsity and
dataflow. We describe the locality and pipelined parallelism advantages with dataflow execution
over kernel-by-kernel execution with various sparsity levels, and quantify the performance benefits
achievable with dataflow execution. We then describe a flow using S2D to train a 13 billion GPT
model from scratch on the C4 dataset, and evaluate on a variety of downstream tasks on the Sam-
baNova RDU. We show that our S2D version achieves the same accuracy as the dense model while
achieving an end-to-end speedup of 4.5x over the dense model on A100
6
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
R EFERENCES
Performance and power evaluation of AI accelerators for training deep learning models. CoRR,
abs/1909.06842, 2019. URL http://arxiv.org/abs/1909.06842. Withdrawn.
Frederico A. C. Azevedo, Ludmila R. B. Carvalho, Lea T. Grinberg, José Marcelo Farfel, Renata
E. L. Ferretti, Renata E. P. Leite, Wilson Jacob Filho, Roberto Lent, and Suzana Herculano-
Houzel. Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically
scaled-up primate brain. The Journal of Comparative Neurology, 513(5):532–541, 2009.
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning
about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial
Intelligence, 2020.
Davis W. Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John V. Guttag. What is the
state of neural network pruning? CoRR, abs/2003.03033, 2020. URL https://arxiv.org/
abs/2003.03033.
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,
Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson,
Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen
Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Esin Dur-
mus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor
Gale, Lauren Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori
Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang,
Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keel-
ing, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Ku-
ditipudi, and et al. On the opportunities and risks of foundation models. CoRR, abs/2108.07258,
2021. URL https://arxiv.org/abs/2108.07258.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray,
Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,
and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato,
R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems,
volume 33, pp. 1877–1901. Curran Associates, Inc., 2020a. URL https://proceedings.
neurips.cc/paper/2020/file/145https://www.overleaf.com/project/
63daacc2baebb8338dfdfad17c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
7
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR,
abs/2005.14165, 2020b. URL https://arxiv.org/abs/2005.14165.
Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher
Ré. Pixelated butterfly: Simple and efficient sparse training for neural network models. CoRR,
abs/2112.00029, 2021a. URL https://arxiv.org/abs/2112.00029.
Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Re.
Pixelated butterfly: Simple and efficient sparse training for neural network models. arXiv preprint
arXiv:2112.00029, 2021b.
Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin
Li. Hardware acceleration of sparse and irregular tensor computations of ML models: A sur-
vey and insights. CoRR, abs/2007.00864, 2020. URL https://arxiv.org/abs/2007.
00864.
Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. More is less: A more complicated
network with less inference complexity. CoRR, abs/1703.08651, 2017. URL http://arxiv.
org/abs/1703.08651.
Murali Emani, Zhen Xie, Siddhisanket Raskar, Varuni Sastry, William Arnold, Bruce Wilson,
Rajeev Thakur, Venkatram Vishwanath, Zhengchun Liu, Michael E. Papka, Cindy Orozco Bo-
horquez, Rick Weisner, Karen Li, Yongning Sheng, Yun Du, Jian Zhang, Alexander Tsyplikhin,
Gurdaman Khaira, Jeremy Fowers, Ramakrishnan Sivakumar, Victoria Godsoe, Adrian Macias,
Chetan Tekur, and Matthew Boyd. A comprehensive evaluation of novel ai accelerators for
deep learning workloads. In 2022 IEEE/ACM International Workshop on Performance Model-
ing, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 13–25,
2022. doi: 10.1109/PMBS56514.2022.00007.
Jean-Philippe Fricker. The cerebras cs-2: Designing an ai accelerator around the world’s largest
2.6 trillion transistor chip. In Proceedings of the 2022 International Symposium on Physical
Design, ISPD ’22, pp. 71, New York, NY, USA, 2022. Association for Computing Machin-
ery. ISBN 9781450392105. doi: 10.1145/3505170.3511036. URL https://doi.org/10.
1145/3505170.3511036.
Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. SemEval-2012 task 7: Choice of plau-
sible alternatives: An evaluation of commonsense causal reasoning. In *SEM 2012: The First
Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main
conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop
on Semantic Evaluation (SemEval 2012), pp. 394–398, Montréal, Canada, 7-8 June 2012. Asso-
ciation for Computational Linguistics. URL https://aclanthology.org/S12-1052.
Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. CoRR,
abs/1608.04493, 2016. URL http://arxiv.org/abs/1608.04493.
Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for
efficient neural networks. CoRR, abs/1506.02626, 2015. URL http://arxiv.org/abs/
1506.02626.
Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Shijian Tang, Erich Elsen, Bryan Catanzaro, John
Tran, and William J. Dally. DSD: regularizing deep neural networks with dense-sparse-dense
training flow. CoRR, abs/1607.04381, 2016. URL http://arxiv.org/abs/1607.04381.
Shaoyi Huang, Bowen Lei, Dongkuan Xu, Hongwu Peng, Yue Sun, Mimi Xie, and Caiwen Ding.
Dynamic sparse training via balancing the exploration-exploitation trade-off. arXiv preprint
arXiv:2211.16667, 2022.
8
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Bi-
narized neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Gar-
nett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Asso-
ciates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/
d8330f857a17c53d217014ee776bfd50-Paper.pdf.
Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. Dissecting the graph-
core IPU architecture via microbenchmarking. CoRR, abs/1912.03413, 2019. URL http:
//arxiv.org/abs/1912.03413.
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meet-
ing of the Association for Computational Linguistics, Vancouver, Canada, July 2017. Association
for Computational Linguistics.
Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Ba-
jwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford
Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir
Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug
Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexan-
der Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James
Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adri-
ana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni,
Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross,
Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter,
Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick
Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-
datacenter performance analysis of a tensor processing unit. CoRR, abs/1704.04760, 2017. URL
http://arxiv.org/abs/1704.04760.
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. CoRR,
abs/2001.04451, 2020. URL https://arxiv.org/abs/2001.04451.
Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In D. Touret-
zky (ed.), Advances in Neural Information Processing Systems, volume 2. Morgan-
Kaufmann, 1989. URL https://proceedings.neurips.cc/paper/1989/file/
6c9882bbac1c7093bd25041881277658-Paper.pdf.
Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler Lampson,
Daniel Sanchez, and Tao B. Schardl. There’s plenty of room at the top. Science, 368(6495):1–7,
June 2020. URL https://www.microsoft.com/en-us/research/publication/
theres-plenty-of-room-at-the-top/.
Chuang Liu, Xueqi Ma, Yinbing Zhan, Liang Ding, Dapeng Tao, Bo Du, Wenbin Hu, and Danilo
Mandic. Comprehensive graph gradual pruning for sparse training in graph neural networks.
arXiv preprint arXiv:2207.08629, 2022.
Leibo Liu, Jianfeng Zhu, Zhaoshi Li, Yanan Lu, Yangdong Deng, Jie Han, Shouyi Yin, and Shaojun
Wei. A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges,
and applications. ACM Comput. Surv., 52(6), oct 2019. ISSN 0360-0300. doi: 10.1145/3357375.
URL https://doi.org/10.1145/3357375.
Shiwei Liu, Decebal Constantin Mocanu, Yulong Pei, and Mykola Pechenizkiy. Selfish sparse RNN
training. CoRR, abs/2101.09048, 2021. URL https://arxiv.org/abs/2101.09048.
Yihuan Mao, Yujing Wang, Chufan Wu, Chen Zhang, Yang Wang, Quanlu Zhang, Yaming Yang,
Yunhai Tong, and Jing Bai. LadaBERT: Lightweight adaptation of BERT through hybrid
model compression. In Proceedings of the 28th International Conference on Computational
Linguistics, pp. 3225–3234, Barcelona, Spain (Online), December 2020. International Com-
mittee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.287. URL https:
//aclanthology.org/2020.coling-main.287.
9
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct
electricity? a new dataset for open book question answering. In EMNLP, 2018.
Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu,
and Antonio Liotta. Evolutionary training of sparse artificial neural networks: A network science
perspective. CoRR, abs/1707.04780, 2017. URL http://arxiv.org/abs/1707.04780.
Decebal Constantin Mocanu, Elena Mocanu, Tiago Pinto, Selima Curci, Phuong H. Nguyen,
Madeleine Gibescu, Damien Ernst, and Zita A. Vale. Sparse training theory for scalable and
efficient agents. CoRR, abs/2103.01636, 2021. URL https://arxiv.org/abs/2103.
01636.
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi,
Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The LAMBADA
dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th An-
nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. URL
http://www.aclweb.org/anthology/P16-1144.
Raghu Prabhakar and Sumti Jairath. Sambanova sn10 rdu:accelerating software 2.0 with dataflow.
In 2021 IEEE Hot Chips 33 Symposium (HCS), pp. 1–37, 2021. doi: 10.1109/HCS52781.2021.
9567250.
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ar-
davan Pedram, Christos Kozyrakis, and Kunle Olukotun. Plasticine: A reconfigurable ar-
chitecture for parallel paterns. In Proceedings of the 44th Annual International Symposium
on Computer Architecture, ISCA ’17, pp. 389–402, New York, NY, USA, 2017. Association
for Computing Machinery. ISBN 9781450348928. doi: 10.1145/3079856.3080256. URL
https://doi.org/10.1145/3079856.3080256.
Raghu Prabhakar, Sumti Jairath, and Jinuk Luke Shin. Sambanova sn10 rdu: A 7nm dataflow archi-
tecture to accelerate software 2.0. In 2022 IEEE International Solid- State Circuits Conference
(ISSCC), volume 65, pp. 350–352, 2022. doi: 10.1109/ISSCC42614.2022.9731612.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
Md Aamir Raihan and Tor M. Aamodt. Sparse weight activation training. CoRR, abs/2001.01969,
2020. URL http://arxiv.org/abs/2001.01969.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version
of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019. URL http://
arxiv.org/abs/1910.01108.
Victor Sanh, Thomas Wolf, and Alexander M. Rush. Movement pruning: Adaptive sparsity by fine-
tuning. CoRR, abs/2005.07683, 2020. URL https://arxiv.org/abs/2005.07683.
Neil C. Thompson, Kristjan H. Greenewald, Keeheon Lee, and Gabriel F. Manso. The computational
limits of deep learning. CoRR, abs/2007.05558, 2020. URL https://arxiv.org/abs/
2007.05558.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.
GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019.
In the Proceedings of ICLR.
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal,
Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language:
Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
10
Published as a workshop paper at ICLR 2023 Workshop on Sparsity in Neural Networks
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity
in deep neural networks. CoRR, abs/1608.03665, 2016. URL http://arxiv.org/abs/
1608.03665.
Adina Williams, Tristan Thrush, and Douwe Kiela. Anlizing the adversarial natural language infer-
ence dataset. 2022.
Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. Lite transformer with long-short range
attention. CoRR, abs/2004.11886, 2020. URL https://arxiv.org/abs/2004.11886.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a ma-
chine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association for Com-
putational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/
P19-1472.
Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia
Mirhoseini. A full-stack search technique for domain optimized deep learning accelerators. In
Proceedings of the 27th ACM International Conference on Architectural Support for Program-
ming Languages and Operating Systems, ASPLOS ’22, pp. 27–42, New York, NY, USA, 2022.
Association for Computing Machinery. ISBN 9781450392051. doi: 10.1145/3503222.3507767.
URL https://doi.org/10.1145/3503222.3507767.
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen,
and Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, 2016. doi: 10.
1109/MICRO.2016.7783723.
Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hong-
sheng Li. Learning N: M fine-grained structured sparse neural networks from scratch. CoRR,
abs/2102.04010, 2021a. URL https://arxiv.org/abs/2102.04010.
Xiao Zhou, Weizhong Zhang, Zonghao Chen, Shizhe Diao, and Tong Zhang. Efficient neural
network training via forward and backward propagation sparsification. CoRR, abs/2111.05685,
2021b. URL https://arxiv.org/abs/2111.05685.
Xuda Zhou, Zidong Du, Qi Guo, Shaoli Liu, Chengsi Liu, Chao Wang, Xuehai Zhou, Ling Li,
Tianshi Chen, and Yunji Chen. Cambricon-s: Addressing irregularity in sparse neural networks
through a cooperative software/hardware approach. In 2018 51st Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), pp. 15–28, 2018. doi: 10.1109/MICRO.2018.00011.
11