0% found this document useful (0 votes)
56 views

A Survey of Small Language Models

Uploaded by

469134492
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

A Survey of Small Language Models

Uploaded by

469134492
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

A Survey of Small Language Models

Chien Van Nguyen1∗, Xuan Shen2∗ , Ryan Aponte3∗ , Yu Xia4 , Samyadeep Basu5 ,
Zhengmian Hu5 , Jian Chen6 , Mihir Parmar7 , Sasidhar Kunapuli, Joe Barrow8 ,
Junda Wu4 , Ashish Singh9 , Yu Wang1 , Jiuxiang Gu8 , Franck Dernoncourt8 ,
Nesreen K. Ahmed10 , Nedim Lipka8 , Ruiyi Zhang8 , Xiang Chen8 , Tong Yu8 ,
Sungchul Kim8 , Hanieh Deilamsalehy8 , Namyong Park11 , Mike Rimer, Zhehao Zhang12 ,
Huanrui Yang13 , Ryan A. Rossi8 , Thien Huu Nguyen1
1
University of Oregon, 2 Northeastern University, 3 Carnegie Mellon University
4
University of California, San Diego, 5 University of Maryland, College Park
6
State University of New York at Buffalo, 7 Arizona State University
8
Adobe Research, 9 University of Massachusetts Amherst, 10 Intel AI Research
11
Meta AI, 12 Dartmouth College, 13 University of Arizona
arXiv:2410.20011v1 [cs.CL] 25 Oct 2024

Abstract the accuracy and/or adaptability of large language


models, while being subject to some constraint(s),
Small Language Models (SLMs) have become
such as training or inference hardware, data avail-
increasingly important due to their efficiency
and performance to perform various language ability, bandwidth, or generation time. Improving
tasks with minimal computational resources, model performance relative to these constraints can
making them ideal for various settings includ- then improve downstream goals such as privacy,
ing on-device, mobile, edge devices, among cost, or the ability to run on consumer devices.
many others. In this article, we present a com- The inherent difficulty of a survey of small lan-
prehensive survey on SLMs, focusing on their guage models is that the definitions of “small” and
architectures, training techniques, and model
“large” are a function of both context and time. GPT-
compression techniques.
2, a “large language model” in 2019 at 1.5B param-
We propose a novel taxonomy for categorizing
eters, is smaller than many “small” language mod-
the methods used to optimize SLMs, includ-
ing model compression, pruning, and quanti-
els covered in this survey. However, although the
zation techniques. We summarize the bench- scale changes, the goals of training small language
mark datasets that are useful for benchmarking models remain relatively stable.
SLMs along with the evaluation metrics com- In this survey, we explore the architectures, train-
monly used. Additionally, we highlight key ing, and model compression techniques that enable
open challenges that remain to be addressed. the building and inferencing of SLMs. In addi-
Our survey aims to serve as a valuable resource tion, we summarize the benchmark datasets and
for researchers and practitioners interested in evaluation metrics commonly used in evaluating
developing and deploying small yet efficient SLM performance. To do this, we propose a novel
language models.
taxonomy for organizing the methods along two
1 Introduction axes:
• the techniques used in pre-processing (model
Although large language models (LLMs) have architecture), training, and post-processing
demonstrated impressive performance on a wide (model compression) SLMs; and
array of benchmarks and real-world situations, • the constraints the technique is attempting to
their success comes at significant cost. LLMs are optimize for, e.g. inference compute, training
resource-intensive to train and run, requiring signif- time, speed, etc.
icant compute and data. This often means that they An overview of these axes can be found in Table 1
are run on centralized and specialized hardware for (techniques) and Table 2 (constraints).
both training and inference. It is important to note that progress on any one
As a response to these challenges, there has of these goals does not necessarily imply progress
been a growing interest in small language mod- on the others. In fact, there are often trade-offs
els (SLMs). Small language models aim to retain between them. For instance, memory-efficient

*The authors contributed equally to this work. training methods like quantization-aware training
(Dettmers et al., 2022a, 2024) are often slower than 2.1 Lightweight Architectures
their full-precision counterparts. However, by us-
Lightweight language model architectures are
ing mixed precision to represent the weights and
designed to achieve efficient performance with
gradients, they allow training or finetuning using
fewer parameters and reduced computational over-
less memory. Finally, although there have been
head, which is ideal for deployment on resource-
several recent surveys on LLMs and their learn-
constrained devices such as mobile phones, edge
ing methods (Rogers et al., 2020; Min et al., 2021;
devices, and embedded systems. Representative
Zhu et al., 2023; Shen et al., 2023), to the best of
lightweight models often follow the encoder-only
our knowledge, this is the first survey focused on
and decoder-only architectures.
SLMs.
Lightweight encoder-only architectures are
Organization of the Survey. This survey is struc- mostly optimized versions of BERT (Devlin et al.,
tured into three main sections, each covering a key 2019). For example, MobileBERT (Sun et al.,
aspect of optimizing SLMs. Section 2 focuses on 2020) introduces an inverted-bottleneck structure
model architectures, including lightweight designs, to maintain a balance between self-attention and
efficient self-attention approximations, and neu- feed-forward networks, achieving a 4.3x size re-
ral architecture search to efficiently build smaller duction and a 5.5x speedup compared to the base
models. Section 3 covers efficient pre-training version of BERT. DistilBERT (Sanh, 2019) and
and fine-tuning techniques to enhance performance TinyBERT (Jiao et al., 2019) achieve more than 96
for SLMs while managing resource constraints. Lightweight decoder-only architectures follow
Section 4 explores model compression techniques, the structure of autoregressive language models
such as pruning, quantization, and knowledge dis- such as the GPT (Radford et al., 2018, 2019) and
tillation, which reduce model size and latency with- LLaMA series (Touvron et al., 2023b). These
out sacrificing significant accuracy. Section 5 intro- models emphasize knowledge distillation, mem-
duces an overview of benchmark datasets and eval- ory overhead optimization, parameter sharing, em-
uation metrics, providing a comprehensive frame- bedding sharing to enhance efficiency and scal-
work for assessing the effectiveness of these meth- ability. BabyLLaMA (Timiryasov and Tastet,
ods. Section 6 discusses the applications that are 2023a) and BabyLLaMA-2 (Tastet and Timiryasov,
enabled by SLMs, organized by constraints. Fi- 2024) distill knowledge from multiple teachers into
nally, a discussion of open challenges for SMLs is a 58M-parameter model and a 345M-parameter
presented in Section 7. model respectively, demonstrating that distillation
Summary of Main Contributions. The key con- can exceed teacher models’ performance partic-
tributions of this work are as follows: ularly under data-constrained conditions. TinyL-
• A comprehensive survey of existing work on LaMA (Zhang et al., 2024), with only 1.1B pa-
small language models for practitioners. We rameters, achieves high efficiency by optimiz-
also survey the problem settings, evaluation ing memory overhead, e.g., via FlashAttention
metrics, and datasets used in the literature. (Dao et al., 2022), while maintaining competi-
• We introduce a few intuitive taxonomies for tive performance for various downstream tasks.
SLMs and survey existing work using these MobilLLaMA (Thawakar et al., 2024) applies a
taxonomies. parameter-sharing scheme that reduces both pre-
• We identify important applications, open prob- training and deployment costs, introducing a 0.5B-
lems, and challenges of SLMs for future work parameter model for resource-constrained devices.
to address. MobileLLM (Liu et al., 2024e) further introduces
embedding-sharing and grouped-query attention
mechanisms with block-wise weight sharing to re-
2 Model Architectures
duce latency.
This section discusses the architectural designs
for developing SLMs. Specifically, we cover 2.2 Efficient Self-Attention Approximations
lightweight architectures (Section 2.1), Deploying large language models can be challeng-
efficient self-attention approximations (Sec- ing due to the substantial number of parameters in
tion 2.2), and neural architecture search (Section the self-attention layers, as well as the computa-
2.3). tional cost associated with self-attention. In this
Inference Runtime
Training Compute

Storage Space
Dataset Size

Memory

Latency
Technique General Mechanism
Lightweight Models (Sec. 2.1) ✓ ✓ ✓ ✓
Model Architectures (Sec. 2)
Efficient Self-Attention (Sec. 2.2) ✓ ✓ ✓ ✓
Neural Arch. Search (Sec. 2.3) ✓ ✓ ✓
Pre-training (Sec. 3.1) ✓ ✓ ✓ ✓ ✓
Training Techniques (Sec. 3)
Finetuning (Sec. 3.2) ✓ ✓
Pruning (Sec. 4.1) ✓ ✓ ✓ ✓
Model Compression (Sec. 4) Quantization (Sec. 4.2) ✓ ✓ ✓ ✓
Knowledge Distillation (Sec. 4.3) ✓

Table 1: General techniques used for optimizing small language models, categorized by type of model optimization
and most central constraints they address.

section, we discuss strategies towards decreasing This ongoing trend towards efficient sequence mod-
this computational cost which can ultimately be eling architectures aims to maintain the expressive-
useful in creating small language models. ness of attention-based models while significantly
Reformer (Kitaev et al., 2020) improves the reducing computational complexity.
complexity of the self-attention from O(N 2 ) to We also note some previous work for process-
O(N log N ) by replacing the dot product attention ing long documents with encoder-only architec-
with one which uses locality-sensitivity hashing. tures. Longformer (Beltagy et al., 2020) uses
Roy et al. (2021) use a sparse routing module based a combination of local windowed attention and
on an online k-means clustering, which reduces the task-specific global attention which scales linearly
complexity of the attention computation. with input length, thus being memory efficient.
Wang et al. (2020a) approximates the self-attention
To reduce the computational quadratic com- mechanism using a low-rank matrix which re-
plexity of the self-attention layer from O(N 2 ) duces the complexity to O(N ). Both these works
to O(N ), several works, including (Wang et al., show that empirically transformers with linear self-
2020a; Katharopoulos et al., 2020; Xiong et al., attention matches the performance of the original
2021; Beltagy et al., 2020), propose linear atten- self-attention mechanism across a variety of down-
tion mechanisms. In particular, (Katharopoulos stream tasks. In a similar vein, Xiong et al. (2021)
et al., 2020) express self-attention as a linear dot- use the popular Nystrom method (Nyström, 1930)
product of kernel feature maps, thus reducing the for approximating the self-attention operation with
quadratic complexity. The authors further show strong empirical performances when compared to
that transformers with this linear attention mech- traditional transformers.
anism can be viewed as a recurrent neural net-
work which enables faster inference. Building
2.3 Neural Architecture Search Techniques
on these foundations, recent advancements have
led to more advanced architectures. Notable ex- This section discusses automated methods to dis-
amples include Mamba (Gu and Dao, 2023; Dao cover the most efficient model architectures for
and Gu, 2024), which introduces a selective state specific tasks and hardware constraints.
space model with input-dependent transitions, and Previous research has primarily concentrated
RWKV (Peng et al., 2023), which combines ele- on Neural Architecture Search (NAS) for vision
ments of transformers and RNNs with a linear at- tasks (Tan and Le, 2019; Zoph and Le, 2016; Wu
tention mechanism. These models not only achieve et al., 2019; Guo et al., 2020) and BERT mod-
linear time and space complexity but also demon- els (Xu et al., 2021; Jawahar et al., 2023; Ganesan
strate competitive performance across various tasks. et al., 2021), as these models have comparatively
fewer parameters, which reduces the cost of the proaches to LLMs, we will focus on efficient tech-
search process for efficient architectures. How- niques to facilitate the general learning scenarios
ever, LLMs with over a billion parameters present with limited resources for SLMs.
a significant challenge in searching for smaller,
more efficient models. Their massive scale makes 3.1 Pre-training Techniques
the search process computationally intensive and
costly. Recently, MobileLLM (Liu et al., 2024e) Mixed precision training is a crucial technique
investigates the impact of model depth (i.e., num- for enhancing pre-training efficiency of SLMs and
ber of layers) and width (i.e., number of heads) on LLMs. This approach leverages low-precision rep-
performance, effectively conducting a targeted ar- resentations for forward and backward propagation
chitecture search within a smaller parameter range while maintaining high-precision weights for up-
for language models with millions of parameters. dates. For instance, (Micikevicius et al., 2018)
Meanwhile, Shen et al. (2024c) reduce the search introduced Automatic Mixed Precision (AMP),
space by exploring an appropriate initialization for which initially keeps a master copy of weights in
the search, which helps expedite the convergence 32-bit floating-point (FP32) precision while per-
of the search process. forming arithmetic operations in 16-bit floating-
point (FP16) precision. However, recent work (Rae
2.4 Small Multi-modal Models et al., 2021) has observed accuracy losses due to
its limited numerical range. To address this issue,
Recent large multi-modal models (LMMs) have (Burgess et al., 2019) propose Brain Floating Point
achieved comparable or superior performance to (BFLOAT16), offering a greater dynamic range
their predecessors while significantly reducing the with more exponent bits than FP16. BFLOAT16
number of parameters. Notable examples include has demonstrated superior training performance
the LLaVA-Next (Liu et al., 2024a), Idefics2 (Lau- and representation accuracy compared to FP16.
rençon et al., 2024), and InternVL2 (Chen et al., Modern GPU architectures have further advanced
2023) series. This progress is partly driven by more mixed-precision capabilities through specialized
efficient, smaller language models like Gemma Tensor Cores. For instance, while earlier genera-
(Team et al., 2024), phi-3-mini (Abdin et al., 2024), tions supported FP16 and BFLOAT16, NVIDIA’s
and emphasizes the critical role of curated datasets. latest Hopper architecture introduces support for
Additionally, there has been a concerted effort 8-bit floating-point (FP8) precision (Luo et al.), en-
to reduce the size of the vision encoder during abling even greater computational efficiency for
multi-modal fusion. InternVL2, for example, lever- large-scale language models.
ages outputs from intermediate layers of large vi-
Complementing these mixed precision ap-
sual encoders while discarding the later blocks.
proaches, various optimization and stability tech-
Smaller models, such as PaliGemma (Beyer et al.,
niques are employed to prevent model collapse
2024) and Mini-Gemini (Li et al., 2024c), adopt
and further enhance training efficiency for SLMs
lightweight vision encoders. Monolithic multi-
and LLMs. While Adam (Diederik, 2014) and
modal models take this further by completely elimi-
AdamW (Loshchilov and Hutter, 2019) optimizers
nating the visual encoder, instead using lightweight
are commonly used, memory-efficient variants like
architectures to generate visual tokens. For exam-
Adafactor (Shazeer and Stern, 2018) and Sophia
ple, Chameleon (Team, 2024a) employs a VQ-VAE
(Liu et al., 2024b) have been introduced to improve
model to encode and decode images into discrete
training speed and efficiency. To further stabilize
tokens, while Mono-InternVL (Luo et al., 2024a)
training, gradient clipping (Zhang et al., 2020) is
uses an MLP to generate visual tokens for image
widely used to prevent exploding gradients. Addi-
patches, incorporating a modality-specific feed-
tionally, careful initialization strategies can provide
forward network, termed multi-modal Mixture-of-
a good starting point for model training. These
Experts, to differentiate between modalities.
combined techniques aim to achieve optimal train-
ing efficiency, maintain numerical stability, and
3 Training Techniques
produce more robust and capable language models.
This section reviews the key training techniques To address the computational demands of the
used for language model pretraining and fine- pre-training stage, language models are typically
tuning. While SLMs involve similar training ap- pre-trained across multiple machine nodes, lever-
aging distributed computing resources efficiently. complexity. Reflection-tuning (Li et al., 2023a,
Several system-level optimization techniques have 2024a) enhances data quality and instruction-
been developed to this end. Zero Redundancy Data response consistency for instruction tuning by re-
Parallelism (ZeRO) (Rajbhandari et al., 2020) of- fining both instructions and responses using GPT-
fers three progressive stages of optimization, each 4 based on predefined criteria. FANNO (Zhu
partitioning more training states across devices: et al., 2024) augments instructions and generates
ZeRO-1 partitions optimizer states, ZeRO-2 adds responses by incorporating external knowledge
gradient partitioning, and ZeRO-3 further partitions sources through retrieval-augmented generation.
model parameters. PyTorch’s Fully Sharded Data LLM2LLM (Lee et al., 2024b) generates more hard
Parallel (FSDP) (Zhao et al., 2023b) implements samples based on model prediction on training data
similar concepts. These parallelism techniques en- during training.
able training with larger batch sizes, significantly Data augmentation is also effective for synthe-
improving efficiency and scalability for SLMs and sizing new data when training data is limited, such
LLMs. as for low-resource languages (Whitehouse et al.,
2023), medical and clinical applications (Chinta-
3.2 Fine-tuning Techniques gunta et al., 2021), and privacy-sensitive data (Song
Fine-tuning on smaller, task-specific datasets al- et al., 2024), enabling models to generalize better
lows LLMs to leverage the knowledge gained dur- and perform more robustly in constrained settings.
ing pre-training, enabling them to excel in special-
ized tasks or domains. Fine-tuning techniques are 4 Model Compression Techniques
designed to address challenges like limited com-
Model compression techniques focus on reducing
puting resources, data quality, availability, and ro-
the size and complexity of large pre-trained lan-
bustness, ensuring efficient adaptation to new tasks
guage models while maintaining their performance.
without extensive retraining.
As a result, these methods are a key approach to
3.2.1 Parameter-Efficient Fine-Tuning deriving SLMs from LLMs. In this section, we pro-
Parameter-Efficient Fine-Tuning (PEFT) updates pose a taxonomy for model compression that cate-
a small subset of parameters or adds lightweight gorizes such techniques by whether they perform
modules, keeping most of the pre-trained model’s pruning (Section 4.1), quantization (Section 4.2),
parameters fixed. This approach reduces compu- or knowledge distillation (Section 4.3).
tational costs during SLM fine-tuning, preserves
4.1 Pruning Techniques
the model’s knowledge, reduces overfitting, and
improves flexibility. LoRA uses low-rank decom- Weight pruning is a model optimization technique
position (Hu et al., 2021), Prompt Tuning (Lester that reduces the number of parameters to enhance
et al., 2021) inserts learnable prompts into inputs, computational efficiency and lower memory usage,
and Llama-Adapter (Zhang et al., 2023b; Gao et al., all while maintaining performance levels. We dif-
2023) adds prompts to LLaMA’s attention blocks. ferentiate between two major approaches for prun-
Dynamic Adapters (Kong et al., 2024; Feng et al., ing: unstructured pruning and structured pruning.
2024; Gou et al., 2023; Liu et al., 2023b; Luo et al., Unstructured pruning removes less significant
2024b) automatically combine multiple adapters as individual weights, offering fine-grained control
a mixture-of-experts model to enable multi-tasking and flexibility in reducing model size. For ex-
and prevent forgetting (Han et al., 2024; Yang et al., ample, to perform irregular pruning on large lan-
2024). guage models, SparseGPT (Frantar and Alistarh,
2023) reformulates the pruning task as a sparse
3.2.2 Data Augmentation regression problem, optimizing both the remain-
Data augmentation increases the complexity, di- ing and pruned weights using a layer-wise ap-
versity and quality of training data, leading to im- proximate regression solver. SparseGPT can ef-
proved generalization and performance on down- ficiently handle large-scale models like OPT-175B
stream tasks. AugGPT (Dai et al., 2023) rephrases and BLOOM-176B. Additionally, (Boža, 2024) in-
training samples using ChatGPT. Evol-Instruct (Xu tegrates the ADMM (Boyd et al., 2011) algorithm
et al., 2023) uses multistep revisions to generate for weight updates to further mitigate pruning er-
diverse, open-domain instructions with increased rors. Wanda (Sun et al., 2023) incorporates both
weights and activations into consideration during Dettmers et al., 2022b; Kim et al., 2023; Xiao et al.,
pruning process, and eliminates the need of weight 2023; Yao et al., 2022; Lin et al., 2024; Liu et al.,
updates. The n:m pruning strategy (Zhou et al., 2023d, 2024d, 2023c; Shao et al., 2023) that quan-
2021) brings unstructured pruning to model accel- tize both weights and activations are increasingly
eration by pruning exactly n weights out of every being adopted for LLMs. AWQ (Lin et al., 2024)
m, balancing pruning flexibility and computational and ZeroQuant (Yao et al., 2022) take activation
efficiency for significant speedups. NVIDIA’s Ten- into account to assess the importance of weights,
sorRT leverages such sparse patterns to optimize enabling more effective optimization for weight
memory access and reduce computational loads, quantization. In addition, for K/V Cache Quanti-
accelerating inference on GPUs, particularly hard- zation (Hooper et al., 2024; Liu et al., 2024f; Yue
ware like the A100. Notably, unstructured pruning et al., 2024), Key-Value cache is specifically quan-
often results in sparse matrices requiring special- tized for enabling efficient long-sequence length
ized hardware or algorithms to maximize computa- inference.
tional benefits (Frantar and Alistarh, 2023). Another challenge of activation quantization lies
Structured pruning (Wang et al., 2020b; San- in the outliers that fall outside the typical activa-
tacroce et al., 2023; Ma et al., 2023; Tao et al., tion distribution. SmoothQuant (Xiao et al., 2023)
2023; Xia et al., 2024; Kurtić et al., 2024) aims to smoothes activation outliers by migrating quanti-
compress LLMs while maintaining performance zation difficulty from activations to weights. Spin-
by removing groups of parameters in a structured Quant (Liu et al., 2024d) introduces rotation ma-
manner, which enables more efficient hardware im- trices to transform outliers into a new space. Re-
plementation. A major direction in this approach cently, quantization-aware training (QAT) methods,
concerns the sparsity of neurons in the model. For such as LLM-QAT (Liu et al., 2023d) and Edge-
instance, Li et al. (2023b) observes prevalent spar- QAT (Shen et al., 2024b), have gained attention
sity in feed-forward networks. Liu et al. (2023e) due to the strong performance. Both methods adopt
proposes using small neural networks for dynamic distillation with float16 models to recover the quan-
pruning based on input, termed “contextual spar- tizationi error. We also note recent work (Shen
sity”. Mirzadeh et al. (2024) change the activation et al., 2024a,b; Zeng et al., 2024) that implements
functions in pre-trained models to ReLU and fine- the quantized LLMs on mobile devices and FPGAs
tune to improve activation sparsity. to demonstrate the effectiveness and efficiency of
Recent work has also addressed the redundancy the weight and activation quantization for LLMs.
in the Transformer architecture to achieve reduc-
tion of GPU memory usage and speed enhance- 4.3 Knowledge Distillation Techniques
ment (Michel et al., 2019; Voita et al., 2019; Ge
In its classical form, knowledge distillation (Hinton
et al., 2024). For example, Sajjad et al. (2023);
et al., 2015) involves training an efficient model,
Xia et al. (2022) investigates the layer redundancy
known as the “student,” to replicate the behavior
for effective structured pruning. We also highlight
of a larger, more complex model, referred to as
input-dependent pruning methods, such as contex-
the “teacher.” In this section, we particularly fo-
tual sparsity (Liu et al., 2023e) and FastGen (Ge
cus on distillation strategies from one or multiple
et al., 2024), which should be considered along
white-box teacher language model to a target stu-
with the challenges of efficient implementation for
dent language model.
optimizing computation and memory. Appendix A
provides further discussion of pruning techniques. Babyllama (Timiryasov and Tastet, 2023b) is
among the first to develop a compact 58M param-
eter language model using a Llama model as the
4.2 Quantization
teacher. A key finding of this work is that distil-
Quantization is widely adopted to compress LLMs lation from a robust teacher can outperform tra-
with vast parameter counts. The GPTQ (Frantar ditional pre-training on the same dataset. In a
et al., 2022) focuses on layer-wise weight-only similar vein, (Gu et al., 2024) introduce mod-
quantization, using inverse Hessian matrices to ifications in the distillation loss, which enables
minimize the reconstruction error. To fully lever- the student models to generate better quality re-
age the benefits of fast integer matrix multiplica- sponses with improved calibration and lower ex-
tion, more quantization methods (Liu et al., 2023a; posure bias. Sequence-level distillation loss can
also be improved by using a generalized version 5.1 Datasets
of f-divergences as shown in (Wen et al., 2023).
The datasets commonly used for pre-training and
Liang et al. (2023) extend layer-wise distillation
evaluating SLMs across various settings are out-
strategies for language models by using task-aware
lined in Table 2. These datasets provide diverse
filters which distill only the task specific knowl-
contextual examples that enable models to general-
edge from the teacher. Recent works (Wan et al.,
ize effectively across different learning settings.
2024a,b) show that multiple language models can
be fused as a teacher towards distilling knowledge Efficient Inference This setting requires mod-
into small language models by strategically merg- els to generate output as quickly as possible, with
ing their output probability distributions. minimal latency and high throughput. Evaluation
One of the issues in knowledge distillation for datasets for this setting often focus on tasks that
language models is that the distillation strategies require fast response times, such as question an-
are primarily effective when (1) the teacher and the swering, text classification, and natural language
student language model share the same tokenizer understanding. To this end, some of the exam-
and (2) the teacher’s pre-training data is available. ple evaluation datasets for this setting can include
Boizard et al. (2024) addresses this issue by intro- SuperGLUE (Sarlin et al., 2020), SQuAD (Ra-
ducing an universal logit distillation loss inspired jpurkar et al., 2016), TriviaQA (Joshi et al., 2017),
from the optimal transport literature. Often distil- CoQA (Reddy et al., 2019), Natural Questions
lation is also combined with pruning techniques (NQ) (Kwiatkowski et al., 2019), and many more
towards creating smaller language models. For ex- (Chang et al., 2024) that cover various tasks that
ample, (Sreenivas et al., 2024; Muralidharan et al., require faster response time.
2024) show that an iterative step of pruning a large
language model followed by retraining with distil- Privacy-preserving Privacy-preserving datasets
lation losses, can enable strong smaller models. play an important role in enabling the development
of SLMs while safeguarding sensitive information.
Recent advancements have explored methods be-
Datasets such as PrivacyGLUE (Shankar et al.,
yond traditional label distillation by incorporating
2023) apply differential privacy techniques to com-
additional supervision during the distillation pro-
mon tasks such as sentiment analysis. Anonymized
cess to create smaller language models. Hsieh et al.
datasets such as MIMIC (Johnson et al., 2020) and
(2023) find that using “rationales” as an additional
n2c2 datasets1 contain de-identified clinical notes
source of supervision during distillation makes it
for medical tasks, protecting personal health in-
more sample-efficient. Moreover, the authors find
formation. Additionally, federated datasets such
that the distilled model outperforms large-language
as LEAF2 allow data to remain distributed across
models on commonly used NLI, Commonsense QA
devices, supporting privacy by design through fed-
and arithmetic reasoning benchmarks. In a similar
erated learning frameworks.
vein, (Dai et al., 2024; Magister et al., 2023; Ho
et al., 2023; Fu et al., 2023) distill the reasoning TinyML and On-device In these settings, the
chain from a larger language model to a smaller focus is on deploying SLMs in highly resource-
language model along with the label information. constrained environments. Frameworks such as
Such distilled models have been shown to possess TinyBERT (Jiao et al., 2020) and OpenOrca (Lian
improved arithmetic, multi-step math, symbolic et al., 2023) play a pivotal role by enabling the train-
and commonsense reasoning abilities. ing and evaluation of SLMs on curated datasets
tailored for such environments. TinyBERT, a dis-
5 Evaluation tilled version of BERT, is optimized for both size
and speed, making it suitable for on-device applica-
Table 2 presents different evaluation settings along tions with minimal latency requirements. Similarly,
with their corresponding datasets and metrics for subsets like OpenOrca provide useful datasets that
SLMs. In this section, we examine how differ- balance performance and resource constraints, sup-
ent datasets and evaluation metrics are specifically porting the development of small, efficient models
designed to assess SLMs. These evaluation com- 1
https://portal.dbmi.hms.harvard.edu/
ponents are organized according to the constraints projects/n2c2-nlp/
2
they address for SLMs. https://github.com/TalwalkarLab/leaf
Setting Constraints Datasets Metrics
Efficient Inference Latency SuperGLUE (Sarlin et al., 2020), SQuAD (Ra- Inference Time (Narayanan et al., 2023), Throughput
jpurkar et al., 2016), TriviaQA (Joshi et al., 2017), (Arora et al., 2024)
CoQA (Reddy et al., 2019), Natural Questions (NQ)
(Kwiatkowski et al., 2019)
On-device/Mobile Memory TinyBERT (Jiao et al., 2020) and OpenOrca (Lian Peak Memory Usage (Lee et al., 2024a), Memory
et al., 2023) Footprint, Compression Ratio (Cao et al., 2024)
Privacy-Preserving Privacy PrivacyGLUE (Shankar et al., 2023), MIMIC (John- Privacy Budget (Yu et al., 2024), Noise Level
son et al., 2020) (Havrilla et al., 2024)

Energy-Efficient AI Energy Optimiza- - Energy Efficiency Ratio (Stojkovic et al., 2024b),


tion Thermal Efficiency, Idle Power Consumption (Patel
et al., 2024)

Table 2: Overview of Settings, Constraints, and Metrics.

that can be deployed on low-power devices without Energy Optimization The energy efficiency ra-
sacrificing accuracy. tio (Stojkovic et al., 2024b) evaluates the energy
used relative to the model’s overall performance,
5.2 Metrics providing insights into how energy-intensive an
The key metrics for evaluating SLMs across dif- SLM is in practice. Other metrics, such as ther-
ferent settings are presented in Table 2. The eval- mal efficiency and idle power consumption (Patel
uation metrics are organized based on the specific et al., 2024), measure the energy consumed when
constraints. the model is either actively processing tasks or
idle, which is crucial for long-term deployment in
Latency Two key metrics to evaluate latency energy-constrained environments like embedded
are inference time (Narayanan et al., 2023) and systems or mobile devices.
throughput (Arora et al., 2024). Inference time
measures how quickly a model can process input 6 Applications
and generate an output, which is crucial for user-
In this section, we consider applications of SLMs,
facing applications that require immediate feed-
that is, specific use-cases like translation and auto-
back. Throughput, on the other hand, evaluates the
completion.
number of tokens or samples a model can process
in a given period, making it especially relevant for 6.1 Real-Time Interaction
large-scale tasks or time-sensitive applications.
GPT-4o, released in May 2024, processes text, vi-
Memory When deploying models in memory- sion, and audio input end-to-end and is faster than
constrained environments, memory efficiency be- GPT-4 Turbo (OpenAI, 2024b). The demonstration
comes a primary consideration. Metrics such as involved responses in the style of human conver-
peak memory usage (Lee et al., 2024a) capture the sation. LLaMA-Omni combine a speech encoder,
highest amount of memory the model consumes adaptor, LLM, and streaming decoder to enable
during inference. Similarly, memory footprint and real-time interaction with speech input based on
compression ratio (Cao et al., 2024) are used to LLaMA-3-8B-Instruct (Fang et al., 2024). Emo-
measure how compact a model is and the efficiency tionally Omni-present Voice Assistant, or EMOVA,
of the compression techniques applied, enabling apply LLaMA-3.1-8B as an end-to-end speech
models to operate within memory constraints with- model that can generate poems and describe images
out sacrificing performance. at the user’s request. Google Deepmind’s Project
Astra uses Gemini to process audio and video infor-
Privacy Privacy budget (Yu et al., 2024), a mea- mation from a smartphone or glasses and respond
sure rooted in differential privacy, quantifies the to respond to queries like mathematics problems
model’s ability to protect sensitive information dur- and memorize object sequences (Deepmind, 2024).
ing both training and inference. Alongside this,
noise level (Havrilla et al., 2024) measures the 6.2 Content Generation and Processing
trade-off between privacy and accuracy by assess- LLMR uses LLMs in mixed reality to generate
ing how much noise is added to ensure privacy and modify 3D scenes. It combines language mod-
while maintaining the model’s performance. els used in several roles - a Scene Analyzer GPT
Inference Runtime

Comm. Overhead
Storage Space
Memory

Latency
Category Application Need for SLM Application
Chatbots Real-time response needed, lightweight ✓ ✓ ✓ ✓
Real-Time Interaction Voice Interfaces Low latency required for real-time ✓ ✓ ✓
Translation Real-time translation with low-resources ✓ ✓ ✓ ✓
Text Summarization Faster inference, minimal resource use ✓ ✓ ✓ ✓

Content Generation Sentiment Analysis Efficient analysis in low-resource envir. ✓ ✓ ✓ ✓

& Processing Text Classification Low latency, on-the-fly processing ✓ ✓ ✓ ✓


NLP for Search Low latency for real-time search ✓ ✓ ✓
Autocompletion Fast prediction with low memory ✓ ✓ ✓ ✓

Table 3: Taxonomy of Applications of Small Language Models.

to summarize objects and give further details like ogy usable without an internet connection.
color, Skill Library GPT to determine what is re- Mixture-of-Experts can reduce inference cost
quired to fufill a user’s request, Builder GPT to by using a gating network to use only a subset of
generate code for the request, and Inspector GPT layers during inference time (Shazeer et al., 2017).
to evaluate its code (Torre et al., 2024). Dream- Google’s GLaM uses mixture-of-experts (Du et al.,
CodeVR assists users in editing an application in 2022) but is a 1.2T parameter model. EdgeMoE ex-
the Unity engine through code generation (Giunchi tend misture-of-experts to edge computing using an
et al., 2024; Juliani et al., 2020). This permits users Nvidia Jetson TX2 and Raspberry Pi 4B, with the
to edit VR applications without requiring extensive latter device being CPU-only (Sarkar et al., 2023).
programming knowledge. Based on experimental findings that most weights
contribute little to the final computation, the au-
6.3 Edge Inference and Privacy thors compress weights and predict the relevant
On-device LLMs maintain usability even when experts in advance.
MobileLLM improve on various chat benchmarks
7 Open Problems
and performs comparably with LLaMA-2-7B in
API calling (Liu et al., 2024e). Apple Intelli- In this section, we discuss open problems and high-
gence applies an 3B parameter model to perform light important areas for future work. Hallucination
on-device inference for a broad range of tasks, and bias are a concern shared by SLMs and LLMs
such as text and notification summarization, im- (Section 7.1 and 7.2). In Section 7.3, we discuss
age and emoji generation, and code completion the increased demand of energy efficiency during
for XCode (Gunter et al., 2024; Research, 2024). inference. Finally, we examine the privacy risks of
On-device inference reduces latency as measured SLMs in Section 7.4.
by the time to first generated token (Hu et al.,
2024; Gerganov). HuatuoGPT is a domain-adapted 7.1 Hallucination
LLM for medical dialogue and BioMistral is an A pervasive problem with LLMs is hallucination,
LLM tailored for biomedical work (Zhang et al., defined as content that is nonsensical or untruth-
2023a; Labrak et al., 2024). Applications related ful in relation to certain sources (OpenAI, 2024a).
to medicine may need to adhere to stringent pri- OpenAI (2024a) propose that as users rely more
vacy regulations and represent a promising area for on models, the harm caused by hallucinations may
future work. TalkBack with GeminiNano assists be increased. Hallucination can be classified into
blind and low vision people by describing and cap- two types: factuality and faithfulness (relevance).
tioning images and runs on Android devices (Team, With hallucination of factuality, the generation is
2024b). On-device inference makes this technol- inconsistent with verifiable facts. In faithfulness
hallucination, generation lacks relevance to user used at inference time, and the user query. Query
queries (Huang et al., 2023). HallusionBench, a privacy is especially important in SLMs.
benchmark for image-context reasoning in vision-
Training Data Li et al. (2024b) address training
language models, found that larger sizes reduced
and system prompt leaking. The authors find that
hallucinations (Guan et al., 2024). Analysis of the
the risk of training data leakage increased faster
AMBER hallucination benchmark find that the type
than their measure of utility for the model series
of hallucination varies as parameter count changes
Pythia (Biderman et al., 2023). They also find that
in Minigpt-4 (Wang et al., 2024). However, find
data towards the end of pre-training is easier to
that bias increases with parameter count for the
extract, with attention layers as a possible cause.
LLaMA series of models (Zhao et al., 2023a). Fu-
ture work may need to consider not only how total System Prompt Liu et al. (2024c) describe unau-
hallucinations change in SLMs, but also the type thorized retrieval of the system prompt as prompt
and severity may be influenced by model size. leaking and use of the prompt for unintended pur-
poses as prompt abuse. They give the example of
7.2 Biases
getting a prompt designed to rephrase user queries
Language models have been found to reproduce to generate code, leading to unexpected cost using
biases present in training data (Brown et al., 2020; Pear AI3 .
OpenAI, 2024a; Touvron et al., 2023a).
Inference-time Data Unlike with the leakage of
Measuring Bias Methods for measuring bias training data and the system prompt, this primarily
such as Bias Benchmark for Question Answer- impacts the end-users of a model. In June 2024,
ing (BBQ) (Parrish et al., 2022), RealToxici- Apple announced the application of language mod-
tyPrompts (Gehman et al., 2020), and Crowd- els to the digital assistant Siri (Research, 2024). In
sourced Stereotype Pairs benchmark (CrowS- the context of digital assistants, SLMs may need to
Pairs) (Nangia et al., 2020). interface with user data like location history or pro-
Influence of Parameter Count (Touvron et al., tected health information. If such data were used to
2023a) find that larger LLaMA models exhibit in- train or protect a model from misuse, users might
creased measured bias on RealToxicityPrompts. face externalities. Existing literature is limited.
(Zhao et al., 2023a) replicate this with Stere-
oSet (Nadeem et al., 2021) and their metric GPT-
8 Conclusion
BIAS, which uses GPT-4 to classify responses as Given the growing importance of SLMs due to their
biased or unbiased. For comparable model sizes, efficiency and applicability across a wide range of
LLaMA-2 had less measured bias than the previous devices and environments, this paper has surveyed
generation (Touvron et al., 2023c). SLMs including model architectures, training tech-
niques, and model compression techniques for op-
7.3 Inference-time Energy Use
timizing SLMs. We also introduced an intuitive
Energy efficiency is a high priority for SLMs, espe- taxonomy of evaluation metrics for SLMs and sum-
cially when used on battery-powered devices. Hu- marize various settings and applications where they
som et al. (2024) find that architecture significantly are important. Furthermore, we summarized the
influences power consumption using the MELODI training and benchmark datasets that have been
benchmar. CPU-only inference was found to be used for SLMs. Finally, we highlighted the funda-
generally less efficient than on GPU and that lap- mental challenges and open problems that remain
tops require more energy for inference. The au- to be addressed. We hope this survey serves as a
thors find response token length to be the most valuable resource for both researchers and practi-
effective predictor of energy usage, suggesting that tioners. driving the next advancements in small yet
more concise responses can help to extend battery powerful language models.
life. Stojkovic et al. (2024a) find that energy usage
can be reduced by about 20 9 Limitations
7.4 Data Privacy While SLMs present a broad array of benefits, risks
Privacy concerns can be broadly classified into and limitations must also be considered. Hallucina-
3
three categories: training data, the system prompt https://www.parea.ai
tion (discussed in Section 7.1) and reinforcement Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
of societal biases (discussed in Section 7.2) are Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
widely recognized risks of large language models.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
While research has been performed to measure and Gretchen Krueger, Tom Henighan, Rewon Child,
reduce these behaviors, they have yet to be fully Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
mitigated. Utama et al. (2020) introduce a frame- Clemens Winter, Christopher Hesse, Mark Chen, Eric
work to reduce self-bias without the specific bias Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
known at test time. Such methods may become Alec Radford, Ilya Sutskever, and Dario Amodei.
more effective with general increases in model ca- 2020. Language models are few-shot learners.
pability. However, risks specific to groups from
which researchers are not primarily drawn may re- Neil Burgess, Jelena Milanovic, Nigel Stephens, Kon-
stantinos Monachopoulos, and David Mansell. 2019.
main unrecognized. Bfloat16 processing for neural networks. In 2019
IEEE 26th Symposium on Computer Arithmetic
(ARITH), pages 88–91. IEEE.
References
Zhiwei Cao, Qian Cao, Yu Lu, Ningxin Peng, Luyang
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Huang, Shanbo Cheng, and Jinsong Su. 2024. Retain-
Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, ing key information under high compression ratios:
Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harki- Query-guided compressor for llms. In Proceedings
rat Behl, et al. 2024. Phi-3 technical report: A highly of the 62nd Annual Meeting of the Association for
capable language model locally on your phone. arXiv Computational Linguistics (Volume 1: Long Papers),
preprint arXiv:2404.14219. pages 12685–12695, Bangkok, Thailand. Association
for Computational Linguistics.
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman
Timalsina, Sinan Kaplan, Megan Leszczynski, Isys Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu,
Johnson, Vishal Subbiah, Azalia Mirhoseini, James Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi,
Zou, and Christopher Ré. 2024. Simple linear atten- Cunxiang Wang, Yidong Wang, et al. 2024. A sur-
tion language models balance the recall-throughput vey on evaluation of large language models. ACM
tradeoff. Transactions on Intelligent Systems and Technology,
15(3):1–45.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
Longformer: The long-document transformer. CoRR, Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su,
abs/2004.05150. Guo Chen, Sen Xing, Muyan Zhong, Qinglong
Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo,
Lucas Beyer, Andreas Steiner, André Susano Pinto, Tong Lu, Yu Qiao, and Jifeng Dai. 2023. Internvl:
Alexander Kolesnikov, Xiao Wang, Daniel Salz, Scaling up vision foundation models and aligning
Maxim Neumann, Ibrahim Alabdulmohsin, Michael for generic visual-linguistic tasks. arXiv preprint
Tschannen, Emanuele Bugliarello, et al. 2024. arXiv:2312.14238.
Paligemma: A versatile 3b vlm for transfer. arXiv
preprint arXiv:2407.07726. Bharath Chintagunta, Namit Katariya, Xavier Amatri-
ain, and Anitha Kannan. 2021. Medically aware
Stella Biderman, Hailey Schoelkopf, Quentin Gregory gpt-3 as a data generator for medical dialogue sum-
Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal- marization. In Machine Learning for Healthcare
lahan, Mohammad Aflah Khan, Shivanshu Purohit, Conference, pages 354–372. PMLR.
USVSN Sai Prashanth, Edward Raff, et al. 2023.
Pythia: A suite for analyzing large language mod- Chengwei Dai, Kun Li, Wei Zhou, and Songlin Hu.
els across training and scaling. In International 2024. Beyond imitation: Learning key reasoning
Conference on Machine Learning, pages 2397–2430. steps from dual chain-of-thoughts in reasoning distil-
PMLR. lation.

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke
and Pierre Colombo. 2024. Towards cross-tokenizer Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen
distillation: the universal logit distillation loss for Xu, Wei Liu, Ninghao Liu, et al. 2023. Auggpt:
llms. Leveraging chatgpt for text data augmentation. arXiv
preprint arXiv:2302.13007.
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato,
and Jonathan Eckstein. 2011. [link]. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and
Christopher Ré. 2022. Flashattention: Fast and
Vladimír Boža. 2024. Fast and optimal weight update memory-efficient exact attention with io-awareness.
for pruned large language models. arXiv preprint Advances in Neural Information Processing Systems,
arXiv:2401.02938. 35:16344–16359.
Tri Dao and Albert Gu. 2024. Transformers are SSMs: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and
Generalized models and efficient algorithms through Dan Alistarh. 2022. Gptq: Accurate post-training
structured state space duality. In Proceedings of the quantization for generative pre-trained transformers.
41st International Conference on Machine Learning, arXiv preprint arXiv:2210.17323.
volume 235 of Proceedings of Machine Learning
Research, pages 10041–10071. PMLR. Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and
Tushar Khot. 2023. Specializing smaller language
Google Deepmind. 2024. Project astra a universal ai models towards multi-step reasoning.
agent that is helpful in everyday life.
Vinod Ganesan, Gowtham Ramesh, and Pratyush Ku-
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke mar. 2021. Supershaper: Task-agnostic super pre-
Zettlemoyer. 2022a. Gpt3. int8 (): 8-bit matrix mul- training of bert models with variable hidden dimen-
tiplication for transformers at scale. Advances in sions. arXiv preprint arXiv:2110.04711.
Neural Information Processing Systems, 35:30318–
30332. Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie
Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. 2023.
Zettlemoyer. 2022b. Llm. int8 (): 8-bit matrix mul- Llama-adapter v2: Parameter-efficient visual instruc-
tiplication for transformers at scale. arXiv preprint tion model. arXiv preprint arXiv:2304.15010.
arXiv:2208.07339.
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang,
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Jiawei Han, and Jianfeng Gao. 2024. Model tells you
Luke Zettlemoyer. 2024. Qlora: Efficient finetuning what to discard: Adaptive KV cache compression for
of quantized llms. Advances in Neural Information LLMs. In The Twelfth International Conference on
Processing Systems, 36. Learning Representations.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Samuel Gehman, Suchin Gururangan, Maarten Sap,
Kristina Toutanova. 2019. BERT: Pre-training of
Yejin Choi, and Noah A. Smith. 2020. Realtoxic-
deep bidirectional transformers for language under-
ityprompts: Evaluating neural toxic degeneration in
standing. In Proceedings of the 2019 Conference
language models.
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Georgi Gerganov. llama.cpp.
Technologies, Volume 1 (Long and Short Papers).

P Kingma Diederik. 2014. Adam: A method for stochas- Daniele Giunchi, Nels Numan, Elia Gatti, and Anthony
tic optimization. (No Title). Steed. 2024. DreamCodeVR: Towards Democratiz-
ing Behavior Design in Virtual Reality with Speech-
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Driven Programming. In 2024 IEEE Conference Vir-
Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, tual Reality and 3D User Interfaces (VR), Orlando,
Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret USA. IEEE.
Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou,
Tao Wang, Emma Wang, Kellie Webster, Marie Pel- Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang
lat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and
Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Yu Zhang. 2023. Mixture of cluster-conditional lora
Wu, Zhifeng Chen, and Claire Cui. 2022. GLaM: experts for vision-language instruction tuning. arXiv
Efficient scaling of language models with mixture- preprint arXiv:2312.12379.
of-experts. In Proceedings of the 39th International
Conference on Machine Learning, volume 162 of Albert Gu and Tri Dao. 2023. Mamba: Linear-time
Proceedings of Machine Learning Research, pages sequence modeling with selective state spaces. arXiv
5547–5569. PMLR. preprint arXiv:2312.00752.

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024.
Shaolei Zhang, and Yang Feng. 2024. Llama-omni: Minillm: Knowledge distillation of large language
Seamless speech interaction with large language mod- models.
els.
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian,
Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Yu Han, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen,
and Hao Wang. 2024. Mixture-of-loras: An efficient Furong Huang, Yaser Yacoob, Dinesh Manocha, and
multitask tuning for large language models. arXiv Tianyi Zhou. 2024. Hallusionbench: An advanced
preprint arXiv:2403.03432. diagnostic suite for entangled language hallucination
and visual illusion in large vision-language models.
Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Mas-
sive language models can be accurately pruned in Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang,
one-shot. In International Conference on Machine Andy Narayanan, Aonan Zhang, et al. 2024. Apple
Learning, pages 10323–10337. PMLR. intelligence foundation language models.
Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Ganesh Jawahar, Haichuan Yang, Yunyang Xiong,
Zechun Liu, Yichen Wei, and Jian Sun. 2020. Single Zechun Liu, Dilin Wang, Fei Sun, Meng Li,
path one-shot neural architecture search with uni- Aasish Pappu, Barlas Oguz, Muhammad Abdul-
form sampling. In Computer Vision–ECCV 2020: Mageed, et al. 2023. Mixture-of-supernets:
16th European Conference, Glasgow, UK, August 23– Improving weight-sharing supernet training with
28, 2020, Proceedings, Part XVI 16, pages 544–560. architecture-routed mixture-of-experts. arXiv
Springer. preprint arXiv:2306.04845.

Jiayi Han, Liang Du, Hongwei Du, Xiangguo Zhou, Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao
Yiwen Wu, Weibo Zheng, and Donghong Han. Chen, Linlin Li, Fang Wang, and Qun Liu. 2019.
2024. Slim: Let llm learn more and forget less Tinybert: Distilling bert for natural language under-
with soft lora and identity mixture. arXiv preprint standing. arXiv preprint arXiv:1909.10351.
arXiv:2410.07739.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao
Alex Havrilla, Yilun Du, Chuanyang Zheng, Phillip Chen, Linlin Li, Fang Wang, and Qun Liu. 2020.
Isola, and Joshua B. Tenenbaum. 2024. Understand- Tinybert: Distilling bert for natural language under-
ing the effect of noise in llm training data with algo- standing. In Findings of the Association for Computa-
rithmic chains of thought. tional Linguistics: EMNLP 2020, pages 4163–4174.

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Alistair Johnson, Lucas Bulgarelli, Tom Pollard,
Distilling the knowledge in a neural network. arXiv Steven Horng, Leo Anthony Celi, and Roger Mark.
preprint arXiv:1503.02531, 2(7). 2020. Mimic-iv. PhysioNet. Available online at:
https://physionet. org/content/mimiciv/1.0/(accessed
Namgyu Ho, Laura Schmid, and Se-Young Yun. 2023. August 23, 2021), pages 49–55.
Large language models are reasoning teachers.
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Zettlemoyer. 2017. Triviaqa: A large scale distantly
Michael W Mahoney, Yakun Sophia Shao, Kurt supervised challenge dataset for reading comprehen-
Keutzer, and Amir Gholami. 2024. Kvquant: sion. arXiv preprint arXiv:1705.03551.
Towards 10 million context length llm inference
with kv cache quantization. arXiv preprint Arthur Juliani, Vincent-Pierre Berges, Ervin Teng, An-
arXiv:2401.18079. drew Cohen, Jonathan Harper, Chris Elion, Chris
Goy, Yuan Gao, Hunter Henry, Marwan Mattar, and
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Danny Lange. 2020. Unity: A general platform for
Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, intelligent agents.
Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister.
2023. Distilling step-by-step! outperforming larger Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap-
language models with less training data and smaller pas, and François Fleuret. 2020. Transformers are
model sizes. rnns: Fast autoregressive transformers with linear
attention. In International conference on machine
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng learning, pages 5156–5165. PMLR.
Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi
Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen
Yizhou Shan. 2024. Inference without interference: Dong, Xiuyu Li, Sheng Shen, Michael W Ma-
Disaggregate llm inference for mixed downstream honey, and Kurt Keutzer. 2023. Squeezellm:
workloads. Dense-and-sparse quantization. arXiv preprint
arXiv:2306.07629.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya.
and Weizhu Chen. 2021. Lora: Low-rank adap- 2020. Reformer: The efficient transformer. arXiv
tation of large language models. arXiv preprint preprint arXiv:2001.04451.
arXiv:2106.09685.
Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng,
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun
Zhangyin Feng, Haotian Wang, Qianglong Chen, Li, Linghe Kong, and Yunxin Liu. 2024. Lora-switch:
Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Boosting the efficiency of dynamic llm adapters
Liu. 2023. A survey on hallucination in large lan- via system-algorithm co-design. arXiv preprint
guage models: Principles, taxonomy, challenges, and arXiv:2405.17741.
open questions.
Eldar Kurtić, Elias Frantar, and Dan Alistarh. 2024. Zi-
Erik Johannes Husom, Arda Goknil, Lwin Khin Shar, plm: Inference-aware structured pruning of language
and Sagar Sen. 2024. The price of prompting: Profil- models. Advances in Neural Information Processing
ing energy use in large language models inference. Systems, 36.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Wing Lian, Bleys Goodson, Eugene Pentland, Austin
field, Michael Collins, Ankur Parikh, Chris Alberti, Cook, Chanvichet Vong, and "Teknium". 2023.
Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- Openorca: An open dataset of gpt augmented
ton Lee, et al. 2019. Natural questions: a benchmark flan reasoning traces. https://https://
for question answering research. Transactions of the huggingface.co/Open-Orca/OpenOrca.
Association for Computational Linguistics, 7:453–
466. Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He,
Weizhu Chen, and Tuo Zhao. 2023. Less is more:
Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre- Task-aware layer-wise distillation for language model
Antoine Gourraud, Mickael Rouvier, and Richard compression.
Dufour. 2024. Biomistral: A collection of open-
source pretrained large language models for medical Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-
domains. Ming Chen, Wei-Chen Wang, Guangxuan Xiao,
Xingyu Dang, Chuang Gan, and Song Han. 2024.
Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Awq: Activation-aware weight quantization for on-
Victor Sanh. 2024. What matters when build- device llm compression and acceleration. Proceed-
ing vision-language models? arXiv preprint ings of Machine Learning and Systems, 6:87–100.
arXiv:2405.02246.
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan
Jaewook Lee, Yoel Park, and Seulki Lee. 2024a. Design- Zhang, Sheng Shen, and Yong Jae Lee. 2024a. Llava-
ing extremely memory-efficient cnns for on-device next: Improved reasoning, ocr, and world knowledge.
vision tasks.
Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy
Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Liang, and Tengyu Ma. 2024b. Sophia: A scal-
Karttikeya Mangalam, Sheng Shen, Gopala Anu- able stochastic second-order optimizer for language
manchipali, Michael W Mahoney, Kurt Keutzer, and model pre-training. In The Twelfth International Con-
Amir Gholami. 2024b. Llm2llm: Boosting llms with ference on Learning Representations.
novel iterative data enhancement. arXiv preprint
arXiv:2403.15042. Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong,
Jianfei Cai, and Bohan Zhuang. 2023a. Qllm: Accu-
rate and efficient low-bitwidth quantization for large
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
language models. arXiv preprint arXiv:2310.08041.
The power of scale for parameter-efficient prompt
tuning. arXiv preprint arXiv:2104.08691.
Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu,
Derong Xu, Feng Tian, and Yefeng Zheng. 2023b.
Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Jiuxi- Moelora: An moe-based parameter efficient fine-
ang Gu, and Tianyi Zhou. 2024a. Selective reflection- tuning method for multi-task medical applications.
tuning: Student-selected data recycling for llm arXiv preprint arXiv:2310.18339.
instruction-tuning. arXiv preprint arXiv:2402.10110.
Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng
Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Dong, and Kwang-Ting Cheng. 2023c. Llm-fp4: 4-
Heng Huang, Jiuxiang Gu, and Tianyi Zhou. 2023a. bit floating-point quantized transformers. In Proceed-
Reflection-tuning: Data recycling improves llm ings of the 2023 Conference on Empirical Methods
instruction-tuning. arXiv preprint arXiv:2310.11716. in Natural Language Processing, pages 592–605.
Qinbin Li, Junyuan Hong, Chulin Xie, Jeffrey Tan, Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zi-
Rachel Xin, Junyi Hou, Xavier Yin, Zhun Wang, Dan hao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang
Hendrycks, Zhangyang Wang, Bo Li, Bingsheng He, Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2024c.
and Dawn Song. 2024b. Llm-pbe: Assessing data Prompt injection attack against llm-integrated appli-
privacy in large language models. cations.

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie
Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Chang, Pierre Stock, Yashar Mehdad, Yangyang
Jiaya Jia. 2024c. Mini-gemini: Mining the potential Shi, Raghuraman Krishnamoorthi, and Vikas Chan-
of multi-modality vision language models. arXiv dra. 2023d. Llm-qat: Data-free quantization aware
preprint arXiv:2403.18814. training for large language models. arXiv preprint
arXiv:2305.17888.
Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang
Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Fe- Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge
lix Chern, Felix Yu, Ruiqi Guo, and Sanjiv Kumar. Soran, Dhruv Choudhary, Raghuraman Krishnamoor-
2023b. The lazy neuron phenomenon: On emer- thi, Vikas Chandra, Yuandong Tian, and Tij-
gence of activation sparsity in transformers. In The men Blankevoort. 2024d. Spinquant–llm quan-
Eleventh International Conference on Learning Rep- tization with learned rotations. arXiv preprint
resentations. arXiv:2405.16406.
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Seyed Iman Mirzadeh, Keivan Alizadeh-Vahid, Sachin
Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Mehta, Carlo C del Mundo, Oncel Tuzel, Golnoosh
Ernie Chang, Yangyang Shi, Raghuraman Krish- Samei, Mohammad Rastegari, and Mehrdad Fara-
namoorthi, et al. 2024e. MobileLLM: Optimizing jtabar. 2024. ReLU strikes back: Exploiting activa-
sub-billion parameter language models for on-device tion sparsity in large language models. In The Twelfth
use cases. arXiv:2402.14905. International Conference on Learning Representa-
tions.
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang
Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Saurav Muralidharan, Sharath Turuvekere Sreenivas,
Yuandong Tian, Christopher Re, et al. 2023e. Deja Raviraj Joshi, Marcin Chochowski, Mostofa Patwary,
vu: Contextual sparsity for efficient llms at infer- Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz,
ence time. In International Conference on Machine and Pavlo Molchanov. 2024. Compact language mod-
Learning, pages 22137–22176. PMLR. els via pruning and knowledge distillation.
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Moin Nadeem, Anna Bethke, and Siva Reddy. 2021.
Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, StereoSet: Measuring stereotypical bias in pretrained
and Xia Hu. 2024f. Kivi: A tuning-free asymmet- language models. In Proceedings of the 59th Annual
ric 2bit quantization for kv cache. arXiv preprint Meeting of the Association for Computational Lin-
arXiv:2402.02750. guistics and the 11th International Joint Conference
Ilya Loshchilov and Frank Hutter. 2019. Decoupled on Natural Language Processing (Volume 1: Long
weight decay regularization. In International Confer- Papers), pages 5356–5371, Online. Association for
ence on Learning Representations. Computational Linguistics.

Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Nikita Nangia, Clara Vania, Rasika Bhalerao, and
Jifeng Dai, Yu Qiao, and Xizhou Zhu. 2024a. Mono- Samuel R. Bowman. 2020. CrowS-pairs: A chal-
internvl: Pushing the boundaries of monolithic multi- lenge dataset for measuring social biases in masked
modal large language models with endogenous visual language models. In Proceedings of the 2020 Con-
pre-training. arXiv preprint arXiv:2410.08202. ference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1953–1967, Online. As-
Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu sociation for Computational Linguistics.
He, Jun Zhao, and Kang Liu. 2024b. Moelora:
Contrastive learning guided mixture of experts on Deepak Narayanan, Keshav Santhanam, Peter Hender-
parameter-efficient fine-tuning for large language son, Rishi Bommasani, Tony Lee, and Percy Liang.
models. arXiv preprint arXiv:2402.12851. 2023. Cheaply evaluating inference efficiency met-
rics for autoregressive transformer apis.
Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang
Wang, and Xiaowen Chu. Benchmarking and dis- E. J. Nyström. 1930. Über Die Praktische Au-
secting the nvidia hopper gpu architecture (2024). flösung von Integralgleichungen mit Anwendun-
URL https://arxiv. org/abs/2402.13499. gen auf Randwertaufgaben. Acta Mathematica,
54(none):185 – 204.
Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023.
Llm-pruner: On the structural pruning of large lan- OpenAI. 2024a. Gpt-4 technical report.
guage models. Advances in neural information pro-
cessing systems, 36:21702–21720. OpenAI. 2024b. Hello gpt-4o.
Lucie Charlotte Magister, Jonathan Mallinson, Jakub Alicia Parrish, Angelica Chen, Nikita Nangia,
Adamek, Eric Malmi, and Aliaksei Severyn. 2023. Vishakh Padmakumar, Jason Phang, Jana Thompson,
Teaching small language models to reason. Phu Mon Htut, and Samuel R. Bowman. 2022. Bbq:
Paul Michel, Omer Levy, and Graham Neubig. 2019. A hand-built bias benchmark for question answering.
Are sixteen heads really better than one? Advances
in neural information processing systems, 32. Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo
Goiri, Brijesh Warrier, Nithish Mahalingam, and Ri-
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gre- cardo Bianchini. 2024. Characterizing power man-
gory Diamos, Erich Elsen, David Garcia, Boris Gins- agement opportunities for llms in the cloud. In ASP-
burg, Michael Houston, Oleksii Kuchaiev, Ganesh LOS ’24: Proceedings of the 29th ACM International
Venkatesh, and Hao Wu. 2018. Mixed precision Conference on Architectural Support for Program-
training. In International Conference on Learning ming Languages and Operating Systems, New York,
Representations. NY, USA. Association for Computing Machinery.

Bonan Min, Hayley Ross, Elior Sulem, Amir Bo Peng, Eric Alcaide, Quentin Anthony, Alon Al-
Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, balak, Samuel Arcadinho, Stella Biderman, Huanqi
Eneko Agirre, Ilana Heinz, and Dan Roth. 2021. Re- Cao, Xin Cheng, Michael Chung, Leon Derczynski,
cent advances in natural language processing via Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng
large pre-trained language models: A survey. ACM He, Haowen Hou, Przemyslaw Kazienko, Jan Ko-
Computing Surveys, 56:1 – 40. con, Jiaming Kong, Bartłomiej Koptyra, Hayden
Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Rishov Sarkar, Hanxue Liang, Zhiwen Fan, Zhangyang
Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Wang, and Cong Hao. 2023. Edge-moe: Memory-
Johan Wind, Stanisław Woźniak, Zhenyuan Zhang, efficient multi-task vision transformer architecture
Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu. 2023. with task-level sparsity via mixture-of-experts.
RWKV: Reinventing RNNs for the transformer era.
In Findings of the Association for Computational Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal-
Linguistics: EMNLP 2023, pages 14048–14077, Sin- isiewicz, and Andrew Rabinovich. 2020. Superglue:
gapore. Association for Computational Linguistics. Learning feature matching with graph neural net-
works. In Proceedings of the IEEE/CVF conference
Alec Radford, Karthik Narasimhan, Tim Salimans, and on computer vision and pattern recognition, pages
Ilya Sutskever. 2018. Improving language under- 4938–4947.
standing by generative pre-training. OpenAI blog.
Atreya Shankar, Andreas Waldis, Christof Bless, Maria
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Andueza Rodriguez, and Luca Mazzola. 2023. Pri-
Dario Amodei, and Ilya Sutskever. 2019. Language
vacyglue: A benchmark dataset for general language
models are unsupervised multitask learners.
understanding in privacy policies. Applied Sciences,
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie 13(6):3701.
Millican, Jordan Hoffmann, Francis Song, John
Aslanides, Sarah Henderson, Roman Ring, Susan- Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng
nah Young, et al. 2021. Scaling language models: Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng
Methods, analysis & insights from training gopher. Gao, Yu Qiao, and Ping Luo. 2023. Omniquant:
arXiv preprint arXiv:2112.11446. Omnidirectionally calibrated quantization for large
language models. arXiv preprint arXiv:2308.13137.
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,
and Yuxiong He. 2020. Zero: Memory optimizations Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
toward training trillion parameter models. In SC20: Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
International Conference for High Performance Com- Dean. 2017. Outrageously large neural networks:
puting, Networking, Storage and Analysis, pages 1– The sparsely-gated mixture-of-experts layer.
16. IEEE.
Noam Shazeer and Mitchell Stern. 2018. Adafactor:
Pranav Rajpurkar, Jian Zhang, Konstantin Liu, and Adaptive learning rates with sublinear memory cost.
Percy Liang. 2016. Squad: 100,000+ questions for In International Conference on Machine Learning,
machine comprehension of text. pages 4596–4604. PMLR.
Siva Reddy, Danqi Chen, and Christopher D Manning.
Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu,
2019. Coqa: A conversational question answering
Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu,
challenge. Transactions of the Association for Com-
and Deyi Xiong. 2023. Large language model align-
putational Linguistics, 7:249–266.
ment: A survey. ArXiv, abs/2309.15025.
Apple Machine Learning Research. 2024. Introducing
apple’s on-device and server foundation models. Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong,
Zhengang Li, Ming Lin, Chao Wu, and Yanzhi Wang.
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2024a. Agile-quant: Activation-guided quantization
2020. A primer in BERTology: What we know about for faster inference of llms on the edge. In Proceed-
how BERT works. Transactions of the Association ings of the AAAI Conference on Artificial Intelligence,
for Computational Linguistics, 8:842–866. pages 18944–18951.
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and Xuan Shen, Zhenglun Kong, Changdi Yang, Zhaoyang
David Grangier. 2021. Efficient content-based sparse Han, Lei Lu, Peiyan Dong, Cheng Lyu, Chih-hsiang
attention with routing transformers. Transactions of Li, Xuehang Guo, Zhihao Shu, et al. 2024b. Edgeqat:
the Association for Computational Linguistics, 9:53– Entropy and distribution guided quantization-aware
68. training for the acceleration of lightweight llms on
the edge. arXiv preprint arXiv:2402.10787.
Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav
Nakov. 2023. On the effect of dropping layers of
Xuan Shen, Pu Zhao, Yifan Gong, Zhenglun Kong,
pre-trained transformer models. Computer Speech &
Zheng Zhan, Yushu Wu, Ming Lin, Chao Wu,
Language, 77:101429.
Xue Lin, and Yanzhi Wang. 2024c. Search for
V Sanh. 2019. Distilbert, a distilled version of bert: efficient large language models. arXiv preprint
Smaller, faster, cheaper and lighter. arXiv preprint arXiv:2409.17372.
arXiv:1910.01108.
Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang,
Michael Santacroce, Zixin Wen, Yelong Shen, and Minlie Huang, and Dongsheng Li. 2024. Llm-based
Yuanzhi Li. 2023. What matters in the structured privacy data augmentation guided by knowledge dis-
pruning of generative language models? arXiv tillation with a distribution tutor for medical text clas-
preprint arXiv:2302.03773. sification. arXiv preprint arXiv:2402.16515.
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Inar Timiryasov and Jean-Loup Tastet. 2023a. Baby
Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, llama: knowledge distillation from an ensemble of
Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, teachers trained on a small dataset with no perfor-
and Pavlo Molchanov. 2024. Llm pruning and distil- mance penalty. arXiv preprint arXiv:2308.02019.
lation in practice: The minitron approach.
Inar Timiryasov and Jean-Loup Tastet. 2023b. Baby
Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo llama: knowledge distillation from an ensemble of
Goiri, and Josep Torrellas. 2024a. Towards greener teachers trained on a small dataset with no perfor-
llms: Bringing energy-efficiency to the forefront of mance penalty.
llm inference.
Fernanda De La Torre, Cathy Mengying Fang, Han
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Tor- Huang, Andrzej Banburski-Fahey, Judith Amores
rellas, and Esha Choukse. 2024b. Dynamollm: De- Fernandez, and Jaron Lanier. 2024. Llmr: Real-time
signing llm inference clusters for performance and prompting of interactive worlds using large language
energy efficiency. models.
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Kolter. 2023. A simple and effective pruning ap- Martinet, Marie-Anne Lachaux, Timothée Lacroix,
proach for large language models. arXiv preprint Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
arXiv:2306.11695. Azhar, et al. 2023a. Llama: Open and effi-
cient foundation language models. arXiv preprint
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, arXiv:2302.13971.
Yiming Yang, and Denny Zhou. 2020. MobileBERT:
a compact task-agnostic BERT for resource-limited Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
devices. In Proceedings of the 58th Annual Meet- Martinet, Marie-Anne Lachaux, Timothée Lacroix,
ing of the Association for Computational Linguistics, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
pages 2158–2170, Online. Association for Computa- Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
tional Linguistics. Grave, and Guillaume Lample. 2023b. Llama: Open
and efficient foundation language models.
Mingxing Tan and Quoc Le. 2019. Efficientnet: Re-
thinking model scaling for convolutional neural net- Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
works. In International conference on machine learn- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
ing, pages 6105–6114. PMLR. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Chaofan Tao, Lu Hou, Haoli Bai, Jiansheng Wei, Xin Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jiang, Qun Liu, Ping Luo, and Ngai Wong. 2023. Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Structured pruning for efficient generative pre-trained Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
language models. In Findings of the Association for thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Computational Linguistics: ACL 2023, pages 10880– Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
10895. Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Jean-Loup Tastet and Inar Timiryasov. 2024. Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
Babyllama-2: Ensemble-distilled models con- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
sistently outperform teachers with limited data. tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
arXiv preprint arXiv:2409.17312. bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
Chameleon Team. 2024a. Chameleon: Mixed-modal Ruan Silva, Eric Michael Smith, Ranjan Subrama-
early-fusion foundation models. arXiv preprint nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
arXiv:2405.09818. lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Gemini Team. 2024b. Gemini: A family of highly Melanie Kambadur, Sharan Narang, Aurelien Ro-
capable multimodal models. driguez, Robert Stojnic, Sergey Edunov, and Thomas
Scialom. 2023c. Llama 2: Open foundation and fine-
Gemma Team, Thomas Mesnard, Cassidy Hardin, tuned chat models.
Robert Dadashi, Surya Bhupatiraju, Shreya Pathak,
Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna
Juliette Love, et al. 2024. Gemma: Open models Gurevych. 2020. Towards debiasing NLU models
based on gemini research and technology. arXiv from unknown biases. In Proceedings of the 2020
preprint arXiv:2403.08295. Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 7597–7610, On-
Omkar Thawakar, Ashmal Vayani, Salman Khan, line. Association for Computational Linguistics.
Hisham Cholakal, Rao M Anwer, Michael Fels-
berg, Tim Baldwin, Eric P Xing, and Fahad Shah- Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
baz Khan. 2024. Mobillama: Towards accurate and nrich, and Ivan Titov. 2019. Analyzing multi-head
lightweight fully transparent gpt. arXiv preprint self-attention: Specialized heads do the heavy lift-
arXiv:2402.16840. ing, the rest can be pruned. In Proceedings of the
57th Annual Meeting of the Association for Computa- the AAAI Conference on Artificial Intelligence, vol-
tional Linguistics, pages 5797–5808, Florence, Italy. ume 35, pages 14138–14148.
Association for Computational Linguistics.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng,
Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin
Wei Bi, and Shuming Shi. 2024a. Knowledge fusion Jiang. 2023. Wizardlm: Empowering large lan-
of large language models. guage models to follow complex instructions. arXiv
preprint arXiv:2304.12244.
Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen,
and Xiaojun Quan. 2024b. Fusechat: Knowledge Jin Xu, Xu Tan, Renqian Luo, Kaitao Song, Jian Li, Tao
fusion of chat models. Qin, and Tie-Yan Liu. 2021. Nas-bert: task-agnostic
and adaptive-size bert compression with neural ar-
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, chitecture search. In Proceedings of the 27th ACM
Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming SIGKDD Conference on Knowledge Discovery &
Yan, Ji Zhang, and Jitao Sang. 2024. Amber: An Data Mining, pages 1933–1943.
llm-free multi-dimensional benchmark for mllms hal-
lucination evaluation. Shu Yang, Muhammad Asif Ali, Cheng-Long Wang, Li-
jie Hu, and Di Wang. 2024. Moral: Moe augmented
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, lora for llms’ lifelong learning. arXiv preprint
and Hao Ma. 2020a. Linformer: Self-attention with arXiv:2402.11260.
linear complexity. arXiv preprint arXiv:2006.04768.
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang,
Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2020b. Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022.
Structured pruning of large language models. In Zeroquant: Efficient and affordable post-training
Proceedings of the 2020 Conference on Empirical quantization for large-scale transformers. Advances
Methods in Natural Language Processing (EMNLP), in Neural Information Processing Systems, 35:27168–
pages 6151–6162, Online. Association for Computa- 27183.
tional Linguistics.
Da Yu, Peter Kairouz, Sewoong Oh, and Zheng Xu.
Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. 2023. 2024. Privacy-preserving instructions for aligning
f-divergence minimization for sequence-level knowl- large language models.
edge distillation. Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan
Zhou, Jianlong Wu, and Liqiang Nie. 2024.
Chenxi Whitehouse, Monojit Choudhury, and Al-
Wkvquant: Quantizing weight and key/value cache
ham Fikri Aji. 2023. Llm-powered data augmen-
for large language models gains more. arXiv preprint
tation for enhanced cross-lingual performance. arXiv
arXiv:2402.12065.
preprint arXiv:2305.14288.
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang,
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun,
Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Shiyao Li, Zixiao Huang, et al. 2024. Flightllm: Effi-
Vajda, Yangqing Jia, and Kurt Keutzer. 2019. Fbnet: cient large language model inference with a complete
Hardware-aware efficient convnet design via differ- mapping flow on fpgas. In Proceedings of the 2024
entiable neural architecture search. In Proceedings ACM/SIGDA International Symposium on Field Pro-
of the IEEE/CVF conference on computer vision and grammable Gate Arrays, pages 223–234.
pattern recognition, pages 10734–10742.
Bohang Zhang, Jikai Jin, Cong Fang, and Liwei Wang.
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi 2020. Improved analysis of clipping algorithms for
Chen. 2024. Sheared LLaMA: Accelerating lan- non-convex optimization. Advances in Neural Infor-
guage model pre-training via structured pruning. In mation Processing Systems, 33:15511–15521.
The Twelfth International Conference on Learning
Representations. Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu,
Zhihong Chen, Jianquan Li, Guiming Chen, Xi-
Mengzhou Xia, Zexuan Zhong, and Danqi Chen. 2022. angbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang
Structured pruning learns compact and accurate mod- Wan, Benyou Wang, and Haizhou Li. 2023a. Hu-
els. arXiv preprint arXiv:2204.00408. atuogpt, towards taming language models to be a
doctor. arXiv preprint arXiv:2305.15075.
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu,
Julien Demouth, and Song Han. 2023. Smoothquant: Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and
Accurate and efficient post-training quantization for Wei Lu. 2024. Tinyllama: An open-source small
large language models. In International Conference language model. arXiv preprint arXiv:2401.02385.
on Machine Learning, pages 38087–38099. PMLR.
Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Ao-
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, jun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hong-
Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. sheng Li, and Yu Qiao. 2023b. Llama-adapter: Effi-
2021. Nyströmformer: A nyström-based algorithm cient fine-tuning of language models with zero-init
for approximating self-attention. In Proceedings of attention. arXiv preprint arXiv:2303.16199.
Jiaxu Zhao, Meng Fang, Shirui Pan, Wenpeng Yin, and
Mykola Pechenizkiy. 2023a. Gptbias: A comprehen-
sive framework for evaluating bias in large language
models.
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo,
Chien-Chin Huang, Min Xu, Less Wright, Hamid
Shojanazeri, Myle Ott, Sam Shleifer, et al. 2023b.
Pytorch fsdp: experiences on scaling fully sharded
data parallel. arXiv preprint arXiv:2304.11277.
Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhi-
jie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng
Li. 2021. Learning n: m fine-grained structured
sparse neural networks from scratch. arXiv preprint
arXiv:2102.04010.
He Zhu, Junyou Su, Tianle Lun, Yicheng Tao, Wenjia
Zhang, Zipei Fan, and Guanhua Chen. 2024. Fanno:
Augmenting high-quality instruction data with open-
sourced llms only. arXiv preprint arXiv:2408.01323.
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping
Wang. 2023. A survey on model compression for
large language models. ArXiv, abs/2308.07633.
Barret Zoph and Quoc V Le. 2016. Neural architecture
search with reinforcement learning. arXiv preprint
arXiv:1611.01578.
A Further Discussion on Pruning
Techniques
For unstructured pruning for SLMs, we further
note that Wanda (Sun et al., 2023) incorporates
both weights and activations into consideration dur-
ing pruning process, and eliminates the need of
weight updates. In addition, the n:m pruning strat-
egy (Zhou et al., 2021) brings unstructured pruning
to model acceleration by pruning exactly n weights
out of every m, balancing pruning flexibility and
computational efficiency for significant speedups.
NVIDIA’s TensorRT leverages such sparse patterns
to optimize memory access and reduce computa-
tional loads, accelerating inference on GPUs, par-
ticularly hardware like the A100. Additionally, the
n:m sparse pattern can also be applied in edge AI
applications on NVIDIA Jetson Nano to enhance
power efficiency and optimize model size. Finally,
unstructured pruning often results in sparse matri-
ces requiring specialized hardware or algorithms
to maximize computational benefits (Frantar and
Alistarh, 2023).

You might also like