Large Language Models: A Survey
Large Language Models: A Survey
Abstract—Large Language Models (LLMs) have drawn a that have different starting points and velocity: statistical lan-
lot of attention due to their strong performance on a wide guage models, neural language models, pre-trained language
range of natural language tasks, since the release of ChatGPT models and LLMs.
in November 2022. LLMs’ ability of general-purpose language
understanding and generation is acquired by training billions of Statistical language models (SLMs) view text as a sequence
model’s parameters on massive amounts of text data, as predicted of words, and estimate the probability of text as the product
arXiv:2402.06196v1 [cs.CL] 9 Feb 2024
by scaling laws [1], [2]. The research area of LLMs, while very of their word probabilities. The dominating form of SLMs
recent, is evolving rapidly in many different ways. In this paper, are Markov chain models known as the n-gram models,
we review some of the most prominent LLMs, including three which compute the probability of a word conditioned on its
popular LLM families (GPT, LLaMA, PaLM), and discuss their
characteristics, contributions and limitations. We also give an
immediate proceeding n − 1 words. Since word probabilities
overview of techniques developed to build, and augment LLMs. are estimated using word and n-gram counts collected from
We then survey popular datasets prepared for LLM training, text corpora, the model needs to deal with data sparsity (i.e.,
fine-tuning, and evaluation, review widely used LLM evaluation assigning zero probabilities to unseen words or n-grams) by
metrics, and compare the performance of several popular LLMs using smoothing, where some probability mass of the model
on a set of representative benchmarks. Finally, we conclude is reserved for unseen n-grams [12]. N-gram models are
the paper by discussing open challenges and future research widely used in many NLP systems. However, these models
directions. are incomplete in that they cannot fully capture the diversity
and variability of natural language due to data sparsity.
I. I NTRODUCTION
Early neural language models (NLMs) [13], [14], [15], [16]
Language modeling is a long-standing research topic, dat- deal with data sparsity by mapping words to low-dimensional
ing back to the 1950s with Shannon’s application of informa- continuous vectors (embedding vectors) and predict the next
tion theory to human language, where he measured how well word based on the aggregation of the embedding vectors of
simple n-gram language models predict or compress natural its proceeding words using neural networks. The embedding
language text [3]. Since then, statistical language modeling vectors learned by NLMs define a hidden space where the
became fundamental to many natural language understanding semantic similarity between vectors can be readily computed
and generation tasks, ranging from speech recognition, ma- as their distance. This opens the door to computing semantic
chine translation, to information retrieval [4], [5], [6]. similarity of any two inputs regardless their forms (e.g., queries
vs. documents in Web search [17], [18], sentences in different
The recent advances on transformer-based large language languages in machine translation [19], [20]) or modalities (e.g.,
models (LLMs), pretrained on Web-scale text corpora, signif- image and text in image captioning [21], [22]). Early NLMs are
icantly extended the capabilities of language models (LLMs). task-specific models, in that they are trained on task-specific
For example, OpenAI’s ChatGPT and GPT-4 can be used not data and their learned hidden space is task-specific.
only for natural language processing, but also as general task
solvers to power Microsoft’s Co-Pilot systems, for instance, Pre-trained language models (PLMs), unlike early NLMs,
can follow human instructions of complex new tasks per- are task-agnostic. This generality also extends to the learned
forming multi-step reasoning when needed. LLMs are thus hidden embedding space. The training and inference of PLMs
becoming the basic building block for the development of follows the pre-training and fine-tuning paradigm, where lan-
general-purpose AI agents or artificial general intelligence guage models with recurrent neural networks [23] or trans-
(AGI). formers [24], [25], [26] are pre-trained on Web-scale unlabeled
text corpora for general tasks such as word prediction, and then
As the field of LLMs is moving fast, with new findings, finetuned to specific tasks using small amounts of (labeled)
models and techniques being published in a matter of months task-specific data. Recent surveys on PLMs include [8], [27],
or weeks [7], [8], [9], [10], [11], AI researchers and practi- [28].
tioners often find it challenging to figure out the best recipes
to build LLM-powered AI systems for their tasks. This paper Large language models (LLMs) mainly refer to
gives a timely survey of the recent advances on LLMs. We transformer-based neural language models 1 that contain
hope this survey will prove a valuable and accessible resource tens to hundreds of billions of parameters, which are pre-
for students, researchers and developers. trained on massive text data, such as PaLM [31], LLaMA
[32], and GPT-4 [33], as summarized in Table III. Compared
LLMs are large-scale, pre-trained, statistical language mod-
els based on neural networks. The recent success of LLMs is 1 Recently, several very promising non-transformer LLMs have been pro-
an accumulation of decodes of research and development of posed, such as the LLMs based on structured state space models [29], [30].
language models, which can be categorized into four waves See Section VII for more details.
Summarization
Multi choice QA
Simplification
Reasoning
Multilingual
Completion Self-improvement
Step by step
Wikipedia QA Multi choice QA Few-shot
solving
Tool planning
Function Calling
Turn based Task definition Physical acting
Symbolic Task
reference decomposition Knowledge base
Pos/Neg example
utilization
World Instruction Virtual acting
knowledge API calling following Assignment
In-context planning Tool
learning utilization
Interacting
Coding
with users
LLM Capabilities
to PLMs, LLMs are not only much larger in model size, but LLMs are used, and augmented for real-world applications
also exhibit stronger language understanding and generation Sections V and VI review popular datasets and benchmarks for
abilities, and more importantly, emergent abilities that are evaluating LLMs, and summarize the reported LLM evaluation
not present in smaller-scale language models. As illustrated results. Finally, Section VII concludes the paper by summa-
in Fig. 1, these emergent abilities include (1) in-context rizing the challenges and future research directions.
learning, where LLMs learn a new task from a small set
of examples presented in the prompt at inference time, (2)
instruction following, where LLMs, after instruction tuning, II. L ARGE L ANGUAGE M ODELS
can follow the instructions for new types of tasks without In this section we start with a review of early pre-trained
using explicit examples, and (3) multi-step reasoning, where neural language models as they are the base of LLMs, and
LLMs can solve a complex task by breaking down that task then focus our discussion on three families of LLMs: GPT,
into intermediate reasoning steps as demonstrated in the LlaMA, and PaLM. Table I provides an overview of some of
chain-of-thought prompt [34]. LLMs can also be augmented these models and their characteristics.
by using external knowledge and tools [35], [36] so that
they can effectively interact with users and environment [37],
and continually improve itself using feedback data collected A. Early Pre-trained Neural Language Models
through interactions (e.g. via reinforcement learning with
human feedback (RLHF)). Language modeling using neural networks was pioneered
by [38], [39], [40]. Bengio et al. [13] developed one of the first
Through advanced usage and augmentation techniques, neural language models (NLMs) that are comparable to n-gram
LLMs can be deployed as so-called AI agents: artificial entities models. Then, [14] successfully applied NLMs to machine
that sense their environment, make decisions, and take actions. translation. The release of RNNLM (an open source NLM
Previous research has focused on developing agents for specific toolkit) by Mikolov [41], [42] helped significantly popularize
tasks and domains. The emergent abilities demonstrated by NLMs. Afterwards, NLMs based on recurrent neural networks
LLMs make it possible to build general-purpose AI agents (RNNs) and their variants, such as long short-term memory
based on LLMs. While LLMs are trained to produce responses (LSTM) [19] and gated recurrent unit (GRU) [20], were widely
in static settings, AI agents need to take actions to interact with used for many natural language applications including machine
dynamic environment. Therefore, LLM-based agents often translation, text generation and text classification [43].
need to augment LLMs to e.g., obtain updated information
from external knowledge bases, verify whether a system action Then, the invention of the Transformer architecture [44]
produces the expected result, and cope with when things do marks another milestone in the development of NLMs. By
not go as expected, etc. We will discuss in detail LLM-based applying self-attention to compute in parallel for every word
agents in Section IV. in a sentence or document an “attention score” to model the
influence each word has on another, Transformers allow for
In the rest of this paper, Section II presents an overview of much more parallelization than RNNs, which makes it possible
state of the art of LLMs, focusing on three LLM families (GPT, to efficiently pre-train very big language models on large
LLaMA and PaLM) and other representative models. Section amounts of data on GPUs. These pre-trained language models
III discusses how LLMs are built. Section IV discusses how (PLMs) can be fine-tuned for many downstream tasks.
Paper Strcuture
I Cost-Effective Training/Inference,
A LLM limitations Adaptation & Compression
B New Post-attention
Architectural Paradigms
C Multi-modal Models
D Security and
Ethical/Responsible AI
We group early popular Transformer-based PLMs, based on BERT (Birectional Encoder Representations from Trans-
their neural architectures, into three main categories: encoder- formers) [24] is one of the most widely used encoder-only
only, decoder-only, and encoder-decoder models. Comprehen- language models. BERT consists of three modules: (1) an
sive surveys of early PLMs are provided in [43], [28]. embedding module that converts input text into a sequence
of embedding vectors, (2) a stack of Transformer encoders
that converts embedding vectors into contextual representation
1) Encoder-only PLMs: As the name suggests, the encoder- vectors, and (3) a fully connected layer that converts the
only models only consist of an encoder network. These models representation vectors (at the final layer) to one-hot vectors.
are originally developed for language understanding tasks, BERT is pre-trained uses two objectives: masked language
such as text classification, where the models need to predict a modeling (MLM) and next sentence prediction. The pre-trained
class label for an input text. Representative encoder-only mod- BERT model can be fine-tuned by adding a classifier layer
els include BERT and its variants, e.g., RoBERTa, ALBERT, for many language understanding tasks, ranging from text
DeBERTa, XLM, XLNet, UNILM, as to be described below.
TABLE I: High-level Overview of Popular Language Models
Type Model Name #Parameters Release Base Models Open #Tokens Training dataset
Source
BERT 110M, 340M 2018 - ✓ 137B BooksCorpus, English Wikipedia
RoBERTa 355M 2019 - ✓ 2.2T BooksCorpus, English Wikipedia, CC-NEWS,
STORIES (a subset of Common Crawl), Reddit
ALBERT 12M, 18M, 60M, 2019 - ✓ 137B BooksCorpus, English Wikipedia
Encoder-Only
235M
DeBERTa - 2020 - ✓ - BooksCorpus, English Wikipedia, STORIES, Red-
dit content
XLNet 110M, 340M 2019 - ✓ 32.89B BooksCorpus, English Wikipedia, Giga5, Com-
mon Crawl, ClueWeb 2012-B
GPT-1 120M 2018 - ✓ 1.3B BooksCorpus
Decoder-only
GPT-2 1.5B 2019 - ✓ 10B Reddit outbound
T5 (Base) 223M 2019 - ✓ 156B Common Crawl
MT5 (Base) 300M 2020 - ✓ - New Common Crawl-based dataset in 101 lan-
Encoder-Decoder
guages (m Common Crawl)
BART (Base) 139M 2019 - ✓ - Corrupting text
GPT-3 125M, 350M, 2020 × 300B Common Crawl (filtered), WebText2, Books1,
760M, 1.3B, 2.7B, Books2, Wikipedia
6.7B, 13B, 175B
CODEX 12B 2021 GPT ✓ - Public GitHub software repositories
GPT Family
WebGPT 760M, 13B, 175B 2021 GPT-3 × - ELI5
GPT-4 1.76T 2023 - × 13T -
LLaMA1 7B, 13B, 33B, 65B 2023 - ✓ 1T, 1.4T Online sources
LLaMA2 7B, 13B, 34B, 70B 2023 - ✓ 2T Online sources
Alpaca 7B 2023 LLaMA1 ✓ - GPT-3.5
Vicuna-13B 13B 2023 LLaMA1 ✓ - GPT-3.5
Koala 13B 2023 LLaMA ✓ - Dialogue data
LLaMA Family
Mistral-7B 7.3B 2023 ✓ - -
Code Llama 34 2023 LLaMA2 ✓ 500B Publicly available code
LongLLaMA 3B, 7B 2023 OpenLLaMA ✓ 1T -
LLaMA-Pro-8B 8.3B 2024 LLaMA2-7B ✓ 80B Code and math corpora
TinyLlama-1.1B 1.1B 2024 LLaMA1.1B ✓ 3T SlimPajama, Starcoderdata
PaLM 8B, 62B, 540B 2022 - × 780B Web documents, books, Wikipedia, conversations,
GitHub code
U-PaLM 8B, 62B, 540B 2022 - × 1.3B Web documents, books, Wikipedia, conversations,
GitHub code
PaLM-2 340B 2023 - ✓ 3.6T Web documents, books, code, mathematics, con-
PaLM Family
versational data
Med-PaLM 540B 2022 PaLM × 780B HealthSearchQA, MedicationQA, LiveQA
Med-PaLM 2 - 2023 PaLM 2 × - MedQA, MedMCQA, HealthSearchQA, LiveQA,
MedicationQA
FLAN 137B 2021 LaMDA-PT ✓ - Web documents, code, dialog data, Wikipedia
Gopher 280B 2021 - × 300B MassiveText
ERNIE 4.0 10B 2023 - × 4TB Chinese text
Retro 7.5B 2021 - × 600B MassiveText
LaMDA 137B 2022 - × 168B public dialog data and web documents
ChinChilla 70B 2022 - × 1.4T MassiveText
Galactia-120B 120B 2022 - 450B
CodeGen 16.1B 2022 - ✓ - THE PILE, BIGQUERY, BIGPYTHON
Other Popular LLMs
BLOOM 176B 2022 - ✓ 366B ROOTS
Zephyr 7.24B 2023 Mistral-7B ✓ 800B Synthetic data
Grok-0 33B 2023 - × - Online source
ORCA-2 13B 2023 LLaMA2 - 2001B -
StartCoder 15.5B 2023 - ✓ 35B GitHub
MPT 7B 2023 - ✓ 1T RedPajama, m Common Crawl, S2ORC, Common
Crawl
Mixtral-8x7B 46.7B 2023 - ✓ - Instruction dataset
Falcon 180B 180B 2023 - ✓ 3.5T RefinedWeb
Gemini 1.8B, 3.25B 2023 ✓ - Web documents, books, and code, image data,
audio data, video data
DeepSeek-Coder 1.3B, 6.7B, 33B 2024 - ✓ 2T GitHub’s Markdown and StackExchange
DocLLM 1B,7B 2024 - × 2T IIT-CDIP Test Collection 1.0, DocBank
classification, question answering to language inference. A larger mini-batches and learning rates. ALBERT [45] uses two
high-level overview of BERT framework is shown in Fig 3. As parameter-reduction techniques to lower memory consumption
BERT significantly improved state of the art on a wide range and increase the training speed of BERT: (1) splitting the
of language understanding tasks when it was published, the AI embedding matrix into two smaller matrices, and (2) using
community was inspired to develop many similar encoder-only repeating layers split among groups. DeBERTa (Decoding-
language models based on BERT. enhanced BERT with disentangled attention) [26] improves the
BERT and RoBERTa models using two novel techniques. The
RoBERTa [25] significantly improves the robustness of first is the disentangled attention mechanism, where each word
BERT using a set of model design choices and training strate- is represented using two vectors that encode its content and
gies, such as modifying a few key hyperparameters, removing position, respectively, and the attention weights among words
the next-sentence pre-training objective and training with much
Fig. 3: Overall pre-training and fine-tuning procedures for
BERT. Courtesy of [24]
GPT4
GPT1 Vicuna Alpaca
Fig. 11: GPT-4 performance on academic and professional Fig. 13: Relative Response Quality of Vicuna and a few other
exams, compared with GPT 3.5. Courtesy of [33]. well-known models by GPT-4. Courtesy of Vicuna Team.
Like Alpaca and Vicuna, the Guanaco models [63] are also
In July 2023, Meta, in partnership with Microsoft, released finetuned LLaMA models using instruction-following data. But
the LLaMA-2 collection [61], which include both foundation the finetuning is done very efficiently using QLoRA such
language models and Chat models finetuned for dialog, known that finetuning a 65B parameter model can be done on a
as LLaMA-2 Chat. The LLaMA-2 Chat models were reported single 48GB GPU. QLoRA back-propagates gradients through
to outperform other open-source models on many public a frozen, 4-bit quantized pre-trained language model into Low
benchmarks. Fig 12 shows the training process of LLaMA-2 Rank Adapters (LoRA). The best Guanaco model outperforms
Chat. The process begins with pre-training LLaMA-2 using all previously released models on the Vicuna benchmark,
publicly available online data. Then, an initial version of reaching 99.3% of the performance level of ChatGPT while
LLaMA-2 Chat is built via supervised fine-tuning. Subse- only requiring 24 hours of fine-tuning on a single GPU.
quently, the model is iteratively refined using RLHF, rejection
Koala [64] is yet another instruction-following language
sampling and proximal policy optimization. In the RLHF stage,
model built on LLaMA, but with a specific focus on interaction
the accumulation of human feedback for revising the reward
data that include user inputs and responses generated by highly
model is crucial to prevent the reward model from being
capable closed-source chat models such as ChatGPT. The
changed too much, which could hurt the stability of LLaMA
Koala-13B model performs competitively with state-of-the-art
model training.
chat models according to human evaluation based on real-
world user prompts.
Mistral-7B [65] is a 7B-parameter language model engi-
neered for superior performance and efficiency. Mistral-7B
outperforms the best open-source 13B model (LLaMA-2-13B)
across all evaluated benchmarks, and the best open-source
34B model (LLaMA-34B) in reasoning, mathematics, and code
generation. This model leverages grouped-query attention for
faster inference, coupled with sliding window attention to
effectively handle sequences of arbitrary length with a reduced
inference cost.
The LLaMA family is growing rapidly, as more instruction-
Fig. 12: Training of LLaMA-2 Chat. Courtesy of [61]. following models have been built on LLaMA or LLaMA-
2, including Code LLaMA [66], Gorilla [67], Giraffe [68],
Vigogne [69], Tulu 65B [70], Long LLaMA [71], and Stable
Alpaca [62] is fine-tuned from the LLaMA-7B model using Beluga2 [72], just to name a few.
52K instruction-following demonstrations generated in the
3) The PaLM Family: The PaLM (Pathways Language
style of self-instruct using GPT-3.5 (text-davinci-003). Alpaca
Model) family are developed by Google. The first PaLM
is very cost-effective for training, especially for academic
model [31] was announced in April 2022 and remained private
research. On the self-instruct evaluation set, Alpaca performs
until March 2023. It is a 540B parameter transformer-based
similarly to GPT-3.5, despite that Alpaca is much smaller.
LLM. The model is pre-trained on a high-quality text corpus
The Vicuna team has developed a 13B chat model, Vicuna- consisting of 780 billion tokens that comprise a wide range
13B, by fine-tuning LLaMA on user-shared conversations of natural language tasks and use cases. PaLM is pre-trained
on 6144 TPU v4 chips using the Pathways system, which [77]. Med-PaLM 2 scored up to 86.5% on the MedQA
enables highly efficient training across multiple TPU Pods. dataset (i.e., a benchmark combining six existing open ques-
PaLM demonstrates continued benefits of scaling by achiev- tion answering datasets spanning professional medical exams,
ing state-of-the-art few-shot learning results on hundreds of research, and consumer queries), improving upon Med-PaLM
language understanding and generation benchmarks. PaLM- by over 19% and setting a new state-of-the-art.
540B outperforms not only state-of-the-art fine-tuned models
on a suite of multi-step reasoning tasks, but also on par with C. Other Representative LLMs
humans on the recently released BIG-bench benchmark. In addition to the models discussed in the previous sub-
The U-PaLM models of 8B, 62B, and 540B scales are sections, there are other popular LLMs which do not belong
continually trained on PaLM with UL2R, a method of continue to those three model families, yet they have achieved great
training LLMs on a few steps with UL2’s mixture-of-denoiser performance and have pushed the LLMs field forward. We
objective [73]. An approximately 2x computational savings briefly describe these LLMs in this subsection.
rate is reported. FLAN: In [78], Wei et al. explored a simple method for
U-PaLM is later instruction-finetuned as Flan-PaLM [74]. improving the zero-shot learning abilities of language models.
Compared to other instruction finetuning work mentioned They showed that instruction tuning language models on a
above, Flan-PaLM’s finetuning is performed using a much collection of datasets described via instructions substantially
larger number of tasks, larger model sizes, and chain-of- improves zero-shot performance on unseen tasks. They take
thought data. As a result, Flan-PaLM substantially outperforms a 137B parameter pretrained language model and instruction
previous instruction-following models. For instance, Flan- tune it on over 60 NLP datasets verbalized via natural language
PaLM-540B, which is instruction-finetuned on 1.8K tasks, instruction templates. They call this instruction-tuned model
outperforms PaLM-540B by a large margin (+9.4% on av- FLAN. Fig 15 provides a comparison of instruction tuning
erage). The finetuning data comprises 473 datasets, 146 task with pretrain–finetune and prompting.
categories, and 1,836 total tasks, as illustrated in Fig 14.
incorporated to fuse information about the relative or absolute (that can contain several words) with a single mask special
position of the tokens in the sequence. word, and the objective is then to predict the text that this
mask word replaces. Encoder-decoder models are best suited
2) Encoder-Only: For this family, at each stage, the atten-
for tasks about generating new sentences conditioned on a
tion layers can access all the words in the initial sentence.
given input, such as summarization, translation, or generative
The pre-training of these models usually consist of some-
question answering.
how corrupting a given sentence (for instance, by masking
random words in it) and tasking the model with finding or
reconstructing the initial sentence. Encoder models are great B. Data Cleaning
for tasks requiring an understanding of the full sequence,
such as sentence classification, named entity recognition, and Data quality is crucial to the performance of language
extractive question answering. One prominent encoder only models trained on them. Data cleaning techniques such as
model is BERT (Bidirectional Encoder Representations from filtering, deduplication, are shown to have a big impact on
Transformers), proposed in [24]. the model performance.
3) Decoder-Only: For these models, at each stage, for any As an example, in Falcon40B [124], Penedo et al. showed
word, the attention layers can only access the words positioned that properly filtered and deduplicated web data alone can lead
before that in the sentence. These models are also sometimes to powerful models; even significantly outperforming models
called auto-regressive models. The pretraining of these models from the state-of-the-art trained on The Pile. Despite extensive
is usually formulated as predicting the next word (or token) filtering, they were able to obtain five trillion tokens from
in the sequence. The decoder-only models are best suited for CommonCrawl. They also released an extract of 600 billion
tasks involving text generation. GPT models are prominent tokens from our REFINEDWEB dataset, and 1.3/7.5B param-
example of this model category. eters language models trained on it. 27 shows the Refinement
4) Encoder-Decoder: These models use both encoder and process of CommonCrawl data by this work.
decoder, and are sometimes called sequence-to-sequence mod- 1) Data Filtering: Data filtering aims to enhance the qual-
els. At each stage, the attention layers of the encoder can access ity of training data and the effectiveness of the trained LLMs.
all the words in the initial sentence, whereas the attention Common data filtering techniques include:
layers of the decoder only accesses the words positioned before
a given word in the input. These models are usually pre- Removing Noise: refers to eliminating irrelevant or noisy
trained using the objectives of encoder or decoder models, but data that might impact the model’s ability to generalize well.
usually involve something a bit more complex. For instance, As an example, one can think of removing false information
some models are pretrained by replacing random spans of text from the training data, to lower the chance of model generating
How LLMs Are Built?
Data Filtering
Removing Noise
Handling Outliers
Data Cleaning
Addressing Imbalances
Text Preprocessing
Deduplication
BytePairEncoding
Tokenizations WordPieceEncoding
SentencePieceEncoding
Encoder-Only
Decoder-Only
LLM Architectures
Encoder-Decoder
...
Supervised Fine-tuning
General Fine-tuning
Fine-tuning and Instruction Tuning
Multi-turn Instructions
Instruction Following
Supervised learning
Reinforcement Learning from Human Feedback
Alignment
Direct Preference Optimization
Kahneman-Tversky Optimization
Greedy Search
Decoding Strategies Beam Search
Top-k Sampling
Top-p Sampling
Optimized Training
Zero Redundancy Optimizer
Receptance Weighted Key Value Cost-Effective Training/Inference,
Low-Rank Adaption Adaptation & Compression
Knowledge Distillation
Quantization
(c) Rotary Positional Embedding [127] (d) Relative Positional Bias [128]
exi /T
sof tmax(xi ) = P x /T (3)
je
j
I. Cost-Effective Training/Inference/Adaptation/Compression
In this part, we review some of the popular approaches
used for more cost-friendly (and compute-friendly) training
and usage of LLMs.
1) Optimized Training: There are many frameworks de- Fig. 33: Time Complexity comparison of RWKV with different
veloped for optimized training of LLMs, here we introduce Transformers. Here T denotes the sequence length, d the
some of the prominent ones. feature dimension, and c is MEGA’s chunk size of quadratic
ZeRO: In [140], Rajbhandari et al. developed a novel attention. Courtesy of [141].
solution, Zero Redundancy Optimizer (ZeRO), to optimize
memory, vastly improving training speed of LLMs while
It is also referred to as an approach to distill the knowledge of
not a single model but in fact multiple models into a smaller
one. Creating smaller models by this approach yields smaller
model sizes that can be used even on edge devices. Knowledge
distillation as shown in Fig 35, illustrates a general setup of
this training scheme.
Statistical Metrics
Automated metrics IE-Based Metrics
Hallucination Model-Based Metrics QA-Based Metrics
Hallucination Quantification NLI-Based Metrics
Scoring
Human judgment
Comparative Analysis
A) LLM limitations
1) Chain of Thought 2) Tree of Thought 5) Expert Prompting 7) Rails 8) Automatic Prompt Engineering
Zero-Shot CoT 3) Self-Consistency 6) Chains Topical Rails Prompt Generation
Manual CoT Fact-Checking Rails Prompt Scoring
4) Reflection Jailbreaking Rails Refinement and Iteration
B) Using LLMs
D) LLM Agents
Despite advances in automated metrics, human judgment providing mechanisms for user feedback.
remains a vital piece. It typically involves two methodologies:
• Data Management and Continuous Improvement.
1) Scoring: Human evaluators rate the level of halluci- Maintaining and analyzing a tracking set of hallucina-
nation within a predefined scale. tions is essential for ongoing model improvement.
2) Comparative Analysis: Evaluators compare gener-
• Prompt Engineering and Metaprompt Design. Many
ated content against baseline or ground-truth refer-
of the advanced prompt techniques described in IV-B
ences, adding an essential layer of subjective assess-
such as Retrieval Augmented Generation directly ad-
ment.
dress hallucination risks.
FactScore [155] is a recent example of a metric that can be • Model Selection and Configuration for Hallucination
used both for human and model-based evaluation. The metric Mitigation. For exemple, larger models with lower
breaks an LLM generation into “atomic facts”. The final score temperature settings usually perform better. Also,
is computed as the sum of the accuracy of each atomic fact, techniques such as RLHF or domain-sepcific fine-
giving each of them equal weight. Accuracy is a binary number tuning can mitigate hallucination risks.
that simply states whether the atomic fact is supported by the
source. The authors implement different automation strategies
that use LLMs to estimate this metric. B. Using LLMs: Prompt Design and Engineering
Finally, mitigating hallucinations in LLMs is a multifaceted A prompt in generative AI models is the textual input
challenge, requiring tailored strategies to suit various applica- provided by users to guide the model’s output. This could
tions. Those include: range from simple questions to detailed descriptions or specific
tasks. Prompts generally consist of instructions, questions,
• Product Design and User Interaction Strategies such input data, and examples. In practice, to elicit a desired
as use case design, structuring the input/output, or response from an AI model, a prompt must contain either
instructions or questions, with other elements being optional. Manual CoT is more effective than zero-shot. However,
Advanced prompts involve more complex structures, such as the effectiveness of this example-based CoT depends on the
”chain of thought” prompting, where the model is guided to choice of diverse examples, and constructing prompts with
follow a logical reasoning process to arrive at an answer. such examples of step by step reasoning by hand is hard and
error prone. That is where automatic CoT [157] comes into
Prompt engineering is a rapidly evolving discipline that play.
shapes the interactions and outputs of LLMs and other gen-
erative AI models. The essence of prompt engineering lies in 2) Tree of Thought (ToT): The Tree of Thought (ToT)
crafting the optimal prompt to achieve a specific goal with [158] prompting technique is inspired by the concept of
a generative model. This process is not only about instructing considering various alternative solutions or thought processes
the model but also involves some understanding of the model’s before converging on the most plausible one. ToT is based
capabilities and limitations, and the context within which it on the idea of branching out into multiple ”thought trees”
operates. where each branch represents a different line of reasoning.
This method allows the LLM to explore various possibilities
Prompt engineering transcends the mere construction of and hypotheses, much like human cognitive processes where
prompts; it requires a blend of domain knowledge, understand- multiple scenarios are considered before determining the most
ing of the AI model, and a methodical approach to tailor likely one.
prompts for different contexts. This might involve creating
templates that can be programmatically modified based on a A critical aspect of ToT is the evaluation of these reasoning
given dataset or context. For example, generating personalized paths. As the LLM generates different branches of thought,
responses based on user data might use a template that is each is assessed for its validity and relevance to the query.
dynamically filled with relevant user information. This process involves real-time analysis and comparison of
the branches, leading to a selection of the most coherent and
Furthermore, prompt engineering is an iterative and ex- logical outcome.
ploratory process, akin to traditional machine learning prac-
tices such as model evaluation or hyperparameter tuning. The ToT is particularly useful in complex problem-solving
rapid growth of this field suggests its potential to revolutionize scenarios where a single line of reasoning might not suffice.
certain aspects of machine learning, moving beyond traditional It allows LLMs to mimic a more human-like problem-solving
methods like feature or architecture engineering. On the other approach, considering a range of possibilities before arriving
hand, traditional engineering practices such as version con- at a conclusion. This technique enhances the model’s ability
trol and regression testing need to be adapted to this new to handle ambiguity, complexity, and nuanced tasks, making it
paradigm just like they were adapted to other machine learning a valuable tool in advanced AI applications.
approaches [156]. 3) Self-Consistency: Self-Consistency [159] utilizes an
In the following paragraphs we detail some of the most ensemble-based method, where the LLM is prompted to gen-
interesting and popular prompt engineering approaches. erate multiple responses to the same query. The consistency
among these responses serves as an indicator of their accuracy
1) Chain of Thought (CoT): The Chain of Thought (CoT) and reliability.
technique, initially described in the paper “Chain-of-Thought
The Self-Consistency approach is grounded in the principle
Prompting Elicits Reasoning in Large Language Models”[34]
that if an LLM generates multiple, similar responses to the
by Google researchers, represents a pivotal advancement in
same prompt, it is more likely that the response is accurate.
prompt engineering for Large Language Models (LLMs).
This method involves asking the LLM to tackle a query mul-
This approach hinges on the understanding that LLMs, while
tiple times, each time analyzing the response for consistency.
proficient in token prediction, are not inherently designed for
This technique is especially useful in scenarios where factual
explicit reasoning. CoT addresses this by guiding the model
accuracy and precision are paramount.
through essential reasoning steps.
The consistency of responses can be measured using vari-
CoT is based on making the implicit reasoning process of ous methods. One common approach is to analyze the overlap
LLMs explicit. By outlining the steps required for reasoning, in the content of the responses. Other methods may include
the model is directed closer to a logical and reasoned output, comparing the semantic similarity of responses or employing
especially in scenarios demanding more than simple informa- more sophisticated techniques like BERT-scores or n-gram
tion retrieval or pattern recognition. overlaps. These measures help in quantifying the level of
CoT prompting manifests in two primary forms: agreement among the responses generated by the LLM.
Self-Consistency has significant applications in fields
1) Zero-Shot CoT: This form involves instructing the where the veracity of information is critical. It is particularly
LLM to “think step by step”, prompting it to de- relevant in scenarios like fact-checking, where ensuring the
construct the problem and articulate each stage of accuracy of information provided by AI models is essential.
reasoning. By employing this technique, prompt engineers can enhance
2) Manual CoT: A more complex variant, it requires the trustworthiness of LLMs, making them more reliable for
providing step-by-step reasoning examples as tem- tasks that require high levels of factual accuracy.
plates for the model. While yielding more effective
results, it poses challenges in scalability and mainte- 4) Reflection: Reflection [160] involves prompting LLMs
nance. to assess and potentially revise their own outputs based on
reasoning about the correctness and coherence of their re- • Jailbreaking Rails: Prevent the LLM from generating
sponses. The concept of Reflection centers on the ability of responses that attempt to bypass its own operational
LLMs to engage in a form of self-evaluation. After generating constraints or guidelines.
an initial response, the model is prompted to reflect on its
own output, considering factors like factual accuracy, logical 8) Automatic Prompt Engineering (APE): Automatic
consistency, and relevance. This introspective process can lead Prompt Engineering (APE) [163] focuses on automating the
to the generation of revised or improved responses. process of prompt creation for Large Language Models
(LLMs). APE seeks to streamline and optimize the prompt
A key aspect of Reflection is the LLM’s capacity for design process, leveraging the capabilities of LLMs themselves
self-editing. By evaluating its initial response, the model can to generate and evaluate prompts. APE involves using LLMs
identify potential errors or areas of improvement. This iterative in a self-referential manner where the model is employed
process of generation, reflection, and revision enables the LLM to generate, score, and refine prompts. This recursive use of
to refine its output, enhancing the overall quality and reliability LLMs enables the creation of high-quality prompts that are
of its responses. more likely to elicit the desired response or outcome.
5) Expert Prompting: Expert Prompting [161] enhances the The methodology of APE can be broken down into several
capabilities of Large Language Models (LLMs) by simulating key steps:
the responses of experts in various fields. This method involves
prompting the LLMs to assume the role of an expert and re- • Prompt Generation: The LLM generates a range of
spond accordingly, providing high-quality, informed answers. potential prompts based on a given task or objective.
A key strategy within Expert Prompting is the multi-expert • Prompt Scoring: Each generated prompt is then
approach. The LLM is prompted to consider responses from evaluated for its effectiveness, often using criteria
multiple expert perspectives, which are then synthesized to like clarity, specificity, and likelihood of eliciting the
form a comprehensive and well-rounded answer. This tech- desired response.
nique not only enhances the depth of the response but also
incorporates a range of viewpoints, reflecting a more holistic • Refinement and Iteration: Based on these evalua-
understanding of the subject matter. tions, prompts can be refined and iterated upon, further
enhancing their quality and effectiveness.
6) Chains: Chains refer to the method of linking multiple
components in a sequence to handle complex tasks with Large C. Augmenting LLMs through external knowledge - RAG
Language Models (LLMs). This approach involves creating a
series of interconnected steps or processes, each contributing One of the main limitations of pre-trained LLMs is their
to the final outcome. The concept of Chains is based on lack of up-to-date knowledge or access to private or use-
the idea of constructing a workflow where different stages case-specific information. This is where retrieval augmented
or components are sequentially arranged. Each component in generation (RAG) comes into the picture [164]. RAG, illus-
a Chain performs a specific function, and the output of one trated in figure 37, involves extracting a query from the input
serves as the input for the next. This end-to-end arrangement prompt and using that query to retrieve relevant information
allows for more complex and nuanced processing, as each from an external knowledge source (e.g. a search engine or a
stage can be tailored to handle a specific aspect of the task. knowledge graph, see figure 38 ). The relevant information is
Chains can vary in complexity and structure, depending on then added to the original prompt and fed to the LLM in order
the requirements. In “PromptChainer: Chaining Large Lan- for the model to generate the final response. A RAG system
guage Model Prompts through Visual Programming” [162], includes three important components: Retrieval, Generation,
the authors not only describe the main challenges in designing Augmentation [165].
chains, but also describe a visual tool to support those tasks. a) RAG-aware prompting techniques: Because of the
7) Rails: Rails in advanced prompt engineering refer to importance of RAG to build advanced LLM systems, several
a method of guiding and controlling the output of Large RAG-aware prompting techniques have been developed re-
Language Models (LLMs) through predefined rules or tem- cently. One such technique is Forward-looking Active Retrieval
plates. This approach is designed to ensure that the model’s Augmented Generation (FLARE)
responses adhere to certain standards or criteria, enhancing the Forward-looking Active Retrieval Augmented Generation
relevance, safety, and accuracy of the output. The concept of (FLARE) [168] enhances the capabilities of Large Language
Rails involves setting up a framework or a set of guidelines Models (LLMs) by iteratively combining prediction and in-
that the LLM must follow while generating responses. These formation retrieval. FLARE represents an evolution in the
guidelines are typically defined using a modeling language or use of retrieval-augmented generation, aimed at improving the
templates known as Canonical Forms, which standardize the accuracy and relevance of LLM responses.
way natural language sentences are structured and delivered.
FLARE involves an iterative process where the LLM
Rails can be designed for various purposes, depending on actively predicts upcoming content and uses these predictions
the specific needs of the application: as queries to retrieve relevant information. This method con-
trasts with traditional retrieval-augmented models that typically
• Topical Rails: Ensure that the LLM sticks to a
retrieve information once and then proceed with generation. In
particular topic or domain.
FLARE, this process is dynamic and ongoing throughout the
• Fact-Checking Rails: Aimed at minimizing the gen- generation phase. In FLARE, each sentence or segment gener-
eration of false or misleading information. ated by the LLM is evaluated for confidence. If the confidence
Fig. 37: An example of synthesizing RAG with LLMs for question answering application [166].
their associated documents from both Wikipedia and M refers to the middle school examinations, whereas
online. RACE-H denotes the high school tests. Finally, RACE
is the synthesis of RACE-M and RACE-H.
• RACE [186] suits for reading comprehension task.
This dataset is based on English tests completed by • SQuAD [187] stands for “Stanford Question Answer-
Chinese students from middle school and high school, ing Dataset” and is a crowdsourced reading compre-
aged 12 to 18, and it contains roughly 28, 000 texts hension dataset based on Wikipedia articles. It has
and 100, 000 questions rigorously prepared by human approximately 100, 000 question-answer pairs con-
specialists, primarily English instructors. This dataset nected to more than 500 articles. The answers to
contains a wide range of subjects that were purpose- these questions are typically text fragments or spans
fully chosen to assess students’ comprehension and taken from the corresponding reading passages. The
reasoning abilities. This dataset is available in three questions may be unanswerable in some cases. The
subgroups: RACE-M, RACE-H, and RACE. RACE- dataset is divided into three sets: an 80% training set,
Fig. 42: Datasets licensed under different licenses.
a 10% development set, and a 10% hidden test set. • GSM8K [190] is designed to evaluate the model’s
ability for multi-step mathematical reasoning. GSM8K
• BoolQ [188] is a yes/no question-answering dataset
includes 8.5K linguistically diverse grade school math
where the goal is reading comprehension task. BoolQ
word problems written by humans. The dataset is split
includes 15, 942 examples. Each example is a triplet
into two sets: a training set with 7.5K problems,
that includes a question, a relevant paragraph, and
and a test set with 1K problems. These problems
the solution. Although the main intuition behind
need 2 to 8 steps to be solved. Solutions mainly
this dataset is for reading comprehension, it can be
are a series of elementary calculations using basic
used for reasoning, natural language inference, and
arithmetic operations.
question-answering tasks.
• MultiRC [189] is another dataset that fits reading • MATH [191] enables to assess how well models can
comprehension task. MultiRC contains brief para- solve math problems. MATH dataset hast 12, 500
graphs as well as multi-sentence questions that can problems from high school math competitions. Each
be answered using the information in the paragraph. problem in the dataset has a step-by-step solution and
The paragraphs in this dataset come from a variety a final answer enclosed in a box. The problems cover
of sources, including news, fiction, historical texts, a wide range of topics and have different levels of
Wikipedia articles, discussions on society and law, complexity. There are seven subjects in total. Further-
elementary school science textbooks, and 9/11 re- more, the difficulty of each problem is rated based
ports. Each question has many response choices, with on the AoPS standards on a scale from ′ 1′ to ′ 5′ . A
′ ′
one or more of them being correct. Answering the 1 shows the easiest problems in a subject, while ′ 5′
questions requires reasoning across several sentences. represents the most difficult. In terms of formatting,
MultiRC dataset encompasses around 6, 000 multi- all problems and solutions are presented using LATEX
sentence questions gathered from over 800 paragraphs. and the Asymptote vector graphics language.
On average, each question offers about two valid
• HellaSwag [192] is designed to assess commonsense
answer alternatives out of a total of five.
reasoning in LLMs. This benchmark includes 70, 000
multiple-choice questions. Each question is derived
B. Datasets for Emergent: ICL, reasoning (CoT), instruction
from one of two domains: ActivityNet or WikiHow,
following
and presents four answer choices regarding what
This section centers on the benchmarks and datasets em- might happen in the following situation. The correct
ployed to evaluate the emergent abilities of LLMs. answer provides an actual statement describing the
upcoming event, but the three wrong answers are C. Datasets for Augmented: using external knowledge/tools
created to confuse machines.
This section focuses on datasets designed for the aug-
• AI2 Reasoning Challenge (ARC) [193] is used mented abilities of LLMs.
for commonsense reasoning. This benchmark encom- • HotpotQA [198] is designed to cover a diverse and
passes 7, 787 science examination questions. These explainable question-answering dataset that necessi-
questions are in English, and most of them are set tates multi-hop reasoning. This dataset is derived from
up in a multiple-choice format. The questions have the English Wikipedia. It consists of roughly 113, 000
been divided into two groups: a Challenge Set with questions. Each question in the dataset comes with
2, 590 difficult questions and an Easy Set with 5,197 two paragraphs, called gold paragraphs, from two
questions. Each collection has also been pre-divided Wikipedia articles. Also, there is a list of sentences
into Train, Development, and Test subsets. in those paragraphs that crowdworkers have picked as
important for answering the question.
• PIQA [194] is intended to evaluate the language
representations on their knowledge of physical com- • ToolQA [199] is a question answering benchmark
monsense. In this dataset, the focus is on everyday to evaluate LLMs’ ability to use external tools for
situations with a preference for uncommon solutions. answering questions.
The central task is a multiple-choice question answer-
• GPT4Tools serves as an instructional dataset, gener-
ing, where a question (q) is provided along with two
ated by instructing advanced teachers (such as Chat-
potential solutions (s1, s2). Then, the best solution is
GPT), with instructions conditioned on visual content
chosen by whether a model or a human. For each
and tool descriptions. This process results in the
question, only one of the solutions is the correct
generation of instructions related to the use of tools.
answer.
There are three versions of this dataset. The first
version comprises 71,000 instruction-following data
• SIQA [195] provides a framework for evaluating mod-
points utilized to fine-tune the GPT4Tools model. The
els’ ability for commonsense reasoning about social
next version consists of manually cleaned instruction
situations. SIQA dataset has 38, 000 multiple-choice
data used for validation, covering instructions related
questions designed to assess emotional and social
to the tools from the first version. The last version is
intelligence in everyday circumstances. This dataset
cleaned instruction data used for testing and includes
covers a wide variety of social scenarios. In SIQA,
instructions related to some tools that are not present
the potential answers is a mixture of human-selected
in the first version.
responses and machine-generated ones that have been
filtered through adversarial processes.
VI. P ROMINENT LLM S ’ P ERFORMANCE ON
• OpenBookQA (OBQA) [196] is a new kind of B ENCHMARKS
question-answering dataset where answering its ques-
tions requires additional common and commonsense In this section we first provide an overview of some of
knowledge not contained in the book and rich text popular metrics used for evaluating the performance of LLMs
comprehension. This dataset includes around 6,000 under different scenarios. We then look at the performance
multiple-choice questions. Each question is linked to of prominent large language models on some of the popular
one core fact, as well as an additional collection datasets and benchmarks.
of over 6000 facts. The questions were developed
using a multi-stage crowdsourcing and expert filter- A. Popular Metrics for Evaluating LLMs
ing procedure. OpenBookQA questions are difficult Evaluating the performance of generative language models
because they need multi-hop reasoning with limited depends on the underlying task they are going to be used for.
background. Tasks that are mostly about selecting a choice out of given
ones (such as sentiment analysis), can be seen as simple as
• TruthfulQA [197] is designed specifically to eval- classification and their performance can be evaluated using
uate the truthfulness of language models in gen- classification metrics. Metrics such as accuracy, precision,
erating answers to questions. This dataset includes recall, F1, etc are applicable in this case. It is also important to
817 questions, written by authors, from 38 different note that the answers generated by the model for specific tasks
categories, including health, law, finance, and politics. such as multi-choice question answering are always either True
These questions are purposefully designed to chal- or False. If the answer is not in a set of options, it can be seen
lenge human responders, as they may contain common as False as well.
misunderstandings that lead to incorrect answers.
However, some tasks that are purely open-ended text gener-
• OPT-IML Bench [103] is a comprehensive bench- ation cannot be evaluated in the same way as for categorization.
mark for Instruction Meta-Learning. It covers 2000 Different metrics are required for the specific purpose of the
NLP tasks from 8 existing benchmarks. The OPT-IML evaluation. Code generation is a very different case in open-
Bench consists of a training set with 17.9 M examples, ended generative evaluations. The generated code must pass
a dev set with 145K samples, and a test set with 321K the test suite but on the other hand, it is also important
samples. to understand if a model is capable of generating different
TABLE II: LLM Datasets Overview.
Benchmark Name Evaluation Metric Leaderboard Source paperswithcode
HumanEval PASS@k Link Link Link
MBPP PASS@k, Accuracy - Link Link
APPS PASS@k, Accuracy - Link Link
WikiSQL Accuracy - Link Link
CoNaLa BLEU Link Link
CodeParrot PASS@k - Link -
HellaSwag Accuracy Link Link Link
AI2 Reasoning
Accuracy Link Link Link
Challenge (ARC)
BoolQ Accuracy - Link Link
MultiRC F1-score, Accuracy - Link Link
CNN/Daily Mail [200] Accuracy - Link -
SQuAD F1-score, EM Link Link Link
RACE Accuracy - Link Link
CNN/Daily Mail [201] ROUGE - Link Link
Drop F1-score, EM Link Link Link
QuAC F1-score, HEQ-Q, HEQ-D Link Link Link
TriviaQA EM, F1-score, Accuracy Link Link Link
Natural Questions EM, F1-score, Accuracy Link Link Link
StrategyQA Accuracy, Recall@10, SARI Link Link Link
CoQA F1-score Link Link Link
XSum ROUGE - Link Link
SAMSum ROUGE - - Link
WikiSum ROUGE - Link -
DialogSum ROUGE - Link Link
TruthfulQA MC1 , MC2, % true, % info, BLEURT Link Link Link
MMLU Accuracy Link Link Link
GSM8K Accuracy Link Link Link
PIQA Accuracy Link Link Link
SIQA Accuracy Link Link Link
OpenBookQA (OBQA) Accuracy Link Link Link
HotpotQA EM, F1-score, Joint EM, Joint F1-score, Link Link Link
MATH Accuracy - Link Link
CommonsenseQA Accuracy Link Link Link
Natural Instructions ROUGE-L, Human Link Link Link
BIG-bench Accuracy, Average - Link Link
Success rate, Precision, Recall, Incorrect
ToolTalk - Link Link
action rate, Percent of failing error types
MetaTool Accuracy, Precision, Recall, F1-score - Link Link
Successful Rate of Thought, Successful
GPT4Tools Rate of Action, Successful Rate of Ar- - Link Link
guments, Success Rate
Correctness, ROUGE, Error(API Hallu-
cination, Has Exception, Invalid Input
API-Bank - Link Link
Parameters, False API Call Format, API
Call, Miss Input Parameters)
Alpaca-CoT - - Link Link
erroneous because another model is used to judge. Still, even we use is their primary use case. We consider each LLM to
today, evaluating purely generated content is very hard and be either: Foundation model (pretrained language model with
no completely fitting metric is not found, metrics are either no instruction fine-tuning and chat fine-tuning), Instruction
looking for simplistic features such as N-Gram, SkipGram, model (pretrained language model with only instruction fine-
etc, or they are models with unknown accuracy and preciseness tuning), and Chat model (pretrained language model with
[204]. instruction and chat fine-tuning). Apart from all the catego-
rization described, another category is required to distinguish
Generative evaluation metrics are also another type of eval- between original models and tuned ones. Original models are
uation metric for LLMs that use another LLM for evaluating those that have been released as a foundation model or a fine-
the answer. However, depending on the task itself, evaluation tuned one. Tuned models are those that grasped the original
can be possible in this way or not. Another dependency model and tuned it with different datasets or even different
that makes generative evaluation error-prone is reliance on training approaches. It is also good to note that original models
the prompt itself. RAGAS is one of the good examples that are usually foundation models that have been fine-tuned on
incorporate the usage of generative evaluation. specific datasets or even different approaches. Availability of
Various benchmarks and leaderboards have been proposed the model weights regardless of the license is another category
to address the most challenging question in the world of in our classification. Models that have their weights publicly
large language models: Which one is better? However not available (even through request) are noted as Public models
a simple answer can address this question. The answer de- while others are noted as Private. Table III shows all of these
pends on various aspects of large language models. Section V definitions and abbreviations used in the rest of the article.
shows the categorical presentation of different tasks and the Figure 43 illustrate these visually.
most important datasets in each category. We will follow the According to the provided categorizations, we can catego-
same categorization and provide a comparison based on each rize and label each notable LLM as shown in table IV. As can
category. After providing comparison for each category, we be seen from this table, models categorized as very large are
will provide a broad overview of aggregated performance by also unavailable as well.
averaging the reported performance metric on different tasks.
B. LLMs’ Performance on Different Tasks
Evaluating different LLMs can be seen also from different
perspectives. For example, a LLM with a drastically fewer Commonsense reasoning is one of the important capabili-
number of parameters is not completely comparable to one ties each model can obtain. This capability denotes the ability
with a larger number of parameters. From this perspective, we of the model to use prior knowledge in combination with
will categorize LLMs in four categories as well: small (less reasoning skills. In the case of HellaSwag for example, finding
than or equal to 1 billion parameters), medium (between 1 and the continuation of text is challenging because the given text
10 billion), large (between 10 and 100 billion), and very large contains a partial part of the story while the given choices
(more than 100 billion). Another classification for the LLMs as continuation are tricky to select, and without having prior
Large LM
10B < # of params <100B
Medium LM
1B < # of params <10B
Very Large LM
100B < # of params
Foundation
Tuned
Originality
Pretrained model that is
Large also fine-tuned on
Type
instruction following.
Fine tuning Language Instruction Example: MPT-7B-instruct
Models
Original
Chat Pretrained model that is
also fine-tuned on chat.
Example: MPT-7B-chat
Availability
Original models that are not fine
tuned or based on any other
pretrained model.
Example: LLaMA
Model weights are NOT publicly Model weights are publicly released
released and is NOT available. and is available.
Example: GPT-4 Private Public Example: LLaMA
knowledge about the world it is not possible. This specific kind From the results presented in Table V it is clear that GPT-4
of reasoning deserves high attention because it is related to achieves best results for HellaSwag while Davinci-003 is best
utilizing previous knowledge with open text-described scenes model for OBQA. It is also good to note that results for OBQA
or facts. As can be seen from table V not just Unavailable are not reported for all of the models and possibly davinci-003
models but also Public ones can achieve good results on is not the best model achieving highest results on OBQA.
various tests.
At the same time this is still a new and extremely active approach since its inception. As described earlier, attention is
research area where the pace of innovation is increasing rather the main mechanism driving transformers. More recently, there
than slowing down. As in any other evolving area though, there has been promising research in alternative approaches that are
are still numerous challenges ahead. Here we briefly mention being labelled as post-attention.
some of the challenges and main active areas which are known
so far. An important class of such class of post-attention models
are the so called State Space Models (SSMs). While the notion
of State Space Models has a long history in machine learning,
A. Smaller and more efficient Language Models it should be noted that in the context of language models, SSM
This is a survey on large language models, and there is usually used in reference to the newer Structure State Space
has been an initial push towards ”larger is better” that has Model architecture or S4 for short (see Gu et al. [29]). Some
clearly been rewarded with ever larger models like GPT- recent models in this category are Mamba [30], Hyena [209],
4 getting better accuracy and performance in benchmarks. and Striped Hyena [210].
However, those large models are costly and inefficient in
While all of those models are very competitive in terms of
several dimensions (e.g. high latency). In response to all of
performance in leaderboards and efficiency, they also address
this, there is a current research trend to come up with Small
an important challenge in more traditional attention-based
Language Models (SLMs) as a cost-effective alternative to
architectures: the lack of support for larger context windows.
LLMs, particularly when used on specific tasks that might not
require the full generality of larger models. Prominent works Having a good answer to many prompts requires context.
in this direction include Phi-1 [207], Phi-1.5 [208], and Phi-2 For example, the response to ”Recommend some good movies
from Microsoft. for me” requires a lot of context about ”me” as well as what
movies are available and which ones I have not watched.
More generally, we should expect many research efforts in
Context length is especially important for RAG, where large
this area of how to train smaller and more efficient models.
portions of text might be retrieved and injected into the prompt
Techniques such as parameter-efficient fine-tuning (PEFT),
for generation (see section IV-C.
teacher/student, and other forms of distillation – see section
III-I – will continue to be used to build a smaller model out The longer the context length, the more tokens we can
of larger ones. squeeze into the context. The more information the model has
access to, the better its response will be. But on the other
B. New Post-attention Architectural Paradigms hand, with very long context, it would be hard for the model
to remember everything and efficiently process all the informa-
Transformer blocks have been a crucial and constant part of tion. Attention-based models are highly inefficient for longer
most of current LLM frameworks, and it’s a big question mark contexts and that is why we should expect more research in
how much longer this architecture will be in vogue, and what different mechanisms that enable processing longer contexts
will be the next big architectural break-through in the field of and generally come up with more efficient architectures.
deep learning (and NLP). Since AlexNet in 2012, we have seen
many architectures go in and out of fashion, including LSTM, That being said, new architectures might not only propose
GRU, seq2seq, but Transformers have been the dominant alternatives for the attention mechanism but rather rethink the
whole Transformer architecture. As an early example of this, deployed to better understand people preference and interests,
Monarch Mixer [211] proposes a new architecture that uses and provide more personalized interactions, whether in cus-
the same sub-quadratic primitive that achieves high hardware tomer service, content recommendation, or other applications.
efficiency on GPUs – Monarch matrices – along both sequence This involves better understanding of user preferences, and
length and model dimension. analyzing their past interactions and using them as the context.
We will continue to see research in the application and usage
On the other end of the spectrum, it is worth mentioning
of LLMs for not only personalization and recommendations,
that there are some attention-compatible architectural mecha-
but many other application areas using other machine learning
nisms that have been recently gaining steam and proving their
techniques.
value in creating better and more powerful LLMs. Probably
the best example of such mechanism is Mixture of Experts Finally, another important area of research we expect to
(MoE). MoEs have been around in machine learning for years, gather increased attention is that of LLM-based agents and
even before the Deep Learning Era [212], but they have been multi-agent systems [172], [173], [174]. The development of
gaining popularity since then, and particularly in the context LLM systems with access to external tools and decision-
of Transformer models and LLMs. making capabilities is both exciting and challenging. We will
see continued research and progress in this important area that
In LLMs, MoEs allow to train an extremely large model
some argue could lead to Artificial General Intelligence (AGI).
than is then only partially instantiated during inference
when some of the experts are turned off wherever the gat-
ing/weighting function has a low weight assigned to them. As E. Security and Ethical/Responsible AI
an example, the GLaM model has 1.2 trillion parameters, but Ensuring the robustness and security of LLMs against
during inference only 2 out of the 64 experts are used [84]. adversarial attacks and other vulnerabilities is a critical area
of research [217]. As LLMs are increasingly deployed in real-
MoEs are nowadays an important component of the so-
world applications, they need to be protected from potential
called frontier LLMs (i.e. the most advanced and capable
threats, to prevent them being used to manipulate people or
models). GPT-4 itself is rumored to be based on a MoE
spread mis-information.
architecture, and some of the best performing LLMs such as
Mixtral [117], are basically an MoE version of pre-existing Addressing ethical concerns and biases in LLMs is another
LLMs. active area of research. Efforts are being made to ensure that
LLMs are fair, unbiased, and capable of handling sensitive
Finally, it is important to note that MoEs can be used as a
information responsibly. As LLMs are being used more and
component of any architecture regardless of whether it is based
more by a large number of people on a daily basis, making
on attention or not. In fact, MoEs have also been applied to
sure they are unbiased and behave responsibly is crucial.
SSM-based LLMs like Mamba citepioro2024moemamba. We
should continue to see MoE-driven improvements in the future
regardless of the underlying architecture. VIII. C ONCLUSION
This paper present a survey of LLMs developed in the
C. Multi-modal Models past few years. We first provide an overview of early pre-
trained language models (e.g., as BERT), then review three
Future LLMs are expected to be multi-modal and handle
popular LLM families (GPT, LLaMA, PaLM), and other
a variety of data types, such as text, images, and videos,
representative LLMs. We then survey methods and techniques
audio, in a unified manner. This opens up possibilities for
of building, augmenting, and using LLMs. We review popular
more diverse applications in fields like question answering,
LLM datasets and benchmarks, and compare performance of
content generation, creative arts, and healthcare, robotics, and
a set of prominent models on public benchmarks. Finally, we
beyond. There are already several prominent multi-modal
present open challenges and future research directions.
LLMs out there, including: LLAVA [213], LLAVA-Plus [214],
GPT-4 [33], Qwen-vl [116], Next-GPT [215], but the trend is
expected to be continued. Evaluation of these models also is a R EFERENCES
new research topic, especially conversational generative vision [1] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess,
models [216]. Multi-modal LLMs can unlock huge potentials R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws
for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
in a variety of tasks, and there has already been a descent
[2] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai,
progress in this direction, which needs a dedicated paper to E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark
discuss all its details. et al., “Training compute-optimal large language models,” arXiv
preprint arXiv:2203.15556, 2022.
D. Improved LLM Usage and Augmentation techniques [3] C. E. Shannon, “Prediction and entropy of printed english,” Bell system
technical journal, vol. 30, no. 1, pp. 50–64, 1951.
As we described in sectionIV, many of the shortcomings [4] F. Jelinek, Statistical methods for speech recognition. MIT press,
and limitations of LLMs such as hallucination can be ad- 1998.
dressed through advanced prompt engineering, use of tools, [5] C. Manning and H. Schutze, Foundations of statistical natural lan-
or other augmentation techniques. We should expect not only guage processing. MIT press, 1999.
continued, but accelerated research in this area. [6] C. D. Manning, An introduction to information retrieval. Cambridge
university press, 2009.
LLM-based systems are already starting to replace ma- [7] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min,
chine learning systems that were until recently using other B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language
approaches. As a clear example of this, LLMs are now being models,” arXiv preprint arXiv:2303.18223, 2023.
[8] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, [31] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra,
L. He et al., “A comprehensive survey on pretrained foundation mod- A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al.,
els: A history from bert to chatgpt,” arXiv preprint arXiv:2302.09419, “Palm: Scaling language modeling with pathways,” arXiv preprint
2023. arXiv:2204.02311, 2022.
[9] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre- [32] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
train, prompt, and predict: A systematic survey of prompting methods T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama:
in natural language processing,” ACM Computing Surveys, vol. 55, Open and efficient foundation language models,” arXiv preprint
no. 9, pp. 1–35, 2023. arXiv:2302.13971, 2023.
[10] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, [33] OpenAI, “GPT-4 Technical Report,” https://arxiv.org/pdf/2303.
J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint 08774v3.pdf, 2023.
arXiv:2301.00234, 2022. [34] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter,
[11] J. Huang and K. C.-C. Chang, “Towards reasoning in large language F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought
models: A survey,” arXiv preprint arXiv:2212.10403, 2022. prompting elicits reasoning in large language models,” in
[12] S. F. Chen and J. Goodman, “An empirical study of smoothing Advances in Neural Information Processing Systems, S. Koyejo,
techniques for language modeling,” Computer Speech & Language, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh,
vol. 13, no. 4, pp. 359–394, 1999. Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837.
[Online]. Available: https://proceedings.neurips.cc/paper files/paper/
[13] Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic
2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
language model,” Advances in neural information processing systems,
vol. 13, 2000. [35] G. Mialon, R. Dessı̀, M. Lomeli, C. Nalmpantis, R. Pasunuru,
R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyil-
[14] H. Schwenk, D. Déchelotte, and J.-L. Gauvain, “Continuous space
maz et al., “Augmented language models: a survey,” arXiv preprint
language models for statistical machine translation,” in Proceedings
arXiv:2302.07842, 2023.
of the COLING/ACL 2006 Main Conference Poster Sessions, 2006,
pp. 723–730. [36] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang,
L. Liden, Z. Yu, W. Chen, and J. Gao, “Check your facts and try
[15] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur,
again: Improving large language models with external knowledge and
“Recurrent neural network based language model.” in Interspeech,
automated feedback,” arXiv preprint arXiv:2302.12813, 2023.
vol. 2, no. 3. Makuhari, 2010, pp. 1045–1048.
[16] A. Graves, “Generating sequences with recurrent neural networks,” [37] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao,
arXiv preprint arXiv:1308.0850, 2013. “React: Synergizing reasoning and acting in language models,” arXiv
preprint arXiv:2210.03629, 2022.
[17] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning
deep structured semantic models for web search using clickthrough [38] D. E. Rumelhart, G. E. Hinton, R. J. Williams et al., “Learning internal
data,” in Proceedings of the 22nd ACM international conference on representations by error propagation,” 1985.
Information & Knowledge Management, 2013, pp. 2333–2338. [39] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14,
[18] J. Gao, C. Xiong, P. Bennett, and N. Craswell, Neural Approaches to no. 2, pp. 179–211, 1990.
Conversational Information Retrieval. Springer Nature, 2023, vol. 44. [40] M. V. Mahoney, “Fast text compression with neural networks.” in
[19] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning FLAIRS conference, 2000, pp. 230–234.
with neural networks,” Advances in neural information processing [41] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Černockỳ, “Strate-
systems, vol. 27, 2014. gies for training large scale neural network language models,” in 2011
[20] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On IEEE Workshop on Automatic Speech Recognition & Understanding.
the properties of neural machine translation: Encoder-decoder ap- IEEE, 2011, pp. 196–201.
proaches,” arXiv preprint arXiv:1409.1259, 2014. [42] tmikolov. rnnlm. [Online]. Available: https://www.fit.vutbr.cz/
∼imikolov/rnnlm/
[21] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár,
J. Gao, X. He, M. Mitchell, J. C. Platt et al., “From captions to [43] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu,
visual concepts and back,” in Proceedings of the IEEE conference and J. Gao, “Deep learning–based text classification: a comprehensive
on computer vision and pattern recognition, 2015, pp. 1473–1482. review,” ACM computing surveys (CSUR), vol. 54, no. 3, pp. 1–40,
[22] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: 2021.
A neural image caption generator,” in Proceedings of the IEEE [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
conference on computer vision and pattern recognition, 2015, pp. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
3156–3164. Advances in neural information processing systems, vol. 30, 2017.
[23] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, [45] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
and L. Zettlemoyer, “Deep contextualized word representations. corr “Albert: A lite bert for self-supervised learning of language represen-
abs/1802.05365 (2018),” arXiv preprint arXiv:1802.05365, 2018. tations,” arXiv preprint arXiv:1909.11942, 2019.
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training [46] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-
of deep bidirectional transformers for language understanding,” arXiv training text encoders as discriminators rather than generators,” arXiv
preprint arXiv:1810.04805, 2018. preprint arXiv:2003.10555, 2020.
[25] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, [47] G. Lample and A. Conneau, “Cross-lingual language model pretrain-
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert ing,” arXiv preprint arXiv:1901.07291, 2019.
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[48] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
[26] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language
with disentangled attention,” arXiv preprint arXiv:2006.03654, 2020. understanding,” Advances in neural information processing systems,
[27] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, vol. 32, 2019.
A. Zhang, L. Zhang et al., “Pre-trained models: Past, present and [49] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao,
future,” AI Open, vol. 2, pp. 225–250, 2021. M. Zhou, and H.-W. Hon, “Unified language model pre-training for
[28] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained natural language understanding and generation,” Advances in neural
models for natural language processing: A survey,” Science China information processing systems, vol. 32, 2019.
Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020. [50] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improv-
[29] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with ing language understanding by generative pre-training,” 2018.
structured state spaces,” 2022. [51] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,
[30] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with “Language models are unsupervised multitask learners,” OpenAI blog,
selective state spaces,” arXiv preprint arXiv:2312.00752, 2023. vol. 1, no. 8, p. 9, 2019.
[52] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Available: [https://huggingface.co/stabilityai/StableBeluga2](https://
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning huggingface.co/stabilityai/StableBeluga2)
with a unified text-to-text transformer,” The Journal of Machine [73] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, S. Shakeri, X. Gar-
Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. cia, H. S. Zheng, J. Rao, A. Chowdhery et al., “Transcending scaling
[53] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, laws with 0.1% extra compute,” arXiv preprint arXiv:2210.11399,
A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained 2022.
text-to-text transformer,” arXiv preprint arXiv:2010.11934, 2020. [74] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus,
[54] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mass: Masked Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-
sequence to sequence pre-training for language generation,” arXiv finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
preprint arXiv:1905.02450, 2019. [75] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos,
[55] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical
V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to- report,” arXiv preprint arXiv:2305.10403, 2023.
sequence pre-training for natural language generation, translation, and [76] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung,
comprehension,” arXiv preprint arXiv:1910.13461, 2019. N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language
[56] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, models encode clinical knowledge,” arXiv preprint arXiv:2212.13138,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- 2022.
els are few-shot learners,” Advances in neural information processing [77] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou,
systems, vol. 33, pp. 1877–1901, 2020. K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al., “Towards expert-
[57] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka- level medical question answering with large language models,” arXiv
plan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., preprint arXiv:2305.09617, 2023.
“Evaluating large language models trained on code,” arXiv preprint [78] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
arXiv:2107.03374, 2021. A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot
[58] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, learners,” arXiv preprint arXiv:2109.01652, 2021.
C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser- [79] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song,
assisted question-answering with human feedback,” arXiv preprint J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language
arXiv:2112.09332, 2021. models: Methods, analysis & insights from training gopher,” arXiv
[59] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, preprint arXiv:2112.11446, 2021.
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language [80] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai,
models to follow instructions with human feedback,” Advances in A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al., “Multi-
Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, task prompted training enables zero-shot task generalization,” arXiv
2022. preprint arXiv:2110.08207, 2021.
[60] OpenAI. (2022) Introducing chatgpt. [Online]. Available: https: [81] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen,
//openai.com/blog/chatgpt Y. Zhao, Y. Lu et al., “Ernie 3.0: Large-scale knowledge enhanced pre-
[61] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, training for language understanding and generation,” arXiv preprint
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama arXiv:2107.02137, 2021.
2: Open foundation and fine-tuned chat models,” arXiv preprint [82] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Mil-
arXiv:2307.09288, 2023. lican, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark
[62] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, et al., “Improving language models by retrieving from trillions of
and T. B. Hashimoto, “Alpaca: A strong, replicable instruction- tokens,” in International conference on machine learning. PMLR,
following model,” Stanford Center for Research on Foundation Mod- 2022, pp. 2206–2240.
els. https://crfm. stanford. edu/2023/03/13/alpaca. html, vol. 3, no. 6, [83] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical
p. 7, 2023. details and evaluation,” White Paper. AI21 Labs, vol. 1, p. 9, 2021.
[63] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Ef-
[84] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun,
ficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314,
Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of
2023.
language models with mixture-of-experts,” in International Conference
[64] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, on Machine Learning. PMLR, 2022, pp. 5547–5569.
and D. Song, “Koala: A dialogue model for academic research,” Blog
[85] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-
post, April, vol. 1, 2023.
T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language
[65] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, models for dialog applications,” arXiv preprint arXiv:2201.08239,
D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., 2022.
“Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
[86] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen,
[66] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained
J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
for code,” arXiv preprint arXiv:2308.12950, 2023.
[87] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Sar-
[67] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large avia, A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large
language model connected with massive apis,” 2023. language model for science,” arXiv preprint arXiv:2211.09085, 2022.
[68] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, and [88] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou,
S. Naidu, “Giraffe: Adventures in expanding context lengths in llms,” S. Savarese, and C. Xiong, “Codegen: An open large language
arXiv preprint arXiv:2308.10882, 2023. model for code with multi-turn program synthesis,” arXiv preprint
[69] B. Huang, “Vigogne: French instruction-following and chat models,” arXiv:2203.13474, 2022.
https://github.com/bofenghuang/vigogne, 2023. [89] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza,
[70] Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R. Chandu, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky et al.,
D. Wadden, K. MacMillan, N. A. Smith, I. Beltagy et al., “How far can “Alexatm 20b: Few-shot learning using a large-scale multilingual
camels go? exploring the state of instruction tuning on open resources,” seq2seq model,” arXiv preprint arXiv:2208.01448, 2022.
arXiv preprint arXiv:2306.04751, 2023. [90] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, V. Firoiu,
[71] S. Tworkowski, K. Staniszewski, M. Pacek, Y. Wu, H. Michalewski, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al.,
and P. Miłoś, “Focused transformer: Contrastive training for context “Improving alignment of dialogue agents via targeted human judge-
scaling,” arXiv preprint arXiv:2307.03170, 2023. ments,” arXiv preprint arXiv:2209.14375, 2022.
[72] D. Mahan, R. Carlow, L. Castricato, N. Cooper, [91] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski,
and C. Laforte, “Stable beluga models.” [Online]. V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo et al.,
“Solving quantitative reasoning problems with language models,” [113] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou,
Advances in Neural Information Processing Systems, vol. 35, pp. “Codegen2: Lessons for training llms on programming and natural
3843–3857, 2022. languages,” arXiv preprint arXiv:2305.02309, 2023.
[92] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, D. Bahri, T. Schuster, [114] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada,
H. S. Zheng, N. Houlsby, and D. Metzler, “Unifying language learning S. Huang, L. von Werra, C. Fourrier, N. Habib et al., “Zephyr: Direct
paradigms,” arXiv preprint arXiv:2205.05131, 2022. distillation of lm alignment,” arXiv preprint arXiv:2310.16944, 2023.
[93] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, [115] X. team. Grok. [Online]. Available: https://grok.x.ai/
R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176b-
[116] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou,
parameter open-access multilingual language model,” arXiv preprint
and J. Zhou, “Qwen-vl: A frontier large vision-language model with
arXiv:2211.05100, 2022.
versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
[94] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
[117] mixtral. mixtral. [Online]. Available: https://mistral.ai/news/
W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained
mixtral-of-experts/
model,” arXiv preprint arXiv:2210.02414, 2022.
[95] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, [118] D. Wang, N. Raman, M. Sibue, Z. Ma, P. Babkin, S. Kaur, Y. Pei,
E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff et al., A. Nourbakhsh, and X. Liu, “Docllm: A layout-aware generative
“Pythia: A suite for analyzing large language models across train- language model for multimodal document understanding,” 2023.
ing and scaling,” in International Conference on Machine Learning. [119] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi,
PMLR, 2023, pp. 2397–2430. Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang, “Deepseek-coder:
[96] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and When the large language model meets programming – the rise of code
A. Awadallah, “Orca: Progressive learning from complex explanation intelligence,” 2024.
traces of gpt-4,” arXiv preprint arXiv:2306.02707, 2023. [120] F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi, “Knowledge
[97] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, fusion of large language models,” 2024.
M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source [121] P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source
be with you!” arXiv preprint arXiv:2305.06161, 2023. small language model,” 2024.
[98] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, [122] C. Wu, Y. Gan, Y. Ge, Z. Lu, J. Wang, Y. Feng, P. Luo, and Y. Shan,
L. Cui, O. K. Mohammed, Q. Liu et al., “Language is not all you “Llama pro: Progressive llama with block expansion,” 2024.
need: Aligning perception with language models,” arXiv preprint
[123] X. Amatriain, A. Sankar, J. Bing, P. K. Bodigutla, T. J. Hazen, and
arXiv:2302.14045, 2023.
M. Kazi, “Transformer models: an introduction and catalog,” 2023.
[99] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly [124] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli,
capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023. H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, “The refined-
web dataset for falcon llm: outperforming curated corpora with web
[100] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, data, and web data only,” arXiv preprint arXiv:2306.01116, 2023.
J. Tompson, I. Mordatch, Y. Chebotar et al., “Inner monologue:
Embodied reasoning through planning with language models,” arXiv [125] D. Hernandez, T. Brown, T. Conerly, N. DasSarma, D. Drain, S. El-
preprint arXiv:2207.05608, 2022. Showk, N. Elhage, Z. Hatfield-Dodds, T. Henighan, T. Hume et al.,
“Scaling laws and interpretability of learning from repeated data,”
[101] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, arXiv preprint arXiv:2205.10487, 2022.
J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti
et al., “Using deepspeed and megatron to train megatron-turing [126] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative
nlg 530b, a large-scale generative language model,” arXiv preprint position representations,” arXiv preprint arXiv:1803.02155, 2018.
arXiv:2201.11990, 2022. [127] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: En-
[102] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- hanced transformer with rotary position embedding,” arXiv preprint
document transformer,” arXiv preprint arXiv:2004.05150, 2020. arXiv:2104.09864, 2021.
[103] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shus- [128] O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention
ter, T. Wang, Q. Liu, P. S. Koura et al., “Opt-iml: Scaling language with linear biases enables input length extrapolation,” arXiv preprint
model instruction meta learning through the lens of generalization,” arXiv:2108.12409, 2021.
arXiv preprint arXiv:2212.12017, 2022. [129] G. Ke, D. He, and T.-Y. Liu, “Rethinking positional encoding in
[104] Y. Hao, H. Song, L. Dong, S. Huang, Z. Chi, W. Wang, S. Ma, language pre-training,” arXiv preprint arXiv:2006.15595, 2020.
and F. Wei, “Language models are general-purpose interfaces,” arXiv [130] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
preprint arXiv:2206.06336, 2022. and J. Dean, “Outrageously large neural networks: The sparsely-gated
[105] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.
and C. Gan, “Principle-driven self-alignment of language mod- [131] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling
els from scratch with minimal human supervision,” arXiv preprint to trillion parameter models with simple and efficient sparsity,” The
arXiv:2305.03047, 2023. Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270,
[106] W. E. team, “Palmyra-base Parameter Autoregressive Language 2022.
Model,” https://dev.writer.com, 2023. [132] R. K. Mahabadi, S. Ruder, M. Dehghani, and J. Henderson,
[107] ——, “Camel-5b instructgpt,” https://dev.writer.com, 2023. “Parameter-efficient multi-task fine-tuning for transformers via shared
[108] Yandex. Yalm. [Online]. Available: https://github.com/yandex/ hypernetworks,” 2021.
YaLM-100B [133] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu,
[109] M. Team et al., “Introducing mpt-7b: a new standard for open-source, T. Zhang, F. Wu, and G. Wang, “Instruction tuning for large language
commercially usable llms,” 2023. models: A survey,” 2023.
[110] A. Mitra, L. D. Corro, S. Mahajan, A. Codas, C. Simoes, S. Agarwal, [134] S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi, “Cross-task
X. Chen, A. Razdaibiedina, E. Jones, K. Aggarwal, H. Palangi, generalization via natural language crowdsourcing instructions,” arXiv
G. Zheng, C. Rosset, H. Khanpour, and A. Awadallah, “Orca 2: preprint arXiv:2104.08773, 2021.
Teaching small language models how to reason,” 2023. [135] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi,
[111] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and and H. Hajishirzi, “Self-instruct: Aligning language model with self
G. Neubig, “Pal: Program-aided language models,” in International generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
Conference on Machine Learning. PMLR, 2023, pp. 10 764–10 799. [136] K. Ethayarajh, W. Xu, D. Jurafsky, and D. Kiela. Kto. [Online].
[112] Anthropic. claude. [Online]. Available: https://www.anthropic.com/ Available: https://github.com/ContextualAI/HALOs/blob/main/assets/
news/introducing-claude report.pdf
[137] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and Association for Computational Linguistics, vol. 10, pp. 1066–1083,
D. Amodei, “Deep reinforcement learning from human preferences,” 2022. [Online]. Available: https://aclanthology.org/2022.tacl-1.62
Advances in neural information processing systems, vol. 30, 2017. [154] S. Santhanam, B. Hedayatnia, S. Gella, A. Padmakumar, S. Kim,
[138] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, C. Bishop, V. Car- Y. Liu, and D. Z. Hakkani-Tür, “Rome was built in 1776: A case study
bune, and A. Rastogi, “Rlaif: Scaling reinforcement learning from on factual correctness in knowledge-grounded response generation,”
human feedback with ai feedback,” arXiv preprint arXiv:2309.00267, ArXiv, vol. abs/2110.05456, 2021.
2023. [155] S. Min, K. Krishna, X. Lyu, M. Lewis, W. tau Yih, P. W. Koh, M. Iyyer,
[139] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and L. Zettlemoyer, and H. Hajishirzi, “Factscore: Fine-grained atomic
C. Finn, “Direct preference optimization: Your language model is evaluation of factual precision in long form text generation,” 2023.
secretly a reward model,” arXiv preprint arXiv:2305.18290, 2023. [156] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner,
[140] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory V. Chaudhary, and M. Young, “Machine learning: The high interest
optimizations toward training trillion parameter models,” in SC20: In- credit card of technical debt,” in SE4ML: Software Engineering for
ternational Conference for High Performance Computing, Networking, Machine Learning (NIPS 2014 Workshop), 2014.
Storage and Analysis. IEEE, 2020, pp. 1–16. [157] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought
[141] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, prompting in large language models,” 2022.
X. Cheng, M. Chung, M. Grella, K. K. GV et al., “Rwkv: Reinventing [158] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and
rnns for the transformer era,” arXiv preprint arXiv:2305.13048, 2023. K. Narasimhan, “Tree of thoughts: Deliberate problem solving with
[142] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, large language models,” 2023.
and W. Chen, “Lora: Low-rank adaptation of large language models,” [159] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheckgpt: Zero-
arXiv preprint arXiv:2106.09685, 2021. resource black-box hallucination detection for generative large lan-
[143] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a guage models,” 2023.
neural network,” arXiv preprint arXiv:1503.02531, 2015. [160] N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan,
[144] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: and S. Yao, “Reflexion: Language agents with verbal reinforcement
A survey,” International Journal of Computer Vision, vol. 129, pp. learning,” 2023.
1789–1819, 2021. [161] S. J. Zhang, S. Florin, A. N. Lee, E. Niknafs, A. Marginean, A. Wang,
[145] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. K. Tyser, Z. Chin, Y. Hicke, N. Singh, M. Udell, Y. Kim, T. Buonassisi,
Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural A. Solar-Lezama, and I. Drori, “Exploring the mit mathematics and
language generation,” ACM Comput. Surv., vol. 55, no. 12, mar 2023. eecs curriculum using large language models,” 2023.
[Online]. Available: https://doi.org/10.1145/3571730 [162] T. Wu, E. Jiang, A. Donsbach, J. Gray, A. Molina, M. Terry, and C. J.
[146] N. McKenna, T. Li, L. Cheng, M. J. Hosseini, M. Johnson, and Cai, “Promptchainer: Chaining large language model prompts through
M. Steedman, “Sources of hallucination by large language models on visual programming,” 2022.
inference tasks,” 2023. [163] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and
[147] C.-Y. Lin, “ROUGE: A package for automatic evaluation of J. Ba, “Large language models are human-level prompt engineers,”
summaries,” in Text Summarization Branches Out. Barcelona, Spain: 2023.
Association for Computational Linguistics, Jul. 2004, pp. 74–81. [164] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin,
[Online]. Available: https://aclanthology.org/W04-1013 N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and
[148] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for D. Kiela, “Retrieval-augmented generation for knowledge-intensive
automatic evaluation of machine translation,” in Proceedings of the NLP tasks,” CoRR, vol. abs/2005.11401, 2020. [Online]. Available:
40th Annual Meeting of the Association for Computational Linguistics, https://arxiv.org/abs/2005.11401
P. Isabelle, E. Charniak, and D. Lin, Eds. Philadelphia, Pennsylvania, [165] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and
USA: Association for Computational Linguistics, Jul. 2002, pp. 311– H. Wang, “Retrieval-augmented generation for large language models:
318. [Online]. Available: https://aclanthology.org/P02-1040 A survey,” arXiv preprint arXiv:2312.10997, 2023.
[149] B. Dhingra, M. Faruqui, A. Parikh, M.-W. Chang, D. Das, and [166] A. W. Services. (Year of publication, e.g., 2023) Question answering
W. Cohen, “Handling divergent reference texts when evaluating using retrieval augmented generation with foundation models in
table-to-text generation,” in Proceedings of the 57th Annual Meeting amazon sagemaker jumpstart. Accessed: Date of access, e.g.,
of the Association for Computational Linguistics, A. Korhonen, December 5, 2023. [Online]. Available: https://shorturl.at/dSV47
D. Traum, and L. Màrquez, Eds. Florence, Italy: Association [167] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large
for Computational Linguistics, Jul. 2019, pp. 4884–4895. [Online]. language models and knowledge graphs: A roadmap,” arXiv preprint
Available: https://aclanthology.org/P19-1483 arXiv:2306.08302, 2023.
[150] Z. Wang, X. Wang, B. An, D. Yu, and C. Chen, “Towards faithful [168] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang,
neural table-to-text generation with content-matching constraints,” J. Callan, and G. Neubig, “Active retrieval augmented generation,”
in Proceedings of the 58th Annual Meeting of the Association 2023.
for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter,
[169] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, L. Zettle-
and J. Tetreault, Eds. Online: Association for Computational
moyer, N. Cancedda, and T. Scialom, “Toolformer: Language models
Linguistics, Jul. 2020, pp. 1072–1086. [Online]. Available: https:
can teach themselves to use tools,” 2023.
//aclanthology.org/2020.acl-main.101
[170] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer,
[151] H. Song, W.-N. Zhang, J. Hu, and T. Liu, “Generating persona consis-
and M. T. Ribeiro, “Art: Automatic multi-step reasoning and tool-use
tent dialogues by exploiting natural language inference,” Proceedings
for large language models,” 2023.
of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp.
8878–8885, Apr. 2020. [171] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt:
Solving ai tasks with chatgpt and its friends in huggingface,” arXiv
[152] O. Honovich, L. Choshen, R. Aharoni, E. Neeman, I. Szpektor, preprint arXiv:2303.17580, 2023.
and O. Abend, “q 2 : Evaluating factual consistency in knowledge-
grounded dialogues via question generation and question answering,” [172] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang,
in Proceedings of the 2021 Conference on Empirical Methods in S. Jin, E. Zhou et al., “The rise and potential of large language model
Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: [173] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen,
Association for Computational Linguistics, Nov. 2021, pp. 7856–7870. J. Tang, X. Chen, Y. Lin et al., “A survey on large language model
[Online]. Available: https://aclanthology.org/2021.emnlp-main.619 based autonomous agents,” arXiv preprint arXiv:2308.11432, 2023.
[153] N. Dziri, H. Rashkin, T. Linzen, and D. Reitter, “Evaluating attribution [174] Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar,
in dialogue systems: The BEGIN benchmark,” Transactions of the R. Taori, Y. Noda, D. Terzopoulos, Y. Choi, K. Ikeuchi, H. Vo, L. Fei-
Fei, and J. Gao, “Agent ai: Surveying the horizons of multimodal CoRR, vol. abs/2110.14168, 2021. [Online]. Available: https:
interaction,” arXiv preprint arXiv:2401.03568, 2024. //arxiv.org/abs/2110.14168
[175] B. Xu, Z. Peng, B. Lei, S. Mukherjee, Y. Liu, and D. Xu, “Rewoo: [191] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang,
Decoupling reasoning from observations for efficient augmented lan- D. Song, and J. Steinhardt, “Measuring mathematical problem solving
guage models,” 2023. with the MATH dataset,” CoRR, vol. abs/2103.03874, 2021. [Online].
[176] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, Available: https://arxiv.org/abs/2103.03874
“React: Synergizing reasoning and acting in language models,” 2023. [192] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag:
[177] V. Nair, E. Schumacher, G. Tso, and A. Kannan, “Dera: Enhanc- Can a machine really finish your sentence?” 2019.
ing large language model completions with dialog-enabled resolving [193] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,
agents,” 2023. and O. Tafjord, “Think you have solved question answering? try
[178] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, arc, the AI2 reasoning challenge,” CoRR, vol. abs/1803.05457, 2018.
C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, [Online]. Available: http://arxiv.org/abs/1803.05457
and X. Xie, “A survey on evaluation of large language models,” 2023. [194] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, “PIQA:
[179] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, reasoning about physical commonsense in natural language,” CoRR,
C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, vol. abs/1911.11641, 2019. [Online]. Available: http://arxiv.org/abs/
L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, 1911.11641
Q. Le, and S. Petrov, “Natural questions: A benchmark for [195] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi, “Socialiqa:
question answering research,” Transactions of the Association for Commonsense reasoning about social interactions,” CoRR, vol.
Computational Linguistics, vol. 7, pp. 452–466, 2019. [Online]. abs/1904.09728, 2019. [Online]. Available: http://arxiv.org/abs/1904.
Available: https://aclanthology.org/Q19-1026 09728
[180] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and [196] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of
J. Steinhardt, “Measuring massive multitask language understanding,” armor conduct electricity? A new dataset for open book question
2021. answering,” CoRR, vol. abs/1809.02789, 2018. [Online]. Available:
[181] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, http://arxiv.org/abs/1809.02789
E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large [197] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models
language models,” arXiv preprint arXiv:2108.07732, 2021. mimic human falsehoods,” arXiv preprint arXiv:2109.07958, 2021.
[182] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, [198] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov,
and L. Zettlemoyer, “QuAC: Question answering in context,” in and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable
Proceedings of the 2018 Conference on Empirical Methods in Natural multi-hop question answering,” CoRR, vol. abs/1809.09600, 2018.
Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and [Online]. Available: http://arxiv.org/abs/1809.09600
J. Tsujii, Eds. Brussels, Belgium: Association for Computational
Linguistics, Oct.-Nov. 2018, pp. 2174–2184. [Online]. Available: [199] Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang, “Toolqa: A
https://aclanthology.org/D18-1241 dataset for llm question answering with external tools,” arXiv preprint
arXiv:2306.13304, 2023.
[183] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo,
C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring [200] D. Chen, J. Bolton, and C. D. Manning, “A thorough examination
coding challenge competence with apps,” NeurIPS, 2021. of the cnn/daily mail reading comprehension task,” in Association for
Computational Linguistics (ACL), 2016.
[184] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating structured
queries from natural language using reinforcement learning,” arXiv [201] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang et al., “Abstractive text
preprint arXiv:1709.00103, 2017. summarization using sequence-to-sequence rnns and beyond,” arXiv
preprint arXiv:1602.06023, 2016.
[185] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “TriviaQA:
A large scale distantly supervised challenge dataset for reading [202] Y. Bai and D. Z. Wang, “More than reading comprehension: A survey
comprehension,” in Proceedings of the 55th Annual Meeting of the on datasets and metrics of textual question answering,” arXiv preprint
Association for Computational Linguistics (Volume 1: Long Papers), arXiv:2109.12264, 2021.
R. Barzilay and M.-Y. Kan, Eds. Vancouver, Canada: Association [203] H.-Y. Huang, E. Choi, and W.-t. Yih, “Flowqa: Grasping flow in
for Computational Linguistics, Jul. 2017, pp. 1601–1611. [Online]. history for conversational machine comprehension,” arXiv preprint
Available: https://aclanthology.org/P17-1147 arXiv:1810.06683, 2018.
[186] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “RACE: Large-scale [204] S. Lee, J. Lee, H. Moon, C. Park, J. Seo, S. Eo, S. Koo, and H. Lim, “A
ReAding comprehension dataset from examinations,” in Proceedings survey on evaluation metrics for machine translation,” Mathematics,
of the 2017 Conference on Empirical Methods in Natural Language vol. 11, no. 4, p. 1006, 2023.
Processing, M. Palmer, R. Hwa, and S. Riedel, Eds. Copenhagen,
[205] J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Halueval:
Denmark: Association for Computational Linguistics, Sep. 2017, pp.
A large-scale hallucination evaluation benchmark for large language
785–794. [Online]. Available: https://aclanthology.org/D17-1082
models,” in Proceedings of the 2023 Conference on Empirical Methods
[187] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ in Natural Language Processing, 2023, pp. 6449–6464.
questions for machine comprehension of text,” in Proceedings of
[206] Simon Mark Hughes, “Hughes hallucination evaluation model
the 2016 Conference on Empirical Methods in Natural Language
(hhem) leaderboard,” 2024, https://huggingface.co/spaces/vectara/
Processing, J. Su, K. Duh, and X. Carreras, Eds. Austin, Texas:
Hallucination-evaluation-leaderboard, Last accessed on 2024-01-21.
Association for Computational Linguistics, Nov. 2016, pp. 2383–2392.
[Online]. Available: https://aclanthology.org/D16-1264 [207] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno,
[188] C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi et al.,
K. Toutanova, “Boolq: Exploring the surprising difficulty of natural “Textbooks are all you need,” arXiv preprint arXiv:2306.11644, 2023.
yes/no questions,” CoRR, vol. abs/1905.10044, 2019. [Online]. [208] Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y. T.
Available: http://arxiv.org/abs/1905.10044 Lee, “Textbooks are all you need ii: phi-1.5 technical report,” arXiv
[189] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth, preprint arXiv:2309.05463, 2023.
“Looking beyond the surface:a challenge set for reading compre- [209] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus,
hension over multiple sentences,” in Proceedings of North American Y. Bengio, S. Ermon, and C. Ré, “Hyena hierarchy: Towards larger
Chapter of the Association for Computational Linguistics (NAACL), convolutional language models,” 2023.
2018. [210] M. Poli, J. Wang, S. Massaroli, J. Quesnelle, E. Nguyen, and
[190] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, A. Thomas, “StripedHyena: Moving Beyond Transformers with
M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and Hybrid Signal Processing Models,” 12 2023. [Online]. Available:
J. Schulman, “Training verifiers to solve math word problems,” https://github.com/togethercomputer/stripedhyena
[211] D. Y. Fu, S. Arora, J. Grogan, I. Johnson, S. Eyuboglu, A. W. Thomas, A PPENDIX
B. Spector, M. Poli, A. Rudra, and C. Ré, “Monarch mixer: A simple
sub-quadratic gemm-based architecture,” 2023. 1. Open Source Toolkits For LLM Development and
[212] G. J. McLachlan, S. X. Lee, and S. I. Rathnayake, “Finite mixture Deployment
models,” Annual review of statistics and its application, vol. 6, pp.
355–378, 2019. There are various frameworks and libraries developed for
[213] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv LLM training, evaluation, and deployment, and covering every
preprint arXiv:2304.08485, 2023. single framework is out of this paper’s scope. But we try to
[214] S. Liu, H. Cheng, H. Liu, H. Zhang, F. Li, T. Ren, X. Zou, provide a brief introduction of some of the most popular ones,
J. Yang, H. Su, J. Zhu, L. Zhang, J. Gao, and C. Li, “Llava-plus: grouped into different categories.
Learning to use tools for creating multimodal agents,” arXiv preprint
arXiv:2311.05437, 2023.
[215] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any
A. LLM Training/Inference Frameworks
multimodal llm,” arXiv preprint arXiv:2309.05519, 2023. Some of the popular frameworks which are useful for LLM
[216] N. N. Khasmakhi, M. Asgari-Chenaghlu, N. Asghar, P. Schaer, and training includes (note that some of them can be used beyond
D. Zühlke, “Convgenvismo: Evaluation of conversational generative
vision models,” 2023.
LLM training too):
[217] L. Sun, Y. Huang, H. Wang, S. Wu, Q. Zhang, C. Gao, Y. Huang, DeepSpeed [218] is a deep learning optimization library
W. Lyu, Y. Zhang, X. Li et al., “Trustllm: Trustworthiness in large that makes distributed training and inference easy, efficient,
language models,” arXiv preprint arXiv:2401.05561, 2024.
and effective. DeepSpeed enables world’s most powerful lan-
[218] Microsoft. Deepspeed. [Online]. Available: https://github.com/
microsoft/DeepSpeed
guage models like MT-530B and BLOOM. It is an easy-
to-use deep learning optimization software suite that powers
[219] HuggingFace. Transformers. [Online]. Available: https://github.com/
huggingface/transformers unprecedented scale and speed for both training and inference.
[220] Nvidia. Megatron. [Online]. Available: https://github.com/NVIDIA/
With DeepSpeed you can:
Megatron-LM Transformers [219] is library by HuggingFace which
[221] BMTrain. Bmtrain. [Online]. Available: https://github.com/OpenBMB/ provides thousands of pretrained models to perform tasks on
BMTrain
different modalities such as text, vision, and audio. Using
[222] EleutherAI. gpt-neox. [Online]. Available: https://github.com/ pretrained models one can reduce compute costs, carbon
EleutherAI/gpt-neox
footprint, and save the time and resources required to train
[223] microsoft. Lora. [Online]. Available: https://github.com/microsoft/
LoRA a model from scratch.
[224] ColossalAI. Colossalai. [Online]. Available: https://github.com/ Megatron-LM [220] is a large, powerful transformer
hpcaitech/ColossalAI developed by the Applied Deep Learning Research team
[225] FastChat. Fastchat. [Online]. Available: https://github.com/lm-sys/ at NVIDIA. It contains efficient, model-parallel (tensor, se-
FastChat
quence, and pipeline), and multi-node pre-training of trans-
[226] skypilot. skypilot. [Online]. Available: https://github.com/skypilot-org/ former based models such as GPT, BERT, and T5 using mixed
skypilot
precision.
[227] vllm. vllm. [Online]. Available: https://github.com/vllm-project/vllm
[228] huggingface. text-generation-inference. [Online]. Available: https: BMTrain [221] is an efficient large model training toolkit
//github.com/huggingface/text-generation-inference that can be used to train large models with tens of billions of
[229] langchain. langchain. [Online]. Available: https://github.com/ parameters. It can train models in a distributed manner while
langchain-ai/langchain keeping the code as simple as stand-alone training.
[230] bentoml. Openllm. [Online]. Available: https://github.com/bentoml/
OpenLLM GPT-NeoX [222] leverages many of the same features and
[231] embedchain. embedchain. [Online]. Available: https://github.com/ technologies as the popular Megatron-DeepSpeed library but
embedchain/embedchain with substantially increased usability and novel optimizations.
[232] microsoft. autogen. [Online]. Available: https://github.com/microsoft/
autogen LoRA [223] library provides the support for Low-Rank
[233] babyagi. babyagi. [Online]. Available: https://github.com/
Adaptation of Large Language Models. It reduces the number
yoheinakajima/babyagi of trainable parameters by learning pairs of rank-decompostion
[234] guidance. guidance. [Online]. Available: https://github.com/ matrices while freezing the original weights. This vastly
guidance-ai/guidance reduces the storage requirement for large language models
[235] prompttools. prompttools. [Online]. Available: https://github.com/ adapted to specific tasks and enables efficient task-switching
hegelai/prompttools during deployment all without introducing inference latency.
[236] promptfoo. promptfoo. [Online]. Available: https://github.com/ LoRA also outperforms several other adaptation methods in-
promptfoo/promptfoo cluding adapter, prefix-tuning, and fine-tuning.
[237] facebook. faiss. [Online]. Available: https://github.com/
facebookresearch/faiss ColossalAI library [224] provides a collection of parallel
[238] milvus. milvus. [Online]. Available: https://github.com/milvus-io/ components. It aims to support developers to write their
milvus distributed deep learning models just like how they write their
[239] qdrant. qdrant. [Online]. Available: https://github.com/qdrant/qdrant model on their laptop. They provide user-friendly tools to
[240] weaviate. weaviate. [Online]. Available: https://github.com/weaviate/ kickstart distributed training and inference in a few lines. In
weaviate terms of Parallelism strategies, they support: Data Parallelism,
[241] llama index. llama-index. [Online]. Available: https://github.com/ Pipeline Parallelism, Sequence Parallelism, Zero Redundancy
run-llama/llama index Optimizer (ZeRO) [140], and Auto-Parallelism.
B. Deployment Tools C. Prompting Libraries
We provide an overview of some of the most popular LLM Guidance [234] is a programming paradigm that offers
deployment tools here. superior control and efficiency compared to conventional
FastChat [225] is an open platform for training, serv- prompting and chaining. It allows users to constrain generation
ing, and evaluating large language model based chatbots. (e.g. with regex and CFGs) as well as to interleave control
FastChat’s core features include: The training and evaluation (conditional, loops) and generation seamlessly.
code for state-of-the-art models (e.g., Vicuna, MT-Bench), and PromptTools [235] offers a set of open-source, self-
a distributed multi-model serving system with web UI and hostable tools for experimenting with, testing, and evaluating
OpenAI-compatible RESTful APIs. LLMs, vector databases, and prompts. The core idea is to
Skypilot [226] is a framework for running LLMs, AI, enable developers to evaluate using familiar interfaces like
and batch jobs on any cloud, offering maximum cost savings, code, notebooks, and a local playground.
highest GPU availability, and managed execution. PromptBench [?] is a Pytorch-based Python package for
vLLM [227] is a fast and easy-to-use library for LLM in- Evaluation of Large Language Models (LLMs). It provides
ference and serving. vLLM seamlessly supports many Hugging user-friendly APIs for researchers to conduct evaluation on
Face models, including the following architectures: Aquila, LLMs.
Baichuan, BLOOM, ChatGLM, DeciLM, Falcon, GPT Big- Promptfoo [236] is a tool for testing and evaluating LLM
Code, LLaMA, LLaMA 2, Mistral, Mixtral, MPT, OPT, Qwen, output quality. It systematically test prompts, models, and
Yi, and many more. RAGs with predefined test cases.
text-generation-inference [228] is a toolkit for deploying
and serving Large Language Models (LLMs). TGI enables D. VectorDB
high-performance text generation for the most popular open-
Faiss [237] is a library developed by Facebook AI Re-
source LLMs, including Llama, Falcon, StarCoder, BLOOM,
search that provides efficient similarity search and clustering
GPT-NeoX, and more.
of dense vectors. It is designed for use with large-scale,
LangChain [229] is a framework for developing applica- high-dimensional data and supports several index types and
tions powered by language models. It enables applications that: algorithms for various use cases.
• Are context-aware: connect a language model to Milvus [238] is an open-source vector database built to
sources of context (prompt instructions, few shot ex- power embedding similarity search and AI applications. Mil-
amples, content to ground its response in, etc.) vus makes unstructured data search more accessible, and pro-
vides a consistent user experience regardless of the deployment
• Reason: rely on a language model to reason (about
environment.
how to answer based on provided context, what ac-
tions to take, etc.) Qdrant [239] is a vector similarity search engine and
vector database. It provides a production-ready service with a
OpenLLM [230] is an open-source platform designed to convenient API to store, search, and manage points—vectors
facilitate the deployment and operation of large language mod- with an additional payload Qdrant is tailored to extended
els (LLMs) in real-world applications. With OpenLLM, you filtering support. environment.
can run inference on any open-source LLM, deploy them on
the cloud or on-premises, and build powerful AI applications. Weaviate [240] is an open-source, GraphQL-based vec-
tor search engine that enables similarity search on high-
Embedchain [231] is an Open Source RAG Framework
dimensional data. While it is open-source, the commercial ver-
that makes it easy to create and deploy AI apps. Embedchain
sion offers additional features, support, and managed services.
streamlines the creation of RAG applications, offering a seam-
less process for managing various types of unstructured data. Some of the other popular options includes LlamaIndex
It efficiently segments data into manageable chunks, generates [241] and Pinecone.
relevant embeddings, and stores them in a vector database for
optimized retrieval.
Autogen [232] is a framework that enables the devel-
opment of LLM applications using multiple agents that can
converse with each other to solve tasks. AutoGen agents
are customizable, conversable, and seamlessly allow human
participation. They can operate in various modes that employ
combinations of LLMs, human inputs, and tools.
BabyAGI [233] is an autonomous Artificial Intelligence
agent, that is designed to generate and execute tasks based on
given objectives. It harnesses cutting-edge technologies from
OpenAI, Pinecone, LangChain, and Chroma to automate tasks
and achieve specific goals. In this blog post, we will dive
into the unique features of BabyAGI and explore how it can
streamline task automation.