0% found this document useful (0 votes)

159 views43 pages

Large Language Models: A Survey

This document provides a survey of large language models (LLMs). It discusses the evolution of language models from statistical models to modern transformer-based LLMs. It reviews popular LLM families like GPT, LLaMA, and PaLM and discusses their characteristics. It also surveys datasets and evaluation metrics used for LLM training and evaluation and compares the performance of popular LLMs on benchmarks.

Uploaded by

Lucas Vital Alves da Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views43 pages

Large Language Models: A Survey

Uploaded by

Lucas Vital Alves da Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu

Richard Socher, Xavier Amatriain, Jianfeng Gao

Abstract—Large Language Models (LLMs) have drawn a that have different starting points and velocity: statistical lan-
lot of attention due to their strong performance on a wide guage models, neural language models, pre-trained language
range of natural language tasks, since the release of ChatGPT models and LLMs.
in November 2022. LLMs’ ability of general-purpose language
understanding and generation is acquired by training billions of Statistical language models (SLMs) view text as a sequence
model’s parameters on massive amounts of text data, as predicted of words, and estimate the probability of text as the product
arXiv:2402.06196v1 [cs.CL] 9 Feb 2024

by scaling laws [1], [2]. The research area of LLMs, while very of their word probabilities. The dominating form of SLMs
recent, is evolving rapidly in many different ways. In this paper, are Markov chain models known as the n-gram models,
we review some of the most prominent LLMs, including three which compute the probability of a word conditioned on its
popular LLM families (GPT, LLaMA, PaLM), and discuss their
characteristics, contributions and limitations. We also give an
immediate proceeding n − 1 words. Since word probabilities
overview of techniques developed to build, and augment LLMs. are estimated using word and n-gram counts collected from
We then survey popular datasets prepared for LLM training, text corpora, the model needs to deal with data sparsity (i.e.,
fine-tuning, and evaluation, review widely used LLM evaluation assigning zero probabilities to unseen words or n-grams) by
metrics, and compare the performance of several popular LLMs using smoothing, where some probability mass of the model
on a set of representative benchmarks. Finally, we conclude is reserved for unseen n-grams [12]. N-gram models are
the paper by discussing open challenges and future research widely used in many NLP systems. However, these models
directions. are incomplete in that they cannot fully capture the diversity
and variability of natural language due to data sparsity.
I. I NTRODUCTION
Early neural language models (NLMs) [13], [14], [15], [16]
Language modeling is a long-standing research topic, dat- deal with data sparsity by mapping words to low-dimensional
ing back to the 1950s with Shannon’s application of informa- continuous vectors (embedding vectors) and predict the next
tion theory to human language, where he measured how well word based on the aggregation of the embedding vectors of
simple n-gram language models predict or compress natural its proceeding words using neural networks. The embedding
language text [3]. Since then, statistical language modeling vectors learned by NLMs define a hidden space where the
became fundamental to many natural language understanding semantic similarity between vectors can be readily computed
and generation tasks, ranging from speech recognition, ma- as their distance. This opens the door to computing semantic
chine translation, to information retrieval [4], [5], [6]. similarity of any two inputs regardless their forms (e.g., queries
vs. documents in Web search [17], [18], sentences in different
The recent advances on transformer-based large language languages in machine translation [19], [20]) or modalities (e.g.,
models (LLMs), pretrained on Web-scale text corpora, signif- image and text in image captioning [21], [22]). Early NLMs are
icantly extended the capabilities of language models (LLMs). task-specific models, in that they are trained on task-specific
For example, OpenAI’s ChatGPT and GPT-4 can be used not data and their learned hidden space is task-specific.
only for natural language processing, but also as general task
solvers to power Microsoft’s Co-Pilot systems, for instance, Pre-trained language models (PLMs), unlike early NLMs,
can follow human instructions of complex new tasks per- are task-agnostic. This generality also extends to the learned
forming multi-step reasoning when needed. LLMs are thus hidden embedding space. The training and inference of PLMs
becoming the basic building block for the development of follows the pre-training and fine-tuning paradigm, where lan-
general-purpose AI agents or artificial general intelligence guage models with recurrent neural networks [23] or trans-
(AGI). formers [24], [25], [26] are pre-trained on Web-scale unlabeled
text corpora for general tasks such as word prediction, and then
As the field of LLMs is moving fast, with new findings, finetuned to specific tasks using small amounts of (labeled)
models and techniques being published in a matter of months task-specific data. Recent surveys on PLMs include [8], [27],
or weeks [7], [8], [9], [10], [11], AI researchers and practi- [28].
tioners often find it challenging to figure out the best recipes
to build LLM-powered AI systems for their tasks. This paper Large language models (LLMs) mainly refer to
gives a timely survey of the recent advances on LLMs. We transformer-based neural language models 1 that contain
hope this survey will prove a valuable and accessible resource tens to hundreds of billions of parameters, which are pre-
for students, researchers and developers. trained on massive text data, such as PaLM [31], LLaMA
[32], and GPT-4 [33], as summarized in Table III. Compared
LLMs are large-scale, pre-trained, statistical language mod-
els based on neural networks. The recent success of LLMs is 1 Recently, several very promising non-transformer LLMs have been pro-
an accumulation of decodes of research and development of posed, such as the LLMs based on structured state space models [29], [30].
language models, which can be categorized into four waves See Section VII for more details.
Summarization

Multi choice QA
Simplification

Boolean QA Reading Comprehension Arithmetic

XNLI
Symbolic
Crosslingual QA
Comprehension Logical
Translation Common Sense
Self-refinement
Crosslingual Tasks Self-cirtisim

Reasoning
Multilingual
Completion Self-improvement
Step by step
Wikipedia QA Multi choice QA Few-shot
solving
Tool planning
Function Calling
Turn based Task definition Physical acting
Symbolic Task
reference decomposition Knowledge base
Pos/Neg example
utilization
World Instruction Virtual acting
knowledge API calling following Assignment
In-context planning Tool
learning utilization
Interacting
Coding
with users

Basic Emerging Augmented

LLM Capabilities

Fig. 1: LLM Capabilities.

to PLMs, LLMs are not only much larger in model size, but LLMs are used, and augmented for real-world applications
also exhibit stronger language understanding and generation Sections V and VI review popular datasets and benchmarks for
abilities, and more importantly, emergent abilities that are evaluating LLMs, and summarize the reported LLM evaluation
not present in smaller-scale language models. As illustrated results. Finally, Section VII concludes the paper by summa-
in Fig. 1, these emergent abilities include (1) in-context rizing the challenges and future research directions.
learning, where LLMs learn a new task from a small set
of examples presented in the prompt at inference time, (2)
instruction following, where LLMs, after instruction tuning, II. L ARGE L ANGUAGE M ODELS
can follow the instructions for new types of tasks without In this section we start with a review of early pre-trained
using explicit examples, and (3) multi-step reasoning, where neural language models as they are the base of LLMs, and
LLMs can solve a complex task by breaking down that task then focus our discussion on three families of LLMs: GPT,
into intermediate reasoning steps as demonstrated in the LlaMA, and PaLM. Table I provides an overview of some of
chain-of-thought prompt [34]. LLMs can also be augmented these models and their characteristics.
by using external knowledge and tools [35], [36] so that
they can effectively interact with users and environment [37],
and continually improve itself using feedback data collected A. Early Pre-trained Neural Language Models
through interactions (e.g. via reinforcement learning with
human feedback (RLHF)). Language modeling using neural networks was pioneered
by [38], [39], [40]. Bengio et al. [13] developed one of the first
Through advanced usage and augmentation techniques, neural language models (NLMs) that are comparable to n-gram
LLMs can be deployed as so-called AI agents: artificial entities models. Then, [14] successfully applied NLMs to machine
that sense their environment, make decisions, and take actions. translation. The release of RNNLM (an open source NLM
Previous research has focused on developing agents for specific toolkit) by Mikolov [41], [42] helped significantly popularize
tasks and domains. The emergent abilities demonstrated by NLMs. Afterwards, NLMs based on recurrent neural networks
LLMs make it possible to build general-purpose AI agents (RNNs) and their variants, such as long short-term memory
based on LLMs. While LLMs are trained to produce responses (LSTM) [19] and gated recurrent unit (GRU) [20], were widely
in static settings, AI agents need to take actions to interact with used for many natural language applications including machine
dynamic environment. Therefore, LLM-based agents often translation, text generation and text classification [43].
need to augment LLMs to e.g., obtain updated information
from external knowledge bases, verify whether a system action Then, the invention of the Transformer architecture [44]
produces the expected result, and cope with when things do marks another milestone in the development of NLMs. By
not go as expected, etc. We will discuss in detail LLM-based applying self-attention to compute in parallel for every word
agents in Section IV. in a sentence or document an “attention score” to model the
influence each word has on another, Transformers allow for
In the rest of this paper, Section II presents an overview of much more parallelization than RNNs, which makes it possible
state of the art of LLMs, focusing on three LLM families (GPT, to efficiently pre-train very big language models on large
LLaMA and PaLM) and other representative models. Section amounts of data on GPUs. These pre-trained language models
III discusses how LLMs are built. Section IV discusses how (PLMs) can be fine-tuned for many downstream tasks.
Paper Strcuture

II Large Language Models

Early Pre-trained III HOW LLMS ARE BUILT

A
Language Models
Dominant LLM
A Architectures E Model Pre-training
B Large Language
Model Families

B Data Cleaning F Fine-tuning and

Other Representative Instruction Tuning
C
LLMs
C Tokenizations G Alignment

D Positional Encoding H Decoding Strategies

I HOW LLMS ARE USED AND AUGMENTED

I Cost-Effective Training/Inference,
A LLM limitations Adaptation & Compression

B Using LLMs: Prompt Design

and Engineering

C Augmenting LLMs through V POPULAR DATASETS FOR LLMS

external knowledge - RAG

Datasets for Basic Tasks: language

A
D Using External Tools modeling/understanding/generation

Datasets for Emergent: ICL, reasoning,

E LLM Agents B
instruction following

C Datasets for Augmented: using

external knowledge/tools

VI PROMINENT LLMS’ PERFORMANCE

ON BENCHMARKS

A Popular Metrics for Evaluating LLMs

VII CHALLENGES AND FUTURE DIRECTIONS

B LLMs’ Performance on Different Tasks Smaller and more efficient

A
Language Models

B New Post-attention
Architectural Paradigms

C Multi-modal Models

D Improved LLM Usage and

Augmentation techniques

D Security and
Ethical/Responsible AI

Fig. 2: The paper structure.

We group early popular Transformer-based PLMs, based on BERT (Birectional Encoder Representations from Trans-
their neural architectures, into three main categories: encoder- formers) [24] is one of the most widely used encoder-only
only, decoder-only, and encoder-decoder models. Comprehen- language models. BERT consists of three modules: (1) an
sive surveys of early PLMs are provided in [43], [28]. embedding module that converts input text into a sequence
of embedding vectors, (2) a stack of Transformer encoders
that converts embedding vectors into contextual representation
1) Encoder-only PLMs: As the name suggests, the encoder- vectors, and (3) a fully connected layer that converts the
only models only consist of an encoder network. These models representation vectors (at the final layer) to one-hot vectors.
are originally developed for language understanding tasks, BERT is pre-trained uses two objectives: masked language
such as text classification, where the models need to predict a modeling (MLM) and next sentence prediction. The pre-trained
class label for an input text. Representative encoder-only mod- BERT model can be fine-tuned by adding a classifier layer
els include BERT and its variants, e.g., RoBERTa, ALBERT, for many language understanding tasks, ranging from text
DeBERTa, XLM, XLNet, UNILM, as to be described below.
TABLE I: High-level Overview of Popular Language Models
Type Model Name #Parameters Release Base Models Open #Tokens Training dataset
Source
BERT 110M, 340M 2018 - ✓ 137B BooksCorpus, English Wikipedia
RoBERTa 355M 2019 - ✓ 2.2T BooksCorpus, English Wikipedia, CC-NEWS,
STORIES (a subset of Common Crawl), Reddit
ALBERT 12M, 18M, 60M, 2019 - ✓ 137B BooksCorpus, English Wikipedia
Encoder-Only
235M
DeBERTa - 2020 - ✓ - BooksCorpus, English Wikipedia, STORIES, Red-
dit content
XLNet 110M, 340M 2019 - ✓ 32.89B BooksCorpus, English Wikipedia, Giga5, Com-
mon Crawl, ClueWeb 2012-B
GPT-1 120M 2018 - ✓ 1.3B BooksCorpus
Decoder-only
GPT-2 1.5B 2019 - ✓ 10B Reddit outbound
T5 (Base) 223M 2019 - ✓ 156B Common Crawl
MT5 (Base) 300M 2020 - ✓ - New Common Crawl-based dataset in 101 lan-
Encoder-Decoder
guages (m Common Crawl)
BART (Base) 139M 2019 - ✓ - Corrupting text
GPT-3 125M, 350M, 2020 × 300B Common Crawl (filtered), WebText2, Books1,
760M, 1.3B, 2.7B, Books2, Wikipedia
6.7B, 13B, 175B
CODEX 12B 2021 GPT ✓ - Public GitHub software repositories
GPT Family
WebGPT 760M, 13B, 175B 2021 GPT-3 × - ELI5
GPT-4 1.76T 2023 - × 13T -
LLaMA1 7B, 13B, 33B, 65B 2023 - ✓ 1T, 1.4T Online sources
LLaMA2 7B, 13B, 34B, 70B 2023 - ✓ 2T Online sources
Alpaca 7B 2023 LLaMA1 ✓ - GPT-3.5
Vicuna-13B 13B 2023 LLaMA1 ✓ - GPT-3.5
Koala 13B 2023 LLaMA ✓ - Dialogue data
LLaMA Family
Mistral-7B 7.3B 2023 ✓ - -
Code Llama 34 2023 LLaMA2 ✓ 500B Publicly available code
LongLLaMA 3B, 7B 2023 OpenLLaMA ✓ 1T -
LLaMA-Pro-8B 8.3B 2024 LLaMA2-7B ✓ 80B Code and math corpora
TinyLlama-1.1B 1.1B 2024 LLaMA1.1B ✓ 3T SlimPajama, Starcoderdata
PaLM 8B, 62B, 540B 2022 - × 780B Web documents, books, Wikipedia, conversations,
GitHub code
U-PaLM 8B, 62B, 540B 2022 - × 1.3B Web documents, books, Wikipedia, conversations,
GitHub code
PaLM-2 340B 2023 - ✓ 3.6T Web documents, books, code, mathematics, con-
PaLM Family
versational data
Med-PaLM 540B 2022 PaLM × 780B HealthSearchQA, MedicationQA, LiveQA
Med-PaLM 2 - 2023 PaLM 2 × - MedQA, MedMCQA, HealthSearchQA, LiveQA,
MedicationQA
FLAN 137B 2021 LaMDA-PT ✓ - Web documents, code, dialog data, Wikipedia
Gopher 280B 2021 - × 300B MassiveText
ERNIE 4.0 10B 2023 - × 4TB Chinese text
Retro 7.5B 2021 - × 600B MassiveText
LaMDA 137B 2022 - × 168B public dialog data and web documents
ChinChilla 70B 2022 - × 1.4T MassiveText
Galactia-120B 120B 2022 - 450B
CodeGen 16.1B 2022 - ✓ - THE PILE, BIGQUERY, BIGPYTHON
Other Popular LLMs
BLOOM 176B 2022 - ✓ 366B ROOTS
Zephyr 7.24B 2023 Mistral-7B ✓ 800B Synthetic data
Grok-0 33B 2023 - × - Online source
ORCA-2 13B 2023 LLaMA2 - 2001B -
StartCoder 15.5B 2023 - ✓ 35B GitHub
MPT 7B 2023 - ✓ 1T RedPajama, m Common Crawl, S2ORC, Common
Crawl
Mixtral-8x7B 46.7B 2023 - ✓ - Instruction dataset
Falcon 180B 180B 2023 - ✓ 3.5T RefinedWeb
Gemini 1.8B, 3.25B 2023 ✓ - Web documents, books, and code, image data,
audio data, video data
DeepSeek-Coder 1.3B, 6.7B, 33B 2024 - ✓ 2T GitHub’s Markdown and StackExchange
DocLLM 1B,7B 2024 - × 2T IIT-CDIP Test Collection 1.0, DocBank

classification, question answering to language inference. A larger mini-batches and learning rates. ALBERT [45] uses two
high-level overview of BERT framework is shown in Fig 3. As parameter-reduction techniques to lower memory consumption
BERT significantly improved state of the art on a wide range and increase the training speed of BERT: (1) splitting the
of language understanding tasks when it was published, the AI embedding matrix into two smaller matrices, and (2) using
community was inspired to develop many similar encoder-only repeating layers split among groups. DeBERTa (Decoding-
language models based on BERT. enhanced BERT with disentangled attention) [26] improves the
BERT and RoBERTa models using two novel techniques. The
RoBERTa [25] significantly improves the robustness of first is the disentangled attention mechanism, where each word
BERT using a set of model design choices and training strate- is represented using two vectors that encode its content and
gies, such as modifying a few key hyperparameters, removing position, respectively, and the attention weights among words
the next-sentence pre-training objective and training with much
Fig. 3: Overall pre-training and fine-tuning procedures for
BERT. Courtesy of [24]

Fig. 5: Cross-lingual language model pretraining. The MLM

are computed using disentangled matrices on their contents and objective is similar to BERT, but with continuous streams
relative positions, respectively. Second, an enhanced mask de- of text as opposed to sentence pairs. The TLM objective
coder is used to incorporate absolute positions in the decoding extends MLM to pairs of parallel sentences. To predict a
layer to predict the masked tokens in model pre-training. In masked English word, the model can attend to both the English
addition, a novel virtual adversarial training method is used for sentence and its French translation, and is encouraged to align
fine-tuning to improve models’ generalization. ELECTRA [46] English and French representations. Courtesy of [47].
uses a new pre-training task, known as replaced token detection
(RTD), which is empirically proven to be more sample-efficient
than MLM. Instead of masking the input, RTD corrupts it by all permutations of the factorization order. UNILM (UNIfied
replacing some tokens with plausible alternatives sampled from pre-trained Language Model) [49] is pre-trained using three
a small generator network. Then, instead of training a model types of language modeling tasks: unidirectional, bidirectional,
that predicts the original identities of the corrupted tokens, a and sequence-to-sequence prediction. This is achieved by
discriminative model is trained to predict whether a token in employing a shared Transformer network and utilizing specific
the corrupted input was replaced by a generated sample or not. self-attention masks to control what context the prediction is
RTD is more sample-efficient than MLM because the former conditioned on, as illustrated in Fig 6. The pre-trained model
is defined over all input tokens rather than just the small subset can be fine-tuned for both natural language understanding and
being masked out, as illustrated in Fig 4. generation tasks.

Fig. 4: A comparison between replaced token detection and

masked language modeling. Courtesy of [46].

Fig. 6: Overview of unified LM pre-training. The model

XLMs [47] extended BERT to cross-lingual language parameters are shared across the LM objectives (i.e., bidirec-
models using two methods: (1) a unsupervised method that tional LM, unidirectional LM, and sequence-to-sequence LM).
only relies on monolingual data, and (2) a supervised method Courtesy of [49].
that leverages parallel data with a new cross-lingual language
model objective, as illustrated in Fig 5. XLMs had obtained
state-of-the-art results on cross-lingual classification, unsuper- 2) Decoder-only PLMs: Two of the most widely used
vised and supervised machine translation, at the time they were decoder-only PLMs are GPT-1 and GPT-2, developed by
proposed. OpenAI. These models lay the foundation to more powerful
LLMs subsequently, i.e., GPT-3 and GPT-4.
There are also encoder-only language models that leverage
the advantages of auto-regressive (decoder) models for model GPT-1 [50] demonstrates for the first time that good
training and inference. Two examples are XLNet and UNILM. performance over a wide range of natural language tasks can be
XLNet [48] is based on Transformer-XL, pre-trained using a obtained by Generative Pre-Training (GPT) of a decoder-only
generalized autoregressive method that enables learning bidi- Transformer model on a diverse corpus of unlabeled text in a
rectional contexts by maximizing the expected likelihood over self-supervised learning fashion (i.e., next word/token predic-
tion), followed by discriminative fine-tuning on each specific B. Large Language Model Families
downstream task (with much fewer samples), as illustrated in
Large language models (LLMs) mainly refer to
Fig 7. GPT-1 paves the way for subsequent GPT models, with
transformer-based PLMs that contain tens to hundreds
each version improving upon the architecture and achieving
of billions of parameters. Compared to PLMs reviewed above,
better performance on various language tasks.
LLMs are not only much larger in model size, but also exhibit
stronger language understanding and generation and emergent
abilities that are not present in smaller-scale models. In what
follows, we review three LLM families: GPT, LLaMA, and
PaLM, as illustrated in Fig 8.
1) The GPT Family: Generative Pre-trained Transform-
ers (GPT) are a family of decoder-only Transformer-based
language models, developed by OpenAI. This family con-
sists of GPT-1, GPT-2, GPT-3, InstrucGPT, ChatGPT, GPT-4,
CODEX, and WebGPT. Although early GPT models, such as
GPT-1 and GPT-2, are open-source, recent models, such as
GPT-3 and GPT-4, are close-source and can only be accessed
Fig. 7: High-level overview of GPT pretraining, and fine-tuning via APIs. GPT-1 and GPT-2 models have been discussed in
steps. Courtesy of OpenAI. the early PLM subsection. We start with GPT-3 below.
GPT-3 [56] is a pre-trained autoregressive language model
with 175 billion parameters. GPT-3 is widely considered as
the first LLM in that it not only is much larger than previous
GPT-2 [51] shows that language models are able to learn
PLMs, but also for the first time demonstrates emergent
to perform specific natural language tasks without any explicit
abilities that are not observed in previous smaller PLMs. GPT-
supervision when trained on a large WebText dataset consisting
3 shows the emergent ability of in-context learning, which
of millions of webpages. The GPT-2 model follows the model
means GPT-3 can be applied to any downstream tasks without
designs of GPT-1 with a few modifications: Layer normal-
any gradient updates or fine-tuning, with tasks and few-shot
ization is moved to the input of each sub-block, additional
demonstrations specified purely via text interaction with the
layer normalization is added after the final self-attention block,
model. GPT-3 achieved strong performance on many NLP
initialization is modified to account for the accumulation on
tasks, including translation, question-answering, and the cloze
the residual path and scaling the weights of residual layers,
tasks, as well as several ones that require on-the-fly reasoning
vocabulary size is expanded to 50,25, and context size is
or domain adaptation, such as unscrambling words, using a
increased from 512 to 1024 tokens.
novel word in a sentence, 3-digit arithmetic. Fig 9 plots the
performance of GPT-3 as a function of the number of examples
3) Encoder-Decoder PLMs: In [52], Raffle et al. shows that in in-context prompts.
almost all NLP tasks can be cast as a sequence-to-sequence
generation task. Thus, an encoder-decoder language model, by CODEX [57], released by OpenAI in March 2023, is a
design, is a unified model in that it can perform all natural general-purpose programming model that can parse natural
language understanding and generation tasks. Representative language and generate code in response. CODEX is a de-
encoder-decoder PLMs we will review below are T5, mT5, scendant of GPT-3, fine-tuned for programming applications
MASS, and BART. on code corpora collected from GitHub. CODEX powers
Microsoft’s GitHub Copilot.
T5 [52] is a Text-to-Text Transfer Transformer (T5) model, WebGPT [58] is another descendant of GPT-3, fine-tuned to
where transfer learning is effectively exploited for NLP via an answer open-ended questions using a text-based web browser,
introduction of a unified framework in which all NLP tasks are facilitating users to search and navigate the web. Specifically,
cast as a text-to-text generation task. mT5 [53] is a multilingual WebGPT is trained in three steps. The first is for WebGPT
variant of T5, which is pre-trained on a new Common Crawl- to learn to mimic human browsing behaviors using human
based dataset consisting of texts in 101 languages. demonstration data. Then, a reward function is learned to
predict human preferences. Finally, WebGPT is refined to
MASS (MAsked Sequence to Sequence pre-training) [54]
optimize the reward function via reinforcement learning and
adopts the encoder-decoder framework to reconstruct a sen-
rejection sampling.
tence fragment given the remaining part of the sentence. The
encoder takes a sentence with randomly masked fragment To enable LLMs to follow expected human instructions,
(several consecutive tokens) as input, and the decoder predicts InstructGPT [59] is proposed to align language models with
the masked fragment. In this way, MASS jointly trains the user intent on a wide range of tasks by fine-tuning with
encoder and decoder for language embedding and generation, human feedback. Starting with a set of labeler-written prompts
respectively. and prompts submitted through the OpenAI API, a dataset
of labeler demonstrations of the desired model behavior is
BART [55] uses a standard sequence-to-sequence transla- collected. Then GPT-3 is fine-tuned on this dataset. Then, a
tion model architecture. It is pre-trained by corrupting text with dataset of human-ranked model outputs is collected to further
an arbitrary noising function, and then learning to reconstruct fine-tune the model using reinforcement learning. The method
the original text. is known Reinforcement Learning from Human Feedback
WizardLM Giraffe
WebGPT
Tulu
InstructGPT Guanaco
CODEX
Long LLaMA PaLM2
GPT3.5 Turbo Mistral
Med-PaLM PaLM
code-davinci Gorilla
text-davinci Stable Beluga2
Med-PaLM2 PaLM-E
GPT4 Turbo
GPT4 Vision Vigogne Code LLaMA
GPT3 U-PaLM FLAN-PaLM

GPT2 Koala Baize

GPT4
GPT1 Vicuna Alpaca

GPT Family LLaMA 1/2 Family PaLM Family

T
GP

Fig. 8: Popular LLM Families.

launch of ChatGPT (Chat Generative Pre-trained Transformer)

[60] on November 30, 2022. ChatGPT is chatbot that enables
users to steer a conversation to complete a wide range of
tasks such as question answering, information seeking, text
summarization, and more. ChatGPT is powered by GPT-3.5
(and later by GPT-4), a sibling model to InstructGPT, which
is trained to follow an instruction in a prompt and provide a
detailed response.

GPT-4 [33] is the latest and most powerful LLM in the

GPT family. Launched in March, 2023, GPT-4 is a multi-
modal LLM in that it can take image and text as inputs and
Fig. 9: GPT-3 shows that larger models make increasingly produce text outputs. While still less capable than humans
efficient use of in-context information. It shows in-context in some of the most challenging real-world scenarios, GPT-4
learning performance on a simple task requiring the model to exhibits human-level performance on various professional and
remove random symbols from a word, both with and without academic benchmarks, including passing a simulated bar exam
a natural language task description. Courtesy of [56]. with a score around the top 10% of test takers, as shown in
Fig 11. Like early GPT models, GPT-4 was first pre-trained to
predict next tokens on large text corpora, and then fine-tuned
with RLHF to align model behaviors with human-desired ones.
(RLHF), as shown in 10. The resultant InstructGPT models
have shown improvements in truthfulness and reductions in 2) The LLaMA Family: LLaMA is a collection of founda-
toxic output generation while having minimal performance tion language models, released by Meta. Unlike GPT models,
regressions on public NLP datasets. LLaMA models are open-source, i.e., model weights are
released to the research community under a noncommercial
license. Thus, the LLaMA family grows rapidly as these
models are widely used by many research groups to develop
better open-source LLMs to compete the closed-source ones or
to develop task-specific LLMs for mission-critical applications.

The first set of LLaMA models [32] was released in Febru-

ary 2023, ranging from 7B to 65B parameters. These models
are pre-trained on trillions of tokens, collected from publicly
available datasets. LLaMA uses the transformer architecture of
GPT-3, with a few minor architectural modifications, including
(1) using a SwiGLU activation function instead of ReLU,
(2) using rotary positional embeddings instead of absolute
positional embedding, and (3) using root-mean-squared layer-
Fig. 10: The high-level overview of RLHF. Courtesy of [59]. normalization instead of standard layer-normalization. The
open-source LLaMA-13B model outperforms the proprietary
GPT-3 (175B) model on most benchmarks, making it a good
The most important milestone of LLM development is the baseline for LLM research.
collected from ShareGPT. Preliminary evaluation using GPT-
4 as a evaluator shows that Vicuna-13B achieves more than
90% quality of OpenAI’s ChatGPT, and Google’s Bard while
outperforming other models like LLaMA and Stanford Alpaca
in more than 90% of cases. 13 shows the relative response
quality of Vicuna and a few other well-known models by
GPT-4. Another advantage of Vicuna-13B is its relative limited
computational demand for model training. The training cost of
Vicuna-13B is merely $300.

Fig. 11: GPT-4 performance on academic and professional Fig. 13: Relative Response Quality of Vicuna and a few other
exams, compared with GPT 3.5. Courtesy of [33]. well-known models by GPT-4. Courtesy of Vicuna Team.

Like Alpaca and Vicuna, the Guanaco models [63] are also
In July 2023, Meta, in partnership with Microsoft, released finetuned LLaMA models using instruction-following data. But
the LLaMA-2 collection [61], which include both foundation the finetuning is done very efficiently using QLoRA such
language models and Chat models finetuned for dialog, known that finetuning a 65B parameter model can be done on a
as LLaMA-2 Chat. The LLaMA-2 Chat models were reported single 48GB GPU. QLoRA back-propagates gradients through
to outperform other open-source models on many public a frozen, 4-bit quantized pre-trained language model into Low
benchmarks. Fig 12 shows the training process of LLaMA-2 Rank Adapters (LoRA). The best Guanaco model outperforms
Chat. The process begins with pre-training LLaMA-2 using all previously released models on the Vicuna benchmark,
publicly available online data. Then, an initial version of reaching 99.3% of the performance level of ChatGPT while
LLaMA-2 Chat is built via supervised fine-tuning. Subse- only requiring 24 hours of fine-tuning on a single GPU.
quently, the model is iteratively refined using RLHF, rejection
Koala [64] is yet another instruction-following language
sampling and proximal policy optimization. In the RLHF stage,
model built on LLaMA, but with a specific focus on interaction
the accumulation of human feedback for revising the reward
data that include user inputs and responses generated by highly
model is crucial to prevent the reward model from being
capable closed-source chat models such as ChatGPT. The
changed too much, which could hurt the stability of LLaMA
Koala-13B model performs competitively with state-of-the-art
model training.
chat models according to human evaluation based on real-
world user prompts.
Mistral-7B [65] is a 7B-parameter language model engi-
neered for superior performance and efficiency. Mistral-7B
outperforms the best open-source 13B model (LLaMA-2-13B)
across all evaluated benchmarks, and the best open-source
34B model (LLaMA-34B) in reasoning, mathematics, and code
generation. This model leverages grouped-query attention for
faster inference, coupled with sliding window attention to
effectively handle sequences of arbitrary length with a reduced
inference cost.
The LLaMA family is growing rapidly, as more instruction-
Fig. 12: Training of LLaMA-2 Chat. Courtesy of [61]. following models have been built on LLaMA or LLaMA-
2, including Code LLaMA [66], Gorilla [67], Giraffe [68],
Vigogne [69], Tulu 65B [70], Long LLaMA [71], and Stable
Alpaca [62] is fine-tuned from the LLaMA-7B model using Beluga2 [72], just to name a few.
52K instruction-following demonstrations generated in the
3) The PaLM Family: The PaLM (Pathways Language
style of self-instruct using GPT-3.5 (text-davinci-003). Alpaca
Model) family are developed by Google. The first PaLM
is very cost-effective for training, especially for academic
model [31] was announced in April 2022 and remained private
research. On the self-instruct evaluation set, Alpaca performs
until March 2023. It is a 540B parameter transformer-based
similarly to GPT-3.5, despite that Alpaca is much smaller.
LLM. The model is pre-trained on a high-quality text corpus
The Vicuna team has developed a 13B chat model, Vicuna- consisting of 780 billion tokens that comprise a wide range
13B, by fine-tuning LLaMA on user-shared conversations of natural language tasks and use cases. PaLM is pre-trained
on 6144 TPU v4 chips using the Pathways system, which [77]. Med-PaLM 2 scored up to 86.5% on the MedQA
enables highly efficient training across multiple TPU Pods. dataset (i.e., a benchmark combining six existing open ques-
PaLM demonstrates continued benefits of scaling by achiev- tion answering datasets spanning professional medical exams,
ing state-of-the-art few-shot learning results on hundreds of research, and consumer queries), improving upon Med-PaLM
language understanding and generation benchmarks. PaLM- by over 19% and setting a new state-of-the-art.
540B outperforms not only state-of-the-art fine-tuned models
on a suite of multi-step reasoning tasks, but also on par with C. Other Representative LLMs
humans on the recently released BIG-bench benchmark. In addition to the models discussed in the previous sub-
The U-PaLM models of 8B, 62B, and 540B scales are sections, there are other popular LLMs which do not belong
continually trained on PaLM with UL2R, a method of continue to those three model families, yet they have achieved great
training LLMs on a few steps with UL2’s mixture-of-denoiser performance and have pushed the LLMs field forward. We
objective [73]. An approximately 2x computational savings briefly describe these LLMs in this subsection.
rate is reported. FLAN: In [78], Wei et al. explored a simple method for
U-PaLM is later instruction-finetuned as Flan-PaLM [74]. improving the zero-shot learning abilities of language models.
Compared to other instruction finetuning work mentioned They showed that instruction tuning language models on a
above, Flan-PaLM’s finetuning is performed using a much collection of datasets described via instructions substantially
larger number of tasks, larger model sizes, and chain-of- improves zero-shot performance on unseen tasks. They take
thought data. As a result, Flan-PaLM substantially outperforms a 137B parameter pretrained language model and instruction
previous instruction-following models. For instance, Flan- tune it on over 60 NLP datasets verbalized via natural language
PaLM-540B, which is instruction-finetuned on 1.8K tasks, instruction templates. They call this instruction-tuned model
outperforms PaLM-540B by a large margin (+9.4% on av- FLAN. Fig 15 provides a comparison of instruction tuning
erage). The finetuning data comprises 473 datasets, 146 task with pretrain–finetune and prompting.
categories, and 1,836 total tasks, as illustrated in Fig 14.

Fig. 15: comparison of instruction tuning with pre-

train–finetune and prompting. Courtesy of [78].

Gopher: In [79], Rae et al. presented an analysis of

Transformer-based language model performance across a wide
Fig. 14: Flan-PaLM finetuning consist of 473 datasets in above range of model scales — from models with tens of millions of
task categories. Courtesy of [74]. parameters up to a 280 billion parameter model called Gopher.
These models were evaluated on 152 diverse tasks, achieving
state-of-the-art performance across the majority. The number
PaLM-2 [75] is a more compute-efficient LLM with bet-
of layers, the key/value size, and other hyper-parameters of
ter multilingual and reasoning capabilities, compared to its
different model sizes are shown in Fig 16.
predecessor PaLM. PaLM-2 is trained using a mixture of
objectives. Through extensive evaluations on English, multi-
lingual, and reasoning tasks, PaLM-2 significantly improves
the model performance on downstream tasks across different
model sizes, while simultaneously exhibiting faster and more
efficient inference than PaLM.
Med-PaLM [76] is a domain-specific PaLM, and is de-
signed to provide high-quality answers to medical questions.
Med-PaLM is finetuned on PaLM using instruction prompt Fig. 16: Model architecture details of Gopher with different
tuning, a parameter-efficient method for aligning LLMs to number of parameters. Courtesy of [78].
new domains using a few exemplars. Med-PaLM obtains very
encouraging results on many healthcare tasks, although it is
still inferior to human clinicians. Med-PaLM 2 improves Med- T0: In [80], Sanh et al. developed T0, a system for easily
PaLM via med-domain finetuning and ensemble prompting mapping any natural language tasks into a human-readable
prompted form. They converted a large set of supervised
datasets, each with multiple prompts with diverse wording.
These prompted datasets allow for benchmarking the ability
of a model to perform completely held-out tasks. Then, a
T0 encoder-decoder model is developed to consume textual
inputs and produces target responses. The model is trained on
a multitask mixture of NLP datasets partitioned into different
tasks.
ERNIE 3.0: In [81], Sun et al. proposed a unified frame-
work named ERNIE 3.0 for pre-training large-scale knowledge Fig. 18: Retro architecture. Left: simplified version where a
enhanced models. It fuses auto-regressive network and auto- sequence of length n = 12 is split into l = 3 chunks of size
encoding network, so that the trained model can be easily tai- m = 4. For each chunk, we retrieve k = 2 neighbours of r =
lored for both natural language understanding and generation 5 tokens each. The retrieval pathway is shown on top. Right:
tasks using zero-shot learning, few-shot learning or fine-tuning. Details of the interactions in the CCA operator. Causality is
They have trained ERNIE 3.0 with 10 billion parameters maintained as neighbours of the first chunk only affect the last
on a 4TB corpus consisting of plain texts and a large-scale token of the first chunk and tokens from the second chunk.
knowledge graph. Fig 17 illustrates the model architecture of Courtesy of [82].
Ernie 3.0.

Fig. 17: High-level model architecture of ERNIE 3.0. Courtesy

of [81].

Fig. 19: GLaM model architecture. Each MoE layer (the

RETRO: In [82], Borgeaud et al. enhanced auto-regressive bottom block) is interleaved with a Transformer layer (the
language models by conditioning on document chunks re- upper block). Courtesy of [84].
trieved from a large corpus, based on local similarity with pre-
ceding tokens. Using a 2-trillion-token database, the Retrieval-
Enhanced Transformer (Retro) obtains comparable perfor-
pre-trained on 1.56T words of public dialog data and web text.
mance to GPT-3 and Jurassic-1 [83] on the Pile, despite using
They showed that fine-tuning with annotated data and enabling
25% fewer parameters. As shown in Fig 18, Retro combines
the model to consult external knowledge sources can lead to
a frozen Bert retriever, a differentiable encoder and a chunked
significant improvements towards the two key challenges of
cross-attention mechanism to predict tokens based on an order
safety and factual grounding.
of magnitude more data than what is typically consumed
during training. OPT: In [86], Zhang et al. presented Open Pre-trained
GLaM: In [84], Du et al. proposed a family of LLMs Transformers (OPT), a suite of decoder-only pre-trained trans-
named GLaM (Generalist Language Model), which use a formers ranging from 125M to 175B parameters, which they
sparsely activated mixture-of-experts architecture to scale the share with researchers. The OPT models’ parameters are
model capacity while also incurring substantially less training shown in 20
cost compared to dense variants. The largest GLaM has 1.2
Chinchilla: In [2], Hoffmann et al. investigated the optimal
trillion parameters, which is approximately 7x larger than GPT-
model size and number of tokens for training a transformer
3. It consumes only 1/3 of the energy used to train GPT-3 and
language model under a given compute budget. By training
requires half of the computation flops for inference, while still
over 400 language models ranging from 70 million to over
achieving better overall zero, one and few-shot performance
16 billion parameters on 5 to 500 billion tokens, they found
across 29 NLP tasks. Fig 19 shows the high-level architecture
that for compute-optimal training, the model size and the
of GLAM.
number of training tokens should be scaled equally: for every
LaMDA: In [85], Thoppilan et al. presented LaMDA, a doubling of model size the number of training tokens should
family of Transformer-based neural language models special- also be doubled. They tested this hypothesis by training a
ized for dialog, which have up to 137B parameters and are predicted compute-optimal model, Chinchilla, that uses the
Fig. 20: Different OPT Models’ architecture details. Courtesy
of [86].
Fig. 21: Sparrow pipeline relies on human participation to
continually expand a training set. Courtesy of [90].
same compute budget as Gopher but with 70B parameters and
4% more more data.
Galactica: In [87], Taylor et al. introduced Galactica, a unified perspective for self-supervision in NLP and show how
large language model that can store, combine and reason about different pre-training objectives can be cast as one another
scientific knowledge. They trained on a large scientific corpus and how interpolating between different objectives can be
of papers, reference material, knowledge bases and many other effective. They proposed Mixture-of-Denoisers (MoD), a pre-
sources. Galactica performed well on reasoning, outperforming training objective that combines diverse pre-training paradigms
Chinchilla on mathematical MMLU by 41.3% to 35.7%, and together. This framework is known as Unifying Language
PaLM 540B on MATH with a score of 20.4% versus 8.8%. Learning (UL2). An overview of UL2 pretraining paradigm
is shown in Fig 21.
CodeGen: In [88], Nijkamp et al. trained and released
a family of large language models up to 16.1B parameters,
called CODEGEN, on natural language and programming
language data, and open sourced the training library JAX-
FORMER. They showed the utility of the trained model by
demonstrating that it is competitive with the previous state-of-
the-art on zero-shot Python code generation on HumanEval.
They further investigated the multi-step paradigm for program
synthesis, where a single program is factorized into multi-
ple prompts specifying sub-problems. They also constructed
an open benchmark, Multi-Turn Programming Benchmark
(MTPB), consisting of 115 diverse problem sets that are
factorized into multi-turn prompts.
AlexaTM: In [89], Soltan et al. demonstrated that mul- Fig. 22: An overview of UL2 pretraining paradigm. Courtesy
tilingual large-scale sequence-to-sequence (seq2seq) models, of [92].
pre-trained on a mixture of denoising and Causal Language
Modeling (CLM) tasks, are more efficient few-shot learners
than decoder-only models on various task. They trained a
20 billion parameter multilingual seq2seq model called Alexa BLOOM: In [93], Scao et al. presented BLOOM, a 176B-
Teacher Model (AlexaTM 20B) and showed that it achieves parameter open-access language model designed and built
state-of-the-art (SOTA) performance on 1-shot summarization thanks to a collaboration of hundreds of researchers. BLOOM
tasks, outperforming a much larger 540B PaLM decoder is a decoder-only Transformer language model trained on the
model. AlexaTM consist of 46 encoder layers, 32 decoder ROOTS corpus, a dataset comprising hundreds of sources in
layers, 32 attention heads, and dmodel = 4096. 46 natural and 13 programming languages (59 in total). An
Sparrow: In [90], Glaese et al. presented Sparrow, an overview of BLOOM architecture is shown in Fig 23.
information-seeking dialogue agent trained to be more helpful, GLM: In [94], Zeng et al. introduced GLM-130B, a
correct, and harmless compared to prompted language model bilingual (English and Chinese) pre-trained language model
baselines. They used reinforcement learning from human feed- with 130 billion parameters. It was an attempt to open-source
back to train their models with two new additions to help a 100B-scale model at least as good as GPT-3 (davinci) and
human raters judge agent behaviour. The high-level pipeline unveil how models of such a scale can be successfully pre-
of Sparrow model is shown in Fig 21. trained.
Minerva: In [91], Lewkowycz et al. introduced Minerva,
Pythia: In [95], Biderman et al. introduced Pythia, a suite
a large language model pretrained on general natural language
of 16 LLMs all trained on public data seen in the exact same
data and further trained on technical content, to tackle previous
order and ranging in size from 70M to 12B parameters. We
LLM struggle with quantitative reasoning (such as solving
provide public access to 154 checkpoints for each one of the
mathematics, science, and engineering problems).
16 models, alongside tools to download and reconstruct their
MoD: In [92], Tay et al. presented a generalized and exact training dataloaders for further study.
Palmyra [106], Camel [107], Yalm [108], MPT [109], ORCA-
2 [110], Gorilla [67], PAL [111], Claude [112], CodeGen 2
[113], Zephyr [114], Grok [115], Qwen [116], Mamba [30],
Mixtral-8x7B [117], DocLLM [118], DeepSeek-Coder [119],
FuseLLM-7B [120], TinyLlama-1.1B [121], LLaMA-Pro-8B
[122].
Fig 24 provides an overview of some of the most repre-
sentative LLM frameworks, and the relevant works that have
contributed to the success of LLMs and helped to push the
limits of LLMs.
Fig. 23: An overview of BLOOM architecture. Courtesy of
[93]. III. H OW LLM S A RE B UILT
In this section, we first review the popular architectures
used for LLMs, and then discuss data and modeling techniques
Orca: In [96], Mukherjee et al. develop Orca, a 13-billion ranging from data preparation, tokenization, to pre-training,
parameter model that learns to imitate the reasoning process instruction tuning, and alignment.
of large foundation models. Orca learns from rich signals Once the model architecture is chosen, the major steps
from GPT-4 including explanation traces; step-by-step thought involved in training an LLM includes: data preparation (col-
processes; and other complex instructions, guided by teacher lection, cleaning, deduping, etc.), tokenization, model pre-
assistance from ChatGPT. training (in a self-supervised learning fashion), instruction
StarCoder: In [97], Li et al. introduced StarCoder and tuning, and alignment. We will explain each of them in a
StarCoderBase. They are 15.5B parameter models with 8K separate subsection below. These steps are also illustrated in
context length, infilling capabilities and fast large-batch in- Fig 25.
ference enabled by multi-query attention. StarCoderBase is
trained on one trillion tokens sourced from The Stack, a A. Dominant LLM Architectures
large collection of permissively licensed GitHub repositories
with inspection tools and an opt-out process. They fine-tuned The most widely used LLM architectures are encoder-only,
StarCoderBase on 35B Python tokens, resulting in the creation decoder-only, and encoder-decoder. Most of them are based on
of StarCoder. They performed the most comprehensive evalu- Transformer (as the building block). Therefore we also review
ation of Code LLMs to date and showed that StarCoderBase the Transformer architecture here.
outperforms every open Code LLM that supports multiple pro- 1) Transformer: in a ground-breaking work [44], Vaswani
gramming languages and matches or outperforms the OpenAI et al. proposed the Transformer framework, which was orig-
code-cushman-001 model. inally designed for effective parallel computing using GPUs.
KOSMOS: In [98], Huang et al. introduced KOSMOS-1, The heart of Transformer is the (self-)attention mechanism,
a Multimodal Large Language Model (MLLM) that can per- which can capture long-term contextual information much
ceive general modalities, learn in context (i.e., few-shot), and more effectively using GPUs than the recurrence and convo-
follow instructions (i.e. zero-shot). Specifically, they trained lution mechanisms. Fig 26 provides a high-level overview of
KOSMOS-1 from scratch on web-scale multi-modal corpora, transformer work. In this section we provide an overview of the
including arbitrarily interleaved text and images, image-caption main elements and variants, see [44], [123] for more details.
pairs, and text data. Experimental results show that KOSMOS- The Transformer language model architecture, originally
1 achieves impressive performance on (i) language understand- proposed for machine translation, consists of an encoder and
ing, generation, and even OCR-free NLP (directly fed with a decoder. The encoder is composed of a stack of N = 6
document images), (ii) perception-language tasks, including identical Transformer layers. Each layer has two sub-layers.
multimodal dialogue, image captioning, visual question an- The first one is a multi-head self-attention layer, and the other
swering, and (iii) vision tasks, such as image recognition with one is a simple position-wise fully connected feed-forward
descriptions (specifying classification via text instructions). network. The decoder is composed of a stack of 6 identical
Gemini: In [99], Gemini team introduced a new family of layers. In addition to the two sub-layers in each encoder layer,
multimodal models, that exhibit promising capabilities across the decoder has a third sub-layer, which performs multi-head
image, audio, video, and text understanding. Gemini family attention over the output of the encoder stack. The attention
includes three versions: Ultra for highly-complex tasks, Pro function can be described as mapping a query and a set of key-
for enhanced performance and deployability at scale, and Nano value pairs to an output, where the query, keys, values, and
for on-device applications. Gemini architecture is built on top output are all vectors. The output is computed as a weighted
of Transformer decoders, and is trained to support 32k context sum of the values, where the weight assigned to each value
length (via using efficient attention mechanisms). is computed by a compatibility function of the query with the
corresponding key. Instead of performing a single attention
Some of the other popular LLM frameworks (or techniques function with dmodel dimensional keys, values and queries,
used for efficient developments of LLMs) includes Inner- it is found to be beneficial to linearly project the queries,
Monologue [100], Megatron-Turing NLG [101], LongFormer keys and values h with different, learned linear projections to
[102], OPT-IML [103], MeTaLM [104], Dromedary [105], dk , dk and dv dimensions, respectively. Positional encoding is
Fig. 24: Timeline of some of the most representative LLM frameworks (so far). In addition to large language models with our
#parameters threshold, we included a few representative works, which pushed the limits of language models, and paved the way
for their success (e.g. vanilla Transformer, BERT, GPT-1), as well as some small language models. ♣ shows entities that serve
not only as models but also as approaches. ♦ shows only approaches.

incorporated to fuse information about the relative or absolute (that can contain several words) with a single mask special
position of the tokens in the sequence. word, and the objective is then to predict the text that this
mask word replaces. Encoder-decoder models are best suited
2) Encoder-Only: For this family, at each stage, the atten-
for tasks about generating new sentences conditioned on a
tion layers can access all the words in the initial sentence.
given input, such as summarization, translation, or generative
The pre-training of these models usually consist of some-
question answering.
how corrupting a given sentence (for instance, by masking
random words in it) and tasking the model with finding or
reconstructing the initial sentence. Encoder models are great B. Data Cleaning
for tasks requiring an understanding of the full sequence,
such as sentence classification, named entity recognition, and Data quality is crucial to the performance of language
extractive question answering. One prominent encoder only models trained on them. Data cleaning techniques such as
model is BERT (Bidirectional Encoder Representations from filtering, deduplication, are shown to have a big impact on
Transformers), proposed in [24]. the model performance.
3) Decoder-Only: For these models, at each stage, for any As an example, in Falcon40B [124], Penedo et al. showed
word, the attention layers can only access the words positioned that properly filtered and deduplicated web data alone can lead
before that in the sentence. These models are also sometimes to powerful models; even significantly outperforming models
called auto-regressive models. The pretraining of these models from the state-of-the-art trained on The Pile. Despite extensive
is usually formulated as predicting the next word (or token) filtering, they were able to obtain five trillion tokens from
in the sequence. The decoder-only models are best suited for CommonCrawl. They also released an extract of 600 billion
tasks involving text generation. GPT models are prominent tokens from our REFINEDWEB dataset, and 1.3/7.5B param-
example of this model category. eters language models trained on it. 27 shows the Refinement
4) Encoder-Decoder: These models use both encoder and process of CommonCrawl data by this work.
decoder, and are sometimes called sequence-to-sequence mod- 1) Data Filtering: Data filtering aims to enhance the qual-
els. At each stage, the attention layers of the encoder can access ity of training data and the effectiveness of the trained LLMs.
all the words in the initial sentence, whereas the attention Common data filtering techniques include:
layers of the decoder only accesses the words positioned before
a given word in the input. These models are usually pre- Removing Noise: refers to eliminating irrelevant or noisy
trained using the objectives of encoder or decoder models, but data that might impact the model’s ability to generalize well.
usually involve something a bit more complex. For instance, As an example, one can think of removing false information
some models are pretrained by replacing random spans of text from the training data, to lower the chance of model generating
How LLMs Are Built?

Data Filtering
Removing Noise
Handling Outliers
Data Cleaning
Addressing Imbalances
Text Preprocessing
Deduplication

BytePairEncoding
Tokenizations WordPieceEncoding
SentencePieceEncoding

Absolute Positional Embeddings

Relative Positional Embeddings
Positional Encoding
Rotary Position Embeddings
Relative Positional Bias

Encoder-Only
Decoder-Only
LLM Architectures
Encoder-Decoder
...

Masked Language Modeling

Causal Language Modeling
Model Pre-training
Next Sentence Prediction
Mixture of Experts

Supervised Fine-tuning
General Fine-tuning
Fine-tuning and Instruction Tuning
Multi-turn Instructions
Instruction Following

Supervised learning
Reinforcement Learning from Human Feedback
Alignment
Direct Preference Optimization
Kahneman-Tversky Optimization

Greedy Search
Decoding Strategies Beam Search
Top-k Sampling
Top-p Sampling

Optimized Training
Zero Redundancy Optimizer
Receptance Weighted Key Value Cost-Effective Training/Inference,
Low-Rank Adaption Adaptation & Compression
Knowledge Distillation
Quantization

Fig. 25: This figure shows different components of LLMs.

2) Deduplication: De-duplication refers to the process of
removing duplicate instances or repeated occurrences of the
same data in a dataset. Duplicate data points can introduce
biases in the model training process and reduce the diversity, as
the model may learn from the same examples multiple times,
potentially leading to overfitting on those particular instances.
Some works [125] have shown that de-duplication improves
models’ ability to generalize to new, unseen data.
The de-duplication process is particularly important when
dealing with large datasets, as duplicates can unintentionally
inflate the importance of certain patterns or characteristics.
This is especially relevant in NLP tasks, where diverse and
representative training data is crucial for building robust lan-
guage models.
The specific de-duplication method can vary based on
the nature of the data and the requirements of the particular
language model being trained. It may involve comparing entire
data points or specific features to identify and eliminate du-
plicates. At the document level, existing works mainly rely on
the overlap ratio of high-level features (e.g. n-grams overlap)
Fig. 26: High-level overview of transformer work. Courtesy of between documents to detect duplicate samples.
[44].
C. Tokenizations
Tokenization referes to the process of converting a se-
quence of text into smaller parts, known as tokens. While
the simplest tokenization tool simply chops text into tokens
based on white space, most tokenization tools rely on a word
dictionary. However, out-of-vocabulary (OOV) is a problem
in this case because the tokenizer only knows words in its
dictionary. To increase the coverage of dictionaries, popular
tokenizers used for LLMs are based on sub-words, which can
be combined to form a large number of words, including the
words unseen in training data or words in different languages.
In what follows, we describe three popular tokenizers.
1) BytePairEncoding: BytePairEncoding is originally a
Fig. 27: Subsequent stages of Macrodata Refinement remove type of data compression algorithm that uses frequent patterns
nearly 90% of the documents originally in CommonCrawl. at byte level to compress the data. By definition, this algorithm
Courtesy of [124]. mainly tries to keep the frequent words in their original form
and break down ones that are not common. This simple
paradigm keeps the vocabulary not very large, but also good
enough to represent common words at the same time. Also
false responses. Two mainstream approaches for quality filter- morphological forms of the frequent words can be represented
ing includes: classifier-based, and heuristic-based frameworks. very well if suffix or prefix is also commonly presented in the
Handling Outliers: Identifying and handling outliers or training data of the algorithm.
anomalies in the data to prevent them from disproportionately 2) WordPieceEncoding: This algorithm is mainly used for
influencing the model. very well-known models such as BERT and Electra. At the
Addressing Imbalances: Balancing the distribution of beginning of training, the algorithm takes all the alphabet from
classes or categories in the dataset to avoid biases and ensure the training data to make sure that nothing will be left as UNK
fair representation. This is specially useful for responsible or unknown from the training dataset. This case happens when
model training and evaluation. the model is given an input that can not be tokenized by the
tokenizer. It mostly happens in cases where some characters are
Text Preprocessing: Cleaning and standardizing text data not tokenizable by it. Similar to BytePairEncoding, it tries to
by removing stop words, punctuation, or other elements that maximize the likelihood of putting all the tokens in vocabulary
may not contribute significantly to the model’s learning. based on their frequency.
Dealing with Ambiguities: Resolving or excluding am- 3) SentencePieceEncoding: Although both tokenizers de-
biguous or contradictory data that might confuse the model scribed before are strong and have many advantages compared
during training. This can help the model to provide more to white-space tokenization, they still take assumption of
definite and reliable answers. words being always separated by white-space as granted. This
assumption is not always true, in fact in some languages, words approaches used for pre-training like next sentence prediction
can be corrupted by many noisy elements such as unwanted [24], two most common ones include, next token prediction
spaces or even invented words. SentencePieceEncoding tries (autoregressive language modeling), and masked language
to address this issue. modeling.
In Autoregressive Language Modeling framework, given
D. Positional Encoding a sequence of n tokens x1 , ..., xn , the model tries to predict
1) Absolute Positional Embeddings: (APE) [44] has been next token xn+1 (and sometimes next sequence of tokens) in
used in the original Transformer model to preserve the infor- an auto-regressive fashion. One popular loss function in this
mation of sequence order. Therefore, the positional information case is the log-likelihood of predicted tokens as shown in Eq
of words is added to the input embeddings at the bottom of 2
N
both the encoder and decoder stacks. There are various options X
for positional encodings, either learned or fixed. In the vanilla LALM (x) = p(xi+n |xi , ..., xi+n−1 ) (1)
Transformer, sine and cosine functions are employed for this i=1
purpose. The main drawback of using APE in Transformers Given the auto-regressive nature of this framework, the
is the restriction to a certain number of tokens. Additionally, decoder-only models are naturally better suited to learn how
APE fails to account for the relative distances between tokens. to accomplish these task.
2) Relative Positional Embeddings: (RPE) [126] involves In Masked Language Modeling, some words are masked
extending self-attention to take into account the pairwise links in a sequence and the model is trained to predict the masked
between input elements. RPE is added to the model at two words based on the surrounding context. Sometimes people
levels: first as an additional component to the keys, and refer to this approach as denoising autoencoding, too. If we
subsequently as a sub-component of the values matrix. This denote the masked/corrupted samples in the sequence x, as x̃,
approach looks at the input as a fully-connected graph with then the training objective of this approach can be written as:
labels and directed edges. In the case of linear sequences, edges
N
can capture information about the relative position differences X
between input elements. A clipping distance, represented as k LM LM (x) = p(x̃|x\x̃) (2)
2 ≤ k ≤ n − 4, specifies the maximum limit on relative lo- i=1
cations. This allows the model to make reasonable predictions
for sequence lengths that are not part of the training data. And more recently, Mixture of Experts (MoE) [130],
[131] have become very popular in LLM space too. MoEs
3) Rotary Position Embeddings: Rotary Positional Em- enable models to be pre-trained with much less compute,
bedding (RoPE) [127] tackles problems with existing ap- which means one can dramatically scale up the model or
proaches. Learned absolute positional encodings can lack gen- dataset size with the same compute budget as a dense model.
eralizability and meaningfulness, particularly when sentences MoE consists of two main elements: Sparse MoE layers,
are short. Moreover, current methods like T5’s positional which are used instead of dense feed-forward network (FFN)
embedding face challenges with constructing a full attention layers, and have a certain number of “experts” (e.g. 8), in
matrix between positions. RoPE uses a rotation matrix to which each expert is a neural network. In practice, the experts
encode the absolute position of words and simultaneously in- are FFNs, but they can also be more complex networks. A gate
cludes explicit relative position details in self-attention. RoPE network or router, that determines which tokens are sent to
brings useful features like flexibility with sentence lengths, a which expert. It is worth noting that, one can send a token
decrease in word dependency as relative distances increase, to more than one expert. How to route a token to an expert
and the ability to improve linear self-attention with relative is one of the big decisions when working with MoEs - the
position encoding. GPT-NeoX-20B, PaLM, CODEGEN, and router is composed of learned parameters and is pretrained at
LLaMA are among models that take advantage of RoPE in the same time as the rest of the network. Fig 29 provides an
their architectures. illustration of a Switch Transformer encoder block, which are
4) Relative Positional Bias: The concept behind this type used in MoE.
of positional embedding is to facilitate extrapolation during
inference for sequences longer than those encountered in train- F. Fine-tuning and Instruction Tuning
ing. In [128] Press et al. proposed Attention with Linear Biases
(ALiBi). Instead of simply adding positional embeddings to Early language models such as BERT trained using self-
word embeddings, they introduced a bias to the attention scores supervision as explained in section III-E were not able to
of query-key pairs, imposing a penalty proportional to their perform specific tasks. In order for the foundation model to be
distance. In the BLOOM model, ALiBi is leveraged. useful it needed to be fine-tuned to a specific task with labeled
data (so-called supervised fine-tuning or SFT for short). For
E. Model Pre-training example, in the original BERT paper [24], the model was fine-
tuned to 11 different tasks. While more recent LLMs no longer
Pre-training is the very first step in large language model require fine-tuning to be used, they can still benefit from task
training pipeline, and it helps LLMs to acquire fundamental or data-specific fine-tuning. For example, OpenAI reports that
language understanding capabilities, which can be useful in a the much smaller GPT-3.5 Turbo model can outperform GPT-4
wide range of language related tasks. During pre-training, the when fine-tuned with task specific data 2 .
LLM is trained on a massive amount of (usually) unlabeled
texts, usually in a self-supervised manner. There are different 2 https://platform.openai.com/docs/guides/fine-tuning
(a) Absolute Positional Embeddings [129] (b) Relative Positional Embeddings

Fig. 28: Various positional encodings are employed in LLMs.

tuning [133]. We dive into the details of how to design

and engineer prompts in section IV-B, but in the context
of instruction tuning, it is important to understand that the
instruction is a prompt that specifies the task that the LLM
should accomplish. Instruction tuning datasets such as Natural
Instructions [134] include not only the task definition but other
components such as positive/negative examples or things to
avoid.
The specific approach and instruction datasets used to
instruction-tune an LLM varies, but, generally speaking, in-
struction tuned models outperform their original foundation
models they are based on. For example, InstructGPT [59]
Fig. 29: : Illustration of a Switch Transformer encoder block. outperforms GPT-3 on most benchmarks. The same is true
They replaced the dense feed forward network (FFN) layer for Alpaca [62] when compared to LLaMA.
present in the Transformer with a sparse Switch FFN layer
(light blue). . Courtesy of [131]. Self-Instruct [135], proposed by Wang et al. is also a
popular approach along this line, in which they introduced a
framework for improving the instruction-following capabilities
of pre-trained language models by bootstrapping their own
Fine-tuning does not need to be performed to a single generations. Their pipeline generates instructions, input, and
task though, and there are different approaches to multi-task output samples from a language model, then filters invalid or
fine-tuning (see e.g. Mahabi et al. [132]). Fine-tuning to one similar ones before using them to fine tune the original model.
or more tasks is known to improve results and reduce the
complexity of prompt engineering, and it can serve as an G. Alignment
alternative to retrieval augmented generation. Furthermore,
there are other reasons why it might be advisable to fine-tune. AI Alignment is the process of steering AI systems towards
For example, one might want to fine-tune to expose the model human goals, preferences, and principles. LLMs, pre-trained
to new or proprietary data that it has not been exposed to for word prediction, often exhibit unintended behaviors. For
during pre-training. example, they might generate contents that are toxic, harmful,
misleading and biased.
An important reason to fine-tune LLMs is to align the
responses to the expectations humans will have when providing Instruction tuning, discussed above, gets LLMs a step
instructions through prompts. This is the so-called instruction closer to being aligned. However, in many cases, it is important
to include further steps to improve the alignment of the model the kind of data it needs is far more abundant. As an example,
and avoid unintended behaviors 3 . We review the most popular every retail company has a lot of customer interaction data and
approaches to alignment in this subsection. whether that interaction was successful (e.g., purchase made)
or unsuccessful (e.g., no purchase made). However, They have
RLHF (reinforcement learning from human feedback) and little to no counterfactual data (i.e., what would have made
RLAIF (reinforcement learning from AI feedback) are two an unsuccessful customer interaction yl into a successful one
popular approaches. RLHF uses a reward model to learn yw ). Fig 31 shows a high-level comparison between KTO and
alignment from human feedback. This reward model, after other alignment approaches discussed above.
being tuned, is able to rate different outputs and score them
according to their alignment preferences given by humans. The
reward model gives feedback to the original LLM and this
feedback is used to tune the LLM further [137]. Reinforcement
learning from AI feedback on the other hand, directly connects
a pretrained and well-aligned model to the LLM and helps it
to learn from larger and more aligned models [138].
In another recent work (known as DPO) [139], Rafailov
et al. discussed that RLHF is a complex and often unstable
procedure, and tried to address this with a new approach. They
leveraged a mapping between reward functions and optimal
policies to show that this constrained reward maximization Fig. 31: LLM alignment involves supervised finetuning fol-
problem can be optimized exactly with a single stage of policy lowed by optimizing a human-centered loss (HALO). How-
training, essentially solving a classification problem on the ever, the paired preferences that existing approaches need are
human preference data. The resulting algorithm, which they hard-to-obtain. In contrast, KTO uses a far more abundant
called Direct Preference Optimization (DPO), is stable, per- kind of data, making it much easier to use in the real world.
formant, and computationally lightweight, eliminating the need Courtesy of [136].
for fitting a reward model, sampling from the LM during fine-
tuning, or performing significant hyperparameter tuning. They
observed that fine-tuning with DPO exceeds RLHF’s ability to
control sentiment of generations and improves response quality
in summarization. Fig 30 shows the high-level comparison H. Decoding Strategies
between DPO vs RLHF. Decoding refers to the process of text generation using pre-
trained LLMs. Given an input prompt, the tokenizer translates
each token in the input text into a corresponding token ID.
Then, the language model uses these token IDs as input and
predicts the next most likely token (or a sequence of tokens).
Finally, the model generates logits, which are converted to
probabilities using a softmax function. Different decoding
Fig. 30: DPO optimizes for human preferences while avoiding strategies have been proposed. Some of the most popular ones
reinforcement learning. Existing methods for fine-tuning lan- are greedy search, beam search, as well as different sample
guage models with human feedback first fit a reward model techniques such as top-K, top-P (Nucleus sampling).
to a dataset of prompts and human preferences over pairs of
1) Greedy Search: Greedy search takes the most probable
responses, and then use RL to find a policy that maximizes
token at each step as the next token in the sequence, discarding
the learned reward. In contrast, DPO directly optimizes for
all other potential options. As you can imagine, this is a simple
the policy best satisfying the preferences with a simple classi-
approach and can loose a lot of temporal consistency and
fication objective, without an explicit reward function or RL.
coherency. It only considers the most probable token at each
Courtesy of [139].
step, without considering the overall effect on the sequence.
This property makes it fast, but it also means that it can miss
out on better sequences that might have appeared with slightly
Even more recently Ethayarajh et al. proposed a new align- less probable next tokens.
ment approach called the Kahneman-Tversky Optimization
2) Beam Search: Unlike greedy search that only considers
(KTO) [136]. Unlike existing state-of-the-art approaches, KTO
the next most probable token, beam search takes into account
does not require paired preference data (x, yw , yl ), and it
the N most likely tokens, where N denotes the number of
only needs (x,y) and knowledge of whether y is desirable or
beams. This procedure is repeated until a predefined maxi-
undesirable. KTO-aligned models are shown to be good or
mum sequence length is reached or an end-of-sequence token
better than DPO-aligned models at scales from 1B to 30B,
appears. At this point, the sequence of tokens (AKA “beam”)
despite not using paired preferences. KTO is also far easier to
with the highest overall score is chosen as the output. For
use in the real world than preference optimization methods, as
example for beam size of 2 and maximum length of 5,
3 According to very recent research by Ethayarajh et al. [136], further the beam search needs to keep track of 25 = 32 possible
alignment besides SFT mainly improves models of at least 7B parameters. sequences. So it is more computationally intensive than greedy
For smaller models, SFT is sufficient. search.
3) Top-k Sampling: Top-k sampling is a technique that increasing the model size that can be efficiently trained. ZeRO
uses the probability distribution generated by the language eliminates memory redundancies in data- and model-parallel
model to select a token randomly from the k most likely training while retaining low communication volume and high
options. computational granularity, allowing one to scale the model
size proportional to the number of devices with sustained high
Suppose we have 6 tokens (A, B, C, D, E, F) and k=2, efficiency.
and P(A)= 30%, and P(B)= 20%, P(C)= P(D)= P(E)= P(F)=
12.5%. In top-k sampling, tokens C, D, E, F are disregarded, RWKV: In [141], Peng et al. proposed a novel model
and the model outputs A 60% of the time, and B, 40% of architecture, Receptance Weighted Key Value (RWKV), that
the time. This approach ensures that we prioritize the most combines the efficient parallelizable training of Transformers
probable tokens while introducing an element of randomness with the efficient inference of RNNs. Their approach leverages
in the selection process. a linear attention mechanism and allows them to formulate the
The randomness is usually introduced via the concept of model as either a Transformer or an RNN, which parallelizes
temperature. The temperature T is a parameter that ranges from computations during training and maintains constant compu-
0 to 1, which affects the probabilities generated by the softmax tational and memory complexity during inference, leading to
function, making the most likely tokens more influential. In the first non-transformer architecture to be scaled to tens of
practice, it simply consists of dividing the input logits by billions of parameters. RWKV architecture is shown in Fig
temperature value: 32. The Time Complexity comparison of RWKV with different

exi /T
sof tmax(xi ) = P x /T (3)
je
j

A low temperature setting significantly alters the proba-

bility distribution (and is commonly used in text generation
to control the level of “creativity” in the generated output),
while a large temperature prioritizes the tokens with higher
probabilities. Top-k is a creative way of sampling, and can be
used along with beam search. The sequence chosen by top-
k sampling may not be the sequence with highest probability
in beam search. But it’s important to remember that highest
scores do not always lead to more realistic or meaningful
sequences.
4) Top-p Sampling: Top-p sampling, also known as Nu-
cleus sampling, takes a slightly different approach from top-k
sampling. Instead of selecting the top k most probable tokens,
nucleus sampling chooses a cutoff value p such that the sum of
the probabilities of the selected tokens exceeds p. This forms
a “nucleus” of tokens from which to randomly choose the next
token. In other words, in top-p sampling the language model
examines the most probable tokens in descending order and
keeps adding them to the list until the sum of probabilities Fig. 32: RWKV architecture. Courtesy of [141].
surpasses the threshold p. As you can imagine, this could be
better specially for scenarios in which top-k tokens do not have
a large probability mass. Unlike top-k sampling, the number Transformers are provided in Fig 33.
of tokens included in the nucleus sampling is not fixed. This
variability often results in a more diverse and creative output,
making nucleus sampling popular for text generation related
tasks.

I. Cost-Effective Training/Inference/Adaptation/Compression
In this part, we review some of the popular approaches
used for more cost-friendly (and compute-friendly) training
and usage of LLMs.
1) Optimized Training: There are many frameworks de- Fig. 33: Time Complexity comparison of RWKV with different
veloped for optimized training of LLMs, here we introduce Transformers. Here T denotes the sequence length, d the
some of the prominent ones. feature dimension, and c is MEGA’s chunk size of quadratic
ZeRO: In [140], Rajbhandari et al. developed a novel attention. Courtesy of [141].
solution, Zero Redundancy Optimizer (ZeRO), to optimize
memory, vastly improving training speed of LLMs while
It is also referred to as an approach to distill the knowledge of
not a single model but in fact multiple models into a smaller
one. Creating smaller models by this approach yields smaller
model sizes that can be used even on edge devices. Knowledge
distillation as shown in Fig 35, illustrates a general setup of
this training scheme.

Fig. 34: An illustration of LoRA reparametrizan. Only A and

B trained during this process. Courtesy of [142].

2) Low-Rank Adaption (LoRA): Low-Rank Adaptation is

a popular and lightweight training technique that significantly Fig. 35: A generic knowledge distillation framework with
reduces the number of trainable parameters, and is based student and teacher (Courtesy of [144]).
on a crucial insight that the difference between the fine-
tuned weights for a specialized task and the initial pre-trained
weights often exhibits “low intrinsic rank” - meaning that Knowledge can be transferred by different forms of learn-
it can be approximated well by a low rank matrix [142]. ing: response distillation, feature distillation, and API distilla-
Training with LoRA is much faster, memory-efficient, and tion. Response distillation is concerned only with the outputs
produces smaller model weights (a few hundred MBs), that are of the teacher model and tries to teach the student model
easier to store and share. One property of low-rank matrices how to exactly or at least similarly perform (in the sense of
is that they can be represented as the product of two smaller prediction) as the teacher. Feature distillation not only uses
matrices. This realization leads to the hypothesis that this delta the last layer but also intermediate layers as well to create a
between fine-tuned weights and initial pre-trained weights can better inner representation for the student model. This helps the
be represented as the matrix product of two much smaller smaller model to have a similar representation as the teacher
matrices. By focusing on updating these two smaller matrices model.
rather than the entire original weight matrix, computational
efficiency can be substantially improved. API distillation is the process of using an API (typically
from an LLM provider such as OpenAI) to train smaller
Specifically, for a pre-trained weight matrix W0 ∈ Rd×k , models. In the case of LLMs, it is used to train the model
LoRA constrains its update by representing the latter with from the direct output of the larger model which makes it very
a low-rank decomposition W0 + ∆W = W0 + BA, where similar to response distillation. Many concerns are raised by
B ∈ Rd×r , A ∈ Rr×k , and the rank r ≪ min(d, k). During this type of distillation because in cases where the model itself
training, W0 is frozen and does not receive gradient updates, is not openly available, a (usually) paid API is exposed for end
while A and B contain trainable parameters. It is worth users. On the other hand, while users pay for each call, how to
mentioning that both W0 and ∆W = BA are multiplied with use the predictions is limited, for example, OpenAI prohibits
the same input, and their respective output vectors are summed usage of its API to create LLMs that later will be used to
coordinate-wise. For h = W0 x, their modified forward pass compete with it. The main value in such case is training data.
yields: h = W0 x + ∆W x = W0 x + BAx. Usually a random
Gaussian initialization is used for A, and zero initialization 4) Quantization: deep learning in its core, is a set of
for B, so ∆W = BA is zero at the beginning of training. mathematical functions applied to matrices, with a specific
They then scale ∆W x by αr, where α is a constant in r. This precision for model weights. Reducing the precision of the
reparametrization is illustrated in Figure 34 weights can be used to reduce the size of the model and also
make it faster. As an example, Float-32 operations compared
It is worth mentioning that LoRA can be applied to any a
to Int-8 operations are slower. This process, which is called
subset of weight matrices in a neural network to reduce the
quantization, can be applied in different phases. Main ap-
number of trainable parameters. In the Transformer architec-
proaches for model quantization can be categorized as: post
ture, there are four weight matrices in the self-attention module
training quantization and quantization-aware training. Post-
(Wq , Wk , Wv , Wo ), and two in the MLP module. Most of
training quantization is concerned with quantized trained mod-
the time, LoRA is focused on adapting the attention weights
els in two well-known methods: dynamic and static. Dynamic
only for downstream tasks, and freezes the MLP modules, so
post-training quantization computes the range of quantization
they are not trained in downstream tasks both for simplicity
on the runtime and is slower compared to static. Quantization-
and parameter-efficiency.
aware training adds quantization criteria into training, and
3) Knowledge Distillation: Knowledge distillation is the a quantized model is trained and optimized during training
process of learning from a larger model [143]. Earlier days of process. This approach ensures that the end model will have
best-performing models release have proven that this approach good performance and also does not need to be quantized after
is very useful even if it is used in an API distillation approach. training.
IV. H OW LLM S A RE U SED AND AUGMENTED psychological parlance, has been appropriated within the field
of artificial intelligence.
Once the LLMs are trained, we can use them to generate
desired outputs for a variety of tasks. LLMs can be used Hallucinations in LLMs can be broadly categorized into
directly through basic prompting. However, in order to exploit two types:
their full potential or to address some of the shortcomings,
1) Intrinsic Hallucinations: These directly conflict with
we need to augment the models through some external means.
the source material, introducing factual inaccuracies
In this section we first provide a brief overview of the main
or logical inconsistencies.
shortcoming of LLMs, with a deeper look at the issue of
2) Extrinsic Hallucinations: These, while not contra-
hallucination. We then describe how prompting and some aug-
dicting, are unverifiable against the source, encom-
mentation approaches can not only address those limitations
passing speculative or unconfirmable elements.
but also be used to augment the capabilities of LLMs going
as far as turning an LLM into a full-blown AI agent with the The definition of ’source’ in LLM contexts varies with the
ability to interface with the external world. task. In dialogue-based tasks, it refers to ’world knowledge’,
whereas in text summarization, it pertains to the input text
A. LLM limitations itself. This distinction plays a crucial role in evaluating and
interpreting hallucinations. The impact of hallucinations is also
It is important to remember that LLMs are trained to predict highly context-dependent. For instance, in creative endeavors
a token. While fine-tuning and alignment improves their per- like poem writing, hallucinations might be deemed acceptable
formance and adds different dimensions to their abilities, there or even beneficial.
are still some important limitations that come up, particularly
if they are used naively. Some of them include the following: LLMs, trained on diverse datasets including the internet,
books, and Wikipedia, generate text based on probabilistic
• They don’t have state/memory. LLMs on their own models without an inherent understanding of truth or falsity.
cannot remember even what was sent to them in the Recent advancements like instruct tuning and Reinforcement
previous prompt. That is an important limitation for Learning from Human Feedback (RLHF) have attempted to
many of the uses cases that require some form of state. steer LLMs towards more factual outputs, but the fundamental
probabilistic nature and its inherent limitations remain. A
• They are stochastic/probabilistic. If you send the same recent study, “Sources of Hallucination by Large Language
prompt to an LLM several times, you are likely to get Models on Inference Tasks” [146], highlights two key aspects
different responses. While there are parameters, and contributing to hallucinations in LLMs: the veracity prior and
in particular the temperature, to limit the variability the relative frequency heuristic, underscoring the complexities
in the response, this is an inherent property of their inherent in LLM training and output generation.
training that can create issues.
Effective automated measurement of hallucinations in
• They have stale information and, on their own, don’t LLMs requires a combination of statistical and model-based
have access to external data. An LLM on its own does metrics.
not even know about the current time or day and does
not have access to any information that was not present Statistical Metrics:
in its training set. • Metrics like ROUGE [147] and BLEU [148] are com-
• They are generally very large. This means that many mon for assessing text similarity, focusing on intrinsic
costly GPU machines are needed for training and hallucinations.
serving. In some cases, largest models have poor • Advanced metrics such as PARENT [149], PARENT-
SLAs, particularly in terms of latency. T [150], and Knowledge F1 [151] are utilized when
• They hallucinate. LLMs do not have a notion of structured knowledge sources are available. These
”truth” and they have usually been trained on a mix metrics, while effective, have limitations in capturing
of good and bad content. They can produce very syntactic and semantic nuances.
plausible but untruthful answers. Model-Based Metrics:
While the previous limitations can all become important • IE-Based Metrics: Utilize Information Extraction
for some applications, it is worth for us to dive a bit into the models to simplify knowledge into relational tuples,
last one, hallucinations, since it has gathered a lot of interest then compare these with the source.
over the past few months and it has also sparked many of the
prompt approaches and LLM augmentation methods we will • QA-Based Metrics: Assess the overlap between gen-
later describe. erated content and the source through a question-
answering framework (see [152]).
Hallucination: In the realm of Large Language Models
• NLI-Based Metrics: Use Natural Language Inference
(LLMs), the phenomenon of ”hallucinations” has garnered
datasets to evaluate the truthfulness of a generated
significant attention. Defined in the literature, notably in the
hypothesis based on a given premise (see [153]).
”Survey of Hallucination in Natural Language Generation”
paper [145], hallucination in an LLM is characterized as • Faithfulness Classification Metrics: Offer a refined
”the generation of content that is nonsensical or unfaithful assessment by creating task-specific datasets for a
to the provided source.” This terminology, although rooted in nuanced evaluation (see [154]).
How LLMs Are Used and Augmented

Statistical Metrics
Automated metrics IE-Based Metrics
Hallucination Model-Based Metrics QA-Based Metrics
Hallucination Quantification NLI-Based Metrics
Scoring
Human judgment
Comparative Analysis
A) LLM limitations

Prompt Design and Engineering

1) Chain of Thought 2) Tree of Thought 5) Expert Prompting 7) Rails 8) Automatic Prompt Engineering
Zero-Shot CoT 3) Self-Consistency 6) Chains Topical Rails Prompt Generation
Manual CoT Fact-Checking Rails Prompt Scoring
4) Reflection Jailbreaking Rails Refinement and Iteration
B) Using LLMs

Components of a RAG RAG Tools a) RAG-aware prompting techniques

Retrieval
Generation LangChain Meltano
Augmentation LlamaIndex Cohere Coral
HayStack Flowise AI
B) Augmenting LLMs through
external knowledge - RAG

a) Tool-aware prompting techniques

C) Using External Tools

Functionality of an LLM-based agent Prompt engineering techniques for agents

Tool Access and Utilization Reasoning without Observation
Decision Making Reason and Act
Dialog-Enabled Resolving Agents

D) LLM Agents

Fig. 36: How LLMs Are Used and Augmented.

Despite advances in automated metrics, human judgment providing mechanisms for user feedback.
remains a vital piece. It typically involves two methodologies:
• Data Management and Continuous Improvement.
1) Scoring: Human evaluators rate the level of halluci- Maintaining and analyzing a tracking set of hallucina-
nation within a predefined scale. tions is essential for ongoing model improvement.
2) Comparative Analysis: Evaluators compare gener-
• Prompt Engineering and Metaprompt Design. Many
ated content against baseline or ground-truth refer-
of the advanced prompt techniques described in IV-B
ences, adding an essential layer of subjective assess-
such as Retrieval Augmented Generation directly ad-
ment.
dress hallucination risks.
FactScore [155] is a recent example of a metric that can be • Model Selection and Configuration for Hallucination
used both for human and model-based evaluation. The metric Mitigation. For exemple, larger models with lower
breaks an LLM generation into “atomic facts”. The final score temperature settings usually perform better. Also,
is computed as the sum of the accuracy of each atomic fact, techniques such as RLHF or domain-sepcific fine-
giving each of them equal weight. Accuracy is a binary number tuning can mitigate hallucination risks.
that simply states whether the atomic fact is supported by the
source. The authors implement different automation strategies
that use LLMs to estimate this metric. B. Using LLMs: Prompt Design and Engineering

Finally, mitigating hallucinations in LLMs is a multifaceted A prompt in generative AI models is the textual input
challenge, requiring tailored strategies to suit various applica- provided by users to guide the model’s output. This could
tions. Those include: range from simple questions to detailed descriptions or specific
tasks. Prompts generally consist of instructions, questions,
• Product Design and User Interaction Strategies such input data, and examples. In practice, to elicit a desired
as use case design, structuring the input/output, or response from an AI model, a prompt must contain either
instructions or questions, with other elements being optional. Manual CoT is more effective than zero-shot. However,
Advanced prompts involve more complex structures, such as the effectiveness of this example-based CoT depends on the
”chain of thought” prompting, where the model is guided to choice of diverse examples, and constructing prompts with
follow a logical reasoning process to arrive at an answer. such examples of step by step reasoning by hand is hard and
error prone. That is where automatic CoT [157] comes into
Prompt engineering is a rapidly evolving discipline that play.
shapes the interactions and outputs of LLMs and other gen-
erative AI models. The essence of prompt engineering lies in 2) Tree of Thought (ToT): The Tree of Thought (ToT)
crafting the optimal prompt to achieve a specific goal with [158] prompting technique is inspired by the concept of
a generative model. This process is not only about instructing considering various alternative solutions or thought processes
the model but also involves some understanding of the model’s before converging on the most plausible one. ToT is based
capabilities and limitations, and the context within which it on the idea of branching out into multiple ”thought trees”
operates. where each branch represents a different line of reasoning.
This method allows the LLM to explore various possibilities
Prompt engineering transcends the mere construction of and hypotheses, much like human cognitive processes where
prompts; it requires a blend of domain knowledge, understand- multiple scenarios are considered before determining the most
ing of the AI model, and a methodical approach to tailor likely one.
prompts for different contexts. This might involve creating
templates that can be programmatically modified based on a A critical aspect of ToT is the evaluation of these reasoning
given dataset or context. For example, generating personalized paths. As the LLM generates different branches of thought,
responses based on user data might use a template that is each is assessed for its validity and relevance to the query.
dynamically filled with relevant user information. This process involves real-time analysis and comparison of
the branches, leading to a selection of the most coherent and
Furthermore, prompt engineering is an iterative and ex- logical outcome.
ploratory process, akin to traditional machine learning prac-
tices such as model evaluation or hyperparameter tuning. The ToT is particularly useful in complex problem-solving
rapid growth of this field suggests its potential to revolutionize scenarios where a single line of reasoning might not suffice.
certain aspects of machine learning, moving beyond traditional It allows LLMs to mimic a more human-like problem-solving
methods like feature or architecture engineering. On the other approach, considering a range of possibilities before arriving
hand, traditional engineering practices such as version con- at a conclusion. This technique enhances the model’s ability
trol and regression testing need to be adapted to this new to handle ambiguity, complexity, and nuanced tasks, making it
paradigm just like they were adapted to other machine learning a valuable tool in advanced AI applications.
approaches [156]. 3) Self-Consistency: Self-Consistency [159] utilizes an
In the following paragraphs we detail some of the most ensemble-based method, where the LLM is prompted to gen-
interesting and popular prompt engineering approaches. erate multiple responses to the same query. The consistency
among these responses serves as an indicator of their accuracy
1) Chain of Thought (CoT): The Chain of Thought (CoT) and reliability.
technique, initially described in the paper “Chain-of-Thought
The Self-Consistency approach is grounded in the principle
Prompting Elicits Reasoning in Large Language Models”[34]
that if an LLM generates multiple, similar responses to the
by Google researchers, represents a pivotal advancement in
same prompt, it is more likely that the response is accurate.
prompt engineering for Large Language Models (LLMs).
This method involves asking the LLM to tackle a query mul-
This approach hinges on the understanding that LLMs, while
tiple times, each time analyzing the response for consistency.
proficient in token prediction, are not inherently designed for
This technique is especially useful in scenarios where factual
explicit reasoning. CoT addresses this by guiding the model
accuracy and precision are paramount.
through essential reasoning steps.
The consistency of responses can be measured using vari-
CoT is based on making the implicit reasoning process of ous methods. One common approach is to analyze the overlap
LLMs explicit. By outlining the steps required for reasoning, in the content of the responses. Other methods may include
the model is directed closer to a logical and reasoned output, comparing the semantic similarity of responses or employing
especially in scenarios demanding more than simple informa- more sophisticated techniques like BERT-scores or n-gram
tion retrieval or pattern recognition. overlaps. These measures help in quantifying the level of
CoT prompting manifests in two primary forms: agreement among the responses generated by the LLM.
Self-Consistency has significant applications in fields
1) Zero-Shot CoT: This form involves instructing the where the veracity of information is critical. It is particularly
LLM to “think step by step”, prompting it to de- relevant in scenarios like fact-checking, where ensuring the
construct the problem and articulate each stage of accuracy of information provided by AI models is essential.
reasoning. By employing this technique, prompt engineers can enhance
2) Manual CoT: A more complex variant, it requires the trustworthiness of LLMs, making them more reliable for
providing step-by-step reasoning examples as tem- tasks that require high levels of factual accuracy.
plates for the model. While yielding more effective
results, it poses challenges in scalability and mainte- 4) Reflection: Reflection [160] involves prompting LLMs
nance. to assess and potentially revise their own outputs based on
reasoning about the correctness and coherence of their re- • Jailbreaking Rails: Prevent the LLM from generating
sponses. The concept of Reflection centers on the ability of responses that attempt to bypass its own operational
LLMs to engage in a form of self-evaluation. After generating constraints or guidelines.
an initial response, the model is prompted to reflect on its
own output, considering factors like factual accuracy, logical 8) Automatic Prompt Engineering (APE): Automatic
consistency, and relevance. This introspective process can lead Prompt Engineering (APE) [163] focuses on automating the
to the generation of revised or improved responses. process of prompt creation for Large Language Models
(LLMs). APE seeks to streamline and optimize the prompt
A key aspect of Reflection is the LLM’s capacity for design process, leveraging the capabilities of LLMs themselves
self-editing. By evaluating its initial response, the model can to generate and evaluate prompts. APE involves using LLMs
identify potential errors or areas of improvement. This iterative in a self-referential manner where the model is employed
process of generation, reflection, and revision enables the LLM to generate, score, and refine prompts. This recursive use of
to refine its output, enhancing the overall quality and reliability LLMs enables the creation of high-quality prompts that are
of its responses. more likely to elicit the desired response or outcome.
5) Expert Prompting: Expert Prompting [161] enhances the The methodology of APE can be broken down into several
capabilities of Large Language Models (LLMs) by simulating key steps:
the responses of experts in various fields. This method involves
prompting the LLMs to assume the role of an expert and re- • Prompt Generation: The LLM generates a range of
spond accordingly, providing high-quality, informed answers. potential prompts based on a given task or objective.
A key strategy within Expert Prompting is the multi-expert • Prompt Scoring: Each generated prompt is then
approach. The LLM is prompted to consider responses from evaluated for its effectiveness, often using criteria
multiple expert perspectives, which are then synthesized to like clarity, specificity, and likelihood of eliciting the
form a comprehensive and well-rounded answer. This tech- desired response.
nique not only enhances the depth of the response but also
incorporates a range of viewpoints, reflecting a more holistic • Refinement and Iteration: Based on these evalua-
understanding of the subject matter. tions, prompts can be refined and iterated upon, further
enhancing their quality and effectiveness.
6) Chains: Chains refer to the method of linking multiple
components in a sequence to handle complex tasks with Large C. Augmenting LLMs through external knowledge - RAG
Language Models (LLMs). This approach involves creating a
series of interconnected steps or processes, each contributing One of the main limitations of pre-trained LLMs is their
to the final outcome. The concept of Chains is based on lack of up-to-date knowledge or access to private or use-
the idea of constructing a workflow where different stages case-specific information. This is where retrieval augmented
or components are sequentially arranged. Each component in generation (RAG) comes into the picture [164]. RAG, illus-
a Chain performs a specific function, and the output of one trated in figure 37, involves extracting a query from the input
serves as the input for the next. This end-to-end arrangement prompt and using that query to retrieve relevant information
allows for more complex and nuanced processing, as each from an external knowledge source (e.g. a search engine or a
stage can be tailored to handle a specific aspect of the task. knowledge graph, see figure 38 ). The relevant information is
Chains can vary in complexity and structure, depending on then added to the original prompt and fed to the LLM in order
the requirements. In “PromptChainer: Chaining Large Lan- for the model to generate the final response. A RAG system
guage Model Prompts through Visual Programming” [162], includes three important components: Retrieval, Generation,
the authors not only describe the main challenges in designing Augmentation [165].
chains, but also describe a visual tool to support those tasks. a) RAG-aware prompting techniques: Because of the
7) Rails: Rails in advanced prompt engineering refer to importance of RAG to build advanced LLM systems, several
a method of guiding and controlling the output of Large RAG-aware prompting techniques have been developed re-
Language Models (LLMs) through predefined rules or tem- cently. One such technique is Forward-looking Active Retrieval
plates. This approach is designed to ensure that the model’s Augmented Generation (FLARE)
responses adhere to certain standards or criteria, enhancing the Forward-looking Active Retrieval Augmented Generation
relevance, safety, and accuracy of the output. The concept of (FLARE) [168] enhances the capabilities of Large Language
Rails involves setting up a framework or a set of guidelines Models (LLMs) by iteratively combining prediction and in-
that the LLM must follow while generating responses. These formation retrieval. FLARE represents an evolution in the
guidelines are typically defined using a modeling language or use of retrieval-augmented generation, aimed at improving the
templates known as Canonical Forms, which standardize the accuracy and relevance of LLM responses.
way natural language sentences are structured and delivered.
FLARE involves an iterative process where the LLM
Rails can be designed for various purposes, depending on actively predicts upcoming content and uses these predictions
the specific needs of the application: as queries to retrieve relevant information. This method con-
trasts with traditional retrieval-augmented models that typically
• Topical Rails: Ensure that the LLM sticks to a
retrieve information once and then proceed with generation. In
particular topic or domain.
FLARE, this process is dynamic and ongoing throughout the
• Fact-Checking Rails: Aimed at minimizing the gen- generation phase. In FLARE, each sentence or segment gener-
eration of false or misleading information. ated by the LLM is evaluated for confidence. If the confidence
Fig. 37: An example of synthesizing RAG with LLMs for question answering application [166].

tool usage by training an LLM to decide what tool to use

when, and even what parameters the API needs. Tools include
two different search engines, or a calculator. In the following
examples, the LLM decides to call an external Q&A tool,
a calculator, and a Wikipedia Search Engine More recently,
researchers at Berkeley have trained a new LLM called Gorilla
[67] that beats GPT-4 at the use of APIs, a specific but quite
general tool.
a) Tool-aware prompting techniques: Similarly to what
was described with RAG, several tool-aware prompting ap-
Fig. 38: This is one example of synthesizing the KG as a proaches have been developed to make usage of tools more
retriever with LLMs [167]. scalable. A popular technique is the so called Automatic Multi-
step Reasoning and Tool-use (ART).
Automatic Multi-step Reasoning and Tool-use (ART) [170]
is a prompt engineering technique that combines automated
level is below a certain threshold, the model uses the generated
chain of thought prompting with the use of external tools.
content as a query to retrieve relevant information, which is
ART represents a convergence of multiple prompt engineering
then used to regenerate or refine the sentence. This iterative
strategies, enhancing the ability of Large Language Models
process ensures that each part of the response is informed by
(LLMs) to handle complex tasks that require both reasoning
the most relevant and current information available.
and interaction with external data sources or tools.
For more details on RAG framework and its relevant works,
ART involves a systematic approach where, given a task
we refer the readers to this survey of retrieval augmented
and input, the system first identifies similar tasks from a task
generations [165].
library. These tasks are then used as examples in the prompt,
guiding the LLM on how to approach and execute the current
D. Using External Tools task. This method is particularly effective when tasks require a
Retrieving information from an external knowledge source combination of internal reasoning and external data processing
as described above is only one of the potential ways to augment or retrieval.
an LLM. More generally, an LLM can access any number
of external tools (e.g. an API to a service) to augment its E. LLM Agents
functionality. In that regards, RAG can be seen as a specific
The idea of AI agents has been well-explored in the history
instance of the broader category of the so called ”tools”.
of AI. An agent is typically an autonomous entity that can
Tools in this context are external functions or services that perceive the environment using its sensors, make a judgment
LLMs can utilize. These tools extend the range of tasks an based on the state it currently is, and accordingly act based on
LLM can perform, from basic information retrieval to complex the actions that are available to it.
interactions with external databases or APIs.
In the context of LLMs, an agent refers to a system based
In the paper ”Toolformer: Language Models Can Teach on a specialized instantiation of an (augmented) LLM that
Themselves to Use Tools” [169], the authors go beyond simple is capable of performing specific tasks autonomously. These
agents are designed to interact with users and environment to external data is not available, relying instead on a well-
make decisions based on the input and the intended goal of structured reasoning framework. This method is particularly
the interaction. Agents are based on LLMs equipped with the advantageous in scenarios where data retrieval is costly, slow,
ability to access and use tools, and to make decisions based on or uncertain, allowing the LLM-based agent to maintain a high
the given input. They are designed to handle tasks that require level of performance and reliability.
a degree of autonomy and decision-making, typically beyond
simple response generation. Reason and Act (ReAct)[176] prompts LLMs to generate
not only verbal reasoning but also actionable steps, thus
The functionalities of a generic LLM-based agent include: enhancing the model’s dynamic problem-solving capabilities.
ReAct is grounded in the principle of integrating reasoning
• Tool Access and Utilization: Agents have the capabil- with action. In this approach, the LLM is prompted to alternate
ity to access external tools and services, and to utilize between generating reasoning traces (explanations) and taking
these resources effectively to accomplish tasks. actions (steps or commands) in an interleaved manner. This
approach allows the model to dynamically reason about a prob-
• Decision Making: They can make decisions based on lem, and propose and take concrete actions simultaneously.
the input, context, and the tools available to them,
often employing complex reasoning processes. Dialog-Enabled Resolving Agents (DERA) [177] are spe-
cialized AI agents that can engage in dialogue, resolve queries,
As an example, an LLM that has access to a function (or and make decisions based on interactive exchanges. DERA
an API) such as weather API, can answer any question related is developed based on the idea of utilizing multiple agents
to the weather of the specific place. In other words, it can use within a dialog context, each with specific roles and functions.
APIs to solve problems. Furthermore, if that LLM has access These agents can include Researchers, who gather and analyze
to an API that allows to make purchases, a purchasing agent information, and Deciders, who make final judgments based
can be built to not only have capabilities to read information on the information provided. This division of roles allows for
from the external world, but also act on it [171]. a well-organized and efficient approach to problem-solving
and decision-making. DERA is particularly advantageous in
Fig. 40 shows another example of LLM-based agents for scenarios requiring complex decision-making and problem-
conversational information seeking [36], where an LLM is solving, such as those in medical diagnostics or customer ser-
augmented with a set of plug-and-play modules, including vice. The collaborative and interactive nature of DERA agents
a working memory that tracks the dialog state, a policy that allows them to handle intricate queries with a level of depth
makes an execution plan for the task and selects next system and nuance that single-agent systems might struggle with.
action, an action executor that performs an action selected by Moreover, this approach aligns well with human decision-
the policy (consolidating evidence from external knowledge, making processes, making AI reasoning more relatable and
or prompting the LLM to generate responses), and a utility trustworthy.
that accesses the alignment of the LLM’s responses with user
expectations or specific business requirements, and generate V. P OPULAR DATASETS FOR LLM S
feedback to improve agent performance.
Large language models exhibit promising accomplish-
For more details on LLM-based AI agents see recent survey ments, but the main question that arises is how effectively
[172], [173], [174]. they function and how their performance can be assessed in
a) Prompt engineering techniques for agents: Like specific tasks or applications.
RAG and Tools, prompt engineering techniques that specif- The evaluation of LLMs poses particular challenges due
ically address the needs of LLM-based agents have been to the evolving landscape of their applications. The original
developed. Three such examples are Reasoning without Ob- intent behind developing LLMs was to boost the performance
servation (ReWOO), Reason and Act (ReAct), and Dialog- of NLP tasks such as translation, summarization, question-
Enabled Resolving Agents (DERA). answering, and so on [178]. However, it is evident today
that these models are finding utility across diverse domains
Reasoning without Observation (ReWOO) [175] aims to
including code generation and finance. Moreover, the eval-
decouple reasoning from direct observations. ReWOO operates
uation of LLMs encompasses several critical considerations
by enabling LLMs to formulate comprehensive reasoning plans
such as fairness and bias, fact-checking, and reasoning. In
or meta-plans without immediate reliance on external data
this section, we outline the commonly used benchmarks for
or tools. This approach allows the agent to create a struc-
assessing LLMs. These benchmarks are categorized based on
tured framework for reasoning that can be executed once the
training or evaluating the LLM Capabilities.
necessary data or observations are available. In ReWOO, the
LLM initially develops a plan (a series of steps) that outlines
how to approach and solve a given problem. This meta- A. Datasets for Basic Tasks: language model-
planning phase is crucial as it sets the stage for the agent to ing/understanding/generation
process information once it becomes available. The execution This section provides an overview of the benchmarks and
phase then involves integrating actual data or observations into datasets suited to evaluate the basic abilities of LLMs.
the pre-specified plan, leading to coherent and contextually
relevant responses. ReWOO offers significant advantages in • Natural Questions [179] is a QA dataset that consists
terms of token efficiency and robustness to tool failure. It of real anonymized, aggregated queries submitted to
enables LLMs to handle tasks where immediate access to the Google search engine as questions. An annotator
Fig. 39: HuggingGPT: An agent-based approach to use tools and planning [image courtesy of [171]]

grams including a wide range of topics, including

fundamental programming concepts and standard li-
brary usage, and more. Each challenge comprises a
task description, a code solution, and three automated
test cases.
• HumanEval [182] is a dataset for code generation
task. This dataset consists of 164 hand-crafted pro-
gramming challenges. Each challenge is accompanied
by a function signature, docstring, code body, and mul-
tiple unit tests. The main intuition behind developing
this dataset is to guarantee the exclusion of its contents
from training datasets for code generation models.
• APPS [183] is designed for code generation task
focusing on the Python programming language. The
Fig. 40: A LLM-based agent for conversational information APPS dataset contains a collection of 232, 444 Python
seeking. Courtesy of [36]. programs. Each program in the dataset has an average
of 18 lines of Python code. Additionally, APPS offers
access to a repository of 10, 000 unique programming
exercises, each with text-based problem descriptions.
is presented with a question along with a Wikipedia
The final aspect to highlight is that the it includes test
page from the top 5 search results, and annotates a
cases.
long answer (typically a paragraph) and a short answer
(one or more entities) if present on the page, or marks • WikiSQL [184] is crafted for code generation task and
null if no long/short answer is present. it has 87,726 carefully labeled pairs of SQL queries
and corresponding natural language questions from
• MMLU [180] is intended to evaluate the knowl- Wikipedia tables. The SQL queries comprise three
edge gained in zero-shot and few-shot scenarios. That subsets: test sets (17, 284 examples), development
means that MMLU assesses both the general knowl- (9, 145 examples), and training (61, 297 examples).
edge and problem-solving ability of a model. It covers
57 subjects in STEM, humanities, social sciences, • TriviaQA [185] is designed for QA task. This
and other areas. The benchmark varies in complexity, dataset comprises more than 650, 000 question-
ranging from elementary to advanced professional. answer-evidence triples. There are 95, 000 question-
It is worth mentioning that the main contribution of answer pairs in this dataset, each authored by trivia en-
this dataset is for multi-task language understanding, thusiasts and supported by an average of six indepen-
question answering, and arithmetic reasoning. dently sourced evidence documents. These documents
are automatically acquired from Wikipedia or broader
• MBPP [181] stands for “Mostly Basic Python Prob- web search results. The dataset is categorized into
lems” and provides a benchmark for evaluating the two segments, including those with authentic answers
performance of models designed for code generation. from Wikipedia and web domains, and verified sets
The benchmark encompasses 974 short Python pro- embody the accurately answered questions along with
Fig. 41: Dataset applications.

their associated documents from both Wikipedia and M refers to the middle school examinations, whereas
online. RACE-H denotes the high school tests. Finally, RACE
is the synthesis of RACE-M and RACE-H.
• RACE [186] suits for reading comprehension task.
This dataset is based on English tests completed by • SQuAD [187] stands for “Stanford Question Answer-
Chinese students from middle school and high school, ing Dataset” and is a crowdsourced reading compre-
aged 12 to 18, and it contains roughly 28, 000 texts hension dataset based on Wikipedia articles. It has
and 100, 000 questions rigorously prepared by human approximately 100, 000 question-answer pairs con-
specialists, primarily English instructors. This dataset nected to more than 500 articles. The answers to
contains a wide range of subjects that were purpose- these questions are typically text fragments or spans
fully chosen to assess students’ comprehension and taken from the corresponding reading passages. The
reasoning abilities. This dataset is available in three questions may be unanswerable in some cases. The
subgroups: RACE-M, RACE-H, and RACE. RACE- dataset is divided into three sets: an 80% training set,
Fig. 42: Datasets licensed under different licenses.

a 10% development set, and a 10% hidden test set. • GSM8K [190] is designed to evaluate the model’s
ability for multi-step mathematical reasoning. GSM8K
• BoolQ [188] is a yes/no question-answering dataset
includes 8.5K linguistically diverse grade school math
where the goal is reading comprehension task. BoolQ
word problems written by humans. The dataset is split
includes 15, 942 examples. Each example is a triplet
into two sets: a training set with 7.5K problems,
that includes a question, a relevant paragraph, and
and a test set with 1K problems. These problems
the solution. Although the main intuition behind
need 2 to 8 steps to be solved. Solutions mainly
this dataset is for reading comprehension, it can be
are a series of elementary calculations using basic
used for reasoning, natural language inference, and
arithmetic operations.
question-answering tasks.
• MultiRC [189] is another dataset that fits reading • MATH [191] enables to assess how well models can
comprehension task. MultiRC contains brief para- solve math problems. MATH dataset hast 12, 500
graphs as well as multi-sentence questions that can problems from high school math competitions. Each
be answered using the information in the paragraph. problem in the dataset has a step-by-step solution and
The paragraphs in this dataset come from a variety a final answer enclosed in a box. The problems cover
of sources, including news, fiction, historical texts, a wide range of topics and have different levels of
Wikipedia articles, discussions on society and law, complexity. There are seven subjects in total. Further-
elementary school science textbooks, and 9/11 re- more, the difficulty of each problem is rated based
ports. Each question has many response choices, with on the AoPS standards on a scale from ′ 1′ to ′ 5′ . A
′ ′
one or more of them being correct. Answering the 1 shows the easiest problems in a subject, while ′ 5′
questions requires reasoning across several sentences. represents the most difficult. In terms of formatting,
MultiRC dataset encompasses around 6, 000 multi- all problems and solutions are presented using LATEX
sentence questions gathered from over 800 paragraphs. and the Asymptote vector graphics language.
On average, each question offers about two valid
• HellaSwag [192] is designed to assess commonsense
answer alternatives out of a total of five.
reasoning in LLMs. This benchmark includes 70, 000
multiple-choice questions. Each question is derived
B. Datasets for Emergent: ICL, reasoning (CoT), instruction
from one of two domains: ActivityNet or WikiHow,
following
and presents four answer choices regarding what
This section centers on the benchmarks and datasets em- might happen in the following situation. The correct
ployed to evaluate the emergent abilities of LLMs. answer provides an actual statement describing the
upcoming event, but the three wrong answers are C. Datasets for Augmented: using external knowledge/tools
created to confuse machines.
This section focuses on datasets designed for the aug-
• AI2 Reasoning Challenge (ARC) [193] is used mented abilities of LLMs.
for commonsense reasoning. This benchmark encom- • HotpotQA [198] is designed to cover a diverse and
passes 7, 787 science examination questions. These explainable question-answering dataset that necessi-
questions are in English, and most of them are set tates multi-hop reasoning. This dataset is derived from
up in a multiple-choice format. The questions have the English Wikipedia. It consists of roughly 113, 000
been divided into two groups: a Challenge Set with questions. Each question in the dataset comes with
2, 590 difficult questions and an Easy Set with 5,197 two paragraphs, called gold paragraphs, from two
questions. Each collection has also been pre-divided Wikipedia articles. Also, there is a list of sentences
into Train, Development, and Test subsets. in those paragraphs that crowdworkers have picked as
important for answering the question.
• PIQA [194] is intended to evaluate the language
representations on their knowledge of physical com- • ToolQA [199] is a question answering benchmark
monsense. In this dataset, the focus is on everyday to evaluate LLMs’ ability to use external tools for
situations with a preference for uncommon solutions. answering questions.
The central task is a multiple-choice question answer-
• GPT4Tools serves as an instructional dataset, gener-
ing, where a question (q) is provided along with two
ated by instructing advanced teachers (such as Chat-
potential solutions (s1, s2). Then, the best solution is
GPT), with instructions conditioned on visual content
chosen by whether a model or a human. For each
and tool descriptions. This process results in the
question, only one of the solutions is the correct
generation of instructions related to the use of tools.
answer.
There are three versions of this dataset. The first
version comprises 71,000 instruction-following data
• SIQA [195] provides a framework for evaluating mod-
points utilized to fine-tune the GPT4Tools model. The
els’ ability for commonsense reasoning about social
next version consists of manually cleaned instruction
situations. SIQA dataset has 38, 000 multiple-choice
data used for validation, covering instructions related
questions designed to assess emotional and social
to the tools from the first version. The last version is
intelligence in everyday circumstances. This dataset
cleaned instruction data used for testing and includes
covers a wide variety of social scenarios. In SIQA,
instructions related to some tools that are not present
the potential answers is a mixture of human-selected
in the first version.
responses and machine-generated ones that have been
filtered through adversarial processes.
VI. P ROMINENT LLM S ’ P ERFORMANCE ON
• OpenBookQA (OBQA) [196] is a new kind of B ENCHMARKS
question-answering dataset where answering its ques-
tions requires additional common and commonsense In this section we first provide an overview of some of
knowledge not contained in the book and rich text popular metrics used for evaluating the performance of LLMs
comprehension. This dataset includes around 6,000 under different scenarios. We then look at the performance
multiple-choice questions. Each question is linked to of prominent large language models on some of the popular
one core fact, as well as an additional collection datasets and benchmarks.
of over 6000 facts. The questions were developed
using a multi-stage crowdsourcing and expert filter- A. Popular Metrics for Evaluating LLMs
ing procedure. OpenBookQA questions are difficult Evaluating the performance of generative language models
because they need multi-hop reasoning with limited depends on the underlying task they are going to be used for.
background. Tasks that are mostly about selecting a choice out of given
ones (such as sentiment analysis), can be seen as simple as
• TruthfulQA [197] is designed specifically to eval- classification and their performance can be evaluated using
uate the truthfulness of language models in gen- classification metrics. Metrics such as accuracy, precision,
erating answers to questions. This dataset includes recall, F1, etc are applicable in this case. It is also important to
817 questions, written by authors, from 38 different note that the answers generated by the model for specific tasks
categories, including health, law, finance, and politics. such as multi-choice question answering are always either True
These questions are purposefully designed to chal- or False. If the answer is not in a set of options, it can be seen
lenge human responders, as they may contain common as False as well.
misunderstandings that lead to incorrect answers.
However, some tasks that are purely open-ended text gener-
• OPT-IML Bench [103] is a comprehensive bench- ation cannot be evaluated in the same way as for categorization.
mark for Instruction Meta-Learning. It covers 2000 Different metrics are required for the specific purpose of the
NLP tasks from 8 existing benchmarks. The OPT-IML evaluation. Code generation is a very different case in open-
Bench consists of a training set with 17.9 M examples, ended generative evaluations. The generated code must pass
a dev set with 145K samples, and a test set with 321K the test suite but on the other hand, it is also important
samples. to understand if a model is capable of generating different
TABLE II: LLM Datasets Overview.
Benchmark Name Evaluation Metric Leaderboard Source paperswithcode
HumanEval PASS@k Link Link Link
MBPP PASS@k, Accuracy - Link Link
APPS PASS@k, Accuracy - Link Link
WikiSQL Accuracy - Link Link
CoNaLa BLEU Link Link
CodeParrot PASS@k - Link -
HellaSwag Accuracy Link Link Link
AI2 Reasoning
Accuracy Link Link Link
Challenge (ARC)
BoolQ Accuracy - Link Link
MultiRC F1-score, Accuracy - Link Link
CNN/Daily Mail [200] Accuracy - Link -
SQuAD F1-score, EM Link Link Link
RACE Accuracy - Link Link
CNN/Daily Mail [201] ROUGE - Link Link
Drop F1-score, EM Link Link Link
QuAC F1-score, HEQ-Q, HEQ-D Link Link Link
TriviaQA EM, F1-score, Accuracy Link Link Link
Natural Questions EM, F1-score, Accuracy Link Link Link
StrategyQA Accuracy, Recall@10, SARI Link Link Link
CoQA F1-score Link Link Link
XSum ROUGE - Link Link
SAMSum ROUGE - - Link
WikiSum ROUGE - Link -
DialogSum ROUGE - Link Link
TruthfulQA MC1 , MC2, % true, % info, BLEURT Link Link Link
MMLU Accuracy Link Link Link
GSM8K Accuracy Link Link Link
PIQA Accuracy Link Link Link
SIQA Accuracy Link Link Link
OpenBookQA (OBQA) Accuracy Link Link Link
HotpotQA EM, F1-score, Joint EM, Joint F1-score, Link Link Link
MATH Accuracy - Link Link
CommonsenseQA Accuracy Link Link Link
Natural Instructions ROUGE-L, Human Link Link Link
BIG-bench Accuracy, Average - Link Link
Success rate, Precision, Recall, Incorrect
ToolTalk - Link Link
action rate, Percent of failing error types
MetaTool Accuracy, Precision, Recall, F1-score - Link Link
Successful Rate of Thought, Successful
GPT4Tools Rate of Action, Successful Rate of Ar- - Link Link
guments, Success Rate
Correctness, ROUGE, Error(API Hallu-
cination, Has Exception, Invalid Input
API-Bank - Link Link
Parameters, False API Call Format, API
Call, Miss Input Parameters)
Alpaca-CoT - - Link Link

solutions as a code, what is the probability of selecting the

correct one among them. Pass@k is a very good metric in this M
EM = (5)
case. It works in this manner that given a problem, different N
solutions as code are generated. They are tested for correctness
using different functionality tests. Afterward, from generated Human equivalence score (HEQ) on the other hand, is an
n solutions, and the respective c number of them being correct alternative to F1 score [203]. HEQ-Q represents the precision
equation 4 provides the final value. of individual questions, wherein an answer is deemed correct
if the model’s F1 score surpasses the average human F1 score.
Likewise, HEQ-D denotes the precision of each dialogue; it is
"
n−c
# deemed accurate when all questions within the dialogue meet
k the criteria of HEQ [182].
pass@k := E 1− n (4)
Problems k Evaluation of other generative tasks such as machine trans-
lation are based on metrics such as Rouge and BLEU. These
scores work well when there is a reference text as ground
Exact match (EM) is another metric that is mostly con- truth (such as translation) and a hypothesis that is generated
cerned with exact matches from (pre-defined) answers. It by the generative model, in our case the LLM. These scores
counts a prediction as correct if it exactly matches one of are mostly used for cases where the goal is to detect the
more than one desired reference text token by token. In some similarity of the answer and ground truth in a computation
cases, it can be the same as accuracy and the equation 5 shows manner. In a computation manner, it meant that nothing more
the mathematical definition. Here M is total number of correct than N-Grams would be used. However, metrics such as BERT-
answers and N is the total number of questions [202]. Score are also good for these cases but they are also heavily
TABLE III: LLM categories and respective definitions.
Classification Category Description
Small Number of parameters ≤ 1B
Medium 1B < Number of parameters ≤ 10B
Size
Large 10B < Number of parameters ≤ 100B
Very Large 100B < Number of parameters
Foundation model Pretrained language model
Type Instruction model Pretrained and instruction fine-tuned language model
Chat model Pretrained, instruction fine-tuned, and chat fine-tuned language model
Original model An original model released with either Foundation, Instruction, or Chat model
Origin
Tuned model Fine-tuned version of an original model
Publicly available Model and weights are available due to request to without request
Availability
Publicly unavailable Model and weights are not publicly available

TABLE IV: Different LLM categorization.

Model Size #Params (B) Type Availability Origin
Davinci-002 Very Large 175 Instruction Unavailable Tuned
Davinci-003 Very Large 175 Instruction Unavailable Tuned
GPT 3.5-turbo Large 20 Chat Unavailable Tuned
Falcon 7B Medium 7 Foundation Public Original
Alpaca Large 13 Chat Public Tuned
Pythia 7B Medium 7 Foundation Public Original
Pythia 12B Large 12 Foundation Public Original
LLAMA 7B Medium 7 Chat Public Original
LLAMA 2 7B Medium 7 Chat Public Tuned
LLAMA 2 7B Medium 7 Foundation Public Original
Vicuna 13B Large 13 Foundation Public Tuned
Vicuna 7B Medium 7 Foundation Public Tuned
Claude Large 93 Chat Unavailable Original
Claude 2 Very Large 137 Chat Unavailable Original

erroneous because another model is used to judge. Still, even we use is their primary use case. We consider each LLM to
today, evaluating purely generated content is very hard and be either: Foundation model (pretrained language model with
no completely fitting metric is not found, metrics are either no instruction fine-tuning and chat fine-tuning), Instruction
looking for simplistic features such as N-Gram, SkipGram, model (pretrained language model with only instruction fine-
etc, or they are models with unknown accuracy and preciseness tuning), and Chat model (pretrained language model with
[204]. instruction and chat fine-tuning). Apart from all the catego-
rization described, another category is required to distinguish
Generative evaluation metrics are also another type of eval- between original models and tuned ones. Original models are
uation metric for LLMs that use another LLM for evaluating those that have been released as a foundation model or a fine-
the answer. However, depending on the task itself, evaluation tuned one. Tuned models are those that grasped the original
can be possible in this way or not. Another dependency model and tuned it with different datasets or even different
that makes generative evaluation error-prone is reliance on training approaches. It is also good to note that original models
the prompt itself. RAGAS is one of the good examples that are usually foundation models that have been fine-tuned on
incorporate the usage of generative evaluation. specific datasets or even different approaches. Availability of
Various benchmarks and leaderboards have been proposed the model weights regardless of the license is another category
to address the most challenging question in the world of in our classification. Models that have their weights publicly
large language models: Which one is better? However not available (even through request) are noted as Public models
a simple answer can address this question. The answer de- while others are noted as Private. Table III shows all of these
pends on various aspects of large language models. Section V definitions and abbreviations used in the rest of the article.
shows the categorical presentation of different tasks and the Figure 43 illustrate these visually.
most important datasets in each category. We will follow the According to the provided categorizations, we can catego-
same categorization and provide a comparison based on each rize and label each notable LLM as shown in table IV. As can
category. After providing comparison for each category, we be seen from this table, models categorized as very large are
will provide a broad overview of aggregated performance by also unavailable as well.
averaging the reported performance metric on different tasks.
B. LLMs’ Performance on Different Tasks
Evaluating different LLMs can be seen also from different
perspectives. For example, a LLM with a drastically fewer Commonsense reasoning is one of the important capabili-
number of parameters is not completely comparable to one ties each model can obtain. This capability denotes the ability
with a larger number of parameters. From this perspective, we of the model to use prior knowledge in combination with
will categorize LLMs in four categories as well: small (less reasoning skills. In the case of HellaSwag for example, finding
than or equal to 1 billion parameters), medium (between 1 and the continuation of text is challenging because the given text
10 billion), large (between 10 and 100 billion), and very large contains a partial part of the story while the given choices
(more than 100 billion). Another classification for the LLMs as continuation are tricky to select, and without having prior
Large LM
10B < # of params <100B

Medium LM
1B < # of params <10B
Very Large LM
100B < # of params

Small LM Pretrained model with no instruction

or chat fine-tuning.
# of params <1B Example: MPT-7B
Fine tuned models that are originally
based on original models.
Example: Alpaca (based on LLaMA)
Parameters

Foundation
Tuned

Originality
Pretrained model that is
Large also fine-tuned on

Type
instruction following.
Fine tuning Language Instruction Example: MPT-7B-instruct
Models
Original
Chat Pretrained model that is
also fine-tuned on chat.
Example: MPT-7B-chat

Availability
Original models that are not fine
tuned or based on any other
pretrained model.
Example: LLaMA

Model weights are NOT publicly Model weights are publicly released
released and is NOT available. and is available.
Example: GPT-4 Private Public Example: LLaMA

Fig. 43: LLM categorizations.

knowledge about the world it is not possible. This specific kind From the results presented in Table V it is clear that GPT-4
of reasoning deserves high attention because it is related to achieves best results for HellaSwag while Davinci-003 is best
utilizing previous knowledge with open text-described scenes model for OBQA. It is also good to note that results for OBQA
or facts. As can be seen from table V not just Unavailable are not reported for all of the models and possibly davinci-003
models but also Public ones can achieve good results on is not the best model achieving highest results on OBQA.
various tests.

TABLE V: Commonsense reasoning comparison.

Not all models report their performance on all datasets, and
Model OBQA HellaSwag because of that, the number of models for which performance
Davinci-003 51 83.4
Falcon 7B 44.4 76.3
is reported in different tables varies.
Alpaca 43.4 73.9
Pythia 7B 37.2 64
Pythia 12B 43.2 68.1
LLAMA 7B 42.4 73
Dolly 6B 41.2 67.6
Dolly 12B 40.4 71
Alpaca 7B 43.4 73.9
TABLE VI: Symbolic reasoning comparison.
Alpaca Lora 7B 42.6 74
Model Cobjects Penguins
GPT-J 6.7B 38.2 66.2
GPT-NeoX 26 33.56
LLama 7B 42.4 73
OPT 66B 31.2 28.08
LLama 13B 42.2 76.2
Bloomberg GPT 34.8 37.67
Pythia 6.7B 37.2 64
BLOOM 176B 36.8 40.41
Pythia 12B 38 67.3
PaLM 540B 38 44.5
StableLM Tuned 33.4 53.6
Gopher-280B 49.2 40.6
Koala 13B 42.8 72.6
Chinchilla-70B 59.7 48.7
Mosaic mpt-7B 42.6 76.3
PaLM 2 61.2 65.8
LLAMA 2 70B - 87.33
LLAMA 65B - 86.09
Falcon 40B - 85.3
Falcon 180B - 88.86
MPT Instruct 30B - 84.31
MPT Instruct 7B - 77.91 World knowledge is mostly about general knowledge ques-
Yi 6B - 76.42
Yi 34B - 85.69 tions, for example, in Wikifact dataset questions such as ”Who
GPT-4 - 95.3 is the author of a specific well-known book” can be found and
Gemini Ultra - 87.8 references are also provided. Table VII shows the results.
TABLE VII: World knowledge comparison. TABLE IX: Arithmetic reasoning comparison.
Model TriviaQA NaturalQ WebQ ARC Model GSM8k MATH
BLOOM - - - 32.9 Gemini Ultra 94.4 53.2
BLOOM 176B - - - 50.85 GPT-4 87.1 42.5
Bloomberg GPT - - - 48.63 Gemini Pro 86.5 32.6
Chinchilla - 35.5 - - ToRA 70B 84.3 49.7
Codex + REPLUG 76.8 44.7 - - MathCoder-L-70B 83.9 -
GAL 120B - - - 67.9 MetaMath 70B 82.3 26
GLaM 62B/64E 75.8 32.5 15.5 50.3 MuggleMATH 70B 82.3 -
Gopher - 28.2 - - MathCoder-CL-34B 81.7 45.2
GPT-3 175B 71.2 29.9 41.5 85.2 ToRA-Code 34B 80.7 50.8
GPT-4 - - - 96.4 MetaMath-Mistral-7B 77.7 -
GPT-NeoX - - - 45.39 Arithmo2-Mistral-7B 76.4 -
LLaMA 13B - - - 52.7 ToRA-Code 13B 75.8 48.1
LLaMA 2 70B 85 33 - - Arithmo-Mistral-7B 74.7 -
LLaMA 33B - 24.9 - 57.8 MathCoder-CL-13B 74.1 35.9
LLaMA 65B 72.6 39.9 - - MuggleMATH 13B 74 -
LLaMA 7B - - - 47.6 CodeT5+ 73.8 -
Mistral 7B 69.9 28.8 - 55.5 KwaiYiiMath 13B 73.3 -
Neo-6B - 13.7 - - ToRA-Code 7B 72.6 44.6
OPT - - - 31.1 MathCoder-L-13B 72.6 29.9
OPT 66B - - - 44.54 MetaMath 13B 71 22.5
OPT-175B - - - 43.94 LLaMA 65B 69.7 10.6
OPT-175B - - - 25.6 MuggleMATH 7B 68.4 -
PaLM 2-L 86.1 37.5 28.2 95.1 MathCoder-CL-7B 67.8 23.3
PaLM 2-M 81.7 32 26.9 64.9 MetaMath 7B 66.4 19.4
PaLM 2-S 75.2 25.3 21.8 59.6 RFT 70B 64.8 -
PaLM-540B 81.4 39.6 43.5 87.1 MathCoder-L-7B 64.2 -
phi-1.5-web 1.3B - - - 44.9 Orca 2-13B 59.14 -
SparseGPT - - - 38.99 U-PaLM 58.5 -
SparseGPT - - - 39.85 PaLM-540B 58.1 8.8
SparseGPT - - - 41.3 LLaMA 2 70B 56.8 -
RFT 13B 55.3 -
LLaMA 33B 53.1 7.1
Mistral 7B 52.2 13.1
RFT 7B 51.2 -
For some specific use-case models, it is highly demanded to LLaMA 65B 50.9 20.5
have coding and code-generation capability. Table VIII shows Orca 2-7B 47.23 -
Text-davinci-002 40.7 19.1
the results of different models on coding capability. LLaMA 33B 35.6 3.9
GPT-Neo-2.7B 19.5 -
LLaMA 7B 18.1 2.9
PaLM 540B 17.9 8.8
TABLE VIII: Coding capability comparison. LLaMA 13B 17.8 3.9
LLaMA 7B 11 2.9
Model HumanEval
GPT-Neo-125M 7.5 -
Gemini Ultra 74.4
PaLM 8B 4.1 1.5
Gemini Pro 67.7
GPT-2 - 5.4
GPT-4 67
GPT-3 175B - 5.2
WizardCoder 15B 57.3
PaLM 62B - 4.4
phi-1 1.3B 50.6
GPT-3-13B - 3
Code Llama 48.8
LLaMA 7B 11 2.9
GPT-3.5 48.1
PaLM 8B - 1.5
OctoCoder 46.2
phi-1-small 45
PaLM 2-S 37.6
InstructCodeT5+ 16B 35 Large language models in some cases are hallucinating an-
Mistral 7B 30.5 swers simply because they are next-token prediction machines.
LLaMA 2 29.9
phi-1-base 29
Hallucination is one of the important factors in measuring
Codex-12B 28.81 how much a large language model is trustworthy and reliable.
PaLM 540B 26.2 Measuring hallucination on the other hand is also not easy as it
CodeT5+ 2B 24.2
LLaMA 65B 23.7
seems because each fact can be written in different styles and
LLaMA 33B 21.7 even the smallest changes in writing make it hard to detect.
PaLM 62B 15.9 It is fair to assume if any particular LLM is more capable
LLaMA 13B 15.8 to detect hallucination of false information in text, it is also
LaMDA 137B 14
MIM-350M 13.7 more trustworthy. HaluEval is one of the datasets that aims to
LLaMA 7B 10.5 measure hallucination in this field [205]. Evaluation can also be
PaLM 8B 3.6 performed by another model judging the response with regard
to the actual answer [206]. Table X shows the evaluation of
different models based on these datasets.
Arithmetic reasoning is another challenging reasoning ca-
VII. C HALLENGES AND F UTURE D IRECTIONS
pability to achieve. GSM8K for example contains grade school
mathematical questions with respect to their answers. Table IX As we have seen in the previous sections, large language
provides an insight for different model comparisons. models have achieved impressive results in the past 1-2 years.
TABLE X: Hallucination evaluation
Model HHEM HaluEval QA HaluEval Dialogue HaluEval Sum. HaluEval General
GPT 4 97 - - - -
GPT 4 Turbo 97 - - - -
GPT 3.5 Turbo 96.5 62.59 72.4 58.53 79.44
Davinci002 - 60.05 60.81 47.77 80.42
Davinci003 - 49.65 68.37 48.07 80.4
GPT-3 - 49.21 50.02 51.23 72.72
Google Gemini Pro 95.2 - - - -
Llama 2 70B 94.9 - - - -
Llama 2 7B 94.4 49.6 43.99 49.55 20.46
Llama 2 13B 94.1 - - - -
Cohere-Chat 92.5 - - - -
Cohere 91.5 - - - -
Claude 2 91.5 69.78 64.73 57.75 75
Claude 1 67.6 64.83 53.76 73.88
Microsoft Phi 2 91.5 - - - -
Google Palm 2 (beta) 91.4 - - - -
Mixtral 8x7B 90.7 - - - -
Amazon Titan Express 90.6 - - - -
Mistral 7B 90.6 - - - -
Google Palm 2 Chat (beta) 90 - - - -
Google Palm 2 87.9 - - - -
Google Palm 2 Chat 72.8 - - - -
ChatGLM - 47.93 44.41 48.57 30.92
Falcon - 39.66 29.08 42.71 18.98
Vicuna - 60.34 46.35 45.62 19.48
Alpaca - 6.68 17.55 20.63 9.54

At the same time this is still a new and extremely active approach since its inception. As described earlier, attention is
research area where the pace of innovation is increasing rather the main mechanism driving transformers. More recently, there
than slowing down. As in any other evolving area though, there has been promising research in alternative approaches that are
are still numerous challenges ahead. Here we briefly mention being labelled as post-attention.
some of the challenges and main active areas which are known
so far. An important class of such class of post-attention models
are the so called State Space Models (SSMs). While the notion
of State Space Models has a long history in machine learning,
A. Smaller and more efficient Language Models it should be noted that in the context of language models, SSM
This is a survey on large language models, and there is usually used in reference to the newer Structure State Space
has been an initial push towards ”larger is better” that has Model architecture or S4 for short (see Gu et al. [29]). Some
clearly been rewarded with ever larger models like GPT- recent models in this category are Mamba [30], Hyena [209],
4 getting better accuracy and performance in benchmarks. and Striped Hyena [210].
However, those large models are costly and inefficient in
While all of those models are very competitive in terms of
several dimensions (e.g. high latency). In response to all of
performance in leaderboards and efficiency, they also address
this, there is a current research trend to come up with Small
an important challenge in more traditional attention-based
Language Models (SLMs) as a cost-effective alternative to
architectures: the lack of support for larger context windows.
LLMs, particularly when used on specific tasks that might not
require the full generality of larger models. Prominent works Having a good answer to many prompts requires context.
in this direction include Phi-1 [207], Phi-1.5 [208], and Phi-2 For example, the response to ”Recommend some good movies
from Microsoft. for me” requires a lot of context about ”me” as well as what
movies are available and which ones I have not watched.
More generally, we should expect many research efforts in
Context length is especially important for RAG, where large
this area of how to train smaller and more efficient models.
portions of text might be retrieved and injected into the prompt
Techniques such as parameter-efficient fine-tuning (PEFT),
for generation (see section IV-C.
teacher/student, and other forms of distillation – see section
III-I – will continue to be used to build a smaller model out The longer the context length, the more tokens we can
of larger ones. squeeze into the context. The more information the model has
access to, the better its response will be. But on the other
B. New Post-attention Architectural Paradigms hand, with very long context, it would be hard for the model
to remember everything and efficiently process all the informa-
Transformer blocks have been a crucial and constant part of tion. Attention-based models are highly inefficient for longer
most of current LLM frameworks, and it’s a big question mark contexts and that is why we should expect more research in
how much longer this architecture will be in vogue, and what different mechanisms that enable processing longer contexts
will be the next big architectural break-through in the field of and generally come up with more efficient architectures.
deep learning (and NLP). Since AlexNet in 2012, we have seen
many architectures go in and out of fashion, including LSTM, That being said, new architectures might not only propose
GRU, seq2seq, but Transformers have been the dominant alternatives for the attention mechanism but rather rethink the
whole Transformer architecture. As an early example of this, deployed to better understand people preference and interests,
Monarch Mixer [211] proposes a new architecture that uses and provide more personalized interactions, whether in cus-
the same sub-quadratic primitive that achieves high hardware tomer service, content recommendation, or other applications.
efficiency on GPUs – Monarch matrices – along both sequence This involves better understanding of user preferences, and
length and model dimension. analyzing their past interactions and using them as the context.
We will continue to see research in the application and usage
On the other end of the spectrum, it is worth mentioning
of LLMs for not only personalization and recommendations,
that there are some attention-compatible architectural mecha-
but many other application areas using other machine learning
nisms that have been recently gaining steam and proving their
techniques.
value in creating better and more powerful LLMs. Probably
the best example of such mechanism is Mixture of Experts Finally, another important area of research we expect to
(MoE). MoEs have been around in machine learning for years, gather increased attention is that of LLM-based agents and
even before the Deep Learning Era [212], but they have been multi-agent systems [172], [173], [174]. The development of
gaining popularity since then, and particularly in the context LLM systems with access to external tools and decision-
of Transformer models and LLMs. making capabilities is both exciting and challenging. We will
see continued research and progress in this important area that
In LLMs, MoEs allow to train an extremely large model
some argue could lead to Artificial General Intelligence (AGI).
than is then only partially instantiated during inference
when some of the experts are turned off wherever the gat-
ing/weighting function has a low weight assigned to them. As E. Security and Ethical/Responsible AI
an example, the GLaM model has 1.2 trillion parameters, but Ensuring the robustness and security of LLMs against
during inference only 2 out of the 64 experts are used [84]. adversarial attacks and other vulnerabilities is a critical area
of research [217]. As LLMs are increasingly deployed in real-
MoEs are nowadays an important component of the so-
world applications, they need to be protected from potential
called frontier LLMs (i.e. the most advanced and capable
threats, to prevent them being used to manipulate people or
models). GPT-4 itself is rumored to be based on a MoE
spread mis-information.
architecture, and some of the best performing LLMs such as
Mixtral [117], are basically an MoE version of pre-existing Addressing ethical concerns and biases in LLMs is another
LLMs. active area of research. Efforts are being made to ensure that
LLMs are fair, unbiased, and capable of handling sensitive
Finally, it is important to note that MoEs can be used as a
information responsibly. As LLMs are being used more and
component of any architecture regardless of whether it is based
more by a large number of people on a daily basis, making
on attention or not. In fact, MoEs have also been applied to
sure they are unbiased and behave responsibly is crucial.
SSM-based LLMs like Mamba citepioro2024moemamba. We
should continue to see MoE-driven improvements in the future
regardless of the underlying architecture. VIII. C ONCLUSION
This paper present a survey of LLMs developed in the
C. Multi-modal Models past few years. We first provide an overview of early pre-
trained language models (e.g., as BERT), then review three
Future LLMs are expected to be multi-modal and handle
popular LLM families (GPT, LLaMA, PaLM), and other
a variety of data types, such as text, images, and videos,
representative LLMs. We then survey methods and techniques
audio, in a unified manner. This opens up possibilities for
of building, augmenting, and using LLMs. We review popular
more diverse applications in fields like question answering,
LLM datasets and benchmarks, and compare performance of
content generation, creative arts, and healthcare, robotics, and
a set of prominent models on public benchmarks. Finally, we
beyond. There are already several prominent multi-modal
present open challenges and future research directions.
LLMs out there, including: LLAVA [213], LLAVA-Plus [214],
GPT-4 [33], Qwen-vl [116], Next-GPT [215], but the trend is
expected to be continued. Evaluation of these models also is a R EFERENCES
new research topic, especially conversational generative vision [1] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess,
models [216]. Multi-modal LLMs can unlock huge potentials R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws
for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
in a variety of tasks, and there has already been a descent
[2] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai,
progress in this direction, which needs a dedicated paper to E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark
discuss all its details. et al., “Training compute-optimal large language models,” arXiv
preprint arXiv:2203.15556, 2022.
D. Improved LLM Usage and Augmentation techniques [3] C. E. Shannon, “Prediction and entropy of printed english,” Bell system
technical journal, vol. 30, no. 1, pp. 50–64, 1951.
As we described in sectionIV, many of the shortcomings [4] F. Jelinek, Statistical methods for speech recognition. MIT press,
and limitations of LLMs such as hallucination can be ad- 1998.
dressed through advanced prompt engineering, use of tools, [5] C. Manning and H. Schutze, Foundations of statistical natural lan-
or other augmentation techniques. We should expect not only guage processing. MIT press, 1999.
continued, but accelerated research in this area. [6] C. D. Manning, An introduction to information retrieval. Cambridge
university press, 2009.
LLM-based systems are already starting to replace ma- [7] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min,
chine learning systems that were until recently using other B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language
approaches. As a clear example of this, LLMs are now being models,” arXiv preprint arXiv:2303.18223, 2023.
[8] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, [31] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra,
L. He et al., “A comprehensive survey on pretrained foundation mod- A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al.,
els: A history from bert to chatgpt,” arXiv preprint arXiv:2302.09419, “Palm: Scaling language modeling with pathways,” arXiv preprint
2023. arXiv:2204.02311, 2022.
[9] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre- [32] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
train, prompt, and predict: A systematic survey of prompting methods T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama:
in natural language processing,” ACM Computing Surveys, vol. 55, Open and efficient foundation language models,” arXiv preprint
no. 9, pp. 1–35, 2023. arXiv:2302.13971, 2023.
[10] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, [33] OpenAI, “GPT-4 Technical Report,” https://arxiv.org/pdf/2303.
J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint 08774v3.pdf, 2023.
arXiv:2301.00234, 2022. [34] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter,
[11] J. Huang and K. C.-C. Chang, “Towards reasoning in large language F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought
models: A survey,” arXiv preprint arXiv:2212.10403, 2022. prompting elicits reasoning in large language models,” in
[12] S. F. Chen and J. Goodman, “An empirical study of smoothing Advances in Neural Information Processing Systems, S. Koyejo,
techniques for language modeling,” Computer Speech & Language, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh,
vol. 13, no. 4, pp. 359–394, 1999. Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837.
[Online]. Available: https://proceedings.neurips.cc/paper files/paper/
[13] Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic
2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
language model,” Advances in neural information processing systems,
vol. 13, 2000. [35] G. Mialon, R. Dessı̀, M. Lomeli, C. Nalmpantis, R. Pasunuru,
R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyil-
[14] H. Schwenk, D. Déchelotte, and J.-L. Gauvain, “Continuous space
maz et al., “Augmented language models: a survey,” arXiv preprint
language models for statistical machine translation,” in Proceedings
arXiv:2302.07842, 2023.
of the COLING/ACL 2006 Main Conference Poster Sessions, 2006,
pp. 723–730. [36] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang,
L. Liden, Z. Yu, W. Chen, and J. Gao, “Check your facts and try
[15] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur,
again: Improving large language models with external knowledge and
“Recurrent neural network based language model.” in Interspeech,
automated feedback,” arXiv preprint arXiv:2302.12813, 2023.
vol. 2, no. 3. Makuhari, 2010, pp. 1045–1048.
[16] A. Graves, “Generating sequences with recurrent neural networks,” [37] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao,
arXiv preprint arXiv:1308.0850, 2013. “React: Synergizing reasoning and acting in language models,” arXiv
preprint arXiv:2210.03629, 2022.
[17] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning
deep structured semantic models for web search using clickthrough [38] D. E. Rumelhart, G. E. Hinton, R. J. Williams et al., “Learning internal
data,” in Proceedings of the 22nd ACM international conference on representations by error propagation,” 1985.
Information & Knowledge Management, 2013, pp. 2333–2338. [39] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14,
[18] J. Gao, C. Xiong, P. Bennett, and N. Craswell, Neural Approaches to no. 2, pp. 179–211, 1990.
Conversational Information Retrieval. Springer Nature, 2023, vol. 44. [40] M. V. Mahoney, “Fast text compression with neural networks.” in
[19] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning FLAIRS conference, 2000, pp. 230–234.
with neural networks,” Advances in neural information processing [41] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Černockỳ, “Strate-
systems, vol. 27, 2014. gies for training large scale neural network language models,” in 2011
[20] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On IEEE Workshop on Automatic Speech Recognition & Understanding.
the properties of neural machine translation: Encoder-decoder ap- IEEE, 2011, pp. 196–201.
proaches,” arXiv preprint arXiv:1409.1259, 2014. [42] tmikolov. rnnlm. [Online]. Available: https://www.fit.vutbr.cz/
∼imikolov/rnnlm/
[21] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár,
J. Gao, X. He, M. Mitchell, J. C. Platt et al., “From captions to [43] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu,
visual concepts and back,” in Proceedings of the IEEE conference and J. Gao, “Deep learning–based text classification: a comprehensive
on computer vision and pattern recognition, 2015, pp. 1473–1482. review,” ACM computing surveys (CSUR), vol. 54, no. 3, pp. 1–40,
[22] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: 2021.
A neural image caption generator,” in Proceedings of the IEEE [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
conference on computer vision and pattern recognition, 2015, pp. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
3156–3164. Advances in neural information processing systems, vol. 30, 2017.
[23] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, [45] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
and L. Zettlemoyer, “Deep contextualized word representations. corr “Albert: A lite bert for self-supervised learning of language represen-
abs/1802.05365 (2018),” arXiv preprint arXiv:1802.05365, 2018. tations,” arXiv preprint arXiv:1909.11942, 2019.
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training [46] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-
of deep bidirectional transformers for language understanding,” arXiv training text encoders as discriminators rather than generators,” arXiv
preprint arXiv:1810.04805, 2018. preprint arXiv:2003.10555, 2020.
[25] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, [47] G. Lample and A. Conneau, “Cross-lingual language model pretrain-
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert ing,” arXiv preprint arXiv:1901.07291, 2019.
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[48] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
[26] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language
with disentangled attention,” arXiv preprint arXiv:2006.03654, 2020. understanding,” Advances in neural information processing systems,
[27] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, vol. 32, 2019.
A. Zhang, L. Zhang et al., “Pre-trained models: Past, present and [49] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao,
future,” AI Open, vol. 2, pp. 225–250, 2021. M. Zhou, and H.-W. Hon, “Unified language model pre-training for
[28] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained natural language understanding and generation,” Advances in neural
models for natural language processing: A survey,” Science China information processing systems, vol. 32, 2019.
Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020. [50] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improv-
[29] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with ing language understanding by generative pre-training,” 2018.
structured state spaces,” 2022. [51] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,
[30] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with “Language models are unsupervised multitask learners,” OpenAI blog,
selective state spaces,” arXiv preprint arXiv:2312.00752, 2023. vol. 1, no. 8, p. 9, 2019.
[52] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Available: [https://huggingface.co/stabilityai/StableBeluga2](https://
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning huggingface.co/stabilityai/StableBeluga2)
with a unified text-to-text transformer,” The Journal of Machine [73] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, S. Shakeri, X. Gar-
Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. cia, H. S. Zheng, J. Rao, A. Chowdhery et al., “Transcending scaling
[53] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, laws with 0.1% extra compute,” arXiv preprint arXiv:2210.11399,
A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained 2022.
text-to-text transformer,” arXiv preprint arXiv:2010.11934, 2020. [74] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus,
[54] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mass: Masked Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-
sequence to sequence pre-training for language generation,” arXiv finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
preprint arXiv:1905.02450, 2019. [75] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos,
[55] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical
V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to- report,” arXiv preprint arXiv:2305.10403, 2023.
sequence pre-training for natural language generation, translation, and [76] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung,
comprehension,” arXiv preprint arXiv:1910.13461, 2019. N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language
[56] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, models encode clinical knowledge,” arXiv preprint arXiv:2212.13138,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- 2022.
els are few-shot learners,” Advances in neural information processing [77] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou,
systems, vol. 33, pp. 1877–1901, 2020. K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al., “Towards expert-
[57] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka- level medical question answering with large language models,” arXiv
plan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., preprint arXiv:2305.09617, 2023.
“Evaluating large language models trained on code,” arXiv preprint [78] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
arXiv:2107.03374, 2021. A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot
[58] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, learners,” arXiv preprint arXiv:2109.01652, 2021.
C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser- [79] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song,
assisted question-answering with human feedback,” arXiv preprint J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language
arXiv:2112.09332, 2021. models: Methods, analysis & insights from training gopher,” arXiv
[59] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, preprint arXiv:2112.11446, 2021.
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language [80] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai,
models to follow instructions with human feedback,” Advances in A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al., “Multi-
Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, task prompted training enables zero-shot task generalization,” arXiv
2022. preprint arXiv:2110.08207, 2021.
[60] OpenAI. (2022) Introducing chatgpt. [Online]. Available: https: [81] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen,
//openai.com/blog/chatgpt Y. Zhao, Y. Lu et al., “Ernie 3.0: Large-scale knowledge enhanced pre-
[61] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, training for language understanding and generation,” arXiv preprint
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama arXiv:2107.02137, 2021.
2: Open foundation and fine-tuned chat models,” arXiv preprint [82] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Mil-
arXiv:2307.09288, 2023. lican, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark
[62] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, et al., “Improving language models by retrieving from trillions of
and T. B. Hashimoto, “Alpaca: A strong, replicable instruction- tokens,” in International conference on machine learning. PMLR,
following model,” Stanford Center for Research on Foundation Mod- 2022, pp. 2206–2240.
els. https://crfm. stanford. edu/2023/03/13/alpaca. html, vol. 3, no. 6, [83] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical
p. 7, 2023. details and evaluation,” White Paper. AI21 Labs, vol. 1, p. 9, 2021.
[63] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Ef-
[84] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun,
ficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314,
Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of
2023.
language models with mixture-of-experts,” in International Conference
[64] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, on Machine Learning. PMLR, 2022, pp. 5547–5569.
and D. Song, “Koala: A dialogue model for academic research,” Blog
[85] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-
post, April, vol. 1, 2023.
T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language
[65] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, models for dialog applications,” arXiv preprint arXiv:2201.08239,
D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., 2022.
“Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
[86] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen,
[66] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained
J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
for code,” arXiv preprint arXiv:2308.12950, 2023.
[87] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Sar-
[67] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large avia, A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large
language model connected with massive apis,” 2023. language model for science,” arXiv preprint arXiv:2211.09085, 2022.
[68] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, and [88] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou,
S. Naidu, “Giraffe: Adventures in expanding context lengths in llms,” S. Savarese, and C. Xiong, “Codegen: An open large language
arXiv preprint arXiv:2308.10882, 2023. model for code with multi-turn program synthesis,” arXiv preprint
[69] B. Huang, “Vigogne: French instruction-following and chat models,” arXiv:2203.13474, 2022.
https://github.com/bofenghuang/vigogne, 2023. [89] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza,
[70] Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R. Chandu, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky et al.,
D. Wadden, K. MacMillan, N. A. Smith, I. Beltagy et al., “How far can “Alexatm 20b: Few-shot learning using a large-scale multilingual
camels go? exploring the state of instruction tuning on open resources,” seq2seq model,” arXiv preprint arXiv:2208.01448, 2022.
arXiv preprint arXiv:2306.04751, 2023. [90] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, V. Firoiu,
[71] S. Tworkowski, K. Staniszewski, M. Pacek, Y. Wu, H. Michalewski, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al.,
and P. Miłoś, “Focused transformer: Contrastive training for context “Improving alignment of dialogue agents via targeted human judge-
scaling,” arXiv preprint arXiv:2307.03170, 2023. ments,” arXiv preprint arXiv:2209.14375, 2022.
[72] D. Mahan, R. Carlow, L. Castricato, N. Cooper, [91] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski,
and C. Laforte, “Stable beluga models.” [Online]. V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo et al.,
“Solving quantitative reasoning problems with language models,” [113] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou,
Advances in Neural Information Processing Systems, vol. 35, pp. “Codegen2: Lessons for training llms on programming and natural
3843–3857, 2022. languages,” arXiv preprint arXiv:2305.02309, 2023.
[92] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, D. Bahri, T. Schuster, [114] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada,
H. S. Zheng, N. Houlsby, and D. Metzler, “Unifying language learning S. Huang, L. von Werra, C. Fourrier, N. Habib et al., “Zephyr: Direct
paradigms,” arXiv preprint arXiv:2205.05131, 2022. distillation of lm alignment,” arXiv preprint arXiv:2310.16944, 2023.
[93] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, [115] X. team. Grok. [Online]. Available: https://grok.x.ai/
R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176b-
[116] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou,
parameter open-access multilingual language model,” arXiv preprint
and J. Zhou, “Qwen-vl: A frontier large vision-language model with
arXiv:2211.05100, 2022.
versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
[94] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
[117] mixtral. mixtral. [Online]. Available: https://mistral.ai/news/
W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained
mixtral-of-experts/
model,” arXiv preprint arXiv:2210.02414, 2022.
[95] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, [118] D. Wang, N. Raman, M. Sibue, Z. Ma, P. Babkin, S. Kaur, Y. Pei,
E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff et al., A. Nourbakhsh, and X. Liu, “Docllm: A layout-aware generative
“Pythia: A suite for analyzing large language models across train- language model for multimodal document understanding,” 2023.
ing and scaling,” in International Conference on Machine Learning. [119] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi,
PMLR, 2023, pp. 2397–2430. Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang, “Deepseek-coder:
[96] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and When the large language model meets programming – the rise of code
A. Awadallah, “Orca: Progressive learning from complex explanation intelligence,” 2024.
traces of gpt-4,” arXiv preprint arXiv:2306.02707, 2023. [120] F. Wan, X. Huang, D. Cai, X. Quan, W. Bi, and S. Shi, “Knowledge
[97] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, fusion of large language models,” 2024.
M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source [121] P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source
be with you!” arXiv preprint arXiv:2305.06161, 2023. small language model,” 2024.
[98] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, [122] C. Wu, Y. Gan, Y. Ge, Z. Lu, J. Wang, Y. Feng, P. Luo, and Y. Shan,
L. Cui, O. K. Mohammed, Q. Liu et al., “Language is not all you “Llama pro: Progressive llama with block expansion,” 2024.
need: Aligning perception with language models,” arXiv preprint
[123] X. Amatriain, A. Sankar, J. Bing, P. K. Bodigutla, T. J. Hazen, and
arXiv:2302.14045, 2023.
M. Kazi, “Transformer models: an introduction and catalog,” 2023.
[99] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly [124] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli,
capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023. H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, “The refined-
web dataset for falcon llm: outperforming curated corpora with web
[100] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, data, and web data only,” arXiv preprint arXiv:2306.01116, 2023.
J. Tompson, I. Mordatch, Y. Chebotar et al., “Inner monologue:
Embodied reasoning through planning with language models,” arXiv [125] D. Hernandez, T. Brown, T. Conerly, N. DasSarma, D. Drain, S. El-
preprint arXiv:2207.05608, 2022. Showk, N. Elhage, Z. Hatfield-Dodds, T. Henighan, T. Hume et al.,
“Scaling laws and interpretability of learning from repeated data,”
[101] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, arXiv preprint arXiv:2205.10487, 2022.
J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti
et al., “Using deepspeed and megatron to train megatron-turing [126] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative
nlg 530b, a large-scale generative language model,” arXiv preprint position representations,” arXiv preprint arXiv:1803.02155, 2018.
arXiv:2201.11990, 2022. [127] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: En-
[102] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- hanced transformer with rotary position embedding,” arXiv preprint
document transformer,” arXiv preprint arXiv:2004.05150, 2020. arXiv:2104.09864, 2021.
[103] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shus- [128] O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention
ter, T. Wang, Q. Liu, P. S. Koura et al., “Opt-iml: Scaling language with linear biases enables input length extrapolation,” arXiv preprint
model instruction meta learning through the lens of generalization,” arXiv:2108.12409, 2021.
arXiv preprint arXiv:2212.12017, 2022. [129] G. Ke, D. He, and T.-Y. Liu, “Rethinking positional encoding in
[104] Y. Hao, H. Song, L. Dong, S. Huang, Z. Chi, W. Wang, S. Ma, language pre-training,” arXiv preprint arXiv:2006.15595, 2020.
and F. Wei, “Language models are general-purpose interfaces,” arXiv [130] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
preprint arXiv:2206.06336, 2022. and J. Dean, “Outrageously large neural networks: The sparsely-gated
[105] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.
and C. Gan, “Principle-driven self-alignment of language mod- [131] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling
els from scratch with minimal human supervision,” arXiv preprint to trillion parameter models with simple and efficient sparsity,” The
arXiv:2305.03047, 2023. Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270,
[106] W. E. team, “Palmyra-base Parameter Autoregressive Language 2022.
Model,” https://dev.writer.com, 2023. [132] R. K. Mahabadi, S. Ruder, M. Dehghani, and J. Henderson,
[107] ——, “Camel-5b instructgpt,” https://dev.writer.com, 2023. “Parameter-efficient multi-task fine-tuning for transformers via shared
[108] Yandex. Yalm. [Online]. Available: https://github.com/yandex/ hypernetworks,” 2021.
YaLM-100B [133] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu,
[109] M. Team et al., “Introducing mpt-7b: a new standard for open-source, T. Zhang, F. Wu, and G. Wang, “Instruction tuning for large language
commercially usable llms,” 2023. models: A survey,” 2023.
[110] A. Mitra, L. D. Corro, S. Mahajan, A. Codas, C. Simoes, S. Agarwal, [134] S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi, “Cross-task
X. Chen, A. Razdaibiedina, E. Jones, K. Aggarwal, H. Palangi, generalization via natural language crowdsourcing instructions,” arXiv
G. Zheng, C. Rosset, H. Khanpour, and A. Awadallah, “Orca 2: preprint arXiv:2104.08773, 2021.
Teaching small language models how to reason,” 2023. [135] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi,
[111] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and and H. Hajishirzi, “Self-instruct: Aligning language model with self
G. Neubig, “Pal: Program-aided language models,” in International generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
Conference on Machine Learning. PMLR, 2023, pp. 10 764–10 799. [136] K. Ethayarajh, W. Xu, D. Jurafsky, and D. Kiela. Kto. [Online].
[112] Anthropic. claude. [Online]. Available: https://www.anthropic.com/ Available: https://github.com/ContextualAI/HALOs/blob/main/assets/
news/introducing-claude report.pdf
[137] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and Association for Computational Linguistics, vol. 10, pp. 1066–1083,
D. Amodei, “Deep reinforcement learning from human preferences,” 2022. [Online]. Available: https://aclanthology.org/2022.tacl-1.62
Advances in neural information processing systems, vol. 30, 2017. [154] S. Santhanam, B. Hedayatnia, S. Gella, A. Padmakumar, S. Kim,
[138] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, C. Bishop, V. Car- Y. Liu, and D. Z. Hakkani-Tür, “Rome was built in 1776: A case study
bune, and A. Rastogi, “Rlaif: Scaling reinforcement learning from on factual correctness in knowledge-grounded response generation,”
human feedback with ai feedback,” arXiv preprint arXiv:2309.00267, ArXiv, vol. abs/2110.05456, 2021.
2023. [155] S. Min, K. Krishna, X. Lyu, M. Lewis, W. tau Yih, P. W. Koh, M. Iyyer,
[139] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and L. Zettlemoyer, and H. Hajishirzi, “Factscore: Fine-grained atomic
C. Finn, “Direct preference optimization: Your language model is evaluation of factual precision in long form text generation,” 2023.
secretly a reward model,” arXiv preprint arXiv:2305.18290, 2023. [156] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner,
[140] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory V. Chaudhary, and M. Young, “Machine learning: The high interest
optimizations toward training trillion parameter models,” in SC20: In- credit card of technical debt,” in SE4ML: Software Engineering for
ternational Conference for High Performance Computing, Networking, Machine Learning (NIPS 2014 Workshop), 2014.
Storage and Analysis. IEEE, 2020, pp. 1–16. [157] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought
[141] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, prompting in large language models,” 2022.
X. Cheng, M. Chung, M. Grella, K. K. GV et al., “Rwkv: Reinventing [158] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and
rnns for the transformer era,” arXiv preprint arXiv:2305.13048, 2023. K. Narasimhan, “Tree of thoughts: Deliberate problem solving with
[142] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, large language models,” 2023.
and W. Chen, “Lora: Low-rank adaptation of large language models,” [159] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheckgpt: Zero-
arXiv preprint arXiv:2106.09685, 2021. resource black-box hallucination detection for generative large lan-
[143] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a guage models,” 2023.
neural network,” arXiv preprint arXiv:1503.02531, 2015. [160] N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan,
[144] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: and S. Yao, “Reflexion: Language agents with verbal reinforcement
A survey,” International Journal of Computer Vision, vol. 129, pp. learning,” 2023.
1789–1819, 2021. [161] S. J. Zhang, S. Florin, A. N. Lee, E. Niknafs, A. Marginean, A. Wang,
[145] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. K. Tyser, Z. Chin, Y. Hicke, N. Singh, M. Udell, Y. Kim, T. Buonassisi,
Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural A. Solar-Lezama, and I. Drori, “Exploring the mit mathematics and
language generation,” ACM Comput. Surv., vol. 55, no. 12, mar 2023. eecs curriculum using large language models,” 2023.
[Online]. Available: https://doi.org/10.1145/3571730 [162] T. Wu, E. Jiang, A. Donsbach, J. Gray, A. Molina, M. Terry, and C. J.
[146] N. McKenna, T. Li, L. Cheng, M. J. Hosseini, M. Johnson, and Cai, “Promptchainer: Chaining large language model prompts through
M. Steedman, “Sources of hallucination by large language models on visual programming,” 2022.
inference tasks,” 2023. [163] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and
[147] C.-Y. Lin, “ROUGE: A package for automatic evaluation of J. Ba, “Large language models are human-level prompt engineers,”
summaries,” in Text Summarization Branches Out. Barcelona, Spain: 2023.
Association for Computational Linguistics, Jul. 2004, pp. 74–81. [164] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin,
[Online]. Available: https://aclanthology.org/W04-1013 N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and
[148] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for D. Kiela, “Retrieval-augmented generation for knowledge-intensive
automatic evaluation of machine translation,” in Proceedings of the NLP tasks,” CoRR, vol. abs/2005.11401, 2020. [Online]. Available:
40th Annual Meeting of the Association for Computational Linguistics, https://arxiv.org/abs/2005.11401
P. Isabelle, E. Charniak, and D. Lin, Eds. Philadelphia, Pennsylvania, [165] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and
USA: Association for Computational Linguistics, Jul. 2002, pp. 311– H. Wang, “Retrieval-augmented generation for large language models:
318. [Online]. Available: https://aclanthology.org/P02-1040 A survey,” arXiv preprint arXiv:2312.10997, 2023.
[149] B. Dhingra, M. Faruqui, A. Parikh, M.-W. Chang, D. Das, and [166] A. W. Services. (Year of publication, e.g., 2023) Question answering
W. Cohen, “Handling divergent reference texts when evaluating using retrieval augmented generation with foundation models in
table-to-text generation,” in Proceedings of the 57th Annual Meeting amazon sagemaker jumpstart. Accessed: Date of access, e.g.,
of the Association for Computational Linguistics, A. Korhonen, December 5, 2023. [Online]. Available: https://shorturl.at/dSV47
D. Traum, and L. Màrquez, Eds. Florence, Italy: Association [167] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large
for Computational Linguistics, Jul. 2019, pp. 4884–4895. [Online]. language models and knowledge graphs: A roadmap,” arXiv preprint
Available: https://aclanthology.org/P19-1483 arXiv:2306.08302, 2023.
[150] Z. Wang, X. Wang, B. An, D. Yu, and C. Chen, “Towards faithful [168] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang,
neural table-to-text generation with content-matching constraints,” J. Callan, and G. Neubig, “Active retrieval augmented generation,”
in Proceedings of the 58th Annual Meeting of the Association 2023.
for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter,
[169] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, L. Zettle-
and J. Tetreault, Eds. Online: Association for Computational
moyer, N. Cancedda, and T. Scialom, “Toolformer: Language models
Linguistics, Jul. 2020, pp. 1072–1086. [Online]. Available: https:
can teach themselves to use tools,” 2023.
//aclanthology.org/2020.acl-main.101
[170] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer,
[151] H. Song, W.-N. Zhang, J. Hu, and T. Liu, “Generating persona consis-
and M. T. Ribeiro, “Art: Automatic multi-step reasoning and tool-use
tent dialogues by exploiting natural language inference,” Proceedings
for large language models,” 2023.
of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp.
8878–8885, Apr. 2020. [171] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt:
Solving ai tasks with chatgpt and its friends in huggingface,” arXiv
[152] O. Honovich, L. Choshen, R. Aharoni, E. Neeman, I. Szpektor, preprint arXiv:2303.17580, 2023.
and O. Abend, “q 2 : Evaluating factual consistency in knowledge-
grounded dialogues via question generation and question answering,” [172] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang,
in Proceedings of the 2021 Conference on Empirical Methods in S. Jin, E. Zhou et al., “The rise and potential of large language model
Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: [173] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen,
Association for Computational Linguistics, Nov. 2021, pp. 7856–7870. J. Tang, X. Chen, Y. Lin et al., “A survey on large language model
[Online]. Available: https://aclanthology.org/2021.emnlp-main.619 based autonomous agents,” arXiv preprint arXiv:2308.11432, 2023.
[153] N. Dziri, H. Rashkin, T. Linzen, and D. Reitter, “Evaluating attribution [174] Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar,
in dialogue systems: The BEGIN benchmark,” Transactions of the R. Taori, Y. Noda, D. Terzopoulos, Y. Choi, K. Ikeuchi, H. Vo, L. Fei-
Fei, and J. Gao, “Agent ai: Surveying the horizons of multimodal CoRR, vol. abs/2110.14168, 2021. [Online]. Available: https:
interaction,” arXiv preprint arXiv:2401.03568, 2024. //arxiv.org/abs/2110.14168
[175] B. Xu, Z. Peng, B. Lei, S. Mukherjee, Y. Liu, and D. Xu, “Rewoo: [191] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang,
Decoupling reasoning from observations for efficient augmented lan- D. Song, and J. Steinhardt, “Measuring mathematical problem solving
guage models,” 2023. with the MATH dataset,” CoRR, vol. abs/2103.03874, 2021. [Online].
[176] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, Available: https://arxiv.org/abs/2103.03874
“React: Synergizing reasoning and acting in language models,” 2023. [192] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag:
[177] V. Nair, E. Schumacher, G. Tso, and A. Kannan, “Dera: Enhanc- Can a machine really finish your sentence?” 2019.
ing large language model completions with dialog-enabled resolving [193] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,
agents,” 2023. and O. Tafjord, “Think you have solved question answering? try
[178] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, arc, the AI2 reasoning challenge,” CoRR, vol. abs/1803.05457, 2018.
C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, [Online]. Available: http://arxiv.org/abs/1803.05457
and X. Xie, “A survey on evaluation of large language models,” 2023. [194] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, “PIQA:
[179] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, reasoning about physical commonsense in natural language,” CoRR,
C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, vol. abs/1911.11641, 2019. [Online]. Available: http://arxiv.org/abs/
L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, 1911.11641
Q. Le, and S. Petrov, “Natural questions: A benchmark for [195] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi, “Socialiqa:
question answering research,” Transactions of the Association for Commonsense reasoning about social interactions,” CoRR, vol.
Computational Linguistics, vol. 7, pp. 452–466, 2019. [Online]. abs/1904.09728, 2019. [Online]. Available: http://arxiv.org/abs/1904.
Available: https://aclanthology.org/Q19-1026 09728
[180] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and [196] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of
J. Steinhardt, “Measuring massive multitask language understanding,” armor conduct electricity? A new dataset for open book question
2021. answering,” CoRR, vol. abs/1809.02789, 2018. [Online]. Available:
[181] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, http://arxiv.org/abs/1809.02789
E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large [197] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models
language models,” arXiv preprint arXiv:2108.07732, 2021. mimic human falsehoods,” arXiv preprint arXiv:2109.07958, 2021.
[182] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, [198] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov,
and L. Zettlemoyer, “QuAC: Question answering in context,” in and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable
Proceedings of the 2018 Conference on Empirical Methods in Natural multi-hop question answering,” CoRR, vol. abs/1809.09600, 2018.
Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and [Online]. Available: http://arxiv.org/abs/1809.09600
J. Tsujii, Eds. Brussels, Belgium: Association for Computational
Linguistics, Oct.-Nov. 2018, pp. 2174–2184. [Online]. Available: [199] Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang, “Toolqa: A
https://aclanthology.org/D18-1241 dataset for llm question answering with external tools,” arXiv preprint
arXiv:2306.13304, 2023.
[183] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo,
C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring [200] D. Chen, J. Bolton, and C. D. Manning, “A thorough examination
coding challenge competence with apps,” NeurIPS, 2021. of the cnn/daily mail reading comprehension task,” in Association for
Computational Linguistics (ACL), 2016.
[184] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating structured
queries from natural language using reinforcement learning,” arXiv [201] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang et al., “Abstractive text
preprint arXiv:1709.00103, 2017. summarization using sequence-to-sequence rnns and beyond,” arXiv
preprint arXiv:1602.06023, 2016.
[185] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “TriviaQA:
A large scale distantly supervised challenge dataset for reading [202] Y. Bai and D. Z. Wang, “More than reading comprehension: A survey
comprehension,” in Proceedings of the 55th Annual Meeting of the on datasets and metrics of textual question answering,” arXiv preprint
Association for Computational Linguistics (Volume 1: Long Papers), arXiv:2109.12264, 2021.
R. Barzilay and M.-Y. Kan, Eds. Vancouver, Canada: Association [203] H.-Y. Huang, E. Choi, and W.-t. Yih, “Flowqa: Grasping flow in
for Computational Linguistics, Jul. 2017, pp. 1601–1611. [Online]. history for conversational machine comprehension,” arXiv preprint
Available: https://aclanthology.org/P17-1147 arXiv:1810.06683, 2018.
[186] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “RACE: Large-scale [204] S. Lee, J. Lee, H. Moon, C. Park, J. Seo, S. Eo, S. Koo, and H. Lim, “A
ReAding comprehension dataset from examinations,” in Proceedings survey on evaluation metrics for machine translation,” Mathematics,
of the 2017 Conference on Empirical Methods in Natural Language vol. 11, no. 4, p. 1006, 2023.
Processing, M. Palmer, R. Hwa, and S. Riedel, Eds. Copenhagen,
[205] J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Halueval:
Denmark: Association for Computational Linguistics, Sep. 2017, pp.
A large-scale hallucination evaluation benchmark for large language
785–794. [Online]. Available: https://aclanthology.org/D17-1082
models,” in Proceedings of the 2023 Conference on Empirical Methods
[187] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ in Natural Language Processing, 2023, pp. 6449–6464.
questions for machine comprehension of text,” in Proceedings of
[206] Simon Mark Hughes, “Hughes hallucination evaluation model
the 2016 Conference on Empirical Methods in Natural Language
(hhem) leaderboard,” 2024, https://huggingface.co/spaces/vectara/
Processing, J. Su, K. Duh, and X. Carreras, Eds. Austin, Texas:
Hallucination-evaluation-leaderboard, Last accessed on 2024-01-21.
Association for Computational Linguistics, Nov. 2016, pp. 2383–2392.
[Online]. Available: https://aclanthology.org/D16-1264 [207] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno,
[188] C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi et al.,
K. Toutanova, “Boolq: Exploring the surprising difficulty of natural “Textbooks are all you need,” arXiv preprint arXiv:2306.11644, 2023.
yes/no questions,” CoRR, vol. abs/1905.10044, 2019. [Online]. [208] Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y. T.
Available: http://arxiv.org/abs/1905.10044 Lee, “Textbooks are all you need ii: phi-1.5 technical report,” arXiv
[189] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth, preprint arXiv:2309.05463, 2023.
“Looking beyond the surface:a challenge set for reading compre- [209] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus,
hension over multiple sentences,” in Proceedings of North American Y. Bengio, S. Ermon, and C. Ré, “Hyena hierarchy: Towards larger
Chapter of the Association for Computational Linguistics (NAACL), convolutional language models,” 2023.
2018. [210] M. Poli, J. Wang, S. Massaroli, J. Quesnelle, E. Nguyen, and
[190] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, A. Thomas, “StripedHyena: Moving Beyond Transformers with
M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and Hybrid Signal Processing Models,” 12 2023. [Online]. Available:
J. Schulman, “Training verifiers to solve math word problems,” https://github.com/togethercomputer/stripedhyena
[211] D. Y. Fu, S. Arora, J. Grogan, I. Johnson, S. Eyuboglu, A. W. Thomas, A PPENDIX
B. Spector, M. Poli, A. Rudra, and C. Ré, “Monarch mixer: A simple
sub-quadratic gemm-based architecture,” 2023. 1. Open Source Toolkits For LLM Development and
[212] G. J. McLachlan, S. X. Lee, and S. I. Rathnayake, “Finite mixture Deployment
models,” Annual review of statistics and its application, vol. 6, pp.
355–378, 2019. There are various frameworks and libraries developed for
[213] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv LLM training, evaluation, and deployment, and covering every
preprint arXiv:2304.08485, 2023. single framework is out of this paper’s scope. But we try to
[214] S. Liu, H. Cheng, H. Liu, H. Zhang, F. Li, T. Ren, X. Zou, provide a brief introduction of some of the most popular ones,
J. Yang, H. Su, J. Zhu, L. Zhang, J. Gao, and C. Li, “Llava-plus: grouped into different categories.
Learning to use tools for creating multimodal agents,” arXiv preprint
arXiv:2311.05437, 2023.
[215] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any
A. LLM Training/Inference Frameworks
multimodal llm,” arXiv preprint arXiv:2309.05519, 2023. Some of the popular frameworks which are useful for LLM
[216] N. N. Khasmakhi, M. Asgari-Chenaghlu, N. Asghar, P. Schaer, and training includes (note that some of them can be used beyond
D. Zühlke, “Convgenvismo: Evaluation of conversational generative
vision models,” 2023.
LLM training too):
[217] L. Sun, Y. Huang, H. Wang, S. Wu, Q. Zhang, C. Gao, Y. Huang, DeepSpeed [218] is a deep learning optimization library
W. Lyu, Y. Zhang, X. Li et al., “Trustllm: Trustworthiness in large that makes distributed training and inference easy, efficient,
language models,” arXiv preprint arXiv:2401.05561, 2024.
and effective. DeepSpeed enables world’s most powerful lan-
[218] Microsoft. Deepspeed. [Online]. Available: https://github.com/
microsoft/DeepSpeed
guage models like MT-530B and BLOOM. It is an easy-
to-use deep learning optimization software suite that powers
[219] HuggingFace. Transformers. [Online]. Available: https://github.com/
huggingface/transformers unprecedented scale and speed for both training and inference.
[220] Nvidia. Megatron. [Online]. Available: https://github.com/NVIDIA/
With DeepSpeed you can:
Megatron-LM Transformers [219] is library by HuggingFace which
[221] BMTrain. Bmtrain. [Online]. Available: https://github.com/OpenBMB/ provides thousands of pretrained models to perform tasks on
BMTrain
different modalities such as text, vision, and audio. Using
[222] EleutherAI. gpt-neox. [Online]. Available: https://github.com/ pretrained models one can reduce compute costs, carbon
EleutherAI/gpt-neox
footprint, and save the time and resources required to train
[223] microsoft. Lora. [Online]. Available: https://github.com/microsoft/
LoRA a model from scratch.
[224] ColossalAI. Colossalai. [Online]. Available: https://github.com/ Megatron-LM [220] is a large, powerful transformer
hpcaitech/ColossalAI developed by the Applied Deep Learning Research team
[225] FastChat. Fastchat. [Online]. Available: https://github.com/lm-sys/ at NVIDIA. It contains efficient, model-parallel (tensor, se-
FastChat
quence, and pipeline), and multi-node pre-training of trans-
[226] skypilot. skypilot. [Online]. Available: https://github.com/skypilot-org/ former based models such as GPT, BERT, and T5 using mixed
skypilot
precision.
[227] vllm. vllm. [Online]. Available: https://github.com/vllm-project/vllm
[228] huggingface. text-generation-inference. [Online]. Available: https: BMTrain [221] is an efficient large model training toolkit
//github.com/huggingface/text-generation-inference that can be used to train large models with tens of billions of
[229] langchain. langchain. [Online]. Available: https://github.com/ parameters. It can train models in a distributed manner while
langchain-ai/langchain keeping the code as simple as stand-alone training.
[230] bentoml. Openllm. [Online]. Available: https://github.com/bentoml/
OpenLLM GPT-NeoX [222] leverages many of the same features and
[231] embedchain. embedchain. [Online]. Available: https://github.com/ technologies as the popular Megatron-DeepSpeed library but
embedchain/embedchain with substantially increased usability and novel optimizations.
[232] microsoft. autogen. [Online]. Available: https://github.com/microsoft/
autogen LoRA [223] library provides the support for Low-Rank
[233] babyagi. babyagi. [Online]. Available: https://github.com/
Adaptation of Large Language Models. It reduces the number
yoheinakajima/babyagi of trainable parameters by learning pairs of rank-decompostion
[234] guidance. guidance. [Online]. Available: https://github.com/ matrices while freezing the original weights. This vastly
guidance-ai/guidance reduces the storage requirement for large language models
[235] prompttools. prompttools. [Online]. Available: https://github.com/ adapted to specific tasks and enables efficient task-switching
hegelai/prompttools during deployment all without introducing inference latency.
[236] promptfoo. promptfoo. [Online]. Available: https://github.com/ LoRA also outperforms several other adaptation methods in-
promptfoo/promptfoo cluding adapter, prefix-tuning, and fine-tuning.
[237] facebook. faiss. [Online]. Available: https://github.com/
facebookresearch/faiss ColossalAI library [224] provides a collection of parallel
[238] milvus. milvus. [Online]. Available: https://github.com/milvus-io/ components. It aims to support developers to write their
milvus distributed deep learning models just like how they write their
[239] qdrant. qdrant. [Online]. Available: https://github.com/qdrant/qdrant model on their laptop. They provide user-friendly tools to
[240] weaviate. weaviate. [Online]. Available: https://github.com/weaviate/ kickstart distributed training and inference in a few lines. In
weaviate terms of Parallelism strategies, they support: Data Parallelism,
[241] llama index. llama-index. [Online]. Available: https://github.com/ Pipeline Parallelism, Sequence Parallelism, Zero Redundancy
run-llama/llama index Optimizer (ZeRO) [140], and Auto-Parallelism.
B. Deployment Tools C. Prompting Libraries
We provide an overview of some of the most popular LLM Guidance [234] is a programming paradigm that offers
deployment tools here. superior control and efficiency compared to conventional
FastChat [225] is an open platform for training, serv- prompting and chaining. It allows users to constrain generation
ing, and evaluating large language model based chatbots. (e.g. with regex and CFGs) as well as to interleave control
FastChat’s core features include: The training and evaluation (conditional, loops) and generation seamlessly.
code for state-of-the-art models (e.g., Vicuna, MT-Bench), and PromptTools [235] offers a set of open-source, self-
a distributed multi-model serving system with web UI and hostable tools for experimenting with, testing, and evaluating
OpenAI-compatible RESTful APIs. LLMs, vector databases, and prompts. The core idea is to
Skypilot [226] is a framework for running LLMs, AI, enable developers to evaluate using familiar interfaces like
and batch jobs on any cloud, offering maximum cost savings, code, notebooks, and a local playground.
highest GPU availability, and managed execution. PromptBench [?] is a Pytorch-based Python package for
vLLM [227] is a fast and easy-to-use library for LLM in- Evaluation of Large Language Models (LLMs). It provides
ference and serving. vLLM seamlessly supports many Hugging user-friendly APIs for researchers to conduct evaluation on
Face models, including the following architectures: Aquila, LLMs.
Baichuan, BLOOM, ChatGLM, DeciLM, Falcon, GPT Big- Promptfoo [236] is a tool for testing and evaluating LLM
Code, LLaMA, LLaMA 2, Mistral, Mixtral, MPT, OPT, Qwen, output quality. It systematically test prompts, models, and
Yi, and many more. RAGs with predefined test cases.
text-generation-inference [228] is a toolkit for deploying
and serving Large Language Models (LLMs). TGI enables D. VectorDB
high-performance text generation for the most popular open-
Faiss [237] is a library developed by Facebook AI Re-
source LLMs, including Llama, Falcon, StarCoder, BLOOM,
search that provides efficient similarity search and clustering
GPT-NeoX, and more.
of dense vectors. It is designed for use with large-scale,
LangChain [229] is a framework for developing applica- high-dimensional data and supports several index types and
tions powered by language models. It enables applications that: algorithms for various use cases.
• Are context-aware: connect a language model to Milvus [238] is an open-source vector database built to
sources of context (prompt instructions, few shot ex- power embedding similarity search and AI applications. Mil-
amples, content to ground its response in, etc.) vus makes unstructured data search more accessible, and pro-
vides a consistent user experience regardless of the deployment
• Reason: rely on a language model to reason (about
environment.
how to answer based on provided context, what ac-
tions to take, etc.) Qdrant [239] is a vector similarity search engine and
vector database. It provides a production-ready service with a
OpenLLM [230] is an open-source platform designed to convenient API to store, search, and manage points—vectors
facilitate the deployment and operation of large language mod- with an additional payload Qdrant is tailored to extended
els (LLMs) in real-world applications. With OpenLLM, you filtering support. environment.
can run inference on any open-source LLM, deploy them on
the cloud or on-premises, and build powerful AI applications. Weaviate [240] is an open-source, GraphQL-based vec-
tor search engine that enables similarity search on high-
Embedchain [231] is an Open Source RAG Framework
dimensional data. While it is open-source, the commercial ver-
that makes it easy to create and deploy AI apps. Embedchain
sion offers additional features, support, and managed services.
streamlines the creation of RAG applications, offering a seam-
less process for managing various types of unstructured data. Some of the other popular options includes LlamaIndex
It efficiently segments data into manageable chunks, generates [241] and Pinecone.
relevant embeddings, and stores them in a vector database for
optimized retrieval.
Autogen [232] is a framework that enables the devel-
opment of LLM applications using multiple agents that can
converse with each other to solve tasks. AutoGen agents
are customizable, conversable, and seamlessly allow human
participation. They can operate in various modes that employ
combinations of LLMs, human inputs, and tools.
BabyAGI [233] is an autonomous Artificial Intelligence
agent, that is designed to generate and execute tasks based on
given objectives. It harnesses cutting-edge technologies from
OpenAI, Pinecone, LangChain, and Chroma to automate tasks
and achieve specific goals. In this blog post, we will dive
into the unique features of BabyAGI and explore how it can
streamline task automation.

Merge and The Strong Minimalist Thesis
No ratings yet
Merge and The Strong Minimalist Thesis
88 pages
CS321 Grosse Lecture Notes
No ratings yet
CS321 Grosse Lecture Notes
169 pages
Complexity Theory, Systems Theory, and Multiple Intersecting Social Inequealities
No ratings yet
Complexity Theory, Systems Theory, and Multiple Intersecting Social Inequealities
23 pages
AI Study Material - Class-9 Unit - 1
43% (7)
AI Study Material - Class-9 Unit - 1
9 pages
Goals of Machine Learning in Artificial Intelligence
No ratings yet
Goals of Machine Learning in Artificial Intelligence
3 pages
A Metaontology For Ontological Engineeri
No ratings yet
A Metaontology For Ontological Engineeri
13 pages
Evolution of Artificial Intelligence: Lee Spector
No ratings yet
Evolution of Artificial Intelligence: Lee Spector
3 pages
First Order Logic: Artificial Intelligence
No ratings yet
First Order Logic: Artificial Intelligence
16 pages
Formal Language Theory
No ratings yet
Formal Language Theory
69 pages
Foundations of Multi Paradigm Modelling PDF
No ratings yet
Foundations of Multi Paradigm Modelling PDF
298 pages
Theoretical Computer Science An Introduction
No ratings yet
Theoretical Computer Science An Introduction
214 pages
Coding Theory: A Bird S Eye View: Continued Block Codes: Basics
No ratings yet
Coding Theory: A Bird S Eye View: Continued Block Codes: Basics
32 pages
Module-5:: Network Analysis
No ratings yet
Module-5:: Network Analysis
22 pages
Sec16 Paper Tramer
No ratings yet
Sec16 Paper Tramer
19 pages
Primitive Recursive Function
No ratings yet
Primitive Recursive Function
29 pages
Organizing and Delivering Impromptu Speech
No ratings yet
Organizing and Delivering Impromptu Speech
13 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
105 pages
Fields of Logic and Computation Essays Dedicated To Yuri Gurevich On The Occasion of His 70th Birthday PDF
No ratings yet
Fields of Logic and Computation Essays Dedicated To Yuri Gurevich On The Occasion of His 70th Birthday PDF
636 pages
Smullyan
No ratings yet
Smullyan
6 pages
Dirac Notation and Basic Linear Algebra For Quantum Computing
No ratings yet
Dirac Notation and Basic Linear Algebra For Quantum Computing
4 pages
Quantum Mechanics - Study Notes
0% (1)
Quantum Mechanics - Study Notes
18 pages
Concrete Semantics With Isabelle/HOL
No ratings yet
Concrete Semantics With Isabelle/HOL
308 pages
Knowledge Representation First Order Logic
No ratings yet
Knowledge Representation First Order Logic
49 pages
The Story of Life Great Discoveries in Biology 2018046625 9780393631562
No ratings yet
The Story of Life Great Discoveries in Biology 2018046625 9780393631562
625 pages
Pemberton 1993 Modeling Theory and Composing Process Models
No ratings yet
Pemberton 1993 Modeling Theory and Composing Process Models
20 pages
IS 7118 Unit-4 N-Grams
100% (2)
IS 7118 Unit-4 N-Grams
93 pages
Intro
No ratings yet
Intro
11 pages
Quantum Algorithms
No ratings yet
Quantum Algorithms
17 pages
Emergence: A Wikipedia Production
100% (2)
Emergence: A Wikipedia Production
69 pages
2023-24 AhaGuru Brochure
No ratings yet
2023-24 AhaGuru Brochure
8 pages
Knowledge Representation and Inference
No ratings yet
Knowledge Representation and Inference
15 pages
Information Retrieval Data Structures & Algorithms - William B. Frakes
No ratings yet
Information Retrieval Data Structures & Algorithms - William B. Frakes
630 pages
Ontology
No ratings yet
Ontology
7 pages
Knowledge Representation and Reasoning
No ratings yet
Knowledge Representation and Reasoning
52 pages
Eurocentrism in The History and Philosopny of Mathematics
No ratings yet
Eurocentrism in The History and Philosopny of Mathematics
6 pages
Methodologies For Mechatronic Systems Design: Attributes and Popularity
No ratings yet
Methodologies For Mechatronic Systems Design: Attributes and Popularity
11 pages
Social Contract PDF
No ratings yet
Social Contract PDF
66 pages
Digital Ecosystem: The Journey of A Metaphor
No ratings yet
Digital Ecosystem: The Journey of A Metaphor
40 pages
A Brief History of Systems Engineering: (Or, How in The Heck Did We Get Here?)
No ratings yet
A Brief History of Systems Engineering: (Or, How in The Heck Did We Get Here?)
19 pages
Hybrid Neural Network
No ratings yet
Hybrid Neural Network
13 pages
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
No ratings yet
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
36 pages
CIUnit 3
No ratings yet
CIUnit 3
19 pages
Fundamentals of Generative AI
No ratings yet
Fundamentals of Generative AI
17 pages
Conceptual Framework For Quantitative Text Analysis
No ratings yet
Conceptual Framework For Quantitative Text Analysis
16 pages
Compiler Design - Theory Tools and Examples PDF
No ratings yet
Compiler Design - Theory Tools and Examples PDF
320 pages
Semantic Web: Abstra CT
No ratings yet
Semantic Web: Abstra CT
15 pages
Lecture 05 - Part A First Order Logic (FOL) : Dr. Shazzad Hosain
No ratings yet
Lecture 05 - Part A First Order Logic (FOL) : Dr. Shazzad Hosain
80 pages
Roman Language and Literature
100% (1)
Roman Language and Literature
2 pages
Natural Language Generation For The Semantic Web: Unsupervised Template Extraction
100% (1)
Natural Language Generation For The Semantic Web: Unsupervised Template Extraction
66 pages
Truth About Science PDF
No ratings yet
Truth About Science PDF
11 pages
m8 Fol
No ratings yet
m8 Fol
27 pages
Book
No ratings yet
Book
554 pages
GRE Computer Science Syllabus
100% (1)
GRE Computer Science Syllabus
2 pages
Neural Networks and Fuzzy Logic
From Everand
Neural Networks and Fuzzy Logic
C. Naga Bhaskar
No ratings yet
LLM Survey
100% (1)
LLM Survey
43 pages
paper-1
No ratings yet
paper-1
44 pages
Survey On Large Language Models
No ratings yet
Survey On Large Language Models
52 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
58 pages
A Survey Large Language Models
No ratings yet
A Survey Large Language Models
58 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Pre-Trained Models For Natural Language Processing: A Survey
No ratings yet
Pre-Trained Models For Natural Language Processing: A Survey
31 pages
G GENAI I m1 l2 File en 2
No ratings yet
G GENAI I m1 l2 File en 2
2 pages
Unit 123(Nlp)
No ratings yet
Unit 123(Nlp)
3 pages
September Solutions - AI Inference Software and Solutions Catalogue 2023 09 12
No ratings yet
September Solutions - AI Inference Software and Solutions Catalogue 2023 09 12
154 pages
5 NLP Cheat Sheets - Beginners - Professional
No ratings yet
5 NLP Cheat Sheets - Beginners - Professional
6 pages
AI UNIT-1
No ratings yet
AI UNIT-1
5 pages
Super a i Generative a i Narrow a i and Chat Bots
No ratings yet
Super a i Generative a i Narrow a i and Chat Bots
25 pages
Fine-Tuning GPT-3 For Legal Rule Classification
No ratings yet
Fine-Tuning GPT-3 For Legal Rule Classification
20 pages
Rajasthan Technical University
No ratings yet
Rajasthan Technical University
1 page
NLP Assignment-7 Solution
No ratings yet
NLP Assignment-7 Solution
5 pages
Natural Language Processing in Action Understanding analyzing and generating text with Python 1st Edition by Hannes Hapke, Cole Howard, Hobson Lane 1617294632 9781617294631 download
100% (1)
Natural Language Processing in Action Understanding analyzing and generating text with Python 1st Edition by Hannes Hapke, Cole Howard, Hobson Lane 1617294632 9781617294631 download
47 pages
AI and Laww
No ratings yet
AI and Laww
2 pages
The Role of Artificial Intelligence in Revolutionizing Healthcare A Comprehensive Review
No ratings yet
The Role of Artificial Intelligence in Revolutionizing Healthcare A Comprehensive Review
11 pages
Synopsis Final 1 - BE
No ratings yet
Synopsis Final 1 - BE
9 pages
Tamil Langu PDF
No ratings yet
Tamil Langu PDF
338 pages
Applied Neural Networks with TensorFlow 2 API Oriented Deep Learning with Python 1st Edition Orhan Gazi Yalcın Yalçın Orhan - Download the ebook now to never miss important content
100% (2)
Applied Neural Networks with TensorFlow 2 API Oriented Deep Learning with Python 1st Edition Orhan Gazi Yalcın Yalçın Orhan - Download the ebook now to never miss important content
69 pages
10 Ai - 240312 - 113910
No ratings yet
10 Ai - 240312 - 113910
123 pages
Sohail DataScientist
No ratings yet
Sohail DataScientist
3 pages
Resume Analyser: Automated Resume Ranking Software
No ratings yet
Resume Analyser: Automated Resume Ranking Software
6 pages
NLP Lab Tasks
No ratings yet
NLP Lab Tasks
16 pages
Artificial Intelligence Course Descriptor
No ratings yet
Artificial Intelligence Course Descriptor
8 pages
Paper On AI
No ratings yet
Paper On AI
3 pages
MCQ Artificial Intelligence Class 10 Computer Vision
100% (3)
MCQ Artificial Intelligence Class 10 Computer Vision
41 pages
Machine Learning For Absolute Beginners A Plain English Introduction 2 edition Edition Oliver Theobald 2024 Scribd Download
100% (2)
Machine Learning For Absolute Beginners A Plain English Introduction 2 edition Edition Oliver Theobald 2024 Scribd Download
55 pages
Unity AI Game Programming - Second Edition - Sample Chapter
100% (1)
Unity AI Game Programming - Second Edition - Sample Chapter
23 pages
Top Large Language Models (LLMS) Interview Questions & Answers - by Youssef Hosni - Level Up Coding
No ratings yet
Top Large Language Models (LLMS) Interview Questions & Answers - by Youssef Hosni - Level Up Coding
63 pages
Resume-Vaishali Singh
No ratings yet
Resume-Vaishali Singh
1 page
Masterarbeit / Master'S Thesis: Text Analytics For Conceptual Modelling"
No ratings yet
Masterarbeit / Master'S Thesis: Text Analytics For Conceptual Modelling"
87 pages
A Rule-Based Afan Oromo Grammar Checker
No ratings yet
A Rule-Based Afan Oromo Grammar Checker
5 pages