Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
Language Models
Strategies and Best Practices for using ChatGPT and Other LLMs
Sinan Ozdemir
Addison-Wesley
Contents at a Glance
1. Preface
Preface
Part I: Introduction to Large Language Models
1. Overview of Large Language Models
What Are Large Language Models (LLMs)?
Popular Modern LLMs
Domain-Specific LLMs
Applications of LLMs
Summary
2. Launching an Application with Proprietary Models
Introduction
The Task
Solution Overview
The Components
Putting It All Together
The Cost of Closed-Source
Summary
3. Prompt Engineering with GPT3
Introduction
Prompt Engineering
Working with Prompts Across Models
Building a Q/A bot with ChatGPT
Summary
4. Optimizing LLMs with Customized Fine-Tuning
Introduction
Transfer Learning and Fine-Tuning: A Primer
A Look at the OpenAI Fine-Tuning API
Preparing Custom Examples with the OpenAI CLI
Our First Fine-Tuned LLM!
Case Study 2: Amazon Review Category Classification
Summary
Part II: Getting the most out of LLMs
5. Advanced Prompt Engineering
Introduction
Prompt Injection Attacks
Input/Output Validation
Batch Prompting
Prompt Chaining
Chain of Thought Prompting
Re-visiting Few-shot Learning
Testing and Iterative Prompt Development
Conclusion
6. Customizing Embeddings and Model Architectures
Introduction
Case Study – Building a Recommendation System
Conclusion
7. Moving Beyond Foundation Models
Introduction
Case Study—Visual Q/A
Case Study—Reinforcement Learning from Feedback
Conclusion
8. Fine-Tuning Open-Source LLMs
Overview of T5
Building Translation/Summarization Pipelines with T5
9. Deploying Custom LLMs to the Cloud
Overview of Cloud Deployment
Best Practices for Cloud Deployment
Preface
This practical guide to the use of LLMs in NLP provides an overview of the
key concepts and techniques used in LLMs and explains how these models
work and how they can be used for various NLP tasks. The book also
covers advanced topics, such as fine-tuning, alignment, and information
retrieval while providing practical tips and tricks for training and
optimizing LLMs for specific NLP tasks.
This work addresses a wide range of topics in the field of Large Language
Models, including the basics of LLMs, launching an application with
proprietary models, fine-tuning GPT3 with custom examples, prompt
engineering, building a recommendation engine, combining Transformers,
and deploying custom LLMs to the cloud. It offers an in-depth look at the
various concepts, techniques, and tools used in the field of Large Language
Models.
Topics covered:
9. Combining Transformers
Ever since an advanced artificial intelligence (AI) deep learning model called the Transformer
was introduced by a team at Google Brain in 2017, it has become the standard for tackling
various natural language processing (NLP) tasks in academia and industry. It is likely that you
have interacted with a Transformer model today without even realizing it, as Google uses
BERT to enhance its search engine by better understanding users’ search queries. The GPT
family of models from OpenAI have also received attention for their ability to generate
human-like text and images.
These Transformers now power applications such as GitHub’s Copilot (developed by OpenAI
in collaboration with Microsoft), which can convert comments and snippets of code into fully
functioning source code that can even call upon other LLMs (like in Listing 1.1) to perform
NLP tasks.
Listing 1.1 Using the Copilot LLM to get an output from Facebook’s BART LLM
def classify_text(email):
"""
Use Facebook's BART model to classify an email into "spam
Args:
email (str): The email to classify
Returns:
str: The classification of the email
"""
# COPILOT START. EVERYTHING BEFORE THIS COMMENT WAS INPUT
classifier = pipeline(
'zero-shot-classification', model='facebook/bart-larg
labels = ['spam', 'not spam']
hypothesis_template = 'This email is {}.'
results = classifier(
email, labels, hypothesis_template=hypothesis_templat
return results['labels'][0]
# COPILOT END
In this listing, I use Copilot to take in only a Python function definition and some comments I
wrote and wrote all of the code to make the function do what I wrote. No cherry-picking here,
just a fully working python function that I can call like this:
It appears we are surrounded by LLMs, but just what are they doing under the hood? Let’s
find out!
Large language models (LLMs) are AI models that are usually (but not necessarily) derived
from the Transformer architecture and are designed to understand and generate human
language, code, and much more. These models are trained on vast amounts of text data,
allowing them to capture the complexities and nuances of human language. LLMs can
perform a wide range of language tasks, from simple text classification to text generation,
with high accuracy, fluency, and style.
In the healthcare industry, LLMs are being used for electronic medical record (EMR)
processing, clinical trial matching, and drug discovery. In finance, LLMs are being utilized
for fraud detection, sentiment analysis of financial news, and even trading strategies. LLMs
are also used for customer service automation via chatbots and virtual assistants. With their
versatility and highly performant natures, Transformer-based LLMs are becoming an
increasingly valuable asset in a variety of industries and applications.
Note
I will use the term understand a fair amount in this text. I am usually referring
to “Natural Language Understanding” (NLU) which is a research branch of
NLP that focuses on developing algorithms and models that can accurately
interpret human language. As we will see, NLU models excel at tasks such as
classification, sentiment analysis, and named entity recognition. However, it is
important to note that while these models can perform complex language
tasks, they do not possess true understanding in the way humans do.
The success of LLMs and Transformers is due to the combination of several ideas. Most of
these ideas had been around for years but were also being actively researched around the
same time. Mechanisms such as attention, transfer learning, and scaling up neural networks
which provide the scaffolding for Transformers were seeing breakthroughs right around the
same time. Figure 1.1 outlines some of the biggest advancements in NLP in the last few
decades, all leading up to the invention of the Transformer.
Figure 1.1 A brief history of Modern NLP highlights using deep learning to tackle language
modeling, advancements in large scale semantic token embeddings (Word2vec), sequence to
sequence models with attention (something we will see in more depth later in this chapter),
and finally the Transformer in 2017.
The Transformer architecture itself is quite impressive. It can be highly parallelized and
scaled in ways that previous state of the art NLP models could not be, allowing it to scale to
much larger data sets and training times than previous NLP models. The Transformer uses a
special kind of attention calculation called self-attention to allow each word in a sequence to
“attend to” (look to for context) all other words in the sequence, enabling it to capture long-
range dependencies and contextual relationships between words. Of course, no architecture is
perfect. Transformers are still limited to an input context window which represents the
maximum length of text it can process at any given moment.
Since the advent of the Transformer in 2017, the ecosystem around using and deploying
Transformers has only exploded. The aptly named “Transformers” library and its supporting
packages have made it accessible for practitioners to use, train, and share models, greatly
accelerating its adoption and being used by thousands of organizations and counting. Popular
LLM repositories like Hugging Face have popped up, providing access to powerful open-
source models to the masses. In short, using and productionizing a Transformer has never
been easier.
My goal is to guide you on how to use, train, and optimize all kinds of LLMs for practical
applications while giving you just enough insight into the inner workings of the model to
know how to make optimal decisions about model choice, data format, fine-tuning
parameters, and so much more.
My aim is to make using Transformers accessible for software developers, data scientists,
analysts, and hobbyists alike. To do that, we should start on a level playing field and learn a
bit more about LLMs.
Definition of LLMs
To back up only slightly, we should talk first about the specific NLP task that LLMs and
Transformers are being used to solve and provides the foundation layer for their ability to
solve a multitude of tasks. Language modeling is a subfield of NLP that involves the
creation of statistical/deep learning models for predicting the likelihood of a sequence of
tokens in a specified vocabulary (a limited and known set of tokens). There are generally two
kinds of language modeling tasks out there: autoencoding tasks and autoregressive tasks
Figure 1.2)
Note
The term token refers to the smallest unit of semantic meaning created by
breaking down a sentence or piece of text into smaller units and are the basic
inputs for an LLM. Tokens can be words but also can be “sub-words” as we
will see in more depth throughout this book. Some readers may be familiar
with the term “n-gram” which refers to a sequence of n consecutive tokens.
Figure 1.2 Both the autoencoding and autoregressive language modeling task involves filling
in a missing token but only the autoencoding task allows for context to be seen on both sides
of the missing token.
Autoregressive language models are trained to predict the next token in a sentence, based
only on the previous tokens in the phrase. These models correspond to the decoder part of the
transformer model, and a mask is applied to the full sentence so that the attention heads can
only see the tokens that came before. Autoregressive models are ideal for text generation and
a good example of this type of model is GPT.
Autoencoding language models are trained to reconstruct the original sentence from a
corrupted version of the input. These models correspond to the encoder part of the
transformer model and have access to the full input without any mask. Autoencoding models
create a bidirectional representation of the whole sentence. They can be fine-tuned for a
variety of tasks such as text generation, but their main application is sentence classification or
token classification. A typical example of this type of model is BERT.
To summarize, Large Language Models (LLMs) are language models that are either
autoregressive, autoencoding, or a combination of the two. Modern LLMs are usually based
on the Transformer architecture which is what we will use but they can be based on another
architecture. The defining feature of LLMs is their large size and large training datasets which
enables them to perform complex language tasks, such as text generation and classification,
with high accuracy and with little to no fine-tuning.
Table 1.1 shows the disk size, memory usage, number of parameters, and approximate size of
the pre-training data for several popular large language models (LLMs). Note that these sizes
are approximate and may vary depending on the specific implementation and hardware used.
But size is everything. Let’s look at some of the key characteristics of LLMs and then dive
into how LLMs learn to read and write.
A decoder which excels at generating text by using a modified type of attention to predict
the next best token
As shown in Figure 1.3, The Transformer has many other sub-components that we won’t get
into that promotes faster training, generalizability, and better performance. Today’s LLMs are
for the most part variants of the original Transformer. Models like BERT and GPT dissect the
Transformer into only an encoder and decoder (respectively) in order to build models that
excel in understanding and generating (also respectively).
Figure 1.3 The original Transformer has two main components: an encoder which is great at
understanding text, and a decoder which is great at generating text. Putting them together
makes the entire model a “sequence to sequence” model.
Combinations of autoregressive and autoencoding, like T5, which can use the encoder and
decoder to be more versatile and flexible in generating text. It has been shown that these
combination models can generate more diverse and creative text in different contexts
compared to pure decoder-based autoregressive models due to their ability to capture
additional context using the encoder.
Figure 1.4 A breakdown of the key characteristics of LLMs based on how they are derived
from the original Transformer architecture.
Figure 1.4 shows the breakdown of the key characteristics of LLMs based on these three
buckets.
No matter how the LLM is constructed and what parts of the Transformer it is using, they all
care about context (Figure 1.5). The goal is to understand each token as it relates to the other
tokens in the input text. Beginning with the popularity of Word2vec around 2013, NLP
practitioners and researchers were always curious about the best ways of combining semantic
meaning (basically word definitions) and context (with the surrounding tokens) to create the
most meaningful token embeddings possible. The Transformer relies on the attention
calculation to make this combination a reality.
Figure 1.5 LLMs are great at understanding context. The word “Python” can have different
meanings depending on the context. We could be talking about a snake, or a pretty cool
coding language.
Choosing what kind of Transformer derivation you want isn’t enough. Just choosing the
encoder doesn’t mean your Transformer is magically good at understanding text. Let’s take a
look at how these LLMs actually learn to read and write.
Pre-training
Every LLM on the market has been pre-trained on a large corpus of text data and on specific
language modeling related tasks. During pre-training, the LLM tries to learn and understand
general language and relationships between words. Every LLM is trained on different corpora
and on different tasks.
BERT, for example, was originally pre-trained on two publicly available text corpora (Figure
1.6):
English Wikipedia - a collection of articles from the English version of Wikipedia, a free
online encyclopedia. It contains a range of topics and writing styles, making it a diverse and
representative sample of English language text
The BookCorpus - a large collection of fiction and non-fiction books. It was created by
scraping book text from the web and includes a range of genres, from romance and mystery to
science fiction and history. The books in the corpus were selected to have a minimum length
of 2000 words and to be written in English by authors with verified identities
• 800M words.
The Masked Language Modeling (MLM) task (AKA the autoencoding task)—this helps
BERT recognize token interactions within a single sentence.
The Next Sentence Prediction Task—this helps BERT understand how tokens interact with
each other between sentences.
Figure 1.6 BERT was originally pre-trained on English Wikipedia and the BookCorpus. More
modern LLMs are trained on datasets thousands of times larger.
Figure 1.7 BERT was pre-trained on two tasks: the autoencoding language modeling task
(referred to as the “masked language modeling” task) to teach it individual word embeddings
and the “next sentence prediction” task to help it learn to embed entire sequences of text.
Pre-training on these corpora allowed BERT (mainly via the self-attention mechanism) to
learn a rich set of language features and contextual relationships. The use of large, diverse
corpora like these has become a common practice in NLP research, as it has been shown to
improve the performance of models on downstream tasks.
Note
The pre-training process for an LLM can evolve over time as researchers find
better ways of training LLMs and phase out methods that don’t help as much.
For example within a year of the original Google BERT release that used the
Next Sentence Prediction (NSP) pre-training task, a BERT variant called
RoBERTa (yes, most of these LLM names will be fun) by Facebook AI was
shown to not require the NSP task to match and even beat the original BERT
model’s performance in several areas.
Depending on which LLM you decide to use, it will likely be pre-trained differently from the
rest. This is what sets LLMs apart from each other. Some LLMs are trained on proprietary
data sources including OpenAI’s GPT family of models in order to give their parent
companies an edge over their competitors.
We will not revisit the idea of pre-training often in this book because it’s not exactly the
“quick” part of a “quick start guide” but it can be worth knowing how these models were pre-
trained because it’s because of this pre-training that we can apply something called transfer
learning to let us achieve the state-of-the-art results we want, which is a big deal!
Transfer Learning
Transfer learning is a technique used in machine learning to leverage the knowledge gained
from one task to improve performance on another related task. Transfer learning for LLMs
involves taking an LLM that has been pre-trained on one corpus of text data and then fine-
tuning it for a specific “downstream” task, such as text classification or text generation, by
updating the model’s parameters with task-specific data.
The idea behind transfer learning is that the pre-trained model has already learned a lot of
information about the language and relationships between words, and this information can be
used as a starting point to improve performance on a new task. Transfer learning allows
LLMs to be fine-tuned for specific tasks with much smaller amounts of task-specific data
than it would require if the model were trained from scratch. This greatly reduces the amount
of time and resources required to train LLMs. Figure 1.8 provides a visual representation of
this relationship.
Figure 1.8 The general transfer learning loop involves pre-training a model on a generic
dataset on some generic self-supervised task and then fine-tuning the model on a task-specific
dataset.
Fine-tuning
Once a LLM has been pre-trained, it can be fine-tuned for specific tasks. Fine-tuning involves
training the LLM on a smaller, task-specific dataset to adjust its parameters for the specific
task at hand. This allows the LLM to leverage its pre-trained knowledge of the language to
improve its accuracy for the specific task. Fine-tuning has been shown to drastically improve
performance on domain-specific and task-specific tasks and lets LLMs adapt quickly to a
wide variety of NLP applications.
Figure 1.9 shows the basic fine-tuning loop that we will use for our models in later chapters.
Whether they are open-sourced or closed-sourced the loop is more or less the same:
1. We define the model we want to fine-tune as well as any fine-tuning parameters (e.g.,
learning rate)
2. We will aggregate some training data (the format and other characteristics depend on the
model we are updating)
3. We compute losses (a measure of error) and gradients (information about how to change
the model to minimize error)
If some of that went over your head, not to worry: we will rely on pre-built tools from
Hugging Face’s Transformers package (Figure 1.9) and OpenAI’s Fine-tuning API to abstract
away a lot of this so we can really focus on our data and our models.
Note
You will not need a Hugging Face account or key to follow along and use any
of this code apart from very specific advanced exercises where I will call it
out.
Figure 1.9 The Transformers package from Hugging Face provides a neat and clean
interface for training and fine-tuning LLMs.
Attention
The name of the original paper that introduced the Transformer was called “Attention is all
you need”. Attention is a mechanism used in deep learning models (not just Transformers)
that assigns different weights to different parts of the input, allowing the model to prioritize
and emphasize the most important information while performing tasks like translation or
summarization. Essentially, attention allows a model to “focus” on different parts of the input
dynamically, leading to improved performance and more accurate results. Before the
popularization of attention, most neural networks processed all inputs equally and the models
relied on a fixed representation of the input to make predictions. Modern LLMs that rely on
attention can dynamically focus on different parts of input sequences, allowing them to weigh
the importance of each part in making predictions.
To recap, LLMs are pre-trained on large corpora and sometimes fine-tuned on smaller
datasets for specific tasks. Recall that one of the factors behind the Transformer’s
effectiveness as a language model is that it is highly parallelizable, allowing for faster training
and efficient processing of text. What really sets the Transformer apart from other deep
learning architectures is its ability to capture long-range dependencies and relationships
between tokens using attention. In other words, attention is a crucial component of
Transformer-based LLMs, and it enables them to effectively retain information between
training loops and tasks (i.e. transfer learning), while being able to process lengthy swatches
of text with ease.
Attention is attributed for being the most responsible for helping LLMs learn (or at least
recognize) internal world models and human-identifiable rules. A Stanford study in 2019
showed that certain attention calculations in BERT corresponded to linguistic notions of
syntax and grammar rules. For example, they noticed that BERT was able to notice direct
objects of verbs, determiners of nouns, and objects of prepositions with remarkably high
accuracy from only its pre-training. These relationships are presented visually in Figure 1.10.
There is research that explores what other kinds of “rules” LLMs are able to learn simply by
pre-training and fine-tuning. One example is a series of experiments led by researchers at
Harvard that explored an LLM’s ability to learn a set of rules to a synthetic task like the game
of Othello (Figure 1.11). They found evidence that an LLM was able to understand the rules
of the game simply by training on historical move data.
Images
Figure 1.10 Research has probed into LLMs to uncover that they seem to be recognizing
grammatical rules even when they were never explicitly told these rules.
Images
Figure 1.11 LLMs may be able to learn all kinds of things about the world, whether it be the
rules and strategy of a game or the rules of human language.
For any LLM to learn any kind of rule, however, it has to convert what we perceive as text
into something machine readable. This is done through a process called embedding.
Embeddings
Figure 1.12 An example of how BERT uses three layers of embedding for a given piece of
text. Once the text is tokenized, each token is given an embedding and then the values are
added up, so each token ends up with an initial embedding before any attention is calculated.
We won’t focus too much on the individual layers of LLM embeddings in this text unless they
serve a more practical purpose but it is good to know about some of these parts and how they
look under the hood!
LLMs learn different embeddings for tokens based on their pre-training and can further
update these embeddings during fine-tuning.
Tokenization
Tokenization, as mentioned previously, involves breaking text down into the smallest unit of
understanding - tokens. These tokens are the pieces of information that are embedded into
semantic meaning and act as inputs to the attention calculations which leads to ... well, the
LLM actually learning and working. Tokens make up an LLMs static vocabulary and don’t
always represent entire words. Tokens can represent punctuation, individual characters, or
even a sub-word if a word is not known to the LLM. Nearly all LLMs also have special
tokens that have specific meaning to the model. For example, the BERT model has a few
special tokens including the [CLS] token which BERT automatically injects as the first token
of every input and is meant to represent an encoded semantic meaning for the entire input
sequence.
Readers may be familiar with techniques like stop words removal, stemming, and truncation
which are used in traditional NLP. These techniques are not used nor are they necessary for
LLMs. LLMs are designed to handle the inherent complexity and variability of human
language, including the usage of stop words like “the” and “an” and variations in word forms
like tenses and misspellings. Altering the input text to an LLM using these techniques could
potentially harm the performance of the model by reducing the contextual information and
altering the original meaning of the text.
Tokenization can also involve several preprocessing steps like casing, which refers to the
capitalization of the tokens. There are two types of casing: uncased and cased. In uncased
tokenization, all the tokens are lowercased and usually accents from letters are stripped, while
in cased tokenization, the capitalization of the tokens is preserved. The choice of casing can
impact the performance of the model, as capitalization can provide important information
about the meaning of a token. An example of this can be found in Figure 1.13.
Note
It is worth mentioning that even the concept of casing has some bias to it
depending on the model. To uncase a text - lowercasing and stripping of
accents - is a pretty Western style preprocessing step. I myself speak Turkish
and know that the umlaut (e.g. the Ö in my last name) matters and can actually
help the LLM understand the word being said. Any language model that has
not been sufficiently trained on diverse corpora may have trouble parsing and
utilizing these bits of context.
Images
Figure 1.13 The choice of uncased versus cased tokenization depends on the task. Simple
tasks like text classification usually prefer uncased tokenization while tasks that derive
meaning from case like Named Entity Recognition prefer a cased tokenization.
Figure 1.14 shows an example of tokenization, and in particular, an example of how LLMs
tend to handle Out of Vocabulary (OOV) phrases. OOV phrases are simply phrases/words
that the LLM doesn’t recognize as a token and has to split up into smaller sub-words. For
example, my name (Sinan) is not a token in most LLMs (story of my life) so in BERT, the
tokenization scheme will split my name up into two tokens (assuming uncased tokenization):
##an - a special sub-word token that is different from the word “an” and is used only as a
means to split up unknown words
Images
Figure 1.14 Any LLM has to deal with words they’ve never seen before. How an LLM
tokenizes text can matter if we care about the token limit of an LLM.
Some LLMs limit the number of tokens we can input at any one time so how an LLM
tokenizes text can matter if we are trying to be mindful about this limit.
So far, we have talked a lot about language modeling - predicting missing/next tokens in a
phrase, but modern LLMs also can also borrow from other fields of AI to make their models
more performant and more importantly more aligned - meaning that the AI is performing in
accordance with a human’s expectation. Put another way, an aligned LLM has an objective
that matches a human’s objective.
Alignment in language models refers to how well the model can respond to input prompts
that match the user’s expectations. Standard language models predict the next word based on
the preceding context, but this can limit their usefulness for specific instructions or prompts.
Researchers are coming up with scalable and performant ways of aligning language models to
a user’s intent. One such broad method of aligning language models is through the
incorporation of reinforcement learning (RL) into the training loop.
RL with Human Feedback (RLHF) is a popular method of aligning pre-trained LLMs that
uses human feedback to enhance their performance. It allows the LLM to learn from feedback
on its own outputs from a relatively small, high-quality batch of human feedback, thereby
overcoming some of the limitations of traditional supervised learning. RLHF has shown
significant improvements in modern LLMs like ChatGPT. RLHF is one example of
approaching alignment with RL, but there are other emerging approaches like RL with AI
feedback (e.g. Constitutional AI).
Let’s take a look at some of the popular LLMs we’ll be using in this book.
BERT, T5, and GPT are three popular LLMs developed by Google, Google, and OpenAI
respectively. These models differ in their architecture pretty greatly even though they all share
the Transformer as a common ancestor. Other widely used variants of LLMs in the
Transformer family include RoBERTa, BART (which we saw earlier performing some text
classification), and ELECTRA.
BERT
BERT (Figure 1.15) is an autoencoding model that uses attention to build a bidirectional
representation of a sentence, making it ideal for sentence classification and token
classification tasks.
Images
Figure 1.15 BERT was one of the first LLMs and continues to be popular for many NLP tasks
that involve fast processing of large amounts of text.
BERT uses the encoder of the Transformer and ignores the decoder to become exceedingly
good at processing/understanding massive amounts of text very quickly relative to other,
slower LLMs that focus on generating text one token at a time. BERT-derived architectures,
therefore, are best for working with and analyzing large corpora quickly when we don’t need
to write free text.
BERT itself doesn’t classify text or summarize documents but it is often used as a pre-trained
model for downstream NLP tasks. BERT has become a widely used and highly regarded
LLM in the NLP community, paving the way for the development of even more advanced
language models.
Images
Figure 1.16 The GPT family of models excels at generating free text aligned with a user’s
intent.
GPT relies on the decoder portion of the Transformer and ignores the encoder to become
exceptionally good at generating text one token at a time. GPT-based models are best for
generating text given a rather large context window. They can also be used to
process/understand text as we will see in an upcoming chapter. GPT-derived architectures are
ideal for applications that require the ability to freely write text.
T5
T5 is a pure encoder/decoder transformer model that was designed to perform several NLP
tasks, from text classification to text summarization and generation, right off the shelf. It is
one of the first popular models to be able to boast such a feat, in fact. Before T5, LLMs like
BERT and GPT-2 generally had to be fine-tuned using labeled data before they could be
relied on to perform such specific tasks.
T5 uses both the encoder and decoder of the Transformer to become highly versatile in both
processing and generating text. T5-based models can perform a wide range of NLP tasks,
from text classification to text generation, due to their ability to build representations of the
input text using the encoder and generate text using the decoder (Figure 1.17). T5-derived
architectures are ideal for applications that require both the ability to process and understand
text and generate text freely.
Images
Figure 1.17 T5 was one of the first LLMs to show promise in solving multiple tasks at once
without any fine-tuning.
T5’s ability to perform multiple tasks with no fine-tuning spurred the development of other
versatile LLMs that can perform multiple tasks with efficiency and accuracy with little/no
fine-tuning. GPT-3, released around the same time at T5, also boasted this ability.
These three LLMs are highly versatile and are used for various NLP tasks, such as text
classification, text generation, machine translation, and sentiment analysis, among others.
These three LLMs, along with flavors (variants) of them will be the main focus of this book
and our applications.
Domain-Specific LLMs
Domain-specific LLMs are LLMs that are trained specifically in a particular subject area,
such as biology or finance. Unlike general-purpose LLMs, these models are designed to
understand the specific language and concepts used within the domain they were trained on.
Images
BioGPT, whose pre-training encoded biomedical knowledge and domain-specific jargon into
the LLM, can be fine-tuned on smaller datasets, making it adaptable for specific biomedical
tasks and reducing the need for large amounts of labeled data.
The advantage of using domain-specific LLMs lies in their training on a specific set of texts.
This allows them to better understand the language and concepts used within their specific
domain, leading to improved accuracy and fluency for NLP tasks that are contained within
that domain. By comparison, general-purpose LLMs may struggle to handle the language and
concepts used in a specific domain as effectively.
Applications of LLMs
As we’ve already seen, applications of LLMs vary widely and researchers continue to find
novel applications of LLMs to this day. We will use LLMs in this book in generally three
ways:
Using a pre-trained LLM’s underlying ability to process and generate text with no further
fine-tuning as part of a larger architecture.
Fine-tuning a pre-trained LLM to perform a very specific task using Transfer Learning.
Asking a pre-trained LLM to solve a task it was pre-trained to solve or could reasonably
intuit.
These methods use LLMs in different ways and while all options take advantage of an LLM’s
pre-training, only option 2 requires any fine-tuning. Let’s take a look at some specific
applications of LLMs.
Text Classification
The text classification task assigns a label to a given piece of text. This task is commonly
used in sentiment analysis, where the goal is to classify a piece of text as positive, negative,
or neutral, or in topic classification, where the goal is to classify a piece of text into one or
more predefined categories. Models like BERT can be fine-tuned to perform classification
with relatively little labeled data as seen in Figure 1.19.
Images
Figure 1.19 A peek at the architecture of using BERT to achieve fast and accurate text
classification results. Classification layers usually act on that special [CLS] token that BERT
uses to encode the semantic meaning of the entire input sequence.
Text classification remains one of the most globally recognizable and solvable NLP tasks
because when it comes down to it, sometimes we just need to know whether this email is
“spam” or not and get on with our days!
Translation Tasks
A harder and yet still classic NLP task is machine translation where the goal is to
automatically translate text from one language to another while preserving meaning and
context. Traditionally, this task is quite difficult because it involves having sufficient
examples and domain knowledge of both languages to accurately gauge how well the model
is doing but modern LLMs seem to have an easier time with this task again due to their pre-
training and efficient attention calculations.
One of the first applications of attention even before Transformers was for machine
translation tasks where AI models were expected to translate from one human language to
another. T5 was one of the first LLMs to tout the ability to perform multiple tasks off the
shelf (Figure 1.20). One of these tasks was the ability to translate English into a few
languages and back.
Images
Figure 1.20 T5 could perform many NLP tasks off the shelf, including grammar correction,
summarization, and translation.
Since T5, language translation in LLMs has only gotten better and more diverse. Models like
GPT-3 and the latest T5 models can translate between dozens of languages with relative ease.
Of course this bumps up against one major known limitation of LLMs that they are mostly
trained from an English-speaking/usually American point of view so most LLMs can handle
English well and non-English languages, well, not as well.
SQL Generation
If we consider SQL as a language, then converting English to SQL is really not that different
from converting English to French (Figure 1.21). Modern LLMs can already do this at a basic
level off the shelf, but more advanced SQL queries often require some fine-tuning.
Images
Figure 1.21 Using GPT-3 to generate functioning SQL code from an (albeit simple) Postgres
schema
If we expand our thinking of what can be considered a “translation” then a lot of new
opportunities lie ahead of us. For example, what if we wanted to “translate” between English
and a series of wavelengths that a brain might interpret and execute as motor functions. I’m
not a neuro-scientist or anything, but that seems like a fascinating area of research!
What first caught the world’s eye in terms of modern LLMs like ChatGPT was their ability to
freely write blogs, emails, and even academic papers. This notion of text generation is why
many LLMs are affectionately referred to as “Generative AI”, although that term is a bit
reductive and imprecise. I will not often use the term “Generative AI” as the specific word
“generative” has its own meaning in machine learning as the analogous way of learning to a
“discriminative” model. For more on that, check out my first book: The Principles of Data
Science)
We could for example prompt (ask) ChatGPT to help plan out a blog post like in Figure 1.22.
Even if you don’t agree with the results, this can help humans with the “tabula rasa” problem
and give us something to at least edit and start from rather than staring at a blank page for too
long.
Images
Figure 1.22 ChatGPT can help ideate, scaffold, and even write entire blog posts
Note
I would be remiss if I didn’t mention the controversy that LLMs like this can
cause at the academic level. Just because an LLM can write entire blogs or
even essays doesn’t mean we should let them. Just like how the internet caused
some to believe that we’d never need books again, some argue that ChatGPT
means that we’ll never need to write anything again. As long as institutions are
aware of how to use this technology and proper regulations/rules are put in
place, students and teachers alike can use ChatGPT and other text-generation-
focused AIs safely and ethically.
We will be using ChatGPT to solve a few tasks in this book. We will rely on ChatGPT’s
ability to contextualize information in its context window and freely write back (usually)
accurate responses. We will mostly be interacting with ChatGPT through the Playground and
the API provided by OpenAI as this model is not open source.
LLMs encode information directly into their parameters via pre-training and fine-tuning but
keeping them up to date with new information is tricky. We either have to further fine-tune
the model on new data or run the pre-training steps again from scratch. To dynamically keep
information fresh, we will architect our own information retrieval system with a vector
database (don’t worry we will go into more details on all of this in the next chapter). Figure
1.23 shows an outline of the architecture we will build.
Images
Figure 1.23 Our neural semantic search system will be able to take in new information
dynamically and be able to retrieve relevant documents quickly and accurately given a user’s
query using LLMs.
We will then add onto this system by building a ChatGPT-based chatbot to conversationally
answer questions from our users.
Chatbots
Everyone loves a good chatbot, right? Well, whether you love them or hate them, LLMs’
capacity for holding a conversation is evident through systems like ChatGPT and even GPT-3
(as seen in Figure 1.24). The way we architect chatbots using LLMs will be quite different
from the traditional way of designing chatbots through intents, entities, and tree-based
conversation flows. These concepts will be replaced by system prompts, context, and
personas – all of which we will dive into in the coming chapters.
Images
Figure 1.24 ChatGPT isn’t the only LLM that can hold a conversation. We can use GPT-3 to
construct a simple conversational chatbot. The text highlighted in green represents GPT-3’s
output. Note that before the chat even begins, I inject context to GPT-3 that would not be
shown to the end-user but GPT-3 needs to provide accurate responses.
We have our work cut out for us. I’m excited to be on this journey with you and I’m excited
to get started!
Summary
LLMs are advanced AI models that have revolutionized the field of NLP. LLMs are highly
versatile and are used for a variety of NLP tasks, including text classification, text generation,
and machine translation. They are pre-trained on large corpora of text data and can then be
fine-tuned for specific tasks.
Using LLMs in this fashion has become a standard step in the development of NLP models.
In our first case study, we will explore the process of launching an application with
proprietary models like GPT-3 and ChatGPT. We will get a hands-on look at the practical
aspects of using LLMs for real-world NLP tasks, from model selection and fine-tuning to
deployment and maintenance.
2
Introduction
In the last chapter, we explored the inner workings of language models and
the impact that modern LLMs have had on NLP tasks like text
classification, generation, and machine translation. There is another
powerful application of LLMs that has been gaining traction in recent years:
semantic search.
Now you might be thinking that it’s time to finally learn the best ways to
talk to ChatGPT and GPT-4 to get the optimal results, and we will start to
do that in the next chapter, I promise. In the meantime, I want to show you
what else we can build on top of this novel transformer architecture. While
text-to-text generative models like GPT are extremely impressive in their
own right, one of the most versatile solutions that AI companies offer is the
ability to generate text embeddings based on powerful LLMs.
Figure 2.1 Vectors that represent similar phrases should be close together
and those that represent dissimilar phrases should be far apart. In this case,
if a user wants a trading card they might ask for “a vintage magic card”. A
proper semantic search system should embed the query in such a way that it
ends up near relevant results (like “magic card”) and far apart from non
relevant items (like “a vintage magic kit”) even if they share certain
keywords.
This map from text to vectors can be thought of as a kind of hash with
meaning. We can’t really reverse vectors back to text but rather they are a
representation of the text that has the added benefit of carrying the ability to
compare points while in their encoded state.
Without further ado, let’s get right into it, shall we?
The Task
A traditional search engine would generally take what you type in and then
give you a bunch of links to websites or items that contain those words or
permutations of the characters that you typed in. So if you typed in
“Vintage Magic the Gathering Cards” on a marketplace, you would get
items with a title/description that contains combinations of those words.
That’s a pretty standard way to search, but it’s not always the best way. For
example I might get vintage magic sets to help me learn how to pull a rabbit
out of a hat. Fun but not what I asked for.
The terms you input into a search engine may not always align with the
exact words used in the items you want to see. It could be that the words in
the query are too general, resulting in a slew of unrelated findings. This
issue often extends beyond just differing words in the results; the same
words might carry different meanings than what was searched for. This is
where semantic search comes into play, as exemplified by the earlier-
mentioned Magic: The Gathering cards scenario.
A semantic search system can understand the meaning and context of your
search query and match it against the meaning and context of the
documents that are available to retrieve. This kind of system can find
relevant results in a database without having to rely on exact keyword or n-
gram matching but rather rely on a pre-trained LLM to understand the
nuance of the query and the documents (Figure 2.2).
Figure 2.2 A traditional keyword-based search might rank a vintage magic
kit with the same weight as the item we actually want whereas a semantic
search system can understand the actual concept we are searching for
The asymmetric part of asymmetric semantic search refers to the fact that
there is generally an imbalance between the semantic information (basically
the size) of the input query and the documents/information that the search
system has to retrieve. For example, the search system is trying to match
“magic the gathering card” to paragraphs of item descriptions on a
marketplace. The four-word search query has much less information than
the paragraphs but nonetheless it is what we are comparing.
Asymmetric semantic search systems can get very accurate and relevant
search results, even if you don’t use the exact right words in your search.
They rely on the learnings of LLMs rather than the user being able to know
exactly what needle to search for in the haystack.
They struggle with nuanced concepts, such as sarcasm or irony that rely
on localized cultural knowledge.
Solution Overview
The general flow of our asymmetric semantic search system will follow
these steps:
PART I - Ingesting documents (Figure 2.3)
Figure 2.4 Zooming in on Part II, when retrieving documents we will have
to embed our query using the same embedding scheme as we used for the
documents and then compare them against the previously stored documents
and return the best (closest) document
The Components
Text Embedder
As we now know, at the heart of any semantic search system is the text
embedder. This is the component that takes in a text document, or a single
word or phrase, and converts it into a vector. The vector is unique to that
text and should capture the contextual meaning of the phrase.
The choice of the text embedder is critical as it determines the quality of the
vector representation of the text. We have many options in how we
vectorize with LLMs, both open and closed source. To get off of the ground
quicker, we are going to use OpenAI’s closed-source “Embeddings”
product. In a later section, I’ll go over some open-source options.
Figure 2.5 shows how the cosine similarity would help us retrieve
documents given a query.
Figure 2.5 In an ideal semantic search scenario, the Cosine Similarity
(formula given at the top) gives us a computationally efficient way to
compare pieces of text at scale, given that embeddings are tuned to place
semantically similar pieces of text near each other (bottom). We start by
embedding all items – including the query (bottom left) and then checking
the angle between them. The smaller the angle, the larger the cosine
similarity (bottom right)
We could also turn to other similarity metrics like the dot product or the
Euclidean distance but OpenAI embeddings have a special property. The
magnitudes (lengths) of their vectors are normalized to length 1, which
basically means that we benefit mathematically on two fronts:
OpenAI’s embedding
It’s worth noting that OpenAI provides several engine options that can be
used for text embedding. Each engine may provide different levels of
accuracy and may be optimized for different types of text data. At the time
of writing, the engine used in the code block is the most recent and the one
they recommend using.
A bi-encoder involves training two BERT models, one to encode the input
text and the other to encode the output text (Figure 2.6). The two models
are trained simultaneously on a large corpus of text data, with the goal of
maximizing the similarity between corresponding pairs of input and output
text. The resulting embeddings capture the semantic relationship between
the input and output text.
Figure 2.6 A bi-encoder is trained in a unique way with two clones of a
single LLM trained in parallel to learn similarities between documents. For
example, a bi-encoder can learn to associate questions to paragraphs so
they appear near each other in a vector space
Listing 2.2 Getting text embeddings from a pre-trained open source bi-
encoder
Different algorithms may perform better on different types of text data and
will have different vector sizes. The choice of algorithm can have a
significant impact on the quality of the resulting embeddings. Additionally,
open-source alternatives may require more customization and fine-tuning
than closed-source products, but they also provide greater flexibility and
control over the embedding process. For more examples of using open-
source bi-encoders to embed text, check out the code portion of this book!
Document Chunker
Once we have our text embedding engine set up, we need to consider the
challenge of embedding large documents. It is often not practical to embed
entire documents as a single vector, particularly when dealing with long
documents such as books or research papers. One solution to this problem is
to use document chunking, which involves dividing a large document into
smaller, more manageable chunks for embedding.
One common concern of this method is that we might accidentally cut off
some important text between chunks, splitting up the context. To mitigate
this, we can set overlapping windows with a specified amount of tokens to
overlap so we have tokens shared between chunks. This of course
introduces a sense of redundancy but this is often fine in service of higher
accuracy and latency.
And now let’s chunk this document by getting chunks of at most a certain
token size (Listing 2.4).
'''
max_tokens: tokens we want per chunk
overlapping_factor: number of sentences to st
'''
return chunks
To help aid our chunking method, we could search for custom natural
delimiters. We would identify natural white spaces within the text and use
them to create more meaningful units of text that will end up in document
chunks that will eventually get embedded (Figure 2.7).
Figure 2.7 Max-token chunking (on the left) and natural whitespace
chunking (on the right) can be done with or without overlap. The natural
whitespace chunking tends to end up with non-uniform chunk sizes.
The most common double white space is two newline characters in a row
which is actually how I earlier distinguished between pages which makes
sense. The most natural whitespace in a book is by page. In other cases, we
may have found natural whitespace between paragraphs as well. This
method is very hands-on and requires a good amount of familiarity and
knowledge of the source documents.
We can also turn to more machine learning to get slightly more creative
with how we architect document chunks.
Let’s try to cluster together those chunks we found from the textbook in our
last section (Listing 2.6).
import numpy as np
# Assume you have a list of text embeddings calle
# First, compute the cosine similarity matrix bet
cosine_sim_matrix = cosine_similarity(embeddings)
Cluster 0: 2 embeddings
Cluster 1: 3 embeddings
Cluster 2: 4 embeddings
...
This approach tends to yield chunks that are more cohesive semantically but
suffer from pieces of content being out of context with surrounding text.
This approach works well when the chunks you start with are known to not
necessarily relate to each other i.e. chunks are more independent of one
another.
Pinecone
Open-source Alternatives
After retrieving potential results from a vector database given a query using
a similarity like cosine similarity, it is often useful to re-rank them to ensure
that the most relevant results are presented to the user (Figure 2.9). One
way to re-rank results is by using a cross-encoder, which is a type of
transformer model that takes pairs of input sequences and predicts a score
indicating how relevant the second sequence is to the first. By using a
cross-encoder to re-rank search results, we can take into account the entire
query context rather than just individual keywords. This of course will add
some overhead and worsen our latency but it could help us in terms of
performance. I will take the time to outline some results in a later section to
compare and contrast using and not using a cross-encoder.
Figure 2.9 A cross-encoder (left) takes in two pieces of text and outputs a
similarity score without returning a vectorized format of the text. A bi-
encoder (right), on the other hand, embeds a bunch of pieces of text into
vectors up front and then retrieves them later in real time given a query
(e.g. looking up “I’m a Data Scientist”)
We now need a place to put all of these components so that users can access
the documents in a fast, secure, and easy way. To do this, let’s create an
API.
FastAPI
import hashlib
import os
from fastapi import FastAPI
openai.api_key = os.environ.get('OPENAI_API_KEY',
pinecone_key = os.environ.get('PINECONE_KEY', '')
def my_hash(s):
# Return the MD5 hash of the input string as
return hashlib.md5(s.encode()).hexdigest()
class DocumentInputRequest(BaseModel):
# define input to /document/ingest
class DocumentInputResponse(BaseModel):
# define output from /document/ingest
class DocumentRetrieveRequest(BaseModel):
# define input to /document/retrieve
class DocumentRetrieveResponse(BaseModel):
# define output from /document/retrieve
@app.post("/document/ingest", response_model=Docu
async def document_ingest(request: DocumentInputR
# Parse request data and chunk it
# Create embeddings and metadata for each chu
# Upsert embeddings and metadata to Pinecone
# Return number of upserted chunks
return DocumentInputResponse(chunks_count=num
if __name__ == "__main__":
uvicorn.run("api:app", host="0.0.0.0", port=8
For the full file, be sure to check out the code repository for this book!
We now have a solution for all of our components. Let’s take a look at
where we are in our solution. Items in bold are new from the last time we
outlined this solution.
PART I - Ingesting documents
With all of these moving parts, let’s take a look at our final system
architecture in Figure 2.10.
Figure 2.10 Our complete semantic search architecture using two closed-
source systems (OpenAI and Pinecone) and an open source API framework
(FastAPI)
We now have a complete end to end solution for our semantic search. Let’s
see how well the system performs against a validation set.
Performance
I’ve outlined a solution to the problem of semantic search, but I want to also
talk about how to test how these different components work together. For
this, let’s use a well-known dataset to run against: the BoolQ dataset - a
question answering dataset for yes/no questions containing nearly 16K
examples. This dataset has pairs of (question, passage) that indicate for a
given question, that passage would be the best passage to answer the
question.
Table 2.2 outlines a few trials I ran and coded up in the code for this book. I
use combinations of embedders, re-ranking solutions, and a bit of fine-
tuning to try and see how well the system performs on two fronts:
2. Latency - I want to see how long it takes to run through these examples
using Pinecone, so for each embedder, I reset the index and uploaded new
vectors and used cross-encoders in my laptop’s memory to keep things
simple and standardized. I will measure latency in minutes it took to run
against the validation set of the BoolQ dataset
1. Fine-tuning the cross-encoder for more epochs and spending more time
finding optimal learning parameters (e.g. weight decay, learning rate
scheduler, etc)
Note that the models I used for the cross-encoder and the bi-encoder were
both specifically pre-trained on data that is similar to asymmetric semantic
search. This is important because we want the embedder to produce vectors
for both short queries and long documents and place them near each other
when they are related.
Let’s assume we want to keep things simple to get things off of the ground
and use only the OpenAI embedder and do no re-ranking (row 1) in our
application. Let’s consider the costs associated with using FastAPI,
Pinecone, and OpenAI for text embeddings.
We have a few components in play and not all of them are free. Fortunately
FastAPI is an open-source framework and does not require any licensing
fees. Our cost with FastAPI is hosting which could be on a free tier
depending on what service we use. I like Render which has a free tier but
also pricing starts at $7/month for 100% uptime. At the time of writing,
Pinecone offers a free tier with a limit of 100,000 embeddings and up to 3
indexes, but beyond that, they charge based on the number of embeddings
and indexes used. Their Standard plan charges $49/month for up to 1
million embeddings and 10 indexes.
OpenAI offers a free tier of their text embedding service, but it is limited to
100,000 requests per month. Beyond that, they charge $0.0004 per 1,000
tokens for the embedding engine we used - Ada-002. If we assume an
average of 500 tokens per document, the cost per document would be
$0.0002. For example, if we wanted to embed 1 million documents, it
would cost approximately $200.
FastAPI Cost = $7
These costs can quickly add up as the system scales, and it may be worth
exploring open-source alternatives or other strategies to reduce costs - like
using open-source bi-encoders for embedding or Pgvector as your vector
database.
Summary
With all of these components accounted for, our pennies added up, and
alternatives available at every step of the way, I’ll leave you all to it. Enjoy
setting up your new semantic search system and be sure to check out the
complete code for this - including a fully working FastAPI app with
instructions on how to deploy it - on the book’s code repository and
experiment to your heart’s content to try and make this work as well as
possible for your domain-specific data.
Stay tuned for our next chapter where we will build on this API with a
chatbot built using GPT-4 and our retrieval system.
3
Introduction
Prompt Engineering
By the end of this chapter, you will have the skills and knowledge needed to
create powerful LLM-based applications that leverage the full potential of
these cutting-edge models.
Alignment in language models refers to how well the model can understand
and respond to input prompts that are in line with what the user expected. In
standard language modeling, a model is trained to predict the next word or
sequence of words based on the context of the preceding words. However,
this approach does not allow for specific instructions or prompts to be given
to the model, which can limit its usefulness for certain applications.
Prompt engineering can be challenging if the language model has not been
aligned with the prompts, as it may generate irrelevant or incorrect
responses. However, some language models have been developed with extra
alignment features, such as Constitutional AI-driven Reinforcement
Learning from AI Feedback (RLAIF) from Anthropic or Reinforcement
Learning with Human Feedback (RLHF) in OpenAI’s GPT series, which
can incorporate explicit instructions and feedback into the model’s training.
These alignment techniques can improve the model’s ability to understand
and respond to specific prompts, making them more useful for applications
such as question-answering or language translation (Figure 3.2).
Figure 3.2 Even modern LLMs like GPT-3 need alignment to behave how
we want them to. The original GPT-3 model released in 2020 is a pure
auto-regressive language model and tries to “complete the thought” and
gives me some misinformation pretty freely. In January 2022, GPT-3’s first
aligned version was released (InstructGPT) and was able to answer
questions in a more succinct and accurate manner.
This chapter will focus on language models that have been specifically
designed and trained to be aligned with instructional prompts. These models
have been developed with the goal of improving their ability to understand
and respond to specific instructions or tasks. These include models like
GPT-3, ChatGPT (closed-source models from OpenAI), FLAN-T5 (an
open-source model from Google), and Cohere’s command series (closed-
source), which have been trained using large amounts of data and
techniques such as transfer learning and fine-tuning to be more effective at
generating responses to instructional prompts. Through this exploration, we
will see the beginnings of fully working NLP products and features that
utilize these models, and gain a deeper understanding of how to leverage
aligned language models’ full capabilities.
Just Ask
The first and most important rule of prompt engineering for instruction
aligned language models is to be clear and direct in what you are asking for.
When we give an LLM a task to complete, we want to make sure that we
are communicating that task as clearly as possible. This is especially true
for simple tasks that are straightforward for the LLM to accomplish.
Note
A space designated for the LLM to answer to give it’s answer which we
will add the intentionally similar prefix “Turkish:”
These three elements are all part of a direct set of instructions with an
organized answer area. By giving GPT-3 this clearly constructed prompt, it
will be able to recognize the task being asked of it and fill in the answer
correctly (Figure 3.4).
Figure 3.4 This more fleshed out version of our just ask prompt has three
components: a clear and concise set of instructions, our input prefixed by
an explanatory label and a prefix for our output followed by a colon and no
further whitespace.
We can expand on this even further by asking the GPT-3 to output multiple
options for our corrected grammar by asking GPT-3 to give results back as
a numbered list (Figure 3.5).
Figure 3.5 Part of giving clear and direct instructions is telling the LLM
how to structure the output. In this example, we ask GPT-3 to give
grammatically correct versions as a numbered list.
Few-shot Learning
When it comes to more complex tasks that require a deeper understanding
of language, giving an LLM a few examples can go a long way in helping
an LLM produce accurate and consistent outputs. Few-shot learning is a
powerful technique that involves providing an LLM with a few examples of
a task to help it understand the context and nuances of the problem.
Few-shot learning has been a pretty major focus of research in the field of
LLMs. The creators of GPT-3 even recognized the potential of this
technique, which is evident from the fact that the original GPT-3 research
paper was titled “Language Models are Few-Shot Learners”.
Few-shot learning is particularly useful for tasks that require a certain tone,
syntax, or style, and for fields where the language used is specific to a
particular domain. Figure 3.6 shows an example of asking GPT-3 to classify
a review as being subjective or not. Basically this is a binary classification
task.
Figure 3.6 A simple binary classification for whether a given review is
subjective or not. The top two examples show how LLMs can intuit a task’s
answer from only a few examples where the bottom two examples show the
same prompt structure without any examples (referred to as “zero-shot”)
and cannot seem to answer how we want it to.
In the following figure, we can see that the few-shot examples are more
likely to produce expected results because the LLM can look back at some
examples to intuit from.
Few-shot learning opens up new possibilities for how we can interact with
LLMs. With this technique, we can provide an LLM with an understanding
of a task without explicitly providing instructions, making it more intuitive
and user-friendly. This breakthrough capability has paved the way for the
development of a wide range of LLM-based applications, from chatbots to
language translation tools.
Output Structuring
LLMs can generate text in a variety of formats, sometimes with too much
variety. It can be helpful to structure the output in a specific way to make it
easier to work with and integrate into other systems. We’ve actually seen
this previously in this chapter when we asked GPT-3 to give us an answer in
a numbered list. We can also make an LLM give back structured data
formats like JSON (JavaScript Object Notation) as the output Figure 3.7).
Figure 3.7 Simply asking GPT-3 to give a response back as a JSON (top)
does give back a valid JSON but the keys are also in Turkish which may not
be what we want. We can be more specific in our instruction by giving a
one-shot example (bottom) which makes the LLM output the translation in
the exact JSON format we requested.
Prompting Personas
Specific word choices in our prompts can greatly influence the output of the
model. Even small changes to the prompt can lead to vastly different
results. For example, adding or removing a single word can cause the LLM
to shift its focus or change its interpretation of the task. In some cases, this
may result in incorrect or irrelevant responses, while in other cases, it may
produce the exact output desired.
Personas may not always be used for positive purposes. Just like any tool or
technology, some people may use LLMs to evoke harmful messages like if
we asked an LLM to imitate an anti-Semite like in the last figure. By
feeding the LLMs with prompts that promote hate speech or other harmful
content, individuals can generate text that perpetuates harmful ideas and
reinforces negative stereotypes. Creators of LLMs tend to take steps to
mitigate this potential misuse, such as implementing content filters and
working with human moderators to review the output of the model.
Individuals who want to use LLMs must also be responsible and ethical
when using LLMs and consider the potential impact of their actions (or the
actions the LLM take on their behalf) on others.
In this section, we will explore how to work with prompts across models,
taking into account the unique features and limitations of each model to
develop effective prompts that can guide language models to generate the
desired output.
ChatGPT
Some LLMs can take in more than just a single “prompt”. Models that are
aligned to conversational dialogue like ChatGPT can take in a system
prompt and multiple “user” and “assistant” prompts (Figure 3.8). The
system prompt is meant to be a general directive for the conversation and
will generally include overarching rules and personas to follow. The user
and assistant prompts are messages between the user and the LLM
respectively. For any LLM you choose to look at, be sure to check out their
documentation for specifics on how to structure input prompts.
Figure 3.8 ChatGPT takes in an overall system prompt as well as any
number of user and assistant prompts that simulate an ongoing
conversation.
Cohere
Let’s return to our simple translation example. Let’s ask OpenAI and
Cohere to translate something from English to Turkish (Figure 3.10).
It seems that the Cohere model I chose required a bit more structuring than
the OpenAI version. That doesn’t mean that the Cohere is worse than GPT-
3, it just means that we need to think about how our prompt is structured for
a given LLM.
It wouldn’t be fair to talk about prompt engineering and not talk about
open-source models like GPT-J and FLAN-T5. When working with them,
prompt engineering is a critical step to get the most out of their pre-training
and fine-tuning which we will start to cover in the next chapter. These
models can generate high-quality text output just like their closed-source
counterparts but unlike closed-source models like GPT and Cohere, open-
source models offer greater flexibility and control over prompt engineering,
enabling developers to customize prompts and tailor output to specific use
cases during fine-tuning.
Let’s build a very simple Q/A bot using ChatGPT and the semantic retrieval
system we built in the last chapter. Recall that one of our API endpoints is
used to retrieve documents from our BoolQ dataset given a natural query.
Note
2. Search for context in our knowledge with every new user message
3. Inject any context we find from our DB directly into ChatGPT’s system
prompt
Figure 3.12 A 10,000 foot view of our chatbot that uses ChatGPT to
provide a conversational interface in front of our semantic search API.
To dig into it one step deeper, Figure 3.13 shows how this will work at the
prompt level, step by step:
Figure 3.13 Starting from the top left and reading left to right, these four
states represent how our bot is architected. Every time a user says
something that surfaces a confident document from our knowledge base,
that document is inserted directly into the system prompt where we tell
ChatGPT to only use documents from our knowledge base.
Let’s wrap all of this logic into a Python class that will have a skeleton like
in Listing 3.1.
Listing 3.1 A ChatGPT Q/A bot
Figure 3.14 Asking our bot about information from the BoolQ dataset
yields cohesive and conversational answers whereas when I ask about
Barack Obama’s age (which is information not present in the knowledge
base) the AI politely declines to answer even though that is general
knowledge it would try to use otherwise.
As a part of testing, I decided to try something out of the box and built a
new namespace in the same vector database (Thank you, Pinecone) and I
chunked documents out of a PDF of a Star Wars-themed card game I like. I
wanted to use the chatbot to ask basic questions about the game and let
ChatGPT retrieve portions of the manual to answer my questions. Figure
3.15 was the result!
Figure 3.15 The same architecture and system prompt against a new
knowledge base of a card game manual. Now I can ask questions in the
manual but my questions from BoolQ are no longer in scope.
Not bad at all if I may say so.
Summary
Introduction
So far, we’ve exclusively used LLMs, both open and closed sourced, as
they are off the shelf. We were relying on the power of the Transformer’s
attention mechanisms and their speed of computation to perform pretty
complex problems with relative ease. As you can probably guess, that isn’t
always enough.
In this chapter, we will delve into the world of fine-tuning Large Language
Models (LLMs) to unlock their full potential. Fine-tuning updates off the
shelf models and empowers them to achieve higher quality results, leads to
token savings, and often lower latency requests. While GPT-like LLMs’
pre-training on extensive text data enables impressive few-shot learning
capabilities, fine-tuning takes it a step further by refining the model on a
multitude of examples, resulting in superior performance across various
tasks.
Test set: A third collection of labeled examples that is separate from both
the training and validation sets. It is used to evaluate the final performance
of the model after the training and fine-tuning processes are complete. The
test set provides a final, unbiased estimate of the model's ability to
generalize to new, unseen data.
Loss function: The loss function quantifies the difference between the
model's predictions and the actual target values. It serves as a metric of
error to evaluate the model's performance and guide the optimization
process. During training, the goal is to minimize the loss function to
achieve better predictions.
This streamlined process saves time and resources while ensuring the
development of high-quality models capable of generating accurate and
relevant responses in a wide array of applications. We will dive deep into
open-source fine-tuning and the benefits/drawbacks it offers in a later
chapter.
The GPT-3 API offers developers access to one of the most advanced LLMs
available. The API provides a range of fine-tuning capabilities, allowing
users to adapt the model to specific tasks, languages, and domains. This
section will discuss the key features of the GPT-3 fine-tuning API, the
supported methods, and best practices for successfully fine-tuning models.
The GPT-3 fine-tuning API is like a treasure chest, brimming with powerful
features that make customizing the model a breeze. From supporting
various fine-tuning capabilities to offering a range of methods, it's a one-
stop-shop for tailoring the model to your specific tasks, languages, or
domains. This section will unravel the secrets of the GPT-3 fine-tuning API,
uncovering the tools and techniques that make it such an invaluable
resource.
Let’s introduce our first case study. We will be working with the
amazon_reviews_multi dataset (previewed in Figure 4.3). This dataset is a
collection of product reviews from Amazon, spanning multiple product
categories and languages (English, Japanese, German, French, Chinese and
Spanish). Each review in the dataset is accompanied by a rating on a scale
of 1 to 5 stars, with 1 being the lowest and 5 being the highest. The goal of
this case study is to fine-tune a pre-trained model from OpenAI to perform
sentiment classification on these reviews, enabling it to predict the number
of stars given to a review. Taking a page out of my own book (albeit one
just a few pages ago), let’s start with taking a look at the data.
Figure 4.3 A snippet of the amazon_reviews_multi dataset shows our input
context (review titles and bodies) and our response - the thing we are trying
to predict - the number of stars the review was for (1-5).
The columns we will care about for this round of fine-tuning are:
Our goal will be to use the context of the title and body of the review and
predict the rating that was given.
In general, there are a few items to consider when selecting data for fine-
tuning:
Data quality: Ensure that the data used for fine-tuning is of high quality,
free from noise, and accurately represents the target domain or task. This
will enable the model to learn effectively from the given examples.
Data diversity: Make sure your dataset is diverse, covering a broad range
of scenarios to help the model generalize well across different situations.
Data balancing: Maintaining a balanced distribution of examples across
different tasks and domains helps prevent overfitting and biases in the
model's performance. This can be achieved from unbalanced datasets by
undersampling majority classes, oversampling minority classes, or adding
synthetic data. Our sentiment is perfectly balanced due to the fact that this
dataset was curated but check out an even harder example in our code base
where we attempt to classify the very unbalanced category classification
task.
Data Quantity: The total amount of data used to fine-tune the model.
Generally, larger language models like LLMs require extensive data to
capture and learn various patterns effectively but fewer if the LLM was pre-
trained on similar enough data. The exact quantity needed can vary based
on the complexity of the task at hand. Any dataset should not only be
extensive but also diverse and representative of the problem space to avoid
potential biases and ensure robust performance across a wide range of
inputs. While a large quantity of data can help to improve model
performance, it also increases the computational resources required for
model training and fine-tuning. This trade-off needs to be considered in the
context of the specific project requirements and resources.
Before diving into fine-tuning, we need to prepare the data by cleaning and
formatting it according to the API's requirements. This includes the
following:
Splitting the data: Divide the dataset into training, validation, and test
sets, maintaining a random distribution of examples across each set. If
necessary, consider using stratified sampling to ensure that each set contains
a representative proportion of the different sentiment labels, thus preserving
the overall distribution of the dataset.
Shuffle the training data: shuffling training data before fine-tuning helps
to avoid biases in the learning process by ensuring that the model
encounters examples in a random order, reducing the risk of learning
unintended patterns based on the order of the examples. It also improves
model generalization by exposing the model to a more diverse range of
instances at each stage of training which also helps to prevent overfitting, as
the model is less likely to memorize the training examples and instead
focuses on learning the underlying patterns. Figure 4.5 shows the benefits
of shuffling training data. Note that data are ideally shuffled before every
single epoch to reduce the chance of the model over-fitting on the data as
much as possible.
Creating the OpenAI JSONL format: OpenAI's API expects the
training data to be in JSONL (newline-delimited JSON) format. For each
example in the training and validation sets, create a JSON object with two
fields: “prompt” (the input) and “completion” (the target class). The
“prompt” field should contain the review text, and the “completion” field
should store the corresponding sentiment label (stars). Save these JSON
objects as newline-delimited records in separate files for the training and
validation sets.
Figure 4.5 Unshuffled data makes for bad training data! It gives the model
room to overfit on specific batches of data and overall lowers the quality of
the responses. The top three graphs represent a model trained on unshuffled
training data and the accuracy is horrible compared to a model trained on
shuffled data, seen in the bottom three graphs.
The following listing (Listing 4.1) loads the Amazon Reviews dataset and
converts the 'train' subset into a pandas DataFrame. Then, it preprocesses
the DataFrame using the custom prepare_df_for_openai function,
which combines the review title and review body into a prompt, creates a
new completion column, and filters the DataFrame to only include English
reviews. Finally, it removes duplicate rows based on the 'prompt' column
and returns a DataFrame with only the 'prompt' and 'completion' columns.
Listing 4.1 Generating a JSONL file for our sentiment training data
english_training_df = prepare_df_for_openai(train
# export the prompts and completions to a JSONL f
english_training_df.to_json("amazon-english-full-
The OpenAI Command Line Interface (CLI) simplifies the process of fine-
tuning and interacting with the API. The CLI allows you to submit fine-
tuning requests, monitor training progress, and manage your models, all
from your command line. Ensure that you have the OpenAI CLI installed
and configured with your API key before proceeding with the fine-tuning
process.
To install the OpenAI CLI, you can use pip, the Python package manager.
First, make sure you have Python 3.6 or later installed on your system.
Then, follow these steps:
a. This command installs the OpenAI Python package, which includes the
CLI.
Before you can use the OpenAI CLI, you need to configure it with your API
key. To do this, set the OPENAI_API_KEY environment variable to your
API key value. You can find your API key in your OpenAI account
dashboard.
Learning rate: The learning rate determines the size of the steps the
model takes during optimization. A smaller learning rate leads to slower
convergence but potentially better accuracy, while a larger learning rate
speeds up training but may cause the model to overshoot the optimal
solution.
Batch size: Batch size refers to the number of training examples used in a
single iteration of model updates. A larger batch size can lead to more
stable gradients and faster training, while a smaller batch size may result in
a more accurate model but slower convergence.
OpenAI has done a lot of work to find optimal settings for most cases, so
we will lean on their recommendations for our first attempt. The only thing
we will change is to train for 1 epoch instead of the default 4. We're doing
this because we want to see how the performance looks before investing too
much time and money. Experimenting with different values and using
techniques like grid search will help you find the optimal hyperparameter
settings for your task and dataset, but be mindful that this process can be
time-consuming and costly.
Let’s kick off our first fine-tuning! Listing 4.2 makes a call to OpenAI to
train an ada model (fastest, cheapest, weakest) for 1 epoch on our training
and validation data.
After one epoch (further metrics shown in Figure 4.6), our classifier is
getting above 63% accuracy on the holdout testing dataset! Remember the
testing subset was not given to OpenAI but rather we held it out for final
model comparisons.
Figure 4.6 Our model is performing pretty well after only one epoch on de-
duplicated shuffled training data
63% accuracy might sound low to you but hear me out: predicting the exact
number of stars is tricky because people aren’t always consistent in what
they write and how they finally review the product so I’ll offer two more
metrics:
Relaxing our accuracy calculation to be binary (did the model predict <= 3
stars and was the review <=3 stars) is 92% so the model can tell between
“good” and “bad”
So you know what? Not bad. Our classifier is definitely learning the
difference between good and bad so the next logical thought might be, “let’s
keep the training going”! We only trained for a single epoch so more epochs
must be better, right?
This process of taking smaller steps in training and updating already fine-
tuned models for more training steps/epochs potentially with new labeled
datapoints is called incremental learning aka continuous learning or online
learning. Incremental learning often results in more controlled learning,
which can be ideal when working with smaller datasets or when you want
to preserve some of the model's general knowledge. Let’s try some
incremental learning! Let’s take our already fine-tuned ada model and let it
run for 3 more epochs on the same data and see the results in Figure 4.7.
Figure 4.7: The model’s performance seems to barely move during a
further 3 epochs of incremental learning after a successful single epoch. 4x
the cost for 1.02x the performance? No thank you.
Uh oh, more epochs didn’t seem to really do anything, but nothing is set in
stone until we test on our holdout test data subset and compare it to our first
model. Table 4.1 shows our results:
Figure 4.8 The playground and the API for GPT-3 like models (including
our fine-tuned ada model as seen in this figure) offer token probabilities
that we can use to check the model’s confidence on a particular
classification. Note that the main option is “ 1“ with a leading space just
like we have in our training data but one of the tokens on the top of the list
is “1” with no leading space. These are two separate tokens according to
many LLMs which is why I am calling it out so often. It can be easy to
forget and mix them up.
Listing 4.3 getting token probabilities from the OpenAI API
import math
# Select a random prompt from the test dataset
prompt = english_test_df['prompt'].sample(1).iloc
Output:
Prompt:
Great pieces of jewelry for the price
###
Predicted Star: 4
Probabilities:
4: 0.9831
5: 0.0165
3: 0.0002
2: 0.0001
1: 0.0001
Between quantitative and qualitative measures, let’s assume we believe our
model is ready to go into production – if not at least a dev or staging
environment for further testing, let’s take a minute to talk about how we can
incorporate our new model into our applications.
2. Use the OpenAI API Normally: Use yourOpenAI API to make requests
to your fine-tuned model. When making requests, replace the base model's
name with your fine-tuned model's unique identifier. Listing 4.3 offers an
example of doing this.
Summary
We will revisit the idea of fine-tuning in later chapters with some more
complicated examples while also exploring the fine-tuning strategies for
open-source models for even further cost reductions.
Part II: Getting the most out of LLMs
5
Introduction
Let’s begin our journey into advanced prompt engineering with a look at
how people might take advantage of the prompts we work so hard on.
As more people start to use LLMs like ChatGPT and GPT-3 in production,
well-engineered prompts will become considered part of a company’s
proprietary information. Perhaps your bot becomes very popular and
someone decides they want to steal your idea. Using prompt injection, they
may have a shot. If an attacker tweets the following at the bot:
A simple prompt injection attack tricking the LLM to reveal the original
prompt which can now be exploited and copied in a competing application
There are different ways to phrase this attack text but the above method is
on the simpler side. Using this method of prompt injection, one could
potentially steal the prompt of a popular application using a popular LLM
and create a clone with near identical quality of responses. There are
already websites out there that document prompts that popular companies
use (which we won’t link to out of respect) so this issue is already on the
rise.
Avoiding prompts that are extremely short as they are more likely to be
exploited. The longer the prompt, the more difficult it is to reveal.
Using unique and complex prompt structures that are less likely to be
guessed by attackers. This might include incorporating specific domain
knowledge.
Employing input/output validation techniques to filter out potential attack
patterns before they reach the LLM and filtering out responses that contain
sensitive information with a post-processing step (more on this in the next
section).
Input/Output Validation
When working with LLMs, it is important to ensure that the input you
provide is clean and free of errors (both grammatical and factual) or
malicious content. This is especially important if you are working with
user-generated content, such as text from social media, transcripts, or online
forums. To protect your LLMs and ensure accurate results, it is a good idea
to implement input sanitization and data validation processes to filter out
any potentially harmful content.
For example, consider a scenario where you are using an LLM to generate
responses to customer inquiries on your website. If you allow users to enter
their own questions or comments directly into a prompt, it is important to
sanitize the input to remove any potentially harmful or offensive content.
This can include things like profanity, personal information, or spam, or
keywords that might indicate a prompt injection attack. Some companies
like OpenAI offer a moderation service (free in OpenAI’s case!) to help
monitor for harmful/offensive text because if we can catch that kind of text
before it reaches the LLM, we are free to error handle more appropriately
and not waste tokens and money on garbage input.
In the above figure, the first prompt demonstrates how an LLM can be
instructed to hide sensitive information. However, the second prompt
indicates a potential security vulnerability via injection as the LLM happily
divulges private information if told to ignore previous instructions. It is
important to consider these types of scenarios when designing prompts for
LLMs and implement appropriate safeguards to protect against potential
vulnerabilities.
Example—Using MNLI to Build Validation Pipelines
'''
{'sequence': ' Unfortunately, I cannot help you w
'labels': ['offensive', 'safe'],
'scores': [0.9724587202072144, 0.005793550983071
'''
'''
{'sequence': " What do you mean you can't access
'labels': ['offensive', 'safe'],
'scores': [0.7064529657363892, 0.000636537268292
'''
'''
{'sequence': ' Absolutely! I can help you get int
'labels': ['safe', 'offensive'],
'scores': [0.36239179968833923, 0.02562042325735
'''
We can see that the confidence levels probably aren’t exactly what we
might expect. We would want to adjust the labels to be more robust for
scalability but this gives us a great start using an off the shelf LLM.
If we are thinking of post processing outputs, which would add time to our
over latency, we might also want to consider some methods to make our
LLM predictions more efficient.
Batch Prompting
Prompt Chaining
Prompt chaining involves using one LLM output as the input to another
LLM in order to complete a more complex or multi-step task. This can be a
powerful way to leverage the capabilities of multiple LLMs and achieve
results that would not be possible with a single model.
For example, suppose you want a generalized LLM to write an email back
to someone indicating interest in working with them (as shown in Figure
5.5). Our prompt may be pretty simply to ask an LLM to write an email
back like so:
Figure 5.5 A simple prompt with a clear instruction to respond to an email
with interest. The incoming email has some pretty clear indicators of how
Charles is feeling that the LLM seems to not be taking into account.
This simple and direct prompt to write an email back to a person indicating
interest outputted a generically good email while being kind and
considerate. We could call this a success but perhaps we can do better.
Figure 5.6 A two prompt chain where the first call to the LLM asks the
model to describe the email sender’s emotional state and the second call
takes in the whole context from the first call and asks the LLM to respond to
the email with interest. The resulting email is more attuned to Charle’s
emotional state
By changing together the first prompt’s output as the input to a second call
with additional instructions, we can encourage the LLM to write more
effective and accurate content by forcing it to think about the task in
multiple steps.
1. The first call to the LLM first is asked to acknowledge the frustration that
Charles expressed in his email when we ask the LLM to determine how the
person is feeling
2. The second call to the LLM asks for the response but now has insight
into how the other person is feeling and can write a more empathetic and
appropriate response.
Specialization: Each LLM in the chain can focus on its area of expertise,
allowing for more accurate and relevant results in the overall solution.
Flexibility: The modular nature of chaining allows for the easy addition,
removal, or replacement of LLMs in the chain to adapt the system to new
tasks or requirements.
Task Decomposition: We should break down the complex task into more
manageable subtasks that can be addressed by individual LLMs.
The original prompt sees the attack input text and outputs the prompt which
would be unfortunate however the second call to the LLM generates the
output seen to the user and no longer contains the original prompt.
You can also use output sanitization to ensure that your LLM outputs are
free from injection attacks. For example, you can use regular expressions or
other validation criteria like the Levenshtein distance or some semantic
model to check that the output of the model is not too similar to the prompt
and block any output that does not conform to that criteria from reaching
the end-user.
Prompt stuffing occurs when a user provides too much information in their
prompt, leading to confusing or irrelevant outputs from the LLM. This often
happens when the user tries to anticipate every possible scenario and
includes multiple tasks or examples in the prompt, which can overwhelm
the LLM and lead to inaccurate results.
Let’s say we want to use GPT to help us draft a marketing plan for a new
product (Figure 5.8). We would want our marketing plan to include specific
information like budget and timeline. Let’s further suppose that not only do
we want a marketing plan, we want advice on how to approach higher ups
with the plan and account for potential pushback. If we wanted to address
all of this in a single prompt, it may look something like Figure 5.8.
This prompt has at least a dozen different tasks for the LLM ranging from
writing an entire marketing plan and outlining potential concerns from key
stakeholders. This is likely too much for the LLM to do in one shot.
Research and cite relevant industry statistics and trends to support the plan
Outline key people in an organization who will need to sign off on the
plan
When I ran this prompt through GPT-3’s Playground a few times (with
default parameters except for the max length to allow for a longer form
piece of content) I saw many problems. The main problem was that the
model usually refuses to complete any further than the marketing plan -
which itself often didn’t even include all of the items I requested. The LLM
often would not list the key people let alone their concerns and how to
address them. The plan itself was usually over 600 words, so it couldn’t
even follow that basic instruction.
That’s not to say the marketing plan itself wasn’t alright. It was a bit generic
but it hit most of the key points we asked it to. The problem was that when
we ask too much of an LLM, it often simply starts to select which tasks to
solve and ignores the others.
In extreme cases, prompt stuffing can arise when a user fills the LLM’s
input token limit with too much information in the hopes that the LLM will
simply “figure it out” which can lead to incorrect or incomplete responses
or hallucinations of facts. An example of reaching the token limit would be
if we want an LLM to output a SQL statement to query a database given the
database’s structure and a natural language query, that could quickly reach
the input limit if we had a huge database with many tables and fields.
There are a few ways to try and avoid the problem of prompt stuffing. First
and foremost, it is important to be concise and specific in the prompt and
only include the necessary information for the LLM. This allows the LLM
to focus on the specific task at hand and produce more accurate results that
address all the points you want it to. Additionally we can implement
chaining to break up the multi-task workflow into multiple prompts (as
shown in Figure 5.9). We could for example have one prompt to generate
the marketing plan, and then use that plan as input to ask the LLM to
identify key people, and so on.
Figure 5.9 A potential workflow of chained prompts would have one prompt
generate the plan, another generate the stakeholders, and a final prompt to
create ways to address those concerns.
Prompt stuffing can also negatively impact the performance and efficiency
of GPT, as the model may take longer to process a cluttered or overly
complex prompt and generate an output. By providing concise and well-
structured prompts, you can help GPT perform more effectively and
efficiently.
Now that we have explored the dangers of prompt stuffing and how to
avoid it, let's turn our attention to an important security and privacy topic:
prompt injection.
Figure 5.10 visualizes this example. The full code for this example can be
found in our code repository.
Figure 5.10 Our multimodal prompt chain - starting with a user in the top
left submitting an image - uses 4 LLMs (3 open source and Cohere) to take
in an image, caption it, categorize it, come up with follow up questions, and
answer them with a given confidence.
Speaking of chains, let’s look at one of the most useful advancements in
prompting to date—chain of thought.
Example—Basic Arithmetic
More recent LLMs like ChatGPT and GPT-4 are more likely than their
predecessors to output chains of thought even without being prompted to.
Figure 5.11 shows the same exact prompt in GPT-3 and ChatGPT.
Figure 5.11 (Top) a basic arithmetic question with multiple choice proves to
be too difficult for DaVinci. (Middle) When we ask DaVinci to first think
about the question by adding “Reason through step by step” at the end of
the prompt we are using a “chain of thought” prompt and it getsit right!
(Bottom) ChatGPT and GPT-4 don’t need to be told to reason through the
problem because they are already aligned to think through the chain of
thought.
Let’s revisit the concept of few-shot learning, the technique that allows
large language models to quickly adapt to new tasks with minimal training
data. We saw examples of few-shot learning in chapter 3 and as the
technology of Transformer-based LLMs continues to advance and more
people adopt it into their architectures, few-shot learning has emerged as a
crucial methodology for getting the most out of these state-of-the-art
models, enabling them to efficiently learn and perform a wider array of
tasks than the LLM originally promises.
I want to take a step deeper with few-shot learning to see if we can improve
an LLM's performance in a particularly challenging domain: math!
Note how the dataset includes << >> markers for equations, just like how
ChatGPT and GPT-4 does it. This is because they were in part trained using
similar datasets with similar notation.
So that means they should be good at this problem already, right? Well
that’s the point of this example. Let’s assume our goal is to try and make an
LLM as good as possible at this task and let’s begin with the most basic
prompt I can think of, just asking an LLM to solve it.
Figure 5.14 gives us a baseline accuracy - defined by the model giving the
exactly correct answer) for our prompt baseline - just asking with clear
instruction and formatting between four LLMs:
ChatGPT (gpt-3.5-turbo)
DaVinci (text-davinci-003)
Cohere (command-xlarge-nightly)
Figure 5.14 Just asking our four models a sample of our arithmetic
questions in the format displayed in Figure 5.13 gives us a baseline to
improve upon. ChatGPT seems to be the best at this task (not surprising)
Just Ask with no chain of thought - The baseline prompt we tested in the
previous section where we have a clear instruction set and formatting.
Just Ask with chain of thought - effectively the same prompt but also
giving the LLM room to reason out the answer first.
Listing 5.2 Load up the GSM8K dataset and define our first two prompts
'''
Janet’s ducks lay 16 eggs per day. She eats three
Our new prompt (visualized in Figure 5.15) asks the LLM to reason through
the answer before giving the final answer. Testing this variant against our
baseline will reveal the answer to our first big question: Do we want to
include a chain of thought in our prompt? The answer might be
“obviously yes we do” but it’s worth testing mainly because including chain
of thought means including more tokens in our context window which as
we have seen time and time again means more money so if chain of thought
does not deliver significant results, then it may not be worth including it at
all.
Figure 5.15 Our first prompt variant expands on our baseline prompt
simply by giving the LLM space to reason out the answer first. ChatGPT is
getting the answer right now for this example.
Listing 5.3 shows an example of running these prompts through our testing
dataset. For a full run of all of our prompts, check out this book’s code
repository.
Listing 5.3 Running through a test set with our prompt variants
# Define a function to format k-shot examples for
def format_k_shot_gsm(examples, cot=True):
if cot:
# If cot=True, include the reasoning in t
return '\n###\n'.join(
[f'Question: {e["question"]}\nReasoni
)
else:
# If cot=False, exclude the reasoning fro
return '\n###\n'.join(
[f'Question: {e["question"]}\nAnswer
)
--------------
--------------
# BEGIN ITERATING OVER GSM TEST SET
Our first results are shown in Figure 5.16, where we compare the accuracy
of our first two prompt choices between our four LLMs.
Figure 5.16 Asking the LLM to produce a chain of thought (the right bars)
already gives us a huge boost in all of our models compared to no chain of
thought (the left bars)
Random 3-shot - Taking a random set of 3 examples from the training set
with chain of thought included in the example to help the LLM understand
how to reason through the problem.
Figure 5.17 shows both an example of our new prompt variant as well as
how the variant performed against our test set. The results seem clear that
including these random examples + CoT is really looking promising. This
seems to answer our question:
Figure 5.17 Including random 3-shot examples (example shown above)
from the training set seems to improve the LLM even more (graph below).
Note that “Just Ask (with CoT)” is the same performance as the last section
and “Random K=3” is our net new results. This can be thought of as a “0-
shot” approach vs a “3-shot” approach because the real difference between
the two is in the number of examples we are giving the LLM.
Amazing, we are making progress. Let’s ask just two more questions.
Listing 5.4 shows how we can accomplish this prototype by encoding all
training examples of GSM8K. We can use these embeddings to include only
semantically similar examples in our few-shot.
Listing 5.4 Encoding the questions in the GSM8K training set to retrieve
dynamically
from sentence_transformers import SentenceTransfo
from random import sample
from sentence_transformers import util
Figure 5.18 shows what this new prompt would look like.
Figure 5.18 This third variant selects the most semantically similar
examples from the training set. We can see that our examples are also about
easter egg hunting.
Figure 5.19 shows the performance of this third variant against our best
performing variant so far (random 3-shot with chain of thought [CoT]). The
graph also includes a third section for semantically similar examples but
without CoT to further convince us that CoT is helpful no matter what.
Figure 5.19 Including semantically similar examples (denoted by
“closest”) gives us yet another boost! Note the first set of bars has
semantically similar examples but no CoT and it performs worse so CoT is
still crucial here!
Things are looking good, but let me ask one more question to really be
rigorous.
The more examples we include, the more tokens we need but in theory, the
more context we give the model. Let’s test a few options for K assuming we
still need chain of thought. Figure 5.20 shows performance of 4 values of
K.
Figure 5.20 A single example seems to not be enough, and 5 or more
actually shows a hit in performance for OpenAI. 3 examples seems to be the
sweet spot for OpenAI. Interestingly, the Cohere model is only getting better
with more examples, which could be an area of further iteration.
We have tried many variants (visualized in Figure 5.21) and the following
table (Table 5.1) summarizes our results:
Figure 5.21 Performance of all variants we attempted
Table 5.1 Our final results on prompt engineering to solve the GSM task
(numbers are accuracy on our sample test set) Bolded numbers represent
the best accuracy for that model.
We can see some pretty drastic results depending on our level of prompt
engineering efforts. As far as the poor performance from our open source
model FLAN-T5, we will revisit this problem in a later chapter when we
attempt to fine-tune open-source models on this dataset to try and compete
with OpenAI’s models.
Like we did in our last example, when designing effective and consistent
prompts for LLMs, you will most likely need to try many variations and
iterations of similar prompts to try and find the best one possible. There are
a few key best practices to keep in mind to make this process faster and
easier and will help you get the most out of your LLM outputs and ensure
that you are creating reliable, consistent, and accurate outputs.
It is important to test your prompts and prompt versions and see how they
perform in practice. This will allow you to identify any issues or problems
with your prompts and make adjustments as needed. This can come in the
form of “unit tests” where you have a set of expected inputs and outputs
that the model should adhere to. Anytime the prompt changes, even if it is
just a single word, running the prompt against these tests will help you be
confident that your new prompt version is working properly. Through
testing and iteration, you can continuously improve your prompts and get
better and better results from your LLMs.
Conclusion
Advanced prompting techniques can enhance the capabilities of LLMs
while being both challenging and rewarding. We saw how dynamic few-
shot learning, chain of thought prompting, and multimodal LLMs can
broaden the scope of tasks that we want to tackle effectively. We also dug
into how implementing security measures, such as using MNLI as an off the
shelf output validator or using chaining to prevent against injection attacks
can help address the responsible use of LLMs.
Happy Prompting!
6
Introduction
To further broaden our horizons, we will utilize lessons learned from earlier
chapters and dive into the world of fine-tuning embedding models and
customizing pre-trained LLM architectures to unlock even greater potential
in our LLM implementations. By refining the very foundations of these
models, we can cater to specific business use cases and foster improved
performance.
The majority of this chapter will explore the role of embeddings and model
architectures in designing a recommendation engine using a real-world
dataset as our case study. Our objective is to highlight the importance of
customizing embeddings and model architectures in achieving better
performance and results tailored to specific use cases.
import numpy as np
Recommendation systems need to take into account both user features and
item features to generate personalized suggestions. User features can
include demographic information like age, browsing history, and past item
interactions (which will be the focus of our work in this chapter), while
item features can encompass characteristics like genre, price, or popularity.
However, these factors alone may not paint the complete picture, as human
mood and context also play a significant role in shaping preferences. For
instance, a user's interest in a particular item might change depending on
their current emotional state or the time of day.
On the other hand, collaborative filtering can be further divided into user-
based and item-based approaches. User-based collaborative filtering finds
users with similar preferences and recommends items that those users have
liked or interacted with. Item-based collaborative filtering, instead, focuses
on finding items that are similar to those the user has previously liked,
based on the interactions of other users. In both cases, the underlying
principle is to leverage the wisdom of the crowd to make personalized
recommendations.
In our case study, we plan to fine-tune a bi-encoder (like the one we saw in
chapter 2) to generate embeddings for anime features. Our goal is to
minimize the cosine similarity loss in such a way that the similarity
between embeddings reflects how common it is that users like both animes.
I got lost here tracking exactly what we’re trying to do. Maybe explaining
how we’re going to construct this embedder, and how it will combine
collaborative filtering and semantic similarity would be helpful. I realized
later that we’re trying this model on the collaborative filtering as a label.
4. Run our system on a testing set of user preference data to decide which
embedder was responsible for the best anime title recommendations
2. Identify highly-rated animes: For each anime title that the user has
rated as a 9 or 10 (a promoting score on the NPS scale), identify k other
relevant animes by finding nearest matches in the anime’s embedding
space. From these, we consider both how often an anime was recommended
and how high the resulting cosine score was in the embedding space to take
the top k results for the user. Figure 6.3 outlines this process. The pseudo
code would look like:
Figure 6.3 Step 2 takes in the user and finds k animes for each user-
promoted (gave a score of 9 or 10) anime. For example if the user promoted
4 animes (6345, 4245, 249, 120) and we set k=3, the system will first
retrieve 12 semantically similar animes (3 per promoted animes with
duplicates allowed) and then de-duplicate any animes that came up
multiple times by weighing it slightly more than the original cosine scores.
We then take the top k unique recommended anime titles considering both
cosine scores to promoted animes and how often occurred in the original
list of 12.
given: user, k=3
promoted_animes = all anime titles that the user
relevant_animes = []
for each promoted_anime in promoted_animes:
add k animes to relevant_animes with the high
The github has the full code to run this step with examples too! For
example, given k=3 and user id 205282 , the result of step two would
result in the following dictionary where each key represents a different
embedding model used and the values are anime title ids and corresponding
cosine similarity scores to promoted titles the user liked:
final_relevant_animes = {
'text-embedding-ada-002': { '6351': 0.921, '172
'paraphrase-distilroberta-base-v1': { '17835':
}
3. Score relevant animes: For each of the relevant animes identified in the
previous step, if the anime is not present in the testing set for that user,
ignore it. If we have a user rating for the anime in the testing set, we assign
a score to the recommended anime given the NPS-inspired rules:
If the rating in the testing set for the user and the recommended anime was
9 or 10, the anime is considered a “Promoter” and the system receives +1
points.
Overall, This approach of creating our own custom description field from
several individual fields ultimately should result in a recommendation
engine that delivers more accurate and relevant content suggestions. Listing
6.2 provides a snippet of the code used to generate these descriptions.
Listing 6.2 Generating custom descriptions from multiple anime fields
def clean_text(text):
# Remove non-printable characters
text = ''.join(filter(lambda x: x in string.p
# Replace multiple whitespace characters with
text = re.sub(r'\s{2,}', ' ', text).strip()
return text.strip()
def get_anime_description(anime_row):
"""
Generates a custom description for an anime t
...
description = (
f"{anime_row['Name']} is a {anime_type}.\
... # NOTE I omitting over a dozen other rows he
f"Its genres are {anime_row['Genres']}\n"
)
return clean_text(description)
To calculate the Jaccard similarity, we first find the people who like both
Anime A and Anime B. In this case, it's Bob and Carol.
Next, we find the total number of distinct people who like either Anime A
or Anime B. Here, we have Alice, Bob, Carol, David, Ethan, and Frank.
So, the Jaccard similarity between Anime A and Anime B, based on the
people who like them, is about 0.33 or 33%. This means that 33% of the
distinct people who like either show have similar tastes in anime, as they
enjoy both Anime A and Anime B. Figure 6.6 shows another example.
Figure 6.6 To convert our raw ratings into pairs of animes with associated
scores, we will consider every pair of anime titles and compute the jaccard
score between promoting users.
We will apply this logic to calculate the Jaccard similarity for every pair of
animes using a training set of the ratings DataFrame and only keep scores
above a certain threshold as “positive examples” (label of 1) and the rest
will be considered “negative” (label of 0).
Important Note: We are free to label any anime pairs with a label between
-1 and 1 but I am only using 0 and 1 because I’m only using promoting
scores to create my data so it’s not fair to say that if the jaccard score
between animes is low then the users totally disagree on the anime. That’s
not necessarily true! If I expanded this case study I would want to explicitly
label animes as -1 if and only if users were genuinely rating them opposite
(most users who promote one are detractors of the other).
Once we have jaccard scores for anime ides, we will need to convert them
into tuples of anime descriptions and the cosine label (in our case either 0 or
1) and then we are ready to update our open-source embedders and
experiment with different token windows (shown in Figure 6.7).
Figure 6.7 Jaccard scores are converted into cosine labels and then fed into
our bi-encoder so it may attempt to learn patterns between the generated
anime descriptions and how users co-like the titles.
The histogram of token lengths shows that 384 would capture most of our
animes in their entirely and would truncate the rest
384 = 256 + 128, the sum of 2 binary numbers and we like binary numbers
-Modern hardware components, especially GPUs, are designed to perform
optimally with binary numbers so they can split up workloads evenly.
Why not 512 then to capture more training data? I still want to be
conservative here. The more I increase this max token window size, the
more data I would need to train the system because we are adding
parameters to our model and therefore there is more to learn. It will also
take more time and compute resources to load, run, and update the larger
model.
• For what it’s worth, I did initially try this process with an embedding size
of 512 and got worse results while taking about 20% longer on my
machine.
We gauge how well the model learned by checking the change in the cosine
similarity which jumped up to the high .8, .9s! That’s great.
...
batch_size=16,
shuffle=True
)
With our fine-tuned bi-encoder, we can generate embeddings for new anime
descriptions and compare them with the embeddings of our existing anime
database. By calculating the cosine similarity between the embeddings, we
can recommend animes that are most similar to the user's preferences.
It’s worth noting that once we go through the process of fine-tuning a single
custom embedder using our user preference data, we can then pretty easily
swap out different models with similar architectures and run the same code,
rapidly expanding our universe of embedder options. For this case study, I
also fine-tuned another LLM called all-mpnet-base-v2 which (at
the time of writing) is regarded as a very good open-source embedder for
semantic search and clustering purposes. It is a bi-encoder as well so we
can simply swap out references to our Roberta model with mpnet and
change virtually no code (see the github for the complete case study).
Summary of Results
Figure 6.9 shows our final results for our four embedder candidates across
lengthening recommendation windows (how many recommendations we
show the user).
Figure 6.9 Our larger open-source model (anime_encoder_bigger)
consistently outperforms OpenAI’s embedder in recommending anime titles
to our users based on historical preferences.
Each tick on the x axis here represents showing the user a list of that many
anime titles and the y axis is a aggregated score for the embedder using the
scoring system outlined before where we also further reward the model if a
correct recommendation was placed closer to the front of the list and
likeways punish it more if it recommends something that the user is a
detractor for closer to the beginning of the list.
The model might require more than 384 tokens to capture all possible
relationships.
All models start to degrade in performance as it is expected to recommend
more and more titles which is fair. The more titles anything recommends,
the less confident it will be as it goes down the list.
Exploring Exploration
Design new metrics for quality assurance to test our embedders on a more
holistic scale
Calculate new training datasets that use other metrics like correlation
coefficients instead of jaccard
Honestly that last one is a bit pie in the sky and would really work best if
we could also combine that with some chain of thought prompting on a
different LLM but still, this is a big question and sometimes that means we
need big ideas and big answers. So I leave it to you now; go have big ideas!
Conclusion
Introduction
Figure 7.1 A Visual Question and Answering system (VQA) generally takes
in two modes (types) of data - image and text - and will return a human
readable answer to the question. This image outlines one of the most basic
approaches to this problem where the image and text are encoded by
separate encoders and a final layer predicts a single word as an answer.
DistilBERT is a distilled version of the popular BERT model that has been
optimized for speed and memory efficiency. It is a pre-trained model that
uses knowledge distillation to transfer knowledge from the larger BERT
model to a smaller and more efficient one. This allows it to run faster and
consume less memory while still retaining much of the performance of the
larger model.
DistilBERT should have prior knowledge of language that will help during
training, thanks to transfer learning. This allows it to understand natural
language text with high accuracy.
The Vision Transformer (ViT) has been pre-trained like BERT has on a
dataset of images known as Imagenet and therefore should hold prior
knowledge of image structures that should help during training as well. This
allows it to understand and extract relevant features from images with high
accuracy.
We should note that when we use ViT, we should try to use the same image
preprocessing steps that it used during pre-training so that the model has an
easier time learning the new image sets. This is not strictly necessary and
has its pros and cons.
2. Incompatibility with new data: In some cases, the new dataset may
have unique properties or structures that are not well-suited to the original
preprocessing steps, which could lead to suboptimal performance if the
preprocessing steps are not adapted accordingly.
We will re-use the ViT image preprocessor for now. Figure 7.2 shows a
sample of an image before preprocessing and the same image after it has
gone though ViT’s standard preprocessing steps.
Figure 7.2 Image systems like the Vision Transformer (ViT) generally have
to standardize images to a set format with pre-defined normalization steps
so that each image is processed as fairly and consistently as possible. For
some images (like the downed tree in the top row) the image preprocessing
really takes away context at the cost of standardization across all images.
When we feed our text and image inputs into their respective models
(DistilBERT and Vision Transformer), they produce output tensors that
contain useful feature representations of the inputs. However, these features
are not necessarily in the same format, and they may have different
dimensionalities.
To address this, we use linear projection layers to project the output tensors
of the text and image models onto a shared dimensional space. This allows
us to fuse the features extracted from the text and image inputs effectively.
The shared dimensional space makes it possible to combine the text and
image features (by averaging them in our case) and feed them into the
decoder (GPT-2) to generate a coherent and relevant textual response.
But how will GPT-2 accept these inputs from the encoding models? The
answer to that is a type of attention mechanism known as cross-attention.
The three internal components of attention – query, key, and value - haven’t
really come up before in this book because frankly we haven’t really needed
to understand why they exists, we simply relied on their ability to learn
patterns in our data for us but it’s time to take a closer look at how these
components interact so we can fully understand how cross-attention is
working.
Figure 7.5 Our VQA system needs to fuse the encoded knowledge from the
image and text encoders and pass that fusion to the GPT-2 model via the
cross-attention mechanism which will take the fused key and value vectors
(See Figure 7.4 for more on that) from the image and text encoders and
pass it onto our decoder GPT-2 to use to scale its own attention
calculations.
# 768
# 768
# 768
In our case, all models have the same hidden state size so in theory we don’t
need to project anything but it is still good practice to include projection
layers so that the model has a trainable layer that translates our text/image
representations into something more meaningful for the decoder.
Before getting deeper into the code, I should note that not all of the code
that powers this example is in these pages, but all of it lives in the
notebooks on the github. I highly recommend following along using both!
When creating a novel PyTorch Module (which we are doing) the main
methods we need to define are the constructor (__init__) that will
instantiate our three Transformer models and potentially freeze layers to
speed up training (more on that in the next chapter) and the forward
method which will take in inputs and potentially labels to generate an
output and a loss value (remember loss is the same as error, the lower the
better). The forward method will take the following as inputs:
input_ids: A tensor containing the input IDs for the text tokens. These IDs
are generated by the tokenizer based on the input text. The shape of the
tensor is [batch_size, sequence_length].
labels: A tensor containing the ground truth labels for the target text. The
shape of the tensor is [batch_size, target_sequence_length]. These labels are
used to compute the loss during training but won’t exist at inference time
because if we had the labels then we wouldn’t need this model!
Listing 7.2 shows a snippet of the code it takes to create a custom model
from our three separate Transformer-based models (BERT, ViT, and GPT2).
The full class can of course be found in the repository for your copy and
pasting needs.
...
With a model defined and properly adjusted for cross-attention, let’s take a
look at the data that will power our engine.
Our Data—Visual QA
Listing 7.3 shows a function I wrote to parse the image files and creates a
dataset that we can use with HuggingFace’s Trainer object.
data = []
images_used = defaultdict(int)
# Create a dictionary to map question_id to t
annotations_dict = {annotation["question_id"]
return data
evaluation_strategy="epoch",
logging_dir="./logs",
logging_steps=10,
fp16=device.type == 'cuda', # this saves mem
save_strategy='epoch'
)
There’s a lot of code that powers this example so once again, I would highly
recommend following along with the notebook on the github for the full
code and comments!
Summary of Results
Figure 7.7 shows a sample of images with a few questions asked of it. Note
that some of the responses are more than a single token which is an
immediate benefit of having the LLM as our decoder as opposed to
outputting a single token like in standard VQA systems.
Figure 7.7 Our VQA system is not half bad at answering out of sample
questions about images even though we used pretty small models (in terms
of number of parameters and especially compared to what is considered
state of the art today). Each percentage is the aggregated token prediction
probabilities that GPT-2 generated while answering the given questions.
Clearly it is getting some questions wrong and with more training/data we
can reduce errors even further!
But this is only a sample of data and not a very holistic representation of
performance. To showcase how our model training went, Figure 7.8 shows
the drastic change in our language modeling loss value after only one
epoch.
Figure 7.8 After only one epoch, our VQA system showed a massive drop in
validation loss which is great!
Our model is far from perfect and will require some more advanced training
strategies and a ton of more data before it can really be considered state of
the art but you know what? Free data, free models, and (mostly) free
compute power of my own laptop yielded and not half bad VQA system.
Let’s step away from the idea of pure language modeling and image
processing for just a moment and step into the world of a novel way of fine-
tuning language models using its powerful cousin - reinforcement learning.
We have seen over and over the remarkable capabilities of language models
in this book and usually we have dealt with relatively objective tasks like
classification and when the task was more subjective like semantic retrieval
and anime recommendations, we had to take some time to define an
objective quantitative metric to guide the model’s fine-tuning and overall
system performance. In general, defining what constitutes “good” output
text can be challenging, as it is often subjective and task/context-dependent.
Different applications may require different “good” attributes, such as
creativity for storytelling, readability for summarization, or code-
functionality for code snippets.
The training process basically breaks down into three core steps (shown in
Figure 7.9):
Figure 7.9 The core steps of a reinforcement learning based LLM training
consists of pre-training an LLM, defining and potentially training a reward
model, and using that reward model to update the LLM from step 1.
We are going to perform this process in its entirety in the next chapter but to
set up this pretty complicated process I am going to outline a simpler
version first. In our version we will take a pre-trained LLM off the shelf
(FLAN-T5), use an already defined and trained reward model, and really
focus on step 3 - the reinforcement learning loop.
We have seen and used FLAN-T5 (visualized in an image taken from the
original FLAN-T5 paper in Figure 7.10) before so this should hopefully be
a refresher. FLAN-T5 is an encoder-decoder model (effectively a pure
Transformer model) which means it already has trained cross-attention
layers built in and it has the benefit of being instruction fine-tuned (like
GPT3.5, ChatGPT, and GPT-4 were). We are going to use the open-sourced
“small” version of the model.
Figure 7.10 FLAN-T5 is an encoder-decoder architecture that has been
instruction-fine-tuned and is open-sourced.
In the next chapter, we will perform our own version of instruction fine-
tuning but for now, we will borrow this already instruction-fine-tuned LLM
from the good people at Google AI and move on to define a reward model.
A reward model has to take in the output of an LLM (in our case a sequence
of text) and returns a scalar (single number) reward which should
numerically represent feedback on the output. This feedback can come from
an actual human which would be very slow to run or could come from
another language model or even a more complicated system that ranks
potential model outputs, and those rankings are converted to rewards. As
long as we are assigning a scalar reward for each output, it is a viable
reward system.
In the next chapter, we will be doing some really interesting work to define
our own reward model but for now we will again rely on the hard work of
others and use the following pre-built LLMS:
I should note that by choosing these classifiers to form the basis of our
reward system, I am implicitly trusting in their performance. I checked out
their descriptions on the HuggingFace model repository to see how they
were trained and what performance metrics I could find but in general be
aware that the reward systems play a big role in this process so if they are
not aligned with how you truly would reward text sequences, you are in for
some trouble.
A snippet of the code that translates generated text into scores (rewards)
using a weighted sum of logits from our two models can be found in Listing
7.5.
texts = [
'The Eiffel Tower in Paris is the tallest str
'This is a bad book',
'this is a bad books'
]
With a model and a reward system ready to go, we just need to introduce
one final net new component, our Reinforcement Learning library: TRL.
The TRL library supports both pure decoder models like GPT-2 and GPT-
Neo (more on that in the next chapter) as well as Sequence to Sequence
models like FLAN-T5. All models can be optimized using what is known as
Proximal Policy Optimization (PPO). Honestly I won’t into how it works
in this book but it’s definitely something for you to look up if you’re
curious. TRL also has many examples on their github page if you want to
see even more applications.
Figure 7.11 shows the high level process of our (for now) simplified RLF
loop.
Figure 7.11 Our first Reinforcement Learning from Feedback loop has our
pre-trained LLM (FLAN-T5) learning from a pre-curated dataset and a pre-
built reward system. In the next chapter, we will see this loop performed
with much more customization and rigor.
Let’s jump into defining our training loop with some code to really see
some results here.
a. Our “reference” model which is the original FLAN-T5 model which will
never be updated ever
b. Our “current” model which will get updated after every batch of data
2. Grab a batch of data from a source (in our case we will use a corpus of
news articles I found from HuggingFace)
3. Calculate rewards from our two reward models and aggregate into a
single scalar (number) as a weighted sum of the two rewards
4. Pass the rewards to the TRL package which calculates two things:
5. TRL updates the “current” model from the batch of data, logs anything to
a reporting system (I like the free Weights & Biases platform) and start over
from the beginning of the steps!
game_data["response"] = [flan_t5_tokenize
# Calculate rewards from the cleaned resp
game_data["clean_response"] = [flan_t5_to
game_data['cola_scores'] = get_cola_score
game_data['neutral_scores'] = get_sentime
rewards = game_data['neutral_scores']
transposed_lists = zip(game_data['cola_sc
# Calculate the averages for each index
rewards = [1 * values[0] + 0.5 * values[
rewards = [torch.tensor([_]) for _ in rew
Summary of Results
Figure 7.13 shows how rewards were given over the training loop of 2
epochs. We can see that as the system progressed, we were giving out more
rewards which is generally a good sign. I should note that the rewards
started out pretty high so FLAN-T5 was already giving relatively neutral
and readable responses so I would not expect drastic changes in the
summaries.
Figure 7.13 Our system is giving out more rewards as training progresses
(the graph is smoothed to see the overall movement).
But what do these adjusted generations look like? Figure 7.14 shows a
sample of generated summaries before and after our RL fine-tuning
Figure 7.14 Our fine-tuned model barely differs in most summaries but
does tend to use more neutral sounding words that are grammatically
correct and easy to read.
Conclusion