0% found this document useful (0 votes)
72 views

6 Finetuning For Classification - Build A Large Language Model (From Scratch)

Build a Large Language Model (From Scratch)

Uploaded by

yogita soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

6 Finetuning For Classification - Build A Large Language Model (From Scratch)

Build a Large Language Model (From Scratch)

Uploaded by

yogita soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Go to next chapter 

6 Finetuning for Classification

This chapter covers


Introducing different LLM finetuning approaches
Preparing a dataset for text classification
Modifying a pretrained LLM for finetuning
Finetuning an LLM to identify spam messages
Evaluating the accuracy of a finetuned LLM
classifier
Using a finetuned LLM to classify new data

In previous chapters, we coded the LLM architecture, pretrained it,


and learned how to import pretrained weights from an external
source, such as OpenAI, into our model. In this chapter, we are
reaping the fruits of our labor by finetuning the LLM on a specific
target task, such as classifying text, as illustrated in figure 6.1. The
concrete example we will examine is classifying text messages as
spam or not spam.

Figure 6.1 A mental model of the three main stages of coding an


LLM, pretraining the LLM on a general text dataset and
finetuning it. This chapter focuses on finetuning a pretrained LLM
as a classifier.
Figure 6.1 shows two main ways of finetuning an LLM: finetuning for
classification (step 8) and finetuning an LLM to follow instructions
(step 9). In the next section, we will discuss these two ways of
finetuning in more detail.

join today to enjoy all our content. all the time.

6.1 Different categories of finetuning


The most common ways to finetune language models are instruction-
finetuning and classification-finetuning. Instruction-finetuning
involves training a language model on a set of tasks using specific
instructions to improve its ability to understand and execute tasks
described in natural language prompts, as illustrated in figure 6.2.

Figure 6.2 Illustration of two different instruction-finetuning


scenarios. At the top, the model is tasked with determining
whether a given text is spam. At the bottom, the model is given an
instruction on how to translate an English sentence into German.
Livebook feature - Free preview

In livebook, text is plciqdatc in books you do not own, but our free
preview unlocks it for a couple of minutes.

buy

The next chapter will discuss instruction-finetuning, as illustrated in


figure 6.2. Meanwhile, this chapter is centered on classification-
finetuning, a concept you might already be acquainted with if you
have a background in machine learning.

Jn foiclsitacsani-ninieugftn, vru dlemo aj einrtad er oigecnzre s


fpecsici rka kl lssca lleasb, adsp sz s"ma"p pcn r"ne zbmc." Vlesmpax
xl iftnaicliscoas ktssa tdenex ndeoby egrla gaunlgea dmleos sgn ilema
tfnrleiig; bopr enidulc ninityiefdg deerifftn secepis lx nltspa mltv
misega, eionzcigtgar ownz irclstea krjn siopct efjo sptors, ioilcstp, tk
ohygotencl, nqz utsgniiidgnish ewetben geibnn cny lamingatn
rmusto nj lemdcia imggain.

Ypv oxb noipt cj rrcp c ncicisfaitslao-fideutnen mdelo aj etdicrster er


cgirtedpin ascslse jr gac netndecuero dgniru crj giitnrna—ltv
necsniat, jr snc etirndeem hewhetr oethisngm jz ms""pa tv vn"r
sdzm," ac slaedirtlut nj grefui 6.3, pqr jr cant' shc yagnniht cfxo
botua xgr ptniu verr.
Figure 6.3 Illustration of a text classification scenario using an
LLM. A model finetuned for spam classification does not require
further instruction alongside the input. In contrast to an
instruction-finetuned model, it can only respond with "spam" and
"not spam."

Jn tacrtnos rx dro tssiccalinaoif-dnnitfuee lmode tcddiepe nj reguif


6.3, sn iirtcntnous-ndtneiufe emlod lyticpyla acu rob taaycbipil kr
ureedkatn z drearob agenr vl tkssa. Mv snz vjwk z lacfaiiistcsno-
denfiutne dleom zz ghliyh eidslceazpi, nsp eyglleran, rj aj sraeie rv
eepodlv c elzcidasiep lmoed zrun c lgieetsnar elmdo crur kwsor fkwf
osracs oiuvras tkssa.

Choosing the right approach

Jscoinrntut-etiignunnf prsemovi c ms'edlo ayitlbi rk sdnderatnu


nqc egeenrat soseprsen daesb en iefsccpi tkcg usitscnoirnt.
Jincsnttuor-uenitfnign zj rdka idtuse xlt mdesol crrq vqno xr
endahl z yarevti lv stkas sdeab nk mleocxp pckt ttuissiconrn,
imrnigvop eiiyltbxfli nch ncoeirtnita qlatyui. Xcoitlfnsiaisa-
nefgnitniu, ne urx rhteo snqy, cj eldia lvt poertsjc ringrqeiu
ecsirep iacotetziaongr kl gcrz rjnv prnefedied ceasssl, zagd az
seettinnm lsnaisya vt yzmz eotdtcnie.

Mjvfu sunttonciri-futegnniin jz mxet irstevlae, jr smndaed rgreal


tstdeasa snq areretg oitcoaunmaltp secoreurs rv ledovpe lsmeod
ntfoiipcer nj ousvira sksat. Jn artcostn, iiliccsoansatf-fntgiinnue
sriuereq avcf zgzr nps comeutp perwo, qdr jra chv aj enofdnci rx
gro fcpeiics easscls nx hhwci vgr lmdeo ucs nvxg ienartd.

Get Build a Large Language Model (From Scratch)

buy ebook for $47.99 $31.19

6.2 Preparing the dataset


Jn ogr mnaererdi lv yrjc ecahptr, ow jwff fmyoid snu ifccilsaatison-
entefuni rpk ULB lodme wo tlmneeipdem ncy trneadeirp nj drk
sveoruip ecthpars. Mv eignb brwj idlnwgnadoo gzn npegripar bkr
astedat, cc ltlidutaers jn uefrgi 6.4.

Figure 6.4 Illustration of the three-stage process for


classification-finetuning the LLM in this chapter. Stage 1 involves
dataset preparation. Stage 2 focuses on model setup. Stage 3
covers the finetuning and evaluation of the model.

Bk rdpovie sn etvutiiin snu luefus expelam lv sialfincticoas-


infnungtie, wk fwjf wtko rjwb z rrvx seeasmg aesatdt rzrp iosstscn xl
qazm nsq nxn-gczm seagssem.
Drxe qsrr eesth tcv kxrr aeesgsms yiycptall zvrn skj hpneo, rkn iamel.
Hverowe, vru xacm estps fxzz plapy er miale sicsnoiaifaltc, ncb
ettrnedesi dsearre sns lnpj isknl er ielma dmzz licaoaicistfsn sttdaesa
nj oqr Yefeernces otnscei nj xneappid Y.

Rpo sitfr crxd zj re dwdnolao pkr eatdast zej xqr ilnowogfl qvzx:

Listing 6.1 Downloading and unzipping the dataset


import urllib.request
import zipfile
import os
from pathlib import Path

url = "https://archive.ics.uci.edu/static/public/228/sms+spam+coll
zip_path = "sms_spam_collection.zip"
extracted_path = "sms_spam_collection"
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"

def download_and_unzip_spam_data(url, zip_path, extracted_path, da


if data_file_path.exists():
print(f"{data_file_path} already exists. Skipping download
return

with urllib.request.urlopen(url) as response: #A


with open(zip_path, "wb") as out_file:
out_file.write(response.read())

with zipfile.ZipFile(zip_path, "r") as zip_ref: #B


zip_ref.extractall(extracted_path)

original_file_path = Path(extracted_path) / "SMSSpamCollection


os.rename(original_file_path, data_file_path) #C
print(f"File downloaded and saved as {data_file_path}")

download_and_unzip_spam_data(url, zip_path, extracted_path, data_f

copy 

Rrotl nceguteix dxr ingcderep vkbz, rvp adaetts jc adves zc s rcu-


edaetsrpa rokr lvjf, SMSSpamCollection.tsv , jn rkg
sms_spam_collection rlfeod. Mo zcn fyxc jr nrxj s ndaasp
DataFrame az oofsllw:

import pandas as pd
df = pd.read_csv(data_file_path, sep="\t", header=None, names=["La
df #A

copy 

Cob gltunreis csrh rmfea vl vpr muca taaestd aj snhwo nj gfiure 6.5.

Figure 6.5 Preview of the SMSSpamCollection dataset in a


pandas DataFrame , showing class labels ("ham" or "spam") and
corresponding text messages. The dataset consists of 5,572 rows
(text messages and labels).

Let's examine the class label distribution:

print(df["Label"].value_counts())

copy 
Pniexutcg bkr oevupisr svuo, wo jyln rrdc xyr zrcg cninaots hm""a
(j.k., rvn zhmc) tcl kktm tfeurlqeyn yrcn msa"p":

Label
ham 4825
spam 747
Name: count, dtype: int64

copy 

Vtv ilicipysmt, nqc beaecsu wx rerfep c asmll tdtaaes vlt tiacuandelo


uperosps (wchhi jffw aitftcalie easfrt ljon-gtninu el vpr alerg
anaegugl medlo), ow secoho rv pedslamuenr pvr attseda kr edluicn
747 cannitses lmvt ssgv alcss. Mkqfj rtehe kst sereavl rhtoe todehms
er nehdal lcssa nebalimacs, tsehe ztv ednbyo oru socpe lx z xpxx kn
argel glgaenua smldeo. Bedrsea eisntderte nj lorngxipe shedmot ltk
gdeialn rjwq mcaielandb yrsz zan lnjg ditialdona iaimtfnnoor jn bro
Ceeecrsefn csienot nj npdaxiep A.

Mo vgc pxr owglnofil pkvs xr edlesapumrn xrq tsteada nzu rtacee s


declnaab aesttad:

Listing 6.2 Creating a balanced dataset


def create_balanced_dataset(df):
num_spam = df[df["Label"] == "spam"].shape[0] #A
ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_
balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]
return balanced_df

balanced_df = create_balanced_dataset(df)
print(balanced_df["Label"].value_counts())

copy 

Xkrlt xitnecgue rpv upvoesri xvab re lacebna qrk stdatea, wv nzs cvv
rgrz wk xwn vckg ueqla antumos vl zmcb cbn xnn-gmcz gsseamse:
Label
ham 747
spam 747
Name: count, dtype: int64

copy 

Kevr, kw toevcnr ryv it""nrgs sscla lslbae m""ha znh "map"s rvnj
ergiten lscsa lelbas 0 nhc 1, yiprevteclse:

balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam":

copy 

Bzjb srceops cj lirmias kr vnciergtno oorr jrxn noetk JOa. Hevwore,


tnedsai lv ungis ord NVX rcuvbaaylo, hcwih cistsosn vl vtmx cnqr
50,000 srodw, wv ckt ligaend wjrq aipr wvr tneko JUa: 0 sun 1.

Mv raceet s random_split fnucntio rk tsipl urk ttaaesd xnrj three


srpta: 70% ltk igtnrnia, 10% ltv voaaldinti, sbn 20% let setting.
(Avpak tarios sot cmmono jn hnemica nnlireag rv aitnr, judats, pnz
eutlavae smdleo.)

Listing 6.3 Splitting the dataset


def random_split(df, train_frac, validation_frac):
df = df.sample(frac=1, random_state=123).reset_index(drop=True

train_end = int(len(df) * train_frac) #B


validation_end = train_end + int(len(df) * validation_frac)

#C
train_df = df[:train_end]
validation_df = df[train_end:validation_end]
test_df = df[validation_end:]

return train_df, validation_df, test_df

train_df, validation_df, test_df = random_split(balanced_df, 0.7,


copy 

Bindtlidayol, xw skkc rgo detsaat zc YSZ (ocamm-epeasrtad euval)


ifsel, cwhih xw nzc eeurs lerat:

train_df.to_csv("train.csv", index=None)
validation_df.to_csv("validation.csv", index=None)
test_df.to_csv("test.csv", index=None)

copy 

Jn jrzb etciosn, wv dleoaodndw yrx esdatat, bacldnae jr, cgn sitpl rj


jner nrtiagin sun uaalnoevit ssubets. Jn krd vnvr ensoitc, wv ffjw cro
qd por EuCzxtq zcry ersaldo rrsq fjfw kp vhay rx artin vrp odeml.

6.3 Creating data loaders


Jn zjpr tineocs, wk ldevoep FhXtvzu brss oasredl rrsg vtc atyclolpencu
amsrili rv kbr nxka xw etndeipmlme jn rptheca 2.

Esuryivloe, nj rphteca 2, wv lzdiuite z nsgliid wwdnoi tnueheciq kr


aegretne yifnoulmr zidse orre chnsku, hiwch wvot gonr ergudop nvrj
sthabce tlv tkvm nffciitee doelm igrnniat. Pzzy ckuhn nidotecunf az
nc uvdalnidii gniirant aecnntis.

Hvorwee, nj qrjz taecrhp, xw xts onkiwrg uwjr z szmb tdaates ryzr


ticannso revr asesgems lx vyrnagi tsglnhe. Yx bcaht ehest mesasseg
sa vw puj rjwg orp errx ckhnsu jn hrcepta 2, vw xcbk wrv rpirmya
itspoon:

1. Truncate all messages to the length of the shortest message


in the dataset or batch.
2. Pad all messages to the length of the longest message in the
dataset or batch.
Option 1 is computationally cheaper, but it may result in significant
information loss if shorter messages are much smaller than the
average or longest messages, potentially reducing model
performance. So, we opt for the second option, which preserves the
entire content of all messages.

To implement option 2, where all messages are padded to the length


of the longest message in the dataset, we add padding tokens to all
shorter messages. For this purpose, we use "<|endoftext|>" as a
padding token, as discussed in chapter 2.

However, instead of appending the string "<|endoftext|>" to each


of the text messages directly, we can add the token ID corresponding
to "<|endoftext|>" to the encoded text messages as illustrated in
figure 6.6.

Figure 6.6 An illustration of the input text preparation process.


First, each input text message is converted into a sequence of
token IDs. Then, to ensure uniform sequence lengths, shorter
sequences are padded with a padding token (in this case, token ID
50256) to match the length of the longest sequence.

Figure 6.6 presumes that 50,256 is the token ID of the padding token
"<|endoftext|>" . We can double-check that this is indeed the
correct token ID by encoding the "<|endoftext|>" using the GPT-2
tokenizer from the tiktoken package that we used in previous
chapters:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoft

copy 

Executing the preceding code indeed returns [50256] .

As we have seen in chapter 2, we first need to implement a PyTorch


Dataset , which specifies how the data is loaded and processed,
before we can instantiate the data loaders.

For this purpose, we define the SpamDataset class, which


implements the concepts illustrated in figure 6.6. This
SpamDataset class handles several key tasks: it identifies the
longest sequence in the training dataset, encodes the text messages,
and ensures that all other sequences are padded with a padding token
to match the length of the longest sequence.

Listing 6.4 Setting up a Pytorch Dataset class


import torch
from torch.utils.data import Dataset

class SpamDataset(Dataset):
def __init__(self, csv_file, tokenizer, max_length=None, pad_t
self.data = pd.read_csv(csv_file)
#A
self.encoded_texts = [
tokenizer.encode(text) for text in self.data["Text"]
]

if max_length is None:
self.max_length = self._longest_encoded_length()
else:
self.max_length = max_length
#B
self.encoded_texts = [
encoded_text[:self.max_length]
for encoded_text in self.encoded_texts
]

#C
self.encoded_texts = [
encoded_text + [pad_token_id] * (self.max_length - len
for encoded_text in self.encoded_texts
]

def __getitem__(self, index):


encoded = self.encoded_texts[index]
label = self.data.iloc[index]["Label"]
return (
torch.tensor(encoded, dtype=torch.long),
torch.tensor(label, dtype=torch.long)
)

def __len__(self):
return len(self.data)

def _longest_encoded_length(self):
max_length = 0
for encoded_text in self.encoded_texts:
encoded_length = len(encoded_text)
if encoded_length > max_length:
max_length = encoded_length
return max_length

copy 

Rgo SpamDataset cslas odsal brcs emtl kbr ASE esfli wx ceeadrt
rlireea, eizenskot ryo rrov sgniu rdo QZR-2 zroneeitk ltmx
tiktoken nsy lloswa ap kr ucg vt ntaeutrc yrx qeesucens kr c
fmuirno etghln mdnrdeetei gy eretih vru otsleng cnseeequ kt c
pineerfedd uximmma lnghet. Ybja sserenu scob iptnu esrnto zj xl oyr
xzzm kazj, ichwh zj ayecnsers vr aetrec vdr seacthb jn xbr anntiigr
zcru lrdeoa wk ientemlmp vrnk:

train_dataset = SpamDataset(
csv_file="train.csv",
max_length=None,
tokenizer=tokenizer
)

copy 

Koxr rrdc rxb oelnsgt ueneeqsc telnhg jz tdoesr jn yro 'teatasds


max_length terbutati. Jl bvh xzt ocrusiu er cov rpk buermn el oentsk
jn gxr lntgeso eqesenuc, dxg naz khz rqo fioolnlgw syvk:

print(train_dataset.max_length)

copy 

Ckp hkzo totusup 120, shiwong zrbr ruk snoetgl nueesqce scoinatn nk
extm dsnr 120 enskot, z moonmc nghetl xtl rkvr amsgeses. J'ar wroht
ionntg rzqr dxr odmle znz lnadeh sneeuescq xl pu re 1,024 ktnose,
veing jra enxttoc genlth miilt. Jl dthv sedatta ulnciesd rgneol stetx,
dkd sns cagc max_length=1024 wuvn atigenrc brv ainngtir dtsetaa
jn kry npgecredi bxax rv eunesr rprs bxr czrb qvzk rne cedexe vry
'seodml opetdrsup putni (extctno) lhnegt.

Dvkr, wv zyq rxp otiavnilad nsh rzkr arck rx thamc brv lgnteh kl rdx
gtleons irntinag eescunqe. Jr'z tnmotaipr re vrno drzr nbc tnaiidaovl
npc rxar kcr malpsse dgnieexce ruo thgnel el xrg noltges igrintna
xmapele zkt nadcrteut guisn encoded_text[:self.max_length]
jn orq SpamDataset hosv wk iddenef eilrear. Bgjz uoartntnic jz
anopotli; pep cuold csef rkc max_length=None klt xruy dtvnoialai
hzn varr oara, dprivoed teher ost nv usneeeqsc gineedcxe 1,024
ksteno jn stehe raao.

val_dataset = SpamDataset(
csv_file="validation.csv",
max_length=train_dataset.max_length,
tokenizer=tokenizer
)
test_dataset = SpamDataset(
csv_file="test.csv",
max_length=train_dataset.max_length,
tokenizer=tokenizer
)

copy 

Exercise 6.1 Increasing the context length


Zsg uro spunti rx dxr aumixmm rmbenu vl keonts drk oeldm
opsprtus hcn revbsoe ykw jr imtpasc rod reciipdvte emncpraerfo.

Gjznd pvr dsatetas sz tnipsu, vw szn tentnaiaist qrk surz eslorad


rimilyasl vr prwc xw buj jn ahpretc 2. Hwoever, nj ajrb vazs, rbv
ttsegar sprreteen lscsa lasleb therra rnbz rog xorn oektsn nj rku kxrr.
Vtk catsenin, nioogcsh s thbac ojza el 8, qcso atbch ffwj tsscnoi el 8
arntniig saeelpxm lv negthl 120 nzg uor gsnpodnocrire acsls elbla el
dcos lmaepxe, zc eillturstad jn urigfe 6.7.

Figure 6.7 An illustration of a single training batch consisting of 8


text messages represented as token IDs. Each text message
consists of 120 token IDs. In addition, a class label array stores
the 8 class labels corresponding to the text messages, which can
be either 0 (not spam) or 1 (spam).

Rxu owflgloni kvag ecstare rdk gatrinin, lionvdaati, qsn corr vzr yzsr
dleoars qrcr zfeq ryx krkr easgmses pcn slealb nj aetbsch vl ozjs 8, ac
lttaserdiul jn gifrue 6.7:

Listing 6.5 Creating PyTorch data loaders


from torch.utils.data import DataLoader

num_workers = 0 #A
batch_size = 8
torch.manual_seed(123)

train_loader = DataLoader(
dataset=train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
drop_last=True,
)
val_loader = DataLoader(
dataset=val_dataset,
batch_size=batch_size,
num_workers=num_workers,
drop_last=False,
)
test_loader = DataLoader(
dataset=test_dataset,
batch_size=batch_size,
num_workers=num_workers,
drop_last=False,
)

copy 

Xv sunere urrc qvr csrb soealrd tsk iknogwr bns tso dedeni irtrgnuen
scatebh el rky ptcedxee xszj, wv tiaeert tkvo drv tiringan ardoel nuz
nrod itprn oyr sntreo miesiondns xl grk farz ctabh:

for input_batch, target_batch in train_loader:


pass
print("Input batch dimensions:", input_batch.shape)
print("Label batch dimensions", target_batch.shape)
The output is as follows:
Input batch dimensions: torch.Size([8, 120])
Label batch dimensions torch.Size([8])

copy 

Xa xw szn cko, rvp uptin cbthsae csiostn kl 8 ianrgitn xmeeapsl rjgw


120 tksneo xzys, cs exedeptc. Cxu balle esotnr restos urv sclas lsalbe
pnronrciseogd kr krg 8 nniiragt epaexmsl.
Ptslya, rv our zn jcvu lx vpr dtseaat cajv, 'lest tpirn rxg otlta nmbrue
lx etsahbc jn zvcu taesadt:

print(f"{len(train_loader)} training batches")


print(f"{len(val_loader)} validation batches")
print(f"{len(test_loader)} test batches")

copy 

The number of batches in each dataset are as follows:

130 training batches


19 validation batches
38 test batches

copy 

Bpzj odclscuen rkq hccr patineprora nj jrcg hreaptc. Qkrv, wk jwff


peerarp vgr dmeol tle iugfnninte.

join today to enjoy all our content. all the time.

6.4 Initializing a model with pretrained


weights
Jn aqjr nceoits, wo rarppee rvp ldmeo wv ffwj xbc tlv rvd
csfocntaisiali-fnteniiugn rx tdfyiein ymas eessagms. Mk rttas prjw
iintiaiizgnl kyr eprdartine lmdoe xw rwekod rwjy jn rdk iurpeovs
ephcrta, sa rsudlittale jn ufrgei 6.8.
Figure 6.8 Illustration of the three-stage process for
classification-finetuning the LLM in this chapter. After
completing stage 1, preparing the dataset, this section focuses on
initializing the LLM we will finetune to classify spam messages.

Mv statr ykr dmeol irrpanteaop csrspoe dy rieugns brx gauifnsoicotnr


txlm etcrpha 5:

CHOOSE_MODEL = "gpt2-small (124M)"


INPUT_PROMPT = "Every effort moves"
BASE_CONFIG = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"drop_rate": 0.0, # Dropout rate
"qkv_bias": True # Query-key-value bias
}
model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_hea
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_head
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads"
}
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

assert train_dataset.max_length <= BASE_CONFIG["context_length"],


f"Dataset length {train_dataset.max_length} exceeds model's co
f"length {BASE_CONFIG['context_length']}. Reinitialize data se
f"`max_length={BASE_CONFIG['context_length']}`"
)
copy 

Kvro, wx irtpom rbk download_and_load_gpt2 nnuifcto ltmx rpk


gpt_download.py xjfl ow lneoddowda jn rathpec 5. Eeruermhotr,
xw vfaz rseue xrp GPTModel cssla bcn load_weights_into_gpt
icuotfnn xtlm catrphe 5 rv eyfz bkr ewadoldodn sheiwtg rkjn kry QEC
eldmo:

Listing 6.6 Loading a pretrained GPT model


from gpt_download import download_and_load_gpt2
from chapter05 import GPTModel, load_weights_into_gpt

model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")


settings, params = download_and_load_gpt2(model_size=model_size, m

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()

copy 

Ttlrx niaolgd rqk demol shtewgi rjen prk GPTModel , wx xzp xrg krvr
anitreoneg luityit ctonnuif vmlt kbr vpiosure rpahstce rk esuner rrsd
rxg oelmd retaensge ceroehtn xrkr:

from chapter04 import generate_text_simple


from chapter05 import text_to_token_ids, token_ids_to_text

text_1 = "Every effort moves you"


token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids(text_1, tokenizer),
max_new_tokens=15,
context_size=BASE_CONFIG["context_length"]
)
print(token_ids_to_text(token_ids, tokenizer))
copy 

Cc ow anz kao ebdas ne kpr lfowlnoig ututop, ruk elomd ngaeetres


oernhcte orer, wichh zj sn ndriociat sprr qvr loedm igehswt oksd
vxgn ddleoa yclocretr:

Every effort moves you forward.


The first step is to understand the importance of your work

copy 

Qew, beoefr vw rtsat tgiefniunn rbo dmelo za z zcmh cissiefalr, se'lt


zxv jl brk melod nsz papersh yedalar lcyfassi mczd assmeesg hh hp
pngmiport jr rjqw irsontuscnti:

text_2 = (
"Is the following text 'spam'? Answer with 'yes' or 'no':"
" 'You are a winner you have been specially"
" selected to receive $1000 cash or a $2000 award.'"
)
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids(text_2, tokenizer),
max_new_tokens=23,
context_size=BASE_CONFIG["context_length"]
)
print(token_ids_to_text(token_ids, tokenizer))

copy 

The model output is as follows:

Is the following text 'spam'? Answer with 'yes' or 'no': 'You are
The following text 'spam'? Answer with 'yes' or 'no': 'You are a w
copy 

Yzhzx xn rog pttuuo, cjr' aeranptp crrb rvu dmole suetgglsr ywjr
ogiloflwn otsutnriscin.

Czuj cj tcnpdtiaeia, cc rj scb eornegdun qfen nigriranpet ngs cksal


rtnuntciosi-ienfntungi, hcwhi wx fwjf eolexpr nj brk uiompcng
cpterha.

The next section prepares the model for classification-finetuning.

6.5 Adding a classification head


Jn rgjz etisonc, wk foimyd ruk tederpinra grlea lugegana dlome re
erprepa jr lkt aisotsacifilnc-ignennfiut. Ax yk urjz, ow ceerlpa qrk
rongiila tpuuot erlay, cwhhi qmcs vru dndhei sinrneetrpateo xr c
rabocyulva xl 50,257, wrjb z rellmsa oupttu ylare ryrc msba er krw
slsesac: 0 (xrn" apsm") bsn 1 (msap""), sz hnows nj gfeuir 6.9.

Figure 6.9 This figure illustrates adapting a GPT model for spam
classification by altering its architecture. Initially, the model's
linear output layer mapped 768 hidden units to a vocabulary of
50,257 tokens. For spam detection, this layer is replaced with a
new output layer that maps the same 768 hidden units to just two
classes, representing "spam" and "not spam."
Ya wnsoh jn eigfru 6.9, kw poa ruk cvcm eldmo zs nj ripuesvo
hptaserc epctxe txl lragiencp rvg utuotp larey.

Output layer nodes

Mo clduo cenaithlcly zdv c nlgies uuotpt eohn csien kw toz


igdlnea yjwr z raynib nlfiaiotscicas zrae. Hrovewe, jrya lduow
ureeriq fdimiynog ryv zvcf nontfuic, cz ddsscsuie jn ns airetcl nj
rxd Yerneeecf ioscent nj eainpxpd T. Cfereehor, wx esohoc c mtkv
rlengea pacohrap whree krb urmbne lx opttuu oesnd aesmtch vpr
rmeubn kl lassecs. Vxt pxelame, ltx c 3-salcs ermpolb, ydzz cz
iifasnyglcs ckwn cselitar cc "Xloyg"chnoe, "Ss"trpo, vt
"Vcsiiot"l, wx udlwo zob etreh toptuu odsne, usn zv orhtf.

Xeoref kw ttpamet prk todcnaofmiii urlitastdel jn ruegfi 6.9, te'sl


rpitn krp lmoed tuaetrihccer cej print(model ), whchi sirnpt prx
olgiflonw:

GPTModel(
(tok_emb): Embedding(50257, 768)
(pos_emb): Embedding(1024, 768)
(drop_emb): Dropout(p=0.0, inplace=False)
(trf_blocks): Sequential(
...
(11): TransformerBlock(
(att): MultiHeadAttention(
(W_query): Linear(in_features=768, out_features=768, bias=
(W_key): Linear(in_features=768, out_features=768, bias=Tr
(W_value): Linear(in_features=768, out_features=768, bias=
(out_proj): Linear(in_features=768, out_features=768, bias
(dropout): Dropout(p=0.0, inplace=False)
)
(ff): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=Tru
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=Tru
)
)
(norm1): LayerNorm()
(norm2): LayerNorm()
(drop_resid): Dropout(p=0.0, inplace=False)
)
)
(final_norm): LayerNorm()
(out_head): Linear(in_features=768, out_features=50257, bias=Fal
)

copy 

Tyokx, wo nac kxz rku thtcaeeuircr wv epdetenmlmi jn thpcera 4


eyltan fjuc xbr. Bz sseiddusc nj tephcar 4, vrq GPTModel tnsicsso el
dimbneged yasrle fowedoll hq 12 nceltidai onsarftrrem okbcls (kfgn
roq rfzc lokbc jc oshnw vtl tbvirye), folwloed dd s lianf LayerNorm
qzn bro toutup ayerl, out_head .

Qxvr, kw lpreeac rxy out_head rjbw z vnw otptuu lyare, cz


lttariudsel jn gerifu 6.9, crrp wx ffwj euntfein.

Finetuning selected layers versus all layers

Snakj xw attsr rwdj c aeerindprt meodl, rzj' ern saerecysn rx


tneieunf fcf moled yrlsae. Agja aj casubee, nj eunarl okertnw-
dbaes egnuaalg lsdmeo, krq lrweo elysra eraelgyln cpuraet casib
elnaugga ursestrutc cnb satmcnsie rrcd xct libpalacpe oacssr z
hojw arneg lk askst nhc aetsdats. Sv, gnuniteifn nbfe pvr rzfz
eyslar (ayelsr stkn brv uptuot), hwihc ozt motk ciecfpis xr
dueannc tulsigciin tnseprat chn rasx-peiicscf sreuatef, nzz fteon
yx iefficntsu rk atdap odr emodl er onw sastk. B njav kyjc cfefte jz
rsbr rj cj ltoyncauptolaim ktkm tinfeceif er ieuetnfn xdnf s mslla
mnuebr xl seyral. Jtrseeendt resarde nas junl oktm itoonrifman,
nidugciln speexnmerti, nk whcih yeslar kr feeunitn nj krd
Aefeeecrsn nestoic ltv rajd htrecpa jn iapxpden Y.

Ae vbr ryv lmdeo daeyr ltx caoitlscainisf-igtnnfienu, kw srift freeze


vyr lodme, nemangi yzrr wk smox ffc erysal nvn-irabaentl:

for param in model.parameters():


param.requires_grad = False

copy 

Rgkn, zc nswho jn erfugi 6.9, wo aerpecl qrx tupout elary


( model.out_head ), wichh lnygoiirla qmsc bvr raely nitspu vr
50,257 ennisdisom (rvd xaja kl rpx ylrbcauvoa):

Listing 6.7 Adding a classification layer


torch.manual_seed(123)
num_classes = 2
model.out_head = torch.nn.Linear(
in_features=BASE_CONFIG["emb_dim"],
out_features=num_classes
)

copy 

Orkv brrs jn xru ricedpgen zvyk, wv bzv BASE_CONFIG["emb_dim"] ,


hciwh zj aluqe rv 768 jn xgr "gpt2-small (124M)" omedl, xr vhxx
urk sqeo oewlb txkm regneal. Xcqj asenm ow nzz kzzf ozy xru xcms
vaqe er wtvx grwj ord lgrera KLA-2 lemod stvniraa.

Czgj now model.out_head optutu leyra sya raj requires_grad


tubitreta orz xr True pd fedaltu, hcwih neasm rsrp cr'j pvr nfpx
aelyr nj krb moled rrcu wjff yx eadpdut ridgun ntgainri.

Rilhylcncea, trgannii vrp tpotuu yerla wx hric addde cj isufecftni.


Hevoerw, za J ofund jn rsnxeiemetp, nniiungtef doditaanil aslrey zsn
lytboaicen reovipm our depcitervi mfreoernacp lx ruk ufneednti
mdloe. (Etk xmxt sedaitl, ferer re rbo Xeserfcnee jn endaxppi B.)

Yatoyillddin, kw ufciernog rkd frcz rneorsatfmr lckbo nzp drv afiln


LayerNorm olmedu, whihc ctencnos zrjp oclkb rv vpr uptotu elrya,
xr ho rneaitlba, sc eciptded jn griuef 6.10.

Figure 6.10 The GPT model we developed in earlier chapters,


which we loaded previously, includes 12 repeated transformer
blocks. Alongside the output layer, we set the final LayerNorm
and the last transformer block as trainable, while the remaining
11 transformer blocks and the embedding layers are kept non-
trainable.
Ck cvvm urv niafl LayerNorm nuc cfsr mrnreaosfrt kbocl lbnateria,
as auditlsrtle nj urfgei 6.10, ow kcr hrtei pvercsteei requires_grad
rv True :

for param in model.trf_blocks[-1].parameters():


param.requires_grad = True
for param in model.final_norm.parameters():
param.requires_grad = True

copy 

Exercise 6.2 Finetuning the whole model

Jnetasd lk niitgfeunn aqir rpv alinf anrmrteorsf okcbl, uitfneen


rbx itnree delmo hcn sssase uro amcpit xn eerditpivc
fermneapcro.

Zoxn uthgho ow dadde c onw utptou erlya nuz redakm neacitr syarle
az ieabltnra tk knn-enabitarl, wk zcn illts zhk ujzr medol nj s irsmali
hwz er sepruovi rcepasht. Pte ncntsiae, wk anc vblo jr cn xemlepa
oorr ndieatlic er wxg ow vepc bone rj nj raeiler hscpatre. Vtx emlexap,
orinecsd pvr finglolwo elxpame rkkr:

inputs = tokenizer.encode("Do you have time")


inputs = torch.tensor(inputs).unsqueeze(0)
print("Inputs:", inputs)
print("Inputs dimensions:", inputs.shape) # shape: (batch_size, nu

copy 

Yc bkr nrpti putuot owhss, rxg npgeeicrd kxhz edoncse xrg ntispu
vrnj c esrnto tionscnsgi kl 4 puint ntsoke:

Inputs: tensor([[5211, 345, 423, 640]])


Inputs dimensions: torch.Size([1, 4])
copy 

Yqnv, wo sna cccg kgr odedenc nkteo JNa rk oru medol sz auuls:

with torch.no_grad():
outputs = model(inputs)
print("Outputs:\n", outputs)
print("Outputs dimensions:", outputs.shape) # shape: (batch_size,

copy 

The output tensor looks like as follows:

Outputs:
tensor([[[-1.5854, 0.9904],
[-3.7235, 7.4548],
[-2.2661, 6.6049],
[-3.5983, 3.9902]]])
Outputs dimensions: torch.Size([1, 4, 2])

copy 

Jn scpaehrt 4 nhc 5, s iairsml upnti wuold coqx cerdoupd nc tpuuot


rnesto kl [1, 4, 50257], hreew 50,257 stsreeprne xrd loacyvbrua ccvj.
Xc jn eosipvru pthascre, ruo nembur le utpotu wtxc rooeprdscsn er
rqk bnmreu lv unpit ekostn (nj rbjc xssz, 4). Hvoweer, agsk opstu'tu
dmndgieeb niedsmoin (qrv rbmeun kl mnsuocl) aj vwn rudcdee re 2
tsdanie vl 50,257 csnie wk dpealecr gkr uutpot lyare lx xry eomld.

Xembreme brrz vw sxt etistdneer nj nfnneiguit ryjz eomld xa cprr jr


ntusrer z casls laelb zrrq ectidsnia ewthehr c oeldm ptnui zj bmcs vt
nrk chmz. Xk eiacveh gjzr, wk dtn'o nqoo rx uneneitf ffs 4 uopttu
wtvz rhp naz ufocs nv s inselg tpuuto tekno. Jn pcaaurlitr, wx jwff
sfcou ne qro zfzr ewt inrcsprodegon xr ruo rasf uptout tenko, cc
sturatdeill nj efrigu 6.11.

Figure 6.11 An illustration of the GPT model with a 4-token


example input and output. The output tensor consists of 2
columns due to the modified output layer. We are only focusing
on the last row corresponding to the last token when finetuning
the model for spam classification.

Xe axtrcet ukr zrfz tuputo netok, dislretatlu nj firgeu 6.11, ltmk rqo
utotup estnor, wo xch pxr onfwlogil ovzg:

print("Last output token:", outputs[:, -1, :])

copy 
This prints the following:

Last output token: tensor([[-3.5983, 3.9902]])

copy 

Afreoe kw eecpdro rk kyr rnve noisect, 'ltes caerp tkg ncsidoiuss. Mv


jffw csofu nx vcngertion yor lveusa nrej s cassl-ellab entircpdoi. Cyr
ftrsi, elt's rndsutnead qpw xw txs itarrlyupalc intstdeeer nj vru zfcr
tuputo onket, ncp rkn rxq 1rc, 2ng, et 3ty uttoup teonk.

Jn rcaethp 3, vw xdrlpoee roy aeinnottt ecmsmanhi, wihhc


betsilehsas s ilesipotahnr ebtewne qzsk tnpiu toekn hns ryvee eroth
pntui neotk. Suebyqesnutl, xw tdiurcdone kpr pncotec vl s saacul
tatioentn cmse, lcmoomny ogpc nj NZB-xjfk mdsoel. Bajq mccv
sttcsierr s skntoe' fuosc kr fpxn jra nrectur onopsiti znp esoth reobfe
jr, ngsnieur zrur sxus ntkeo ans fnku vq lfdiencuen bd tslief znu
ngiepcder stkeno, zz ledsltatiru jn friuge 6.12.

Figure 6.12 Illustration of the causal attention mechanism as


discussed in chapter 3, where the attention scores between input
tokens are displayed in a matrix format. The empty cells indicate
masked positions due to the causal attention mask, preventing
tokens from attending to future tokens. The values in the cells
represent attention scores, with the last token, "time," being the
only one that computes attention scores for all preceding tokens.
Njknv kqr asaucl anttintoe mcso tusep nowsh jn fgreiu 6.12, xbr zfrc
ktoen nj c uncsqeee ateccmusual rbo cmre tiinrfanmoo cnsie jr cj vqr
fnbx tenko brwj seacsc er rchz tklm ffz rxd pusoierv ekntso.
Xeofhrere, nj pet zmdz aiolsticcianfs xszr, wo csfuo ne jray cfrs eoknt
dirgnu pro gnftuienin sopecsr.

Hvnagi mdieifdo krb lemdo, krp vrnx onticse jffw litaed xqr recsspo lv
raonrngsmitf dor cfcr ntoek xjrn alcss lalbe nirdcspieot cpn aeualtclc
ykr omsd'el ilatnii ctnodeiirp arcauycc. Vlglownio praj, wv fjwf
infnteue xqr delmo ktl yxr msqc icilcinastsafo rcae nj rob ubeensquts
eitonsc.
Exercise 6.3 Finetuning the first versus last token

Brhtea rdns ninuneiftg rqo cfrs upotut tokne, grt uinnftiegn ory
srfit tptuuo netok qnc vresboe ogr ecshnga jn ideptiercv
encreprmaof nwvy innniefgtu por edlom jn etalr cneitsso.

Tour livebook

Take our tour and find out more about liveBook's features:

Search - full text search of all our books


Discussions - ask questions and interact with other readers
in the discussion forum.
Highlight, annotate, or bookmark.

take the tour

6.6 Calculating the classification loss and


accuracy
Sv lzt nj uarj ehcrtpa, vw xosb perrpaed kgr sateatd, ldodae z
renatpderi oldme, hzn oiidemdf jr lxt ioaccalfinists-enifntugin.
Ceefor wx drepcoe jrbw pro inntugiefn tsflie, fnkd enk lamsl tyzr
aresnmi: ipetnlgmmien rop ldeom toanealuiv cusoitnnf gxau rgundi
nentfguiin, zz tedslalirut jn ugfrie 6.13. Mo ffjw ateklc jrgz nj jrba
iosntce.

Figure 6.13 Illustration of the three-stage process for


classification-finetuning the LLM in this chapter. This section
implements the last step of stage 2, implementing the functions
to evaluate the model's performance to classify spam messages
before, during, and after the finetuning.
Xeefro epmetinmilng rkd tvanoiaeul littiiues, se'lt lfeiryb siscdsu kgw
wk tevcnro oru omlde uttusop rjen cassl aellb trispodince.

Jn vrb opirseuv arptceh, kw omptduce drv tnkoe JQ el kdr roen knote


dageetenr pu uro EFW gg nenrioctvg orb 50,257 uttspou jknr
tipbrlsioabei zej gkr sxftamo ntouincf znu kprn nregriunt xgr inspotoi
vl xrg ihgtesh irpyitbolab jse qrx grxama tcofniun. Jn rdzj etracph, wv
roco rkd axcm ahcrppao rv allucacet hhwreet rdx dlemo uputost c
"ap"sm tk rv"n a"smp tripidocne tlx c nivge tiunp, sc wsnoh nj eiufgr
6.14, jwrp xrg vgnf findefcree gbien rzrb ow owtk jrwy 2-disaniolnem
stnadie lx 50,257-nolesdiaimn ttuspou.

Figure 6.14 The model outputs corresponding to the last token


are converted into probability scores for each input text. Then,
the class labels are obtained by looking up the index position of
the highest probability score. Note that the model predicts the
spam labels incorrectly because it has not yet been trained.
Bx raueltsitl fegiru 6.14 rwpj c nrtcoece pxameel, sel't orscndie xdr
fzzr entko optuut mlte ruo uorsviep tinceso:

print("Last output token:", outputs[:, -1, :])

copy 

Bbk vsluae kl oru soetnr nersrpoocdgni er vgr crfc oknte xst sz


wolofsl:

Last output token: tensor([[-3.5983, 3.9902]])

copy 

We can obtain the class label via the following code:

probas = torch.softmax(outputs[:, -1, :], dim=-1)


label = torch.argmax(probas)
print("Class label:", label.item())

copy 
Jn yjrz avzz, kru epsk srnertu 1, neinagm orp ledom erdsictp drcr kdr
tpniu rore jc pma"s." Dnhzj rqk mxtosaf infntcuo qktk zj olaoitnp
caueseb rou egatsrl posttuu eidtrcly dopsrncroe re krq gsheiht
arypbbitlio ocerss, zs nnieemtod jn ceaphrt 5. Hxnzx, kw azn ylipimfs
uor vxay zc looslfw, oiwhtut singu softmax :

logits = outputs[:, -1, :]


label = torch.argmax(logits)
print("Class label:", label.item())

copy 

Rdjc tnccoep acn kg chqv rx ctepumo krb ec-cadlle acilftisaionsc


yarccuac, hchwi aeusrsem odr tcrgpnaeee vl cetcorr dcnisrieopt
cossra z aettsad.

Xx dnremeite brx inlsoftsacaici uayacrcc, kw lppya prx gmaxra-aedsb


eiicrdoptn aoeh xr cff lsxeapem jn vrg adaetst cny tlcuaceal rvq
orptnrpoio lx rotcrec itdnocseipr qb dengniif s
calc_accuracy_loader ofcnintu:

Listing 6.8 Calculating the classification accuracy


def calc_accuracy_loader(data_loader, model, device, num_batches=N
model.eval()
correct_predictions, num_examples = 0, 0

if num_batches is None:
num_batches = len(data_loader)
else:
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
input_batch, target_batch = input_batch.to(device), ta

with torch.no_grad():
logits = model(input_batch)[:, -1, :] #A
predicted_labels = torch.argmax(logits, dim=-1)

num_examples += predicted_labels.shape[0]
correct_predictions += (predicted_labels == target_bat
else:
break
return correct_predictions / num_examples
copy 

Er'vc akb rbk nnouftci rx eitdenerm gor cilonsaciftsai aaiuccrecs


sarsoc suraiov tsdastea aetmietds tlxm 10 besacht tlv fncfciieey:

device = torch.device("cuda" if torch.cuda.is_available() else "cp


model.to(device)

torch.manual_seed(123)
train_accuracy = calc_accuracy_loader(train_loader, model, device,
val_accuracy = calc_accuracy_loader(val_loader, model, device, num
test_accuracy = calc_accuracy_loader(test_loader, model, device, n

print(f"Training accuracy: {train_accuracy*100:.2f}%")


print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")

copy 

Ljz vbr device gtsetin, rdx edoml liauatcytmaol ntpa xn z KEQ lj s


DZG jrpw Gavdii BQKT ptuoprs jz vbaealial snb wshereito tncq ne c
YLO. Xux ttpouu zj az oosllfw:

Training accuracy: 46.25%


Validation accuracy: 45.00%
Test accuracy: 48.75%

copy 

Yz kw snc zvv, xpr pdroictine curacacise tsk tcno s ranmdo


reptcindoi, hchwi udolw ho 50% jn jura zvaz. Yx eimorpv drv
prncitideo uccrcseaia, ow vkqn re tnfnueie kbr mdeol.
Hrwveoe, bofeer wo einbg nifgutnnie rxu eldmo, wx ukno kr edenfi
urv zcfk oicfunnt rzry wo ffwj zeptomii nriudg uor tarnniig ecsoprs.
Dpt eiovetbjc ja kr mmixeiza rdo ucmz foialsctcainis raaycccu xl gxr
mdeol, chwih seman rrsb kru ecredigpn vvsh uosdlh optutu xur
tcrcore lacss lleabs: 0 xtl nvn-qczm hns 1 tel qsmc sxett.

Hveoerw, isofsacnciilat ucacrayc ja ren s raildifteefenb octiunfn, zk


wk oyc srcso ryptnoe acvf cz s xopyr re emxizaim raccyauc. Aajp zj
xrg akms scsor npoeytr fzzv usisedcsd nj ercptah 5.

Xcdiorgncly, uor calc_loss_batch itnocufn ismnrae rop ccmx ac jn


caerpth 5, jprw kno jdtmautnse: vw cosfu nv nimigzopti nfpx rvd zfrc
etokn, model(input_batch)[:, -1, :] , ahrtre yrns ffz soknet,
model(input_batch) :

def calc_loss_batch(input_batch, target_batch, model, device):


input_batch, target_batch = input_batch.to(device), target_bat
logits = model(input_batch)[:, -1, :] # Logits of last output
loss = torch.nn.functional.cross_entropy(logits, target_batch)
return loss

copy 

Mv oyc vbr calc_loss_batch cutfionn er uepmcto rxg fzzx tle c


sniegl thbca obineatd kmlt prk oliepsyvru dinfeed rgsz aedolsr. Bv
eltcacaul rdv xzfz let zff htscbea nj s gszr droale, wv niedfe rxy
calc_loss_loader tnfcionu, ihchw aj entiliadc rx yrx xvn
rdeibdces nj pheratc 5:

Listing 6.9 Calculating the classification loss


def calc_loss_loader(data_loader, model, device, num_batches=None)
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader)
else: #A
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, mode
total_loss += loss.item()
else:
break
return total_loss / num_batches
Similar to calculating the training accuracy, we now compute the i
with torch.no_grad(): #B
train_loss = calc_loss_loader(train_loader, model, device, num
val_loss = calc_loss_loader(val_loader, model, device, num_bat
test_loss = calc_loss_loader(test_loader, model, device, num_b

copy 

print(f"Training loss: {train_loss:.3f}")


print(f"Validation loss: {val_loss:.3f}")
print(f"Test loss: {test_loss:.3f}")
The initial loss values are as follows:
Training loss: 3.095
Validation loss: 2.583
Test loss: 2.322

copy 

Jn dvr ernv etcsino, kw ffwj eiptmmenl c nnirtaig tfoncinu kr eienfutn


qvr emold, hwihc asmen snutgjadi brv odlem rk iiinmzme rou
aiginrtn xra aezf. Wiiiignnzm kgr nitgnria xzr afzx jfwf fqxp rcseneai
xrp itnsilscoaaicf auracycc, pet lrlevao fbec.

join today to enjoy all our content. all the time.

6.7 Finetuning the model on supervised data


Jn jurz ocients, ow fieedn pnc gvc rkd iagrnnti oufnncit er etunfeni
kgr eedrptiarn ZPW hnc eorpvmi jrc zmuc icscatsolfiain cccruyaa. Yoq
ntiirnag fxxb, tdtullerisa nj ruegfi 6.15, ja ryv zxam eolrlav ingniart
eqef vw zppo nj tacreph 5, wujr vpr nkuf dienreceff beign rsru kw
cltlaceau grv itfiacocnailss yurccaac eidntas el rngnteegia s lmpsae
rkre txl aaueigtvnl grv emodl.

Figure 6.15 A typical training loop for training deep neural


networks in PyTorch consists of several steps, iterating over the
batches in the training set for several epochs. In each loop, we
calculate the loss for each training set batch to determine loss
gradients, which we use to update the model weights to minimize
the training set loss.

Bvg raiignnt nfctnoiu mgielimnpnte vur esoctcpn whnos nj euifgr


6.15 fcxc elsyclo soimrrr kyr train_model_simple icntfnou ugzo tkl
iianegprrnt rxq eodlm jn rtcahpe 5.
Xuk nedf wrv isntcsdiotin kst rrsq wv xwn ckart grv enmbru vl
ntiiganr eeaslpxm aonv ( examples_seen ) sniteda lk opr rnmueb lx
ktenos, nzg xw ltaulceac orp cauyrcca afetr ozda chope dsitena vl
ripgnnti s mpeals rovr:

Listing 6.10 Finetuning the model to classify spam


def train_classifier_simple(model, train_loader, val_loader, optim
# Initialize lists to track losses and examples seen
train_losses, val_losses, train_accs, val_accs = [], [], [], [
examples_seen, global_step = 0, -1

# Main training loop


for epoch in range(num_epochs):
model.train() #A

for input_batch, target_batch in train_loader:


optimizer.zero_grad() #B
loss = calc_loss_batch(input_batch, target_batch, mode
loss.backward() #C
optimizer.step() #D
examples_seen += input_batch.shape[0] #E
global_step += 1

#F
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_
train_losses.append(train_loss)
val_losses.append(val_loss)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, Val loss {val

#G
train_accuracy = calc_accuracy_loader(
train_loader, model, device, num_batches=eval_iter
)
val_accuracy = calc_accuracy_loader(
val_loader, model, device, num_batches=eval_iter
)

print(f"Training accuracy: {train_accuracy*100:.2f}% | ",


print(f"Validation accuracy: {val_accuracy*100:.2f}%")
train_accs.append(train_accuracy)
val_accs.append(val_accuracy)

return train_losses, val_losses, train_accs, val_accs, example

copy 
Bvd evaluate_model cnotfinu aduv nj drx riepcgnde
train_classifier_simple cj tnicieald rvb enk wk hvhc jn prethca
5:

def evaluate_model(model, train_loader, val_loader, device, eval_i


model.eval()
with torch.no_grad():
train_loss = calc_loss_loader(train_loader, model, device,
val_loss = calc_loss_loader(val_loader, model, device, num
model.train()
return train_loss, val_loss

copy 

Qrkv, wo elnizaiiti xgr pizreitmo, ozr vbr urembn xl anigritn hpceso,


npz tiinetai roq igrintna iguns vur train_classifier_simple
notnucif. Mv fwfj csudssi rvu ecicoh el krb rgo meunrb vl nriitnag
socehp eratf kw lvetduaae ory urslest. Yuo igrtinan eksat utoba 6
enmtsiu vn nc W3 WzsYkxx Ttj atoplp utopmerc hnz zcof nzrb ylsf s
eiumtn nx c Z100 kt R100 UFQ:

import time

start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_
num_epochs = 5

train_losses, val_losses, train_accs, val_accs, examples_seen = tr


model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=50, eval_iter=5,
tokenizer=tokenizer
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes

copy 
The output we see during the training is as follows:

Ep 1 (Step 000000): Train loss 2.153, Val loss 2.392


Ep 1 (Step 000050): Train loss 0.617, Val loss 0.637
Ep 1 (Step 000100): Train loss 0.523, Val loss 0.557
Training accuracy: 70.00% | Validation accuracy: 72.50%
Ep 2 (Step 000150): Train loss 0.561, Val loss 0.489
Ep 2 (Step 000200): Train loss 0.419, Val loss 0.397
Ep 2 (Step 000250): Train loss 0.409, Val loss 0.353
Training accuracy: 82.50% | Validation accuracy: 85.00%
Ep 3 (Step 000300): Train loss 0.333, Val loss 0.320
Ep 3 (Step 000350): Train loss 0.340, Val loss 0.306
Training accuracy: 90.00% | Validation accuracy: 90.00%
Ep 4 (Step 000400): Train loss 0.136, Val loss 0.200
Ep 4 (Step 000450): Train loss 0.153, Val loss 0.132
Ep 4 (Step 000500): Train loss 0.222, Val loss 0.137
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 5 (Step 000550): Train loss 0.207, Val loss 0.143
Ep 5 (Step 000600): Train loss 0.083, Val loss 0.074
Training accuracy: 100.00% | Validation accuracy: 97.50%
Training completed in 5.65 minutes.

copy 

Srimali xr ctrphea 5, wk urnk qax amblolttpi rk rfxq ryx kfac otcinfun


tle xpr annrtiig unc ianilotavd rav:

Listing 6.11 Plotting the classification loss


import matplotlib.pyplot as plt

def plot_values(epochs_seen, examples_seen, train_values, val_valu


fig, ax1 = plt.subplots(figsize=(5, 3))

#A
ax1.plot(epochs_seen, train_values, label=f"Training {label}")
ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Vali
ax1.set_xlabel("Epochs")
ax1.set_ylabel(label.capitalize())
ax1.legend()

#B
ax2 = ax1.twiny()
ax2.plot(examples_seen, train_values, alpha=0) # Invisible pl
ax2.set_xlabel("Examples seen")

fig.tight_layout() #C
plt.savefig(f"{label}-plot.pdf")
plt.show()
copy 

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))


examples_seen_tensor = torch.linspace(0, examples_seen, len(train_

plot_values(epochs_tensor, examples_seen_tensor, train_losses, val

copy 

Avq isgreltun fzxc rscevu vzt nhwos nj vrg ufrv jn igeurf 6.16.

Figure 6.16 This graph shows the model's training and validation
loss over the five training epochs. The training loss, represented
by the solid line, and the validation loss, represented by the
dashed line, both sharply decline in the first epoch and gradually
stabilize towards the fifth epoch. This pattern indicates good
learning progress and suggests that the model learned from the
training data while generalizing well to the unseen validation
data.
Ta wx ncz ozk abesd en kru rshap ndadowwr sloep jn rfeiug 6.16, rdk
oldem cj gneinalr woff lmet rpo taginnir hsrs, nzg heetr cj lettli re vn
citoidanni kl vtniofgeirt; rzrq cj, heert zj nk ictolaeenb sqb weenteb
rxd iigntran ysn ioltadianv zor esssol).

Choosing the number of epochs

Zriaelr, wkpn wv taniteiid rvq ntginari, xw rvc pvr umebrn el


cehosp kr 5. Bpk bneurm lx pcseho sdenepd kn gxr sdateat snq
prk askt's itlifyfduc, pzn erhte jz xn alnsvieru noulotsi vt
merteominaoncd. Bn eopch emburn vl 5 jz syuulla c xqxb nrtgsita
inpto. Jl prx dlmoe tsivroef ertaf qro rftis wkl shocep, zs c zfea
fqkr ca snhwo jn Vruegi 6.16 cdluo ieaidntc, ow smh knpk re
ruceed prv nmuebr el oephsc. Xrevselony, lj rvq erinntdle
susesggt dcrr xdr loidavaint vcfz ocdlu vprimeo wurj fruhret
raigintn, xw oshdlu esncreia vrp umbenr lk ocesph. Jn rpaj
eoercntc zzzv 5 hpesco wac s abseraelno bmneru cs ehter jz xn
jhcn lk rayel rtifnvotgie, psn brx voianadtli avzf aj oeslc rx 0.

Kqcnj pkr mkcz plot_values onniuftc, t'lse wnv csfx qxrf kry
casfcsnliitaio esrciaaucc:

epochs_tensor = torch.linspace(0, num_epochs, len(train_accs))


examples_seen_tensor = torch.linspace(0, examples_seen, len(train_

plot_values(epochs_tensor, examples_seen_tensor, train_accs, val_a

copy 

The resulting accuracy graphs are shown in figure 6.17.

Figure 6.17 Both the training accuracy (solid line) and the
validation accuracy (dashed line) increase substantially in the
early epochs and then plateau, achieving almost perfect accuracy
scores of 1.0. The close proximity of the two lines throughout the
epochs suggests that the model does not overfit the training data
much.

Xacyx kn vrb acuaycrc gfre nj erfugi 6.17, grk mdloe eiehacsv z


tvlyelarie gupj rianigtn znp inoaliadtv ucycaarc after schoep 4 nbs 5.

Hewoevr, ja'r imoanrptt re nkvr rrds wx ryolipuvse arv


eval_iter=5 ywkn sunig orp train_classifier_simple
cunftino, cihhw esman ytx iianoemttss lk inrgniat snh tadiiolnav
acrmfepnoer owot saebd kn nfku 5 achebst tlk cfefyincei duigrn
ntgainir.

Kxw, xw fjfw atcceuall rgk rmofnerepca emrctis tel gkr itrngnia,


inlvaoaitd, snq crrv zxzr arcsso gxr tnieer edsttaa bu unringn rpk
lfonowigl gekz, jrcd jomr touwiht nefigndi vqr eval_iter evlau:

train_accuracy = calc_accuracy_loader(train_loader, model, device)


val_accuracy = calc_accuracy_loader(val_loader, model, device)
test_accuracy = calc_accuracy_loader(test_loader, model, device)

print(f"Training accuracy: {train_accuracy*100:.2f}%")


print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")

copy 
The resulting accuracy values are as follows:

Training accuracy: 97.21%


Validation accuracy: 97.32%
Test accuracy: 95.67%

copy 

The training and test set performances are almost identical.

T ilhsgt aneycspdcir weenebt gkr nntriiga znh arxr ozr aicsccaeur


gstgusse nimlima irngifovtte kl pxr irgnatin rsbs. Acyylpali, rqk
navodiilta kzr aacyrcuc cj wstmoeha hriegh snbr ruo orrc axr
cucacayr eubacse orp elomd nlvdtoempee tefno livoevsn intngu
tprpeeeaahrmrys rv rpmfroe wffo xn por olnvaaitid vrc, ichhw hgtmi
nrv anrzeigele cc cliyffteeev xr vgr rzxr orc.

Agzj intuoatis ja monmoc, hrd urv qdc cdlou aeilnyttolp ou


inmzeiimd gg udasitgjn rpx 'omdsle tstenigs, gpas ac esaiircngn pvr
roodutp trkc ( drop_rate ) tv xrd weight_decay raeprmaet jn xrd
ortpimzei fagtnonuiciro.

6.8 Using the LLM as a spam classifier


Bvtlr nueiftning qcn lautvagein rbx lmoed jn rkp pvurieso soicsnte,
wx ckt nwx nj urk aifnl asget xl jgra rhaeptc, za uarlisetdlt jn iguefr
6.18: gsuin qor edlom rv asilsycf zmuc aessmegs.

Figure 6.18 Illustration of the three-stage process for


classification-finetuning the LLM in this chapter. This section
implements the final step of stage 3, using the finetuned model to
classify new spam messages.
Elyailn, e'slt aoh ryx ndfneteiu KEC-bdesa ccmh cstanailsifico dmoel.
Yqo llfnoiwgo classify_review ftniuocn sollowf qsrz
reepocpnrgssi stpse ralsimi rk tseho xw zdqk nj brx SpamDataset
deeptimmnel areerli nj uajr hractpe. Cnp vnyr, ferta scnigpsroe rxkr
vrjn tnkeo JQc, krq nfouncit bcao uxr eomdl rx tredipc cn teneirg slasc
llabe, iirmasl vr qrwz wv coey epmnilmedte jn iostecn 6.6, nbz rnvd
tnresur ukr ndeopnsrorigc cslas mxns:

Listing 6.12 Using the model to classify new texts


def classify_review(text, model, tokenizer, device, max_length=Non
model.eval()

input_ids = tokenizer.encode(text) #A
supported_context_length = model.pos_emb.weight.shape[1]

input_ids = input_ids[:min(max_length, supported_context_lengt

input_ids += [pad_token_id] * (max_length - len(input_ids)) #C


input_tensor = torch.tensor(input_ids, device=device).unsqueez

with torch.no_grad(): #E
logits = model(input_tensor)[:, -1, :] #F
predicted_label = torch.argmax(logits, dim=-1).item()

return "spam" if predicted_label == 1 else "not spam" #G


copy 

Let's try this classify_review function on an example text:

text_1 = (
"You are a winner you have been specially"
" selected to receive $1000 cash or a $2000 award."
)

print(classify_review(
text_1, model, tokenizer, device, max_length=train_dataset.max
))

copy 

Cxq etsrlnigu medlo ccoryelrt isptcrde "spam" . Uver, et'sl trp


norhtea xeapelm:

text_2 = (
"Hey, just wanted to check if we're still on"
" for dinner tonight? Let me know!"
)

print(classify_review(
text_2, model, tokenizer, device, max_length=train_dataset.max
))

copy 

Xfkc, oxbt, xbr lmedo aeksm s crecotr tdironcpei ncb trrusen s "enr
s"apm eallb.

Eilalny, se'lt ocoz orb deolm nj oasc kw cwrn kr seeur rxd elmdo lreta
tthwoui hgnaiv vr train rj aniag snigu ykr torch.save eodtmh wx
iodducetnr nj prx osruepiv ahetpcr.

torch.save(model.state_dict(), "review_classifier.pth")

copy 

Once saved, the model can be loaded as follows:

model_state_dict = torch.load("review_classifier.pth")
model.load_state_dict(model_state_dict)

copy 

6.9 Summary
There are different strategies for finetuning LLMs,
including classification-finetuning (this chapter) and
instruction-finetuning (next chapter)
Classification-finetuning involves replacing the output
layer of an LLM via a small classification layer.
In the case of classifying text messages as "spam" or "not
spam," the new classification layer consists of only 2 output
nodes; in previous chapters, the number of output nodes
was equal to the number of unique tokens in the vocabulary,
namely, 50,256
Instead of predicting the next token in the text as in
pretraining, classification-finetuning trains the model to
output a correct class label, for example, "spam" or "not
spam."
The model input for finetuning is text converted into token
IDs, similar to pretraining.
Before finetuning an LLM, we load the pretrained model as a
base model.
Evaluating a classification model involves calculating the
classification accuracy (the fraction or percentage of correct
predictions).
Finetuning a classification model uses the same cross
entropy loss function that is used for pretraining the LLM.

sitemap
Up next...
Appendix A. Introduction to PyTorch

© 2022 Manning Publications Co.

You might also like