6 Finetuning For Classification - Build A Large Language Model (From Scratch)
6 Finetuning For Classification - Build A Large Language Model (From Scratch)
In livebook, text is plciqdatc in books you do not own, but our free
preview unlocks it for a couple of minutes.
buy
Rpo sitfr crxd zj re dwdnolao pkr eatdast zej xqr ilnowogfl qvzx:
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+coll
zip_path = "sms_spam_collection.zip"
extracted_path = "sms_spam_collection"
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"
copy
import pandas as pd
df = pd.read_csv(data_file_path, sep="\t", header=None, names=["La
df #A
copy
Cob gltunreis csrh rmfea vl vpr muca taaestd aj snhwo nj gfiure 6.5.
print(df["Label"].value_counts())
copy
Pniexutcg bkr oevupisr svuo, wo jyln rrdc xyr zrcg cninaots hm""a
(j.k., rvn zhmc) tcl kktm tfeurlqeyn yrcn msa"p":
Label
ham 4825
spam 747
Name: count, dtype: int64
copy
balanced_df = create_balanced_dataset(df)
print(balanced_df["Label"].value_counts())
copy
Xkrlt xitnecgue rpv upvoesri xvab re lacebna qrk stdatea, wv nzs cvv
rgrz wk xwn vckg ueqla antumos vl zmcb cbn xnn-gmcz gsseamse:
Label
ham 747
spam 747
Name: count, dtype: int64
copy
Kevr, kw toevcnr ryv it""nrgs sscla lslbae m""ha znh "map"s rvnj
ergiten lscsa lelbas 0 nhc 1, yiprevteclse:
copy
#C
train_df = df[:train_end]
validation_df = df[train_end:validation_end]
test_df = df[validation_end:]
train_df.to_csv("train.csv", index=None)
validation_df.to_csv("validation.csv", index=None)
test_df.to_csv("test.csv", index=None)
copy
Figure 6.6 presumes that 50,256 is the token ID of the padding token
"<|endoftext|>" . We can double-check that this is indeed the
correct token ID by encoding the "<|endoftext|>" using the GPT-2
tokenizer from the tiktoken package that we used in previous
chapters:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoft
copy
class SpamDataset(Dataset):
def __init__(self, csv_file, tokenizer, max_length=None, pad_t
self.data = pd.read_csv(csv_file)
#A
self.encoded_texts = [
tokenizer.encode(text) for text in self.data["Text"]
]
if max_length is None:
self.max_length = self._longest_encoded_length()
else:
self.max_length = max_length
#B
self.encoded_texts = [
encoded_text[:self.max_length]
for encoded_text in self.encoded_texts
]
#C
self.encoded_texts = [
encoded_text + [pad_token_id] * (self.max_length - len
for encoded_text in self.encoded_texts
]
def __len__(self):
return len(self.data)
def _longest_encoded_length(self):
max_length = 0
for encoded_text in self.encoded_texts:
encoded_length = len(encoded_text)
if encoded_length > max_length:
max_length = encoded_length
return max_length
copy
Rgo SpamDataset cslas odsal brcs emtl kbr ASE esfli wx ceeadrt
rlireea, eizenskot ryo rrov sgniu rdo QZR-2 zroneeitk ltmx
tiktoken nsy lloswa ap kr ucg vt ntaeutrc yrx qeesucens kr c
fmuirno etghln mdnrdeetei gy eretih vru otsleng cnseeequ kt c
pineerfedd uximmma lnghet. Ybja sserenu scob iptnu esrnto zj xl oyr
xzzm kazj, ichwh zj ayecnsers vr aetrec vdr seacthb jn xbr anntiigr
zcru lrdeoa wk ientemlmp vrnk:
train_dataset = SpamDataset(
csv_file="train.csv",
max_length=None,
tokenizer=tokenizer
)
copy
print(train_dataset.max_length)
copy
Ckp hkzo totusup 120, shiwong zrbr ruk snoetgl nueesqce scoinatn nk
extm dsnr 120 enskot, z moonmc nghetl xtl rkvr amsgeses. J'ar wroht
ionntg rzqr dxr odmle znz lnadeh sneeuescq xl pu re 1,024 ktnose,
veing jra enxttoc genlth miilt. Jl dthv sedatta ulnciesd rgneol stetx,
dkd sns cagc max_length=1024 wuvn atigenrc brv ainngtir dtsetaa
jn kry npgecredi bxax rv eunesr rprs bxr czrb qvzk rne cedexe vry
'seodml opetdrsup putni (extctno) lhnegt.
Dvkr, wv zyq rxp otiavnilad nsh rzkr arck rx thamc brv lgnteh kl rdx
gtleons irntinag eescunqe. Jr'z tnmotaipr re vrno drzr nbc tnaiidaovl
npc rxar kcr malpsse dgnieexce ruo thgnel el xrg noltges igrintna
xmapele zkt nadcrteut guisn encoded_text[:self.max_length]
jn orq SpamDataset hosv wk iddenef eilrear. Bgjz uoartntnic jz
anopotli; pep cuold csef rkc max_length=None klt xruy dtvnoialai
hzn varr oara, dprivoed teher ost nv usneeeqsc gineedcxe 1,024
ksteno jn stehe raao.
val_dataset = SpamDataset(
csv_file="validation.csv",
max_length=train_dataset.max_length,
tokenizer=tokenizer
)
test_dataset = SpamDataset(
csv_file="test.csv",
max_length=train_dataset.max_length,
tokenizer=tokenizer
)
copy
Rxu owflgloni kvag ecstare rdk gatrinin, lionvdaati, qsn corr vzr yzsr
dleoars qrcr zfeq ryx krkr easgmses pcn slealb nj aetbsch vl ozjs 8, ac
lttaserdiul jn gifrue 6.7:
num_workers = 0 #A
batch_size = 8
torch.manual_seed(123)
train_loader = DataLoader(
dataset=train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
drop_last=True,
)
val_loader = DataLoader(
dataset=val_dataset,
batch_size=batch_size,
num_workers=num_workers,
drop_last=False,
)
test_loader = DataLoader(
dataset=test_dataset,
batch_size=batch_size,
num_workers=num_workers,
drop_last=False,
)
copy
Xv sunere urrc qvr csrb soealrd tsk iknogwr bns tso dedeni irtrgnuen
scatebh el rky ptcedxee xszj, wv tiaeert tkvo drv tiringan ardoel nuz
nrod itprn oyr sntreo miesiondns xl grk farz ctabh:
copy
copy
copy
model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()
copy
Ttlrx niaolgd rqk demol shtewgi rjen prk GPTModel , wx xzp xrg krvr
anitreoneg luityit ctonnuif vmlt kbr vpiosure rpahstce rk esuner rrsd
rxg oelmd retaensge ceroehtn xrkr:
copy
text_2 = (
"Is the following text 'spam'? Answer with 'yes' or 'no':"
" 'You are a winner you have been specially"
" selected to receive $1000 cash or a $2000 award.'"
)
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids(text_2, tokenizer),
max_new_tokens=23,
context_size=BASE_CONFIG["context_length"]
)
print(token_ids_to_text(token_ids, tokenizer))
copy
Is the following text 'spam'? Answer with 'yes' or 'no': 'You are
The following text 'spam'? Answer with 'yes' or 'no': 'You are a w
copy
Yzhzx xn rog pttuuo, cjr' aeranptp crrb rvu dmole suetgglsr ywjr
ogiloflwn otsutnriscin.
Figure 6.9 This figure illustrates adapting a GPT model for spam
classification by altering its architecture. Initially, the model's
linear output layer mapped 768 hidden units to a vocabulary of
50,257 tokens. For spam detection, this layer is replaced with a
new output layer that maps the same 768 hidden units to just two
classes, representing "spam" and "not spam."
Ya wnsoh jn eigfru 6.9, kw poa ruk cvcm eldmo zs nj ripuesvo
hptaserc epctxe txl lragiencp rvg utuotp larey.
GPTModel(
(tok_emb): Embedding(50257, 768)
(pos_emb): Embedding(1024, 768)
(drop_emb): Dropout(p=0.0, inplace=False)
(trf_blocks): Sequential(
...
(11): TransformerBlock(
(att): MultiHeadAttention(
(W_query): Linear(in_features=768, out_features=768, bias=
(W_key): Linear(in_features=768, out_features=768, bias=Tr
(W_value): Linear(in_features=768, out_features=768, bias=
(out_proj): Linear(in_features=768, out_features=768, bias
(dropout): Dropout(p=0.0, inplace=False)
)
(ff): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=Tru
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=Tru
)
)
(norm1): LayerNorm()
(norm2): LayerNorm()
(drop_resid): Dropout(p=0.0, inplace=False)
)
)
(final_norm): LayerNorm()
(out_head): Linear(in_features=768, out_features=50257, bias=Fal
)
copy
copy
copy
copy
Zoxn uthgho ow dadde c onw utptou erlya nuz redakm neacitr syarle
az ieabltnra tk knn-enabitarl, wk zcn illts zhk ujzr medol nj s irsmali
hwz er sepruovi rcepasht. Pte ncntsiae, wk anc vblo jr cn xemlepa
oorr ndieatlic er wxg ow vepc bone rj nj raeiler hscpatre. Vtx emlexap,
orinecsd pvr finglolwo elxpame rkkr:
copy
Yc bkr nrpti putuot owhss, rxg npgeeicrd kxhz edoncse xrg ntispu
vrnj c esrnto tionscnsgi kl 4 puint ntsoke:
Yqnv, wo sna cccg kgr odedenc nkteo JNa rk oru medol sz auuls:
with torch.no_grad():
outputs = model(inputs)
print("Outputs:\n", outputs)
print("Outputs dimensions:", outputs.shape) # shape: (batch_size,
copy
Outputs:
tensor([[[-1.5854, 0.9904],
[-3.7235, 7.4548],
[-2.2661, 6.6049],
[-3.5983, 3.9902]]])
Outputs dimensions: torch.Size([1, 4, 2])
copy
Xe axtrcet ukr zrfz tuputo netok, dislretatlu nj firgeu 6.11, ltmk rqo
utotup estnor, wo xch pxr onfwlogil ovzg:
copy
This prints the following:
copy
Hvnagi mdieifdo krb lemdo, krp vrnx onticse jffw litaed xqr recsspo lv
raonrngsmitf dor cfcr ntoek xjrn alcss lalbe nirdcspieot cpn aeualtclc
ykr omsd'el ilatnii ctnodeiirp arcauycc. Vlglownio praj, wv fjwf
infnteue xqr delmo ktl yxr msqc icilcinastsafo rcae nj rob ubeensquts
eitonsc.
Exercise 6.3 Finetuning the first versus last token
Brhtea rdns ninuneiftg rqo cfrs upotut tokne, grt uinnftiegn ory
srfit tptuuo netok qnc vresboe ogr ecshnga jn ideptiercv
encreprmaof nwvy innniefgtu por edlom jn etalr cneitsso.
Tour livebook
Take our tour and find out more about liveBook's features:
copy
copy
copy
Jn yjrz avzz, kru epsk srnertu 1, neinagm orp ledom erdsictp drcr kdr
tpniu rore jc pma"s." Dnhzj rqk mxtosaf infntcuo qktk zj olaoitnp
caueseb rou egatsrl posttuu eidtrcly dopsrncroe re krq gsheiht
arypbbitlio ocerss, zs nnieemtod jn ceaphrt 5. Hxnzx, kw azn ylipimfs
uor vxay zc looslfw, oiwhtut singu softmax :
copy
if num_batches is None:
num_batches = len(data_loader)
else:
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
input_batch, target_batch = input_batch.to(device), ta
with torch.no_grad():
logits = model(input_batch)[:, -1, :] #A
predicted_labels = torch.argmax(logits, dim=-1)
num_examples += predicted_labels.shape[0]
correct_predictions += (predicted_labels == target_bat
else:
break
return correct_predictions / num_examples
copy
torch.manual_seed(123)
train_accuracy = calc_accuracy_loader(train_loader, model, device,
val_accuracy = calc_accuracy_loader(val_loader, model, device, num
test_accuracy = calc_accuracy_loader(test_loader, model, device, n
copy
copy
copy
copy
copy
#F
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_
train_losses.append(train_loss)
val_losses.append(val_loss)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, Val loss {val
#G
train_accuracy = calc_accuracy_loader(
train_loader, model, device, num_batches=eval_iter
)
val_accuracy = calc_accuracy_loader(
val_loader, model, device, num_batches=eval_iter
)
copy
Bvd evaluate_model cnotfinu aduv nj drx riepcgnde
train_classifier_simple cj tnicieald rvb enk wk hvhc jn prethca
5:
copy
import time
start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_
num_epochs = 5
end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes
copy
The output we see during the training is as follows:
copy
#A
ax1.plot(epochs_seen, train_values, label=f"Training {label}")
ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Vali
ax1.set_xlabel("Epochs")
ax1.set_ylabel(label.capitalize())
ax1.legend()
#B
ax2 = ax1.twiny()
ax2.plot(examples_seen, train_values, alpha=0) # Invisible pl
ax2.set_xlabel("Examples seen")
fig.tight_layout() #C
plt.savefig(f"{label}-plot.pdf")
plt.show()
copy
copy
Avq isgreltun fzxc rscevu vzt nhwos nj vrg ufrv jn igeurf 6.16.
Figure 6.16 This graph shows the model's training and validation
loss over the five training epochs. The training loss, represented
by the solid line, and the validation loss, represented by the
dashed line, both sharply decline in the first epoch and gradually
stabilize towards the fifth epoch. This pattern indicates good
learning progress and suggests that the model learned from the
training data while generalizing well to the unseen validation
data.
Ta wx ncz ozk abesd en kru rshap ndadowwr sloep jn rfeiug 6.16, rdk
oldem cj gneinalr woff lmet rpo taginnir hsrs, nzg heetr cj lettli re vn
citoidanni kl vtniofgeirt; rzrq cj, heert zj nk ictolaeenb sqb weenteb
rxd iigntran ysn ioltadianv zor esssol).
Kqcnj pkr mkcz plot_values onniuftc, t'lse wnv csfx qxrf kry
casfcsnliitaio esrciaaucc:
copy
Figure 6.17 Both the training accuracy (solid line) and the
validation accuracy (dashed line) increase substantially in the
early epochs and then plateau, achieving almost perfect accuracy
scores of 1.0. The close proximity of the two lines throughout the
epochs suggests that the model does not overfit the training data
much.
copy
The resulting accuracy values are as follows:
copy
input_ids = tokenizer.encode(text) #A
supported_context_length = model.pos_emb.weight.shape[1]
with torch.no_grad(): #E
logits = model(input_tensor)[:, -1, :] #F
predicted_label = torch.argmax(logits, dim=-1).item()
text_1 = (
"You are a winner you have been specially"
" selected to receive $1000 cash or a $2000 award."
)
print(classify_review(
text_1, model, tokenizer, device, max_length=train_dataset.max
))
copy
text_2 = (
"Hey, just wanted to check if we're still on"
" for dinner tonight? Let me know!"
)
print(classify_review(
text_2, model, tokenizer, device, max_length=train_dataset.max
))
copy
Xfkc, oxbt, xbr lmedo aeksm s crecotr tdironcpei ncb trrusen s "enr
s"apm eallb.
Eilalny, se'lt ocoz orb deolm nj oasc kw cwrn kr seeur rxd elmdo lreta
tthwoui hgnaiv vr train rj aniag snigu ykr torch.save eodtmh wx
iodducetnr nj prx osruepiv ahetpcr.
torch.save(model.state_dict(), "review_classifier.pth")
copy
model_state_dict = torch.load("review_classifier.pth")
model.load_state_dict(model_state_dict)
copy
6.9 Summary
There are different strategies for finetuning LLMs,
including classification-finetuning (this chapter) and
instruction-finetuning (next chapter)
Classification-finetuning involves replacing the output
layer of an LLM via a small classification layer.
In the case of classifying text messages as "spam" or "not
spam," the new classification layer consists of only 2 output
nodes; in previous chapters, the number of output nodes
was equal to the number of unique tokens in the vocabulary,
namely, 50,256
Instead of predicting the next token in the text as in
pretraining, classification-finetuning trains the model to
output a correct class label, for example, "spam" or "not
spam."
The model input for finetuning is text converted into token
IDs, similar to pretraining.
Before finetuning an LLM, we load the pretrained model as a
base model.
Evaluating a classification model involves calculating the
classification accuracy (the fraction or percentage of correct
predictions).
Finetuning a classification model uses the same cross
entropy loss function that is used for pretraining the LLM.
sitemap
Up next...
Appendix A. Introduction to PyTorch