Generative AI
Generative AI
Course Overview
+ AutoDiff + RNN-LMs
Matt Gormley
Lecture 1
Jan. 17, 2024
1
Generative AI Full Course 2024
2
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning
3
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning
4
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning
5
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning
6
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning
7
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning
8
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning
9
“Deep Style” from https://deepdreamgenerator.com/#gallery
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning
10
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation qQ: What does Generative AI
• Planning have to do with any of
• Communication these goals?
• Creativity
• Learning qA: It’s making in-roads into
all of them.
11
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning q Communication comprises the
• Control / Motion / Manipulation comprehension and generation of
human language.
• Planning
q Large language models (LLMs)
• Communication excel at both
• Creativity q (Even though they are most often
trained autoregressively, i.e. to
• Learning generate a next word, given the
previous ones)
12
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning q The traditional way of learning in
• Control / Motion / Manipulation ML is via parameter estimation
• Planning q But in-context learning (i.e.
• Communication providing training examples as
• Creativity context at test time) shows that
learning can also be done via
• Learning
inference
13
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning q LLMs are also (unexpectedly)
• Control / Motion / Manipulation good at certain reasoning tasks
• Planning q cf. Chain-of-Though Prompting
• Communication (an ex. of in-context learning)
• Creativity
• Learning
14
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning q LLMs are already being used for
• Control / Motion / Manipulation grounded planning for
• Planning embodied agents, c.f. LLM-
• Communication Planner
• Creativity
• Learning
15
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning q Text-to-image models
• Control / Motion / Manipulation [Midjourney’s Discord server has
• Planning 18 million members (1.7 million
• Communication were online this morning)]
• Creativity q Text-to-music models
[MusicGen capable of
• Learning
conditioning on text and audio
sample]
16
“Deep Style” from https://deepdreamgenerator.com/#gallery
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning q Multimodal foundation models
• Control / Motion / Manipulation learn to answer questions about
• Planning images (and text in images)
• Communication q Diffusion models can be used as
• Creativity zero-shot classifiers
• Learning
17
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning q DayDreamer learns a generative
• Control / Motion / Manipulation model of experiences for RL, i.e.
• Planning a World Model, without
• Communication simulation
• Creativity q Quadruped
robot learns
• Learning
to walk in
under 1 hour
18
Artificial Intelligence Artificial Intelligence
Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning
• Perception
• Reasoning
• Control / Motion / Manipulation qQ: What does Generative AI
• Planning have to do with any of
• Communication these goals?
• Creativity
• Learning qA: It’s making in-roads into
all of them.
19
EXAMPLE OF GENERATIVE AI
20
Text Generation
Prompt: Write a proof of the fact that there are SECOND PARTY: But how, my friend, will this be
infinitely many primes; do it in the style of a done? The proof is not yet fairly won.
Shakespeare play through a dialogue between two
parties arguing over the proof. FIRST PARTY: It’s simple, friend, just wait and see,
We’ll multiply the primes, that’s key. All primes from
GPT-4: Enter FIRST PARTY and SECOND PARTY. two to "P" combined, Will make this new "Q" we will
find. But one thing more we must include, To make
FIRST PARTY: Thou dost not understand, my friend, the proof more watertight, We’ll add a one onto the
There simply is no upper end To primes, those end, And now the proof is in our sight.
numbers indivisible, No matter if it seems impossible.
SECOND PARTY: I see it now, it’s crystal clear, For "Q"
must be a prime, my dear. It can’t divide by any prime,
SECOND PARTY: But how dost thou so surely know? All smaller than itself in time. And since we chose "P"
To prove this truth, what canst thou show?
as the last, This proves that our assumption’s past.
There is no final prime, we see, The numbers go on
FIRST PARTY: Consider this, my doubtful peer, A endlessly.
clever proof that will make clear: Assume that there’s
a final prime, The largest one we’ll see in time. Let’s
call this number "P" for short, And then, my friend,
we will purport To build a number "Q" that’s new,
One larger than "P" and prime too.
21
Example from https://arxiv.org/pdf/2303.12712.pdf
Image Editing
A variety of tasks involve
automatic editing of an
image:
• Inpainting fills in the (pre-
specified) missing pixels
• Colorization restores
color to a greyscale image
• Uncropping creates a
photo-realistic
reconstruction of a
missing side of an image
22
Figure from Saharia et al. (2022)
Text-to-Image Generation
• Given a text description, sample an
image that depicts the prompt
• The following images are samples from
SDXL with refinement
23
Figure from https://stablediffusionweb.com/
Music Generation
MusicGen
• A transformer decoder model over quantized
units (discrete elements of a codebook of audio
frames)
• Interleaves sounds by adjusting how codebooks
attend to each other
• Permits conditioning on text and/or audio
samples
24
Figure from https://arxiv.org/pdf/2306.05284.pdf
Code Generation
25
Example from https://arxiv.org/pdf/2303.12712.pdf
Video Generation
• Latent diffusion
models use a low-
dimensional latent
space for efficiency
• Key question: how
to generate multiple
correlated frames?
• ‘Align your latents’
inserts temporal
convolution /
attention between
each spatial
convolution /
attention
• ‘Preserve Your Own
Correlation’ includes
temporally
correlated noise
26
Figure from https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt
SCALING UP
27
Training Data for LLMs
The Pile:
• An open source dataset for
training language models
• Comprised of 22 smaller
datasets
• Favors high quality text
• 825 Gb ≈ 1.2 trillion tokens
28
RLHF
• InstructGPT uses
Reinforcement
Learning with Human
Feedback (RLHF) to
fine-tune a pre-
trained GPT model
• From the paper:
“In human
evaluations on our
prompt distribution,
outputs from the 1.3B
parameter
InstructGPT model are
preferred to outputs
from the 175B GPT-3,
despite having 100x
fewer parameters.”
29
Figure from https://arxiv.org/pdf/2203.02155.pdf
Memory Usage of LLMs
How to store a large language Model Megatron-LM GPT-3
# parameters 8.3 billion 175 billion
model in memory?
full precision 30 Gb 651 Gb
– full precision: 32-bit floats half precision 15 Gb 325 Gb
– half precision: 16-bit floats
– Using half precision not only
reduces memory, it also speeds GPU / TPU Max Memory
30
Distributed Training: Model Parallel
There are a variety of Matrix multiplication The most natural division is A more efficient solution is
different options for comprises most by layer: each device to divide computation by
how to distribute the Transformer LM computes a subset of the token and layer. This
model computation / computation and can be layers, only that device requires careful division of
parameters across divided along rows/columns stores the parameters and work and is specific to the
multiple devices. of the respective matrices. computation graph for Transformer LM.
those layers. 31
Figure from https://arxiv.org/pdf/2102.07988.pdf
Cost to train
32
Figure from https://arxiv.org/pdf/2203.15556.pdf
n-g
ram
s
200
0
RN
N-L
Ms
201
0
Tra
GPT nsf
202 orm
-3 0 erL
Ms
201
7
Ins
tru
ctG
202
PT 1
LaM ELM
O
201
BD
A
8
BER
T
GPT
Pal
m
202
2
Cha
tGP
T
BLO
OM
GPT
-2
201
9
Timeline: Language Modeling
RoB
ERT
a
Lla
ma
202
3
GPT
-4
Fal
con
Mis
tra
l
33
LeN
199
et 8
Ima
geN
200
et 9
Pas
calV
O
201
C 0
Tra
nsf
orm
201
er 7
Ale
xNe
201
t 2
DD
PM
202
0
VAE
201
s 3
Tra
nsf Vision
orm
202
er 1
Dal
l-E
VGG
201
CLI
P
4
R-C
NN
GA
Ns
Timeline: Image Generation
Dal
l-E
202
2 2
Ima
gen Dif
fu
S mo sion
201
diff table
usio
del
s
5
n Res
Ne
t
SDX
202
L 3
34
SDX
L Tu
rbo
Why learn the inner-
workings of GenAI?
(a metaphor)
37
Figure from https://www.astonmartin.com/en/
Figure from https://daily.jstor.org/the-science-of-traffic/
Figure from https://earthobservatory.nasa.gov/images/149321/2021-continued-earths-warming-trend
40
Figure from https://www.energy.gov/eere/vehicles/fact-617-april-5-2010-changes-vehicles-capita-around-world
Figure from GHSA
41
Figure from https://www.businesswire.com/news/home/20210624005926/en/Strategy-Analytics-Half-the-World-Owns-a-Smartphone
43
Figure from https://www.npr.org/2024/01/16/1224913698/teslas-chicago-charging-extreme-cold
GENERATIVE AI IS PROBABILISTIC MODELING
45
GenAI is Probabilistic Modeling
p(xt+1 | x1 , . . . , xt )
46
What if I want to model
EVERY possible
interaction?
(RNN-LMs)
47
RNN Language Model
50
Syllabus Highlights
http://423.mlcourse.org
https://www.cs.cmu.edu/~mgormley/courses/10423/
http://623.mlcourse.org
51
Syllabus Highlights
• Grading: 40% homework, 10% quizzes, 20% Technologies:
exam, 25% project, 5% participation – Piazza (discussion),
• Exam: in-class exam, Wed, Mar. 27 – Gradescope (homework),
• Homework: 5 assignments – Google Forms (polls),
– 6 grace days for homework assignments – Zoom (livestream),
– Late submissions: 75% day 1, 50% day 2, 25% – Panopto (video recordings)
day 3 • Academic Integrity:
– No submissions accepted after 3 days w/o – Collaboration encouraged, but must be
extension documented
– Extension requests: for emergency – Solutions must always be written
situations, see syllabus independently
• Recitations: Fridays, same time/place as – No re-use of found code / past assignments
lecture (optional, interactive sessions) – Severe penalties (i.e.. failure)
• Readings: required, online PDFs, – (Policies differ from 10-301/10-601)
recommended for after lecture • Office Hours: posted on Google Calendar
• on “Office Hours” page
52
Lectures
• You should ask lots of questions
– Interrupting (by raising a hand) to ask your question is strongly
encouraged
– Asking questions later (or in real time) on Piazza is also great
• When I ask a question…
– I want you to answer
– Even if you don’t answer, think it through as though I’m about to
call on you
• Interaction improves learning (both in-class and at my office
hours)
53
Prerequisites
What they are: What is not required:
Introductory machine learning. • Deep learning
(i.e. 10-301, 10-315, 10-601, 10-701) • PyTorch
Depending on which prerequisite
If you instead took an introduction course you took and in which
to deep learning course, that is semester you took it, you may or
also fine may not have been exposed to
(i.e. 11-485/11-685/11-785) deep learning and/or PyTorch.
Either way is fine.
54
Homework
There will be 5 homework assignments during the semester. The
assignments will consist of both conceptual and programming
problems.
Main Topic Implementation Application Type
Area
HW0 PyTorch Primer image classifier + vision + written +
Text classifier language programming
HW1 Large Language TransformerLM with char-level written +
Models sliding window attn. text gen programming
HW2 Image Generation GAN or diffusion image infilling written +
model programming
HW3 Adapters for LLMs Llama + LoRA code + chat written +
programming
HW4 Multimodal text-to-image model vision + written +
Foundation Models language programming
HW623 (10-623 only) read / analyze a recent genAI video
research paper presentation
55
Project
• Goals:
– Explore a generative
modeling technique of your
choosing
– Deeper understanding of
methods in real-world
application
– Work in teams of 3 students
56
Textbooks
57
Where can I find…?
58
Where can I find…?
59
Where can I find…?
60
Reminders
• Homework 0: PyTorch + Weights & Biases
– Out: Wed, Jan 17
– Due: Wed, Jan 24 at 11:59pm
– Two parts:
1. written part to Gradescope
2. programming part to Gradescope
– unique policy for this assignment: we will grant (essentially) any
and all extension requests
62
Learning Objectives
You should be able to…
1. Differentiate between different mechanisms of learning such as parameter tuning and
in-context learning.
2. Implement the foundational models underlying modern approaches to generative
modeling, such as transformers and diffusion models.
3. Apply existing models to real-world generation problems for text, code, images, audio,
and video.
4. Employ techniques for adapting foundation models to tasks such as fine-tuning,
adapters, and in-context learning.
5. Enable methods for generative modeling to scale-up to large datasets of text, code, or
images.
6. Use existing generative models to solve real-world discriminative problems and for
other everyday use cases.
7. Analyze the theoretical properties of foundation models at scale.
8. Identify potential pitfalls of generative modeling for different modalities.
9. Describe societal impacts of large-scale generative AI systems.
64
Q&A
65
MODULE-BASED AUTOMATIC
DIFFERENTIATION
66
Backpropagation
Automatic Differentiation – Reverse Mode (aka. Backpropagation)
Forward Computation
1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a
directed acyclic graph, where each variable is a node (i.e. the “computation
graph”)
2. Visit each node in topological order.
For variable ui with inputs v1,…, vN
a. Compute ui = gi(v1,…, vN)
b. Store the result at the node
Backward Computation (Version A)
1. Initialize dy/dy = 1.
2. Visit each node vj in reverse topological order.
Let u1,…, uM denote all the nodes with vj as an input
Assuming that y = h(u) = h(u1,…, uM)
and u = g(v) or equivalently ui = gi(v1,…, vj,…, vN) for all i
a. We already know dy/dui for all i
b. Compute dy/dvj as below (Choice of algorithm ensures computing
(dui/dvj) is easy)
69
A Recipe for
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function
70
Backpropagation:
Abstract Picture
… (F) Loss
Output K
J = k=1 yk (yk )
73
Module-based AutoDiff
• Key Idea:
– componentize the computation of the neural-network into layers
– each layer consolidates multiple real-valued nodes in the
computation graph (a subset of them) into one vector-valued node
(aka. a module)
• Each module is capable of two actions:
1. Forward computation of output b = [b1 , . . . , bB ] given input
b gb a = [a1 , . . . , aA ] via some di昀昀erentiable function f . That is
b = f (a).
2. Backward computation of the gradient of the input
module ga = ∇a J = [ ∂a
∂J
, . . . , ∂J
] given the gradient of output
1 ∂a A
gb = ∇b J = [ ∂b ∂J
1
, . . . , ∂J
∂b B
], where J is the 昀椀nal real‐valued
output of the entire computation graph. This is done via the
a ga !J ∂J dbj
chain rule ∂ai = j=1 ∂bj dai for all i ∈ {1, . . . , A}.
∂J
74
Module-based AutoDiff
Dimensions: input a ∈ RA , output b ∈ RB , gradient
of output ga ! ∇a J ∈ RA , and gradient of input gb ! Linear Module The linear layer has two inputs: a vec-
∇ b J ∈ RB . tor a and parameters ω ∈ RB×A . The output b
is not used by L B , but we pass it in
Sigmoid Module The sigmoid layer has only one input for consistency of form.
vector a. Below σ is the sigmoid applied element- 1: procedure L F (a, ω)
wise, and ! is element-wise multiplication s.t. u! 2: b = ωa
v = [u1 v1 , . . . , uM vM ]. 3: return b
1: procedure S F (a) 4: procedure L B (a, ω, b, gb )
2: b = σ(a) 5: gω = gb a T
3: return b 6: ga = ω T gb
4: procedure S B (a, b, gb ) 7: return gω , ga
5: ga = gb ! b ! (1 − b)
6: return ga Cross-Entropy Module The cross-entropy layer has two in-
puts: a gold one-hot vector a and a predicted proba-
Softmax Module The softmax layer has only one input bility distribution â. It’s output b ∈ R is a scalar. Be-
vector a. For any vector v ∈ RD , we have that low ÷ is element-wise division. The output b is not
diag(v) returns a D × D diagonal matrix whose used by C E B , but we pass it in
diagonal entries are v1 , v2 , . . . , vD and whose non- for consistency of form.
diagonal entries are zero. 1: procedure C E F (a, â)
1: procedure S F (a) 2: b = −a log â
T
2: b = softmax(a) 3: return b
3: return b 4: procedure C E B (a, â, b, gb )
4: procedure S ! B " (a, b, gb ) 5: gâ = −gb (a ÷ â)
5: ga = gTb diag(b) − bbT 6: return ga
6: return ga 75
Module-based AutoDiff
Algorithm 1 Forward Computation
1: procedure NNF (Training example (x, y), Parameters α, Advantages of
β) Module-based
2: a=L F (x, α) AutoDiff
3: z=S F (a)
4: b=L F (z, β)
1. Easy to reuse /
5: ŷ = S F (b) adapt for other
6: J =C E F (y, ŷ) models
7: o = object(x, a, z, b, ŷ, J) 2. Encapsulated
8: return intermediate quantities o layers are easier
to optimize (e.g.
Algorithm 2 Backpropagation implement in C++
1: procedure NNB (Training example (x, y), Parameters or CUDA)
α, β, Intermediates o)
2: Place intermediate quantities x, a, z, b, ŷ, J in o in scope 3. Easier to find
3: gJ = dJdJ
=1 ! Base case bugs because we
4: gŷ = C E B (y, ŷ, J, gJ ) can run a finite-
5: gb = S B (b, ŷ, gŷ ) difference check
6: gβ , gz = L B (z, b, gb ) on each layer
7: ga = S B (a, z, gz )
8: gα , gx = L B (x, a, ga ) ! We discard gx
separately
9: return parameter gradients gα , gβ 76
Module-based AutoDiff (OOP Version)
Object-Oriented Implementation:
– Let each module be an object
– Then allow the control flow dictate the creation of the computation graph
– No longer need to implement NNBackward(·), just follow the computation
graph in reverse topological order
1 class Linear(Module)
1 class Sigmoid(Module) 2 method forward(a , ω)
2 method forward(a) 3 b = ωa
3 b = σ(a) 4 return b
4 return b 5 method backward(a , ω , b , gb )
5 method backward(a , b , gb ) 6 gω = gb aT
6 ga = gb ! b ! (1 − b) 7 ga = ω T gb
7 return ga 8 return gω , ga
78
Module-based AutoDiff (OOP Version)
1 global tape = stack()
2
1 class NeuralNetwork(Module):
2
3 class Module:
4
3 method init()
4 lin1_layer = Linear() 5 method init()
5 sig_layer = Sigmoid() 6 out_tensor = null
6 lin2_layer = Linear() 7 out_gradient = 1
8
7 soft_layer = Softmax()
8 ce_layer = CrossEntropy() 9 method apply_fwd(List in_modules)
9
10 in_tensors = [x.out_tensor for x in in_modules]
10
out_tensor = forward(in_tensors)
method forward(Tensor x , Tensor y , Tensor α11, Tensor β)
11 a =lin1_layer.apply_fwd(x, α) 12 tape.push(self)
12 z =sig_layer.apply_fwd(a) 13 return self
14
13 b =lin2_layer.apply_fwd(z, β)
14 ŷ =soft_layer.apply_fwd(b) 15 method apply_bwd():
15 J =ce_layer.apply_fwd(y, ŷ) 16 in_gradients = backward(in_tensors , out_tensor , out_gradient)
16 return J.out_tensor 17 for i in 1, . . . , len(in_modules):
17
18 in_modules[i].out_gradient += in_gradients[i]
18 method backward(Tensor x , Tensor y , Tensor 19α , Tensorreturn
β) self
20
19 tape_bwd()
21 function tape_bwd():
20 return lin1_layer.in_gradients[1] , lin2_layer.in_gradients[1]
22 while len(tape) > 0
23 m = tape.pop()
24 m.apply_bwd() 79
Module-based AutoDiff (OOP Version)
1 global tape = stack()
2
1 class NeuralNetwork(Module):
2
3 class Module:
4
3 method init()
4 lin1_layer = Linear() 5 method init()
5 sig_layer = Sigmoid() 6 out_tensor = null
6 lin2_layer = Linear() 7 out_gradient = 1
8
7 soft_layer = Softmax()
8 ce_layer = CrossEntropy() 9 method apply_fwd(List in_modules)
9
10 in_tensors = [x.out_tensor for x in in_modules]
10
out_tensor = forward(in_tensors)
method forward(Tensor x , Tensor y , Tensor α11, Tensor β)
11 a =lin1_layer.apply_fwd(x, α) 12 tape.push(self)
12 z =sig_layer.apply_fwd(a) 13 return self
14
13 b =lin2_layer.apply_fwd(z, β)
14 ŷ =soft_layer.apply_fwd(b) 15 method apply_bwd():
15 J =ce_layer.apply_fwd(y, ŷ) 16 in_gradients = backward(in_tensors , out_tensor , out_gradient)
16 return J.out_tensor 17 for i in 1, . . . , len(in_modules):
17
18 in_modules[i].out_gradient += in_gradients[i]
18 method backward(Tensor x , Tensor y , Tensor 19α , Tensorreturn
β) self
20
19 tape_bwd()
21 function tape_bwd():
20 return lin1_layer.in_gradients[1] , lin2_layer.in_gradients[1]
22 while len(tape) > 0
23 m = tape.pop()
24 m.apply_bwd() 80
PyTorch
The same simple
neural network
we defined in
pseudocode can
also be defined
in PyTorch.
81
Example adapted from https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
PyTorch
Q: Why don’t we call linear.forward() in PyTorch?
This is just syntactic sugar. There’s a special method in Python
A: __call__ that allows you to define what happens when you treat
an object as if it were a function.
82
PyTorch
Q: Why don’t we pass in the parameters to a PyTorch Module?
In PyTorch, you store the parameters inside the Module and “mark”
them as parameters that should contribute to the eventual gradient
used by an optimizer
83
BACKGROUND:
N-GRAM LANGUAGE MODELS
96
n-Gram Language Model
• Goal: Generate realistic looking sentences in a human
language
• Key Idea: condition on the last n-1 words to sample
the nth word
)
ise
e)
)
de
Th
no
at)
t)
ma
T)
ba
T,
,
de
ise
AR
AR
e,
t,
ma
no
ba
Th
ST
ST
·|
·|
·|
·|
·|
·|
p(
p(
p(
p(
p(
p(
97
The Chain Rule of Probability
Question: How can we define a probability distribution over a
sequence of length T?
The bat made noise at night
w1 w2 w3 w4 w5 w6
w1 w2 w3 w4 w5 w6
w1 w2 w3 w4 w5 w6
w1 w2 w3 w4 w5 w6
w1 w2 w3 w4 w5 w6
… … …
… … …
103
Learning an n-Gram Model
Question: How do we learn the probabilities for the n-Gram
Model?
Answer: From data! Just count n-gram frequencies
p(wt | wt-2 = cows,
wt-1 = eat)
…the cows eat grass…
…our cows eat hay daily… wt p(· | ·, ·)
…factory-farm cows eat corn…
corn 4/11
…on an organic farm, cows eat hay and…
…do your cows eat grass or corn?... grass 3/11
…what do cows eat if they have…
…cows eat corn when there is no… hay 2/11
…which cows eat which foods depends…
…if cows eat grass… if 1/11
…when cows eat corn their stomachs…
which 1/11
…should we let cows eat corn?...
104
Sampling from a Language Model
Question: How do we sample from a Language Model?
Answer:
1. Treat each probability distribution like a (50k-sided) weighted die
2. Pick the die corresponding to p(wt | wt-2, wt-1)
3. Roll that die and generate whichever word wt lands face up
4. Repeat
)
ise
e)
)
de
Th
no
at)
t)
ma
T)
ba
T,
,
de
ise
AR
AR
e,
t,
ma
no
ba
Th
ST
ST
·|
·|
·|
·|
·|
·|
p(
p(
p(
p(
p(
p(
105
Sampling from a Language Model
Question: How do we sample from a Language Model?
Answer:
1. Treat each probability distribution like a (50k-sided) weighted die
2. Pick the die corresponding to p(wt | wt-2, wt-1)
3. Roll that die and generate whichever word wt lands face up
4. Repeat
Training Data (Shakespeaere) 5-Gram Model
I tell you, friends, most charitable care Approacheth, denay. dungy
ave the patricians of you. For your Thither! Julius think: grant,--O
Yead linens, sheep's Ancient,
wants, Your suffering in this dearth, Agreed: Petrarch plaguy Resolved
you may as well Strike at the heaven pear! observingly honourest
with your staves as lift them Against adulteries wherever scabbard
the Roman state, whose course will on guess; affirmation--his monsieur;
The way it takes, cracking ten thousand died. jealousy, chequins me.
Daphne building. weakness: sun-
curbs Of more strong link asunder than rise, cannot stays carry't,
can ever Appear in your impediment. unpurposed. prophet-like drink;
For the dearth, The gods, not the back-return 'gainst surmise
patricians, make it, and Your knees to Bridget ships? wane; interim?
106
them, not arms, must help. She's striving wet;
RECURRENT NEURAL NETWORK (RNN)
LANGUAGE MODELS
107
rent neural network (RNN) computes the hidden vector se-
quence h = (h1 , . . . , hT ) and output vector sequence y =
Recurrent Neural Networks (RNNs)(y1 , . . . , yT ) by iterating the following equations from t = 1
to T :
inputs: x = (x1 , x2 , . . . , xT ), xi RI Definition of the RNN:
hidden units: h = (h1 , h2 , . . . , hT ), hi RJ ht = H (Wxh xt + Whh ht 1 + bh ) (1)
outputs: y = (y1 , y2 , . . . , yT ), yi RK yt = Why ht + by (2)
nonlinearity: H
where the W terms denote weight matrices (e.g. Wxh is the
input-hidden weight matrix), the b terms denote bias vectors
y1 y2
(e.g. bhyis hidden bias
y4
vector) and
y5
H is the hidden layer func-
3
tion.
H is usually an elementwise application of a sigmoid
h1 h2 function.
h3 Howeverhwe 4
have found
h5 that the Long Short-Term
Memory (LSTM) architecture [11], which uses purpose-built
memory cells to store information, is better at finding and ex-
x1 x2 ploitingx3long rangex4context. Fig.
x5 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
108
it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3)
Reca
ll…
The Chain Rule of Probability
Question: How can we define a probability distribution over a
sequence of length T?
The bat made noise at night
w1 w2 w3 w4 w5 w6
h0 h1 h2 h3 h4 h5 h6
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 111
RNN Language Model
The
p(w1|h1)
h0
START
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 112
RNN Language Model
bat
p(w2|h2)
h0 h1
START The
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 113
RNN Language Model
made
p(w3|h3)
h0 h1 h2
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 114
RNN Language Model
noise
p(w4|h4)
h0 h1 h2 h3
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 115
RNN Language Model
at
p(w5|h5)
h0 h1 h2 h3 h4
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 116
RNN Language Model
Question: How can we create a distribution
p(wt|ht) from ht? night
Answer:
p(w6|h6)
h0 h1 h2 h3 h4 h5
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 117
RNN Language Model
END
p(w7|h7)
h0 h1 h2 h3 h4 h5 h6
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 118
RNN Language Model
h0 h1 h2 h3 h4 h5 h6
p(w1, w2, w3, … , wT) = p(w1 | h1) p(w2 | h2) … p(w2 | hT)
119
Sampling from a Language Model
Question: How do we sample from a Language Model?
Answer:
1. Treat each probability distribution like a (50k-sided) weighted die
2. Pick the die corresponding to p(wt | wt-2, wt-1)
3. Roll that die and generate whichever word wt lands face up
4. Repeat
)
ise
e)
)
de
Th
no
at)
t)
ma
T)
ba
T,
,
de
ise
AR
AR
e,
t,
ma
no
ba
Th
ST
ST
·|
·|
·|
·|
·|
·|
p(
p(
p(
p(
p(
p(
The same approach to
START The bat made sampling
noise we atused for
nightan n-
Gram Language Model also
works here for an RNN
Language Model
120
Sampling from an RNN-LM
?? ??
VIOLA: Why, Salisbury must find his flesh and thought CHARLES: Marry, do I, sir; and I came to acquaint you
That which I am not aps, not a man and in fire, To show with a matter. I am given, sir, secretly to understand that
the reining of the raven and the wars To grace my hand your younger brother Orlando hath a disposition to come
reproach within, and not a fair are hand, That Caesar and in disguised against me to try a fall. To-morrow, sir, I
my goodly father's world; When I was heaven of wrestle for my credit; and he that escapes me without
presence and our fleets, We spare with hours, but cut thy some broken limb shall acquit him well. Your brother is
council I am great, Murdered and by thy master's ready is the
Which but young
real and tender; and, for your love, I would be
there My power to give thee but so much as hell: Some loath to foil him, as I must, for my own honour, if he
service in the noble bondman here, Would show him Shakespeare?!
to come in: therefore, out of my love to you, I came hither
her wine. to acquaint you withal, that either you might stay him
from his intendment or brook such disgrace well as he
KING LEAR: O, if you were a feeble sight, the courtesy of shall run into, in that it is a thing of his own search and
your law, Your sight and several breath, will wear the altogether against my will.
gods With his heads, and my hands are wonder'd at the
deeds, So drop upon your lordship's head, and your TOUCHSTONE: For my part, I had rather bear with you
opinion Shall be against your honour. than bear you; yet I should bear no cross if I did bear you,
for I think you have no money in your purse.
121
Example from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Sampling from an RNN-LM
Shakespeare’s As You Like It RNN-LM Sample
VIOLA: Why, Salisbury must find his flesh and thought CHARLES: Marry, do I, sir; and I came to acquaint you
That which I am not aps, not a man and in fire, To show with a matter. I am given, sir, secretly to understand that
the reining of the raven and the wars To grace my hand your younger brother Orlando hath a disposition to come
reproach within, and not a fair are hand, That Caesar and in disguised against me to try a fall. To-morrow, sir, I
my goodly father's world; When I was heaven of wrestle for my credit; and he that escapes me without
presence and our fleets, We spare with hours, but cut thy some broken limb shall acquit him well. Your brother is
council I am great, Murdered and by thy master's ready but young and tender; and, for your love, I would be
there My power to give thee but so much as hell: Some loath to foil him, as I must, for my own honour, if he
service in the noble bondman here, Would show him to come in: therefore, out of my love to you, I came hither
her wine. to acquaint you withal, that either you might stay him
from his intendment or brook such disgrace well as he
KING LEAR: O, if you were a feeble sight, the courtesy of shall run into, in that it is a thing of his own search and
your law, Your sight and several breath, will wear the altogether against my will.
gods With his heads, and my hands are wonder'd at the
deeds, So drop upon your lordship's head, and your TOUCHSTONE: For my part, I had rather bear with you
opinion Shall be against your honour. than bear you; yet I should bear no cross if I did bear you,
for I think you have no money in your purse.
122
Example from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Sampling from an RNN-LM
RNN-LM Sample Shakespeare’s As You Like It
VIOLA: Why, Salisbury must find his flesh and thought CHARLES: Marry, do I, sir; and I came to acquaint you
That which I am not aps, not a man and in fire, To show with a matter. I am given, sir, secretly to understand that
the reining of the raven and the wars To grace my hand your younger brother Orlando hath a disposition to come
reproach within, and not a fair are hand, That Caesar and in disguised against me to try a fall. To-morrow, sir, I
my goodly father's world; When I was heaven of wrestle for my credit; and he that escapes me without
presence and our fleets, We spare with hours, but cut thy some broken limb shall acquit him well. Your brother is
council I am great, Murdered and by thy master's ready but young and tender; and, for your love, I would be
there My power to give thee but so much as hell: Some loath to foil him, as I must, for my own honour, if he
service in the noble bondman here, Would show him to come in: therefore, out of my love to you, I came hither
her wine. to acquaint you withal, that either you might stay him
from his intendment or brook such disgrace well as he
KING LEAR: O, if you were a feeble sight, the courtesy of shall run into, in that it is a thing of his own search and
your law, Your sight and several breath, will wear the altogether against my will.
gods With his heads, and my hands are wonder'd at the
deeds, So drop upon your lordship's head, and your TOUCHSTONE: For my part, I had rather bear with you
opinion Shall be against your honour. than bear you; yet I should bear no cross if I did bear you,
for I think you have no money in your purse.
123
Example from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Sampling from an RNN-LM
?? ??
VIOLA: Why, Salisbury must find his flesh and thought CHARLES: Marry, do I, sir; and I came to acquaint you
That which I am not aps, not a man and in fire, To show with a matter. I am given, sir, secretly to understand that
the reining of the raven and the wars To grace my hand your younger brother Orlando hath a disposition to come
reproach within, and not a fair are hand, That Caesar and in disguised against me to try a fall. To-morrow, sir, I
my goodly father's world; When I was heaven of wrestle for my credit; and he that escapes me without
presence and our fleets, We spare with hours, but cut thy some broken limb shall acquit him well. Your brother is
council I am great, Murdered and by thy master's ready is the
Which but young
real and tender; and, for your love, I would be
there My power to give thee but so much as hell: Some loath to foil him, as I must, for my own honour, if he
service in the noble bondman here, Would show him Shakespeare?!
to come in: therefore, out of my love to you, I came hither
her wine. to acquaint you withal, that either you might stay him
from his intendment or brook such disgrace well as he
KING LEAR: O, if you were a feeble sight, the courtesy of shall run into, in that it is a thing of his own search and
your law, Your sight and several breath, will wear the altogether against my will.
gods With his heads, and my hands are wonder'd at the
deeds, So drop upon your lordship's head, and your TOUCHSTONE: For my part, I had rather bear with you
opinion Shall be against your honour. than bear you; yet I should bear no cross if I did bear you,
for I think you have no money in your purse.
124
Example from http://karpathy.github.io/2015/05/21/rnn-effectiveness/