100% found this document useful (1 vote)
152 views

Generative AI

The document outlines a course on Generative AI offered by Carnegie Mellon University, covering various topics including AutoDiff, RNN-LMs, Transformer Language Models, and Generative Adversarial Networks. It discusses the goals of AI and how Generative AI contributes to areas such as perception, reasoning, and creativity. Additionally, it provides examples of generative AI applications in text, image, music, and video generation.

Uploaded by

Sawan Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
152 views

Generative AI

The document outlines a course on Generative AI offered by Carnegie Mellon University, covering various topics including AutoDiff, RNN-LMs, Transformer Language Models, and Generative Adversarial Networks. It discusses the goals of AI and how Generative AI contributes to areas such as perception, reasoning, and creativity. Additionally, it provides examples of generative AI applications in text, image, music, and video generation.

Uploaded by

Sawan Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

10-423/10-623 Generative AI

Machine Learning Department


School of Computer Science
Carnegie Mellon University

Course Overview
+ AutoDiff + RNN-LMs
Matt Gormley
Lecture 1
Jan. 17, 2024

1
Generative AI Full Course 2024

01. Course Overview + AutoDiff + RNN-LMs


02. Transformer Language Models
03. Learning Large Language Models (Pre-training, fine-tuning, decoding)
04. Pretraining vs. finetuning + Modern Transformers (RoPE, GQA, Longformer) + CNNs
05. Vision Transformers + Generative Adversarial Networks (GANs)
06. Generative Adversarial Networks (GANs)
07. Diffusion Models 01
08. Diffusion Models 02
09. Variational Autoencoders (VAEs)
10. In-context Learning
11. Parameter Efficient Fine-Tuning Download Link
12. Instruction Fine-tuning + Reinforcement Learning with Human Feedback (RLHF)

WHAT IS GENERATIVE AI?

2
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning

3
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning

4
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning

5
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning

6
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning

7
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning

8
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning

9
“Deep Style” from https://deepdreamgenerator.com/#gallery
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation
• Planning
• Communication
• Creativity
• Learning

10
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation qQ: What does Generative AI
• Planning have to do with any of
• Communication these goals?
• Creativity
• Learning qA: It’s making in-roads into
all of them.
11
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning q Communication comprises the
• Control / Motion / Manipulation comprehension and generation of
human language.
• Planning
q Large language models (LLMs)
• Communication excel at both
• Creativity q (Even though they are most often
trained autoregressively, i.e. to
• Learning generate a next word, given the
previous ones)
12
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning q The traditional way of learning in
• Control / Motion / Manipulation ML is via parameter estimation
• Planning q But in-context learning (i.e.
• Communication providing training examples as
• Creativity context at test time) shows that
learning can also be done via
• Learning
inference

13
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning q LLMs are also (unexpectedly)
• Control / Motion / Manipulation good at certain reasoning tasks
• Planning q cf. Chain-of-Though Prompting
• Communication (an ex. of in-context learning)
• Creativity
• Learning

14
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning q LLMs are already being used for
• Control / Motion / Manipulation grounded planning for
• Planning embodied agents, c.f. LLM-
• Communication Planner
• Creativity
• Learning

15
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning q Text-to-image models
• Control / Motion / Manipulation [Midjourney’s Discord server has
• Planning 18 million members (1.7 million
• Communication were online this morning)]
• Creativity q Text-to-music models
[MusicGen capable of
• Learning
conditioning on text and audio
sample]
16
“Deep Style” from https://deepdreamgenerator.com/#gallery
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning q Multimodal foundation models
• Control / Motion / Manipulation learn to answer questions about
• Planning images (and text in images)
• Communication q Diffusion models can be used as
• Creativity zero-shot classifiers
• Learning

17
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning q DayDreamer learns a generative
• Control / Motion / Manipulation model of experiences for RL, i.e.
• Planning a World Model, without
• Communication simulation
• Creativity q Quadruped
robot learns
• Learning
to walk in
under 1 hour
18
Artificial Intelligence Artificial Intelligence

Machine Learning
The basic goal of AI is to develop intelligent
machines. Deep Learning

This consists of many sub-goals: GenAI

• Perception
• Reasoning
• Control / Motion / Manipulation qQ: What does Generative AI
• Planning have to do with any of
• Communication these goals?
• Creativity
• Learning qA: It’s making in-roads into
all of them.
19
EXAMPLE OF GENERATIVE AI

20
Text Generation
Prompt: Write a proof of the fact that there are SECOND PARTY: But how, my friend, will this be
infinitely many primes; do it in the style of a done? The proof is not yet fairly won.
Shakespeare play through a dialogue between two
parties arguing over the proof. FIRST PARTY: It’s simple, friend, just wait and see,
We’ll multiply the primes, that’s key. All primes from
GPT-4: Enter FIRST PARTY and SECOND PARTY. two to "P" combined, Will make this new "Q" we will
find. But one thing more we must include, To make
FIRST PARTY: Thou dost not understand, my friend, the proof more watertight, We’ll add a one onto the
There simply is no upper end To primes, those end, And now the proof is in our sight.
numbers indivisible, No matter if it seems impossible.
SECOND PARTY: I see it now, it’s crystal clear, For "Q"
must be a prime, my dear. It can’t divide by any prime,
SECOND PARTY: But how dost thou so surely know? All smaller than itself in time. And since we chose "P"
To prove this truth, what canst thou show?
as the last, This proves that our assumption’s past.
There is no final prime, we see, The numbers go on
FIRST PARTY: Consider this, my doubtful peer, A endlessly.
clever proof that will make clear: Assume that there’s
a final prime, The largest one we’ll see in time. Let’s
call this number "P" for short, And then, my friend,
we will purport To build a number "Q" that’s new,
One larger than "P" and prime too.

21
Example from https://arxiv.org/pdf/2303.12712.pdf
Image Editing
A variety of tasks involve
automatic editing of an
image:
• Inpainting fills in the (pre-
specified) missing pixels
• Colorization restores
color to a greyscale image
• Uncropping creates a
photo-realistic
reconstruction of a
missing side of an image

22
Figure from Saharia et al. (2022)
Text-to-Image Generation
• Given a text description, sample an
image that depicts the prompt
• The following images are samples from
SDXL with refinement

Prompt: close up headshot, futuristic old


man, wild hair sly smile in front of
gigantic UFO, dslr, sharp focus, dynamic
composition, rule of thirds

23
Figure from https://stablediffusionweb.com/
Music Generation

MusicGen
• A transformer decoder model over quantized
units (discrete elements of a codebook of audio
frames)
• Interleaves sounds by adjusting how codebooks
attend to each other
• Permits conditioning on text and/or audio
samples

24
Figure from https://arxiv.org/pdf/2306.05284.pdf
Code Generation

25
Example from https://arxiv.org/pdf/2303.12712.pdf
Video Generation
• Latent diffusion
models use a low-
dimensional latent
space for efficiency
• Key question: how
to generate multiple
correlated frames?
• ‘Align your latents’
inserts temporal
convolution /
attention between
each spatial
convolution /
attention
• ‘Preserve Your Own
Correlation’ includes
temporally
correlated noise

26
Figure from https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt
SCALING UP

27
Training Data for LLMs

The Pile:
• An open source dataset for
training language models
• Comprised of 22 smaller
datasets
• Favors high quality text
• 825 Gb ≈ 1.2 trillion tokens

28
RLHF
• InstructGPT uses
Reinforcement
Learning with Human
Feedback (RLHF) to
fine-tune a pre-
trained GPT model
• From the paper:
“In human
evaluations on our
prompt distribution,
outputs from the 1.3B
parameter
InstructGPT model are
preferred to outputs
from the 175B GPT-3,
despite having 100x
fewer parameters.”

29
Figure from https://arxiv.org/pdf/2203.02155.pdf
Memory Usage of LLMs
How to store a large language Model Megatron-LM GPT-3
# parameters 8.3 billion 175 billion
model in memory?
full precision 30 Gb 651 Gb
– full precision: 32-bit floats half precision 15 Gb 325 Gb
– half precision: 16-bit floats
– Using half precision not only
reduces memory, it also speeds GPU / TPU Max Memory

up GPU computation TPU v2 16 Gb


TPU v3/v4 32 Gb
– “Peak float16 matrix multiplication
Tesla V100 GPU 32 Gb
and convolution performance is 16x
NVIDIA RTX A6000 48 Gb
faster than peak float32
Tesla A100 GPU 80 Gb
performance on A100 GPUs.”
from Pytorch docs

30
Distributed Training: Model Parallel

There are a variety of Matrix multiplication The most natural division is A more efficient solution is
different options for comprises most by layer: each device to divide computation by
how to distribute the Transformer LM computes a subset of the token and layer. This
model computation / computation and can be layers, only that device requires careful division of
parameters across divided along rows/columns stores the parameters and work and is specific to the
multiple devices. of the respective matrices. computation graph for Transformer LM.
those layers. 31
Figure from https://arxiv.org/pdf/2102.07988.pdf
Cost to train

32
Figure from https://arxiv.org/pdf/2203.15556.pdf
n-g
ram
s
200
0

RN
N-L
Ms
201
0

Tra
GPT nsf
202 orm
-3 0 erL
Ms
201
7
Ins
tru
ctG
202
PT 1
LaM ELM
O
201
BD
A
8
BER
T

GPT

Pal
m
202
2
Cha
tGP
T
BLO
OM
GPT
-2
201
9
Timeline: Language Modeling

RoB
ERT
a

Lla
ma
202
3
GPT
-4

Fal
con

Mis
tra
l
33
LeN
199
et 8

Ima
geN
200
et 9

Pas
calV
O
201
C 0
Tra
nsf
orm
201
er 7
Ale
xNe
201
t 2
DD
PM
202
0
VAE
201
s 3
Tra
nsf Vision
orm
202
er 1
Dal
l-E
VGG
201
CLI
P
4
R-C
NN

GA
Ns
Timeline: Image Generation

Dal
l-E
202
2 2
Ima
gen Dif
fu
S mo sion
201
diff table
usio
del
s
5
n Res
Ne
t

SDX
202
L 3
34

SDX
L Tu
rbo
Why learn the inner-
workings of GenAI?

(a metaphor)
37
Figure from https://www.astonmartin.com/en/
Figure from https://daily.jstor.org/the-science-of-traffic/
Figure from https://earthobservatory.nasa.gov/images/149321/2021-continued-earths-warming-trend

40
Figure from https://www.energy.gov/eere/vehicles/fact-617-april-5-2010-changes-vehicles-capita-around-world
Figure from GHSA

41
Figure from https://www.businesswire.com/news/home/20210624005926/en/Strategy-Analytics-Half-the-World-Owns-a-Smartphone
43
Figure from https://www.npr.org/2024/01/16/1224913698/teslas-chicago-charging-extreme-cold
GENERATIVE AI IS PROBABILISTIC MODELING

45
GenAI is Probabilistic Modeling

p(xt+1 | x1 , . . . , xt )

46
What if I want to model
EVERY possible
interaction?

…or at least the interactions of the


current variable with all those that came
before it…

(RNN-LMs)
47
RNN Language Model

RNN Language Model:

p(w1, w2, w3, … , w6) =


The p(w1)
The bat p(w2 | fθ(w1))
The bat made p(w3 | fθ(w2, w1))
The bat made noise p(w4 | fθ(w3, w2, w1))
The bat made noise at p(w5 | fθ(w4, w3, w2, w1))
The bat made noise at night p(w6 | fθ(w5, w4, w3, w2, w1))
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector 48
Topics
• Generative models of text – Visual-language foundation models
– RNN LMs / Autodiff • Scaling up
– Transformer LMs – Efficient decoding strategies
– Pre-training, fine-tuning, evaluation, decoding – Distributed training / multi-GPU or TPU
• Generative models of images – Scaling laws and data
– CNNs / Transformers for vision • What can go wrong?
– GANs, Conditional GANs – Safety/bias/fairness, Hallucinations, Adversarial
– Diffusion models (e.g., prompt injection) attacks
– VAEs / Evaluation – Cheating – how to watermark, Legal issues, e.g.,
copyright,...
• Applying and adapting foundation models
– Drift in performance, Data contamination, Lack
– Reinforcement learning with human feedback of ground truth
(RLHF)
– Parameter-efficient fine tuning
• Advanced Topics
– In-context learning for text – Normalizing flows
– In-context learning for vision – Audio understanding and synthesis
– Video synthesis
• Multimodal foundation models
– Text-to-image generation
– Aligning multimodal representations (CLIP)
49
SYLLABUS HIGHLIGHTS

50
Syllabus Highlights

The syllabus is located on the course webpage:

http://423.mlcourse.org
https://www.cs.cmu.edu/~mgormley/courses/10423/
http://623.mlcourse.org

The course policies are required reading.

51
Syllabus Highlights
• Grading: 40% homework, 10% quizzes, 20% Technologies:
exam, 25% project, 5% participation – Piazza (discussion),
• Exam: in-class exam, Wed, Mar. 27 – Gradescope (homework),
• Homework: 5 assignments – Google Forms (polls),
– 6 grace days for homework assignments – Zoom (livestream),
– Late submissions: 75% day 1, 50% day 2, 25% – Panopto (video recordings)
day 3 • Academic Integrity:
– No submissions accepted after 3 days w/o – Collaboration encouraged, but must be
extension documented
– Extension requests: for emergency – Solutions must always be written
situations, see syllabus independently
• Recitations: Fridays, same time/place as – No re-use of found code / past assignments
lecture (optional, interactive sessions) – Severe penalties (i.e.. failure)
• Readings: required, online PDFs, – (Policies differ from 10-301/10-601)
recommended for after lecture • Office Hours: posted on Google Calendar
• on “Office Hours” page

52
Lectures
• You should ask lots of questions
– Interrupting (by raising a hand) to ask your question is strongly
encouraged
– Asking questions later (or in real time) on Piazza is also great
• When I ask a question…
– I want you to answer
– Even if you don’t answer, think it through as though I’m about to
call on you
• Interaction improves learning (both in-class and at my office
hours)

53
Prerequisites
What they are: What is not required:
Introductory machine learning. • Deep learning
(i.e. 10-301, 10-315, 10-601, 10-701) • PyTorch
Depending on which prerequisite
If you instead took an introduction course you took and in which
to deep learning course, that is semester you took it, you may or
also fine may not have been exposed to
(i.e. 11-485/11-685/11-785) deep learning and/or PyTorch.
Either way is fine.

54
Homework
There will be 5 homework assignments during the semester. The
assignments will consist of both conceptual and programming
problems.
Main Topic Implementation Application Type
Area
HW0 PyTorch Primer image classifier + vision + written +
Text classifier language programming
HW1 Large Language TransformerLM with char-level written +
Models sliding window attn. text gen programming
HW2 Image Generation GAN or diffusion image infilling written +
model programming
HW3 Adapters for LLMs Llama + LoRA code + chat written +
programming
HW4 Multimodal text-to-image model vision + written +
Foundation Models language programming
HW623 (10-623 only) read / analyze a recent genAI video
research paper presentation

55
Project
• Goals:
– Explore a generative
modeling technique of your
choosing
– Deeper understanding of
methods in real-world
application
– Work in teams of 3 students

56
Textbooks

…do not exist for this course.

Instead, we will be directing


your reading time to current
research papers.

57
Where can I find…?

58
Where can I find…?

59
Where can I find…?

60
Reminders
• Homework 0: PyTorch + Weights & Biases
– Out: Wed, Jan 17
– Due: Wed, Jan 24 at 11:59pm
– Two parts:
1. written part to Gradescope
2. programming part to Gradescope
– unique policy for this assignment: we will grant (essentially) any
and all extension requests

62
Learning Objectives
You should be able to…
1. Differentiate between different mechanisms of learning such as parameter tuning and
in-context learning.
2. Implement the foundational models underlying modern approaches to generative
modeling, such as transformers and diffusion models.
3. Apply existing models to real-world generation problems for text, code, images, audio,
and video.
4. Employ techniques for adapting foundation models to tasks such as fine-tuning,
adapters, and in-context learning.
5. Enable methods for generative modeling to scale-up to large datasets of text, code, or
images.
6. Use existing generative models to solve real-world discriminative problems and for
other everyday use cases.
7. Analyze the theoretical properties of foundation models at scale.
8. Identify potential pitfalls of generative modeling for different modalities.
9. Describe societal impacts of large-scale generative AI systems.
64
Q&A
65
MODULE-BASED AUTOMATIC
DIFFERENTIATION

66
Backpropagation
Automatic Differentiation – Reverse Mode (aka. Backpropagation)
Forward Computation
1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a
directed acyclic graph, where each variable is a node (i.e. the “computation
graph”)
2. Visit each node in topological order.
For variable ui with inputs v1,…, vN
a. Compute ui = gi(v1,…, vN)
b. Store the result at the node
Backward Computation (Version A)
1. Initialize dy/dy = 1.
2. Visit each node vj in reverse topological order.
Let u1,…, uM denote all the nodes with vj as an input
Assuming that y = h(u) = h(u1,…, uM)
and u = g(v) or equivalently ui = gi(v1,…, vj,…, vN) for all i
a. We already know dy/dui for all i
b. Compute dy/dvj as below (Choice of algorithm ensures computing
(dui/dvj) is easy)

Return partial derivatives dy/dui for all variables 67


Backpropagation
Automatic Differentiation – Reverse Mode (aka. Backpropagation)
Forward Computation
1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a
directed acyclic graph, where each variable is a node (i.e. the “computation
graph”)
2. Visit each node in topological order.
For variable ui with inputs v1,…, vN
a. Compute ui = gi(v1,…, vN)
b. Store the result at the node
Backward Computation (Version B)
1. Initialize all partial derivatives dy/duj to 0 and dy/dy = 1.
2. Visit each node in reverse topological order.
For variable ui = gi(v1,…, vN)
a. We already know dy/dui
b. Increment dy/dvj by (dy/dui)(dui/dvj)
(Choice of algorithm ensures computing (dui/dvj) is easy)

Return partial derivatives dy/dui for all variables 68


Backpropagation
Why is the backpropagation algorithm efficient?
1. Reuses computation from the forward pass in the backward pass
2. Reuses partial derivatives throughout the backward pass (but
only if the algorithm reuses shared computation in the forward
pass)

(Key idea: partial derivatives in the backward pass should be


thought of as variables stored for reuse)

69
A Recipe for
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function

70
Backpropagation:
Abstract Picture
… (F) Loss
Output K
J = k=1 yk (yk )

Hidden Layer … (E) Output (softmax)


yk = K (bk )(b )
l=1 l

(D) Output (linear)


Input … D
bk = j=0 kj zj k

Forward Backward (C) Hidden (nonlinear)


5. J = −yT log ŷ 6. gŷ = −y ÷ ŷ zj = (aj ), j
4. ŷ = softmax(b) 7. gb = gŷ diag(ŷ) − ŷŷ
T T
! "

3. b = βz 8. gβ = gTb zT (B) Hidden (linear)


M
aj = i=0 ji xi , j
gz = β T gTb
2. z = σ(a) 10. ga = gz " z " (1 − z)
(A) Input
1. a = αx 11. gα = ga x T
Given xi , i 71
Backpropagation:
Procedural Method
Algorithm 1 Forward Computation Drawbacks of
1: procedure NNF (Training example (x, y), Params α, β) Procedural Method
2: a = αx 1. Hard to reuse /
3: z = σ(a) adapt for other
4: b = βz models
5: ŷ = softmax(b)
6: J = −yT log ŷ
2. (Possibly) harder to
7: o = object(x, a, z, b, ŷ, J)
make individual
8: return intermediate quantities o
steps more efficient
3. Hard to find source
Algorithm 2 Backpropagation of error if finite-
1: procedure NNB (Training example (x, y), Params α, β,
difference check
Intermediates o)
reports an error
2: Place intermediate quantities x, a, z, b, ŷ, J in o in scope
(since it tells you
3: gŷ = −y!÷ ŷ
only that there is an
4: gb = gTŷ diag(ŷ) − ŷŷT
" error somewhere in
those 17 lines of
5: gβ = gTb zT
code)
6: gz = β T gTb
7: ga = gz " z " (1 − z)
8: gα = ga xT
9: return parameter gradients gα , gβ 72
Module-based AutoDiff
Module-based automatic differentiation (AD / Autodiff) is a technique that has
long been used to develop libraries for deep learning
• Dynamic neural network packages allow a specification of the computation
graph dynamically at runtime
– PyTorch http://pytorch.org
– Torch http://torch.ch
– DyNet https://dynet.readthedocs.io
– TensorFlow with Eager Execution https://www.tensorflow.org
• Static neural network packages require a static specification of a
computation graph which is subsequently compiled into code
– TensorFlow with Graph Execution https://www.tensorflow.org
– Aesara (and Theano) https://aesara.readthedocs.io
– (These libraries are also module-based, but herein by “module-based AD” we mean the
dynamic approach)

73
Module-based AutoDiff
• Key Idea:
– componentize the computation of the neural-network into layers
– each layer consolidates multiple real-valued nodes in the
computation graph (a subset of them) into one vector-valued node
(aka. a module)
• Each module is capable of two actions:
1. Forward computation of output b = [b1 , . . . , bB ] given input
b gb a = [a1 , . . . , aA ] via some di昀昀erentiable function f . That is
b = f (a).
2. Backward computation of the gradient of the input
module ga = ∇a J = [ ∂a
∂J
, . . . , ∂J
] given the gradient of output
1 ∂a A
gb = ∇b J = [ ∂b ∂J
1
, . . . , ∂J
∂b B
], where J is the 昀椀nal real‐valued
output of the entire computation graph. This is done via the
a ga !J ∂J dbj
chain rule ∂ai = j=1 ∂bj dai for all i ∈ {1, . . . , A}.
∂J

74
Module-based AutoDiff
Dimensions: input a ∈ RA , output b ∈ RB , gradient
of output ga ! ∇a J ∈ RA , and gradient of input gb ! Linear Module The linear layer has two inputs: a vec-
∇ b J ∈ RB . tor a and parameters ω ∈ RB×A . The output b
is not used by L B , but we pass it in
Sigmoid Module The sigmoid layer has only one input for consistency of form.
vector a. Below σ is the sigmoid applied element- 1: procedure L F (a, ω)
wise, and ! is element-wise multiplication s.t. u! 2: b = ωa
v = [u1 v1 , . . . , uM vM ]. 3: return b
1: procedure S F (a) 4: procedure L B (a, ω, b, gb )
2: b = σ(a) 5: gω = gb a T

3: return b 6: ga = ω T gb
4: procedure S B (a, b, gb ) 7: return gω , ga
5: ga = gb ! b ! (1 − b)
6: return ga Cross-Entropy Module The cross-entropy layer has two in-
puts: a gold one-hot vector a and a predicted proba-
Softmax Module The softmax layer has only one input bility distribution â. It’s output b ∈ R is a scalar. Be-
vector a. For any vector v ∈ RD , we have that low ÷ is element-wise division. The output b is not
diag(v) returns a D × D diagonal matrix whose used by C E B , but we pass it in
diagonal entries are v1 , v2 , . . . , vD and whose non- for consistency of form.
diagonal entries are zero. 1: procedure C E F (a, â)
1: procedure S F (a) 2: b = −a log â
T

2: b = softmax(a) 3: return b
3: return b 4: procedure C E B (a, â, b, gb )
4: procedure S ! B " (a, b, gb ) 5: gâ = −gb (a ÷ â)
5: ga = gTb diag(b) − bbT 6: return ga
6: return ga 75
Module-based AutoDiff
Algorithm 1 Forward Computation
1: procedure NNF (Training example (x, y), Parameters α, Advantages of
β) Module-based
2: a=L F (x, α) AutoDiff
3: z=S F (a)
4: b=L F (z, β)
1. Easy to reuse /
5: ŷ = S F (b) adapt for other
6: J =C E F (y, ŷ) models
7: o = object(x, a, z, b, ŷ, J) 2. Encapsulated
8: return intermediate quantities o layers are easier
to optimize (e.g.
Algorithm 2 Backpropagation implement in C++
1: procedure NNB (Training example (x, y), Parameters or CUDA)
α, β, Intermediates o)
2: Place intermediate quantities x, a, z, b, ŷ, J in o in scope 3. Easier to find
3: gJ = dJdJ
=1 ! Base case bugs because we
4: gŷ = C E B (y, ŷ, J, gJ ) can run a finite-
5: gb = S B (b, ŷ, gŷ ) difference check
6: gβ , gz = L B (z, b, gb ) on each layer
7: ga = S B (a, z, gz )
8: gα , gx = L B (x, a, ga ) ! We discard gx
separately
9: return parameter gradients gα , gβ 76
Module-based AutoDiff (OOP Version)
Object-Oriented Implementation:
– Let each module be an object
– Then allow the control flow dictate the creation of the computation graph
– No longer need to implement NNBackward(·), just follow the computation
graph in reverse topological order
1 class Linear(Module)
1 class Sigmoid(Module) 2 method forward(a , ω)
2 method forward(a) 3 b = ωa
3 b = σ(a) 4 return b
4 return b 5 method backward(a , ω , b , gb )
5 method backward(a , b , gb ) 6 gω = gb aT
6 ga = gb ! b ! (1 − b) 7 ga = ω T gb
7 return ga 8 return gω , ga

1 class Softmax(Module) 1 class CrossEntropy(Module)


2 method forward(a) 2 method forward(a , â)
3 b = softmax(a) 3 b = −aT log â
4 return b 4 return b
5 method backward(a , b , "gb ) 5 method backward(a , â , b , gb )
ga = gTb diag(b) − bbT gâ = −gb (a ÷ â)
!
6 6
7 return ga 7 return ga 77
Module-based AutoDiff (OOP Version)
1 class NeuralNetwork(Module):
2
3 method init()
4 lin1_layer = Linear()
5 sig_layer = Sigmoid()
6 lin2_layer = Linear()
7 soft_layer = Softmax()
8 ce_layer = CrossEntropy()
9
10 method forward(Tensor x , Tensor y , Tensor α , Tensor β)
11 a =lin1_layer.apply_fwd(x, α)
12 z =sig_layer.apply_fwd(a)
13 b =lin2_layer.apply_fwd(z, β)
14 ŷ =soft_layer.apply_fwd(b)
15 J =ce_layer.apply_fwd(y, ŷ)
16 return J.out_tensor
17
18 method backward(Tensor x , Tensor y , Tensor α , Tensor β)
19 tape_bwd()
20 return lin1_layer.in_gradients[1] , lin2_layer.in_gradients[1]

78
Module-based AutoDiff (OOP Version)
1 global tape = stack()
2
1 class NeuralNetwork(Module):
2
3 class Module:
4
3 method init()
4 lin1_layer = Linear() 5 method init()
5 sig_layer = Sigmoid() 6 out_tensor = null
6 lin2_layer = Linear() 7 out_gradient = 1
8
7 soft_layer = Softmax()
8 ce_layer = CrossEntropy() 9 method apply_fwd(List in_modules)
9
10 in_tensors = [x.out_tensor for x in in_modules]
10
out_tensor = forward(in_tensors)
method forward(Tensor x , Tensor y , Tensor α11, Tensor β)
11 a =lin1_layer.apply_fwd(x, α) 12 tape.push(self)
12 z =sig_layer.apply_fwd(a) 13 return self
14
13 b =lin2_layer.apply_fwd(z, β)
14 ŷ =soft_layer.apply_fwd(b) 15 method apply_bwd():
15 J =ce_layer.apply_fwd(y, ŷ) 16 in_gradients = backward(in_tensors , out_tensor , out_gradient)
16 return J.out_tensor 17 for i in 1, . . . , len(in_modules):
17
18 in_modules[i].out_gradient += in_gradients[i]
18 method backward(Tensor x , Tensor y , Tensor 19α , Tensorreturn
β) self
20
19 tape_bwd()
21 function tape_bwd():
20 return lin1_layer.in_gradients[1] , lin2_layer.in_gradients[1]
22 while len(tape) > 0
23 m = tape.pop()
24 m.apply_bwd() 79
Module-based AutoDiff (OOP Version)
1 global tape = stack()
2
1 class NeuralNetwork(Module):
2
3 class Module:
4
3 method init()
4 lin1_layer = Linear() 5 method init()
5 sig_layer = Sigmoid() 6 out_tensor = null
6 lin2_layer = Linear() 7 out_gradient = 1
8
7 soft_layer = Softmax()
8 ce_layer = CrossEntropy() 9 method apply_fwd(List in_modules)
9
10 in_tensors = [x.out_tensor for x in in_modules]
10
out_tensor = forward(in_tensors)
method forward(Tensor x , Tensor y , Tensor α11, Tensor β)
11 a =lin1_layer.apply_fwd(x, α) 12 tape.push(self)
12 z =sig_layer.apply_fwd(a) 13 return self
14
13 b =lin2_layer.apply_fwd(z, β)
14 ŷ =soft_layer.apply_fwd(b) 15 method apply_bwd():
15 J =ce_layer.apply_fwd(y, ŷ) 16 in_gradients = backward(in_tensors , out_tensor , out_gradient)
16 return J.out_tensor 17 for i in 1, . . . , len(in_modules):
17
18 in_modules[i].out_gradient += in_gradients[i]
18 method backward(Tensor x , Tensor y , Tensor 19α , Tensorreturn
β) self
20
19 tape_bwd()
21 function tape_bwd():
20 return lin1_layer.in_gradients[1] , lin2_layer.in_gradients[1]
22 while len(tape) > 0
23 m = tape.pop()
24 m.apply_bwd() 80
PyTorch
The same simple
neural network
we defined in
pseudocode can
also be defined
in PyTorch.

81
Example adapted from https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
PyTorch
Q: Why don’t we call linear.forward() in PyTorch?
This is just syntactic sugar. There’s a special method in Python
A: __call__ that allows you to define what happens when you treat
an object as if it were a function.

In other words, running the following:


linear(x)
is equivalent to running:
linear.__call__(x)
which in PyTorch is (nearly) the same as running:
linear.forward(x)

This is because PyTorch defines every Module’s __call__ method


to be something like this:
def __call__(self):
self.forward()

82
PyTorch
Q: Why don’t we pass in the parameters to a PyTorch Module?

A: This just makes your code cleaner.

In PyTorch, you store the parameters inside the Module and “mark”
them as parameters that should contribute to the eventual gradient
used by an optimizer

83
BACKGROUND:
N-GRAM LANGUAGE MODELS

96
n-Gram Language Model
• Goal: Generate realistic looking sentences in a human
language
• Key Idea: condition on the last n-1 words to sample
the nth word

)
ise
e)

)
de
Th

no

at)
t)

ma
T)

ba
T,

,
de

ise
AR

AR

e,

t,

ma

no
ba
Th
ST

ST

·|

·|
·|
·|

·|
·|

p(

p(
p(
p(

p(
p(

START The bat made noise at night

97
The Chain Rule of Probability
Question: How can we define a probability distribution over a
sequence of length T?
The bat made noise at night

w1 w2 w3 w4 w5 w6

p(w1, w2, w3, … , w6) =


The p(w1)
The bat p(w2 | w1)
The bat made p(w3 | w2, w1)
The bat made noise p(w4 | w3, w2, w1)
The bat made noise at p(w5 | w4, w3, w2, w1)
The bat made noise at night p(w6 | w5, w4, w3, w2, w1) 98
The Chain Rule of Probability
Question: How can we define a probability distribution over a
sequence of length T?
The bat made noise at night

w1 w2 w3 w4 w5 w6

Chain rule of probability:

p(w1, w2, w3, … , w6) =


The p(w1)
The bat p(w2 | w1)
The Note:
bat This
made is called the chain rule
p(w because
3 | w2, w1)
The it is made
bat always noisetrue for everyp(w
probability
4 | w3, w2, w1)
The bat made noise
distribution
at p(w5 | w4, w3, w2, w1)
The bat made noise at night p(w6 | w5, w4, w3, w2, w1) 99
n-Gram Language Model
Question: How can we define a probability distribution over a
sequence of length T?
The bat made noise at night

w1 w2 w3 w4 w5 w6

n-Gram Model (n=2)

p(w1, w2, w3, … , w6) =


The p(w1)
The bat p(w2 | w1)
bat made p(w3 | w2)
made noise p(w4 | w3)
noise at p(w5 | w4)
at night p(w6 | w5) 100
n-Gram Language Model
Question: How can we define a probability distribution over a
sequence of length T?
The bat made noise at night

w1 w2 w3 w4 w5 w6

n-Gram Model (n=3)

p(w1, w2, w3, … , w6) =


The p(w1)
The bat p(w2 | w1)
The bat made p(w3 | w2, w1)
bat made noise p(w4 | w3, w2)
made noise at p(w5 | w4, w3)
noise at night p(w6 | w5, w4) 101
n-Gram Language Model
Question: How can we define a probability distribution over a
sequence of length T?
The bat made noise at night

w1 w2 w3 w4 w5 w6

n-Gram Model (n=3)

p(w1, w2, w3, … , w6) =


The p(w1)
The bat p(w2 | w1)
The Note:
bat This is called a modelp(w
made because we
3 | w2, w1)
made
bat some
made assumptions about
noise p(w4 how
| w3, wmany
2)
previous
made noise words
at to condition
p(w5 | won
4, w3)
(i.e.
noise only
at n-1 words)
night p(w6 | w5, w4) 102
Learning an n-Gram Model
Question: How do we learn the probabilities for the n-Gram
Model?

p(wt | wt-2 = The, p(wt | wt-2 = made, p(wt | wt-2 = cows,


wt-1 = bat) wt-1 = noise) wt-1 = eat)

wt p(· | ·, ·) wt p(· | ·, ·) wt p(· | ·, ·)

ate 0.015 at 0.020 corn 0.420

… … …

flies 0.046 pollution 0.030 grass 0.510

… … …

zebra 0.000 zebra 0.000 zebra 0.000

103
Learning an n-Gram Model
Question: How do we learn the probabilities for the n-Gram
Model?
Answer: From data! Just count n-gram frequencies
p(wt | wt-2 = cows,
wt-1 = eat)
…the cows eat grass…
…our cows eat hay daily… wt p(· | ·, ·)
…factory-farm cows eat corn…
corn 4/11
…on an organic farm, cows eat hay and…
…do your cows eat grass or corn?... grass 3/11
…what do cows eat if they have…
…cows eat corn when there is no… hay 2/11
…which cows eat which foods depends…
…if cows eat grass… if 1/11
…when cows eat corn their stomachs…
which 1/11
…should we let cows eat corn?...
104
Sampling from a Language Model
Question: How do we sample from a Language Model?
Answer:
1. Treat each probability distribution like a (50k-sided) weighted die
2. Pick the die corresponding to p(wt | wt-2, wt-1)
3. Roll that die and generate whichever word wt lands face up
4. Repeat

)
ise
e)

)
de
Th

no

at)
t)

ma
T)

ba
T,

,
de

ise
AR

AR

e,

t,

ma

no
ba
Th
ST

ST

·|

·|
·|
·|

·|
·|

p(

p(
p(
p(

p(
p(

START The bat made noise at night

105
Sampling from a Language Model
Question: How do we sample from a Language Model?
Answer:
1. Treat each probability distribution like a (50k-sided) weighted die
2. Pick the die corresponding to p(wt | wt-2, wt-1)
3. Roll that die and generate whichever word wt lands face up
4. Repeat
Training Data (Shakespeaere) 5-Gram Model
I tell you, friends, most charitable care Approacheth, denay. dungy
ave the patricians of you. For your Thither! Julius think: grant,--O
Yead linens, sheep's Ancient,
wants, Your suffering in this dearth, Agreed: Petrarch plaguy Resolved
you may as well Strike at the heaven pear! observingly honourest
with your staves as lift them Against adulteries wherever scabbard
the Roman state, whose course will on guess; affirmation--his monsieur;
The way it takes, cracking ten thousand died. jealousy, chequins me.
Daphne building. weakness: sun-
curbs Of more strong link asunder than rise, cannot stays carry't,
can ever Appear in your impediment. unpurposed. prophet-like drink;
For the dearth, The gods, not the back-return 'gainst surmise
patricians, make it, and Your knees to Bridget ships? wane; interim?
106
them, not arms, must help. She's striving wet;
RECURRENT NEURAL NETWORK (RNN)
LANGUAGE MODELS

107
rent neural network (RNN) computes the hidden vector se-
quence h = (h1 , . . . , hT ) and output vector sequence y =
Recurrent Neural Networks (RNNs)(y1 , . . . , yT ) by iterating the following equations from t = 1
to T :
inputs: x = (x1 , x2 , . . . , xT ), xi RI Definition of the RNN:
hidden units: h = (h1 , h2 , . . . , hT ), hi RJ ht = H (Wxh xt + Whh ht 1 + bh ) (1)
outputs: y = (y1 , y2 , . . . , yT ), yi RK yt = Why ht + by (2)
nonlinearity: H
where the W terms denote weight matrices (e.g. Wxh is the
input-hidden weight matrix), the b terms denote bias vectors
y1 y2
(e.g. bhyis hidden bias
y4
vector) and
y5
H is the hidden layer func-
3
tion.
H is usually an elementwise application of a sigmoid
h1 h2 function.
h3 Howeverhwe 4
have found
h5 that the Long Short-Term
Memory (LSTM) architecture [11], which uses purpose-built
memory cells to store information, is better at finding and ex-
x1 x2 ploitingx3long rangex4context. Fig.
x5 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
108
it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3)
Reca
ll…
The Chain Rule of Probability
Question: How can we define a probability distribution over a
sequence of length T?
The bat made noise at night

w1 w2 w3 w4 w5 w6

Chain rule of probability:

p(w1, w2, w3, … , w6) =


The p(w1)
The bat p(w2 | w1)
The Note:
bat This
made is called the chain rule
p(w because
3 | w2, w1)
The it is made
bat always noisetrue for everyp(w
probability
4 | w3, w2, w1)
The bat made noise
distribution
at p(w5 | w4, w3, w2, w1)
The bat made noise at night p(w6 | w5, w4, w3, w2, w1) 109
RNN Language Model

RNN Language Model:

p(w1, w2, w3, … , w6) =


The p(w1)
The bat p(w2 | fθ(w1))
The bat made p(w3 | fθ(w2, w1))
The bat made noise p(w4 | fθ(w3, w2, w1))
The bat made noise at p(w5 | fθ(w4, w3, w2, w1))
The bat made noise at night p(w6 | fθ(w5, w4, w3, w2, w1))
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector 110
RNN Language Model

The bat made noise at night END

p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4) p(w5|h5) p(w6|h6) p(w7|h7)

h0 h1 h2 h3 h4 h5 h6

START The bat made noise at night

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 111
RNN Language Model

The

p(w1|h1)

h0

START

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 112
RNN Language Model

bat

p(w2|h2)

h0 h1

START The

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 113
RNN Language Model

made

p(w3|h3)

h0 h1 h2

START The bat

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 114
RNN Language Model

noise

p(w4|h4)

h0 h1 h2 h3

START The bat made

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 115
RNN Language Model

at

p(w5|h5)

h0 h1 h2 h3 h4

START The bat made noise

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 116
RNN Language Model
Question: How can we create a distribution
p(wt|ht) from ht? night

Answer:
p(w6|h6)

h0 h1 h2 h3 h4 h5

START The bat made noise at

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 117
RNN Language Model

END

p(w7|h7)

h0 h1 h2 h3 h4 h5 h6

START The bat made noise at night

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 118
RNN Language Model

The bat made noise at night END

p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4) p(w5|h5) p(w6|h6) p(w7|h7)

h0 h1 h2 h3 h4 h5 h6

START The bat made noise at night

p(w1, w2, w3, … , wT) = p(w1 | h1) p(w2 | h2) … p(w2 | hT)

119
Sampling from a Language Model
Question: How do we sample from a Language Model?
Answer:
1. Treat each probability distribution like a (50k-sided) weighted die
2. Pick the die corresponding to p(wt | wt-2, wt-1)
3. Roll that die and generate whichever word wt lands face up
4. Repeat

)
ise
e)

)
de
Th

no

at)
t)

ma
T)

ba
T,

,
de

ise
AR

AR

e,

t,

ma

no
ba
Th
ST

ST

·|

·|
·|
·|

·|
·|

p(

p(
p(
p(

p(
p(
The same approach to
START The bat made sampling
noise we atused for
nightan n-
Gram Language Model also
works here for an RNN
Language Model
120
Sampling from an RNN-LM
?? ??
VIOLA: Why, Salisbury must find his flesh and thought CHARLES: Marry, do I, sir; and I came to acquaint you
That which I am not aps, not a man and in fire, To show with a matter. I am given, sir, secretly to understand that
the reining of the raven and the wars To grace my hand your younger brother Orlando hath a disposition to come
reproach within, and not a fair are hand, That Caesar and in disguised against me to try a fall. To-morrow, sir, I
my goodly father's world; When I was heaven of wrestle for my credit; and he that escapes me without
presence and our fleets, We spare with hours, but cut thy some broken limb shall acquit him well. Your brother is
council I am great, Murdered and by thy master's ready is the
Which but young
real and tender; and, for your love, I would be
there My power to give thee but so much as hell: Some loath to foil him, as I must, for my own honour, if he
service in the noble bondman here, Would show him Shakespeare?!
to come in: therefore, out of my love to you, I came hither
her wine. to acquaint you withal, that either you might stay him
from his intendment or brook such disgrace well as he
KING LEAR: O, if you were a feeble sight, the courtesy of shall run into, in that it is a thing of his own search and
your law, Your sight and several breath, will wear the altogether against my will.
gods With his heads, and my hands are wonder'd at the
deeds, So drop upon your lordship's head, and your TOUCHSTONE: For my part, I had rather bear with you
opinion Shall be against your honour. than bear you; yet I should bear no cross if I did bear you,
for I think you have no money in your purse.

121
Example from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Sampling from an RNN-LM
Shakespeare’s As You Like It RNN-LM Sample
VIOLA: Why, Salisbury must find his flesh and thought CHARLES: Marry, do I, sir; and I came to acquaint you
That which I am not aps, not a man and in fire, To show with a matter. I am given, sir, secretly to understand that
the reining of the raven and the wars To grace my hand your younger brother Orlando hath a disposition to come
reproach within, and not a fair are hand, That Caesar and in disguised against me to try a fall. To-morrow, sir, I
my goodly father's world; When I was heaven of wrestle for my credit; and he that escapes me without
presence and our fleets, We spare with hours, but cut thy some broken limb shall acquit him well. Your brother is
council I am great, Murdered and by thy master's ready but young and tender; and, for your love, I would be
there My power to give thee but so much as hell: Some loath to foil him, as I must, for my own honour, if he
service in the noble bondman here, Would show him to come in: therefore, out of my love to you, I came hither
her wine. to acquaint you withal, that either you might stay him
from his intendment or brook such disgrace well as he
KING LEAR: O, if you were a feeble sight, the courtesy of shall run into, in that it is a thing of his own search and
your law, Your sight and several breath, will wear the altogether against my will.
gods With his heads, and my hands are wonder'd at the
deeds, So drop upon your lordship's head, and your TOUCHSTONE: For my part, I had rather bear with you
opinion Shall be against your honour. than bear you; yet I should bear no cross if I did bear you,
for I think you have no money in your purse.

122
Example from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Sampling from an RNN-LM
RNN-LM Sample Shakespeare’s As You Like It
VIOLA: Why, Salisbury must find his flesh and thought CHARLES: Marry, do I, sir; and I came to acquaint you
That which I am not aps, not a man and in fire, To show with a matter. I am given, sir, secretly to understand that
the reining of the raven and the wars To grace my hand your younger brother Orlando hath a disposition to come
reproach within, and not a fair are hand, That Caesar and in disguised against me to try a fall. To-morrow, sir, I
my goodly father's world; When I was heaven of wrestle for my credit; and he that escapes me without
presence and our fleets, We spare with hours, but cut thy some broken limb shall acquit him well. Your brother is
council I am great, Murdered and by thy master's ready but young and tender; and, for your love, I would be
there My power to give thee but so much as hell: Some loath to foil him, as I must, for my own honour, if he
service in the noble bondman here, Would show him to come in: therefore, out of my love to you, I came hither
her wine. to acquaint you withal, that either you might stay him
from his intendment or brook such disgrace well as he
KING LEAR: O, if you were a feeble sight, the courtesy of shall run into, in that it is a thing of his own search and
your law, Your sight and several breath, will wear the altogether against my will.
gods With his heads, and my hands are wonder'd at the
deeds, So drop upon your lordship's head, and your TOUCHSTONE: For my part, I had rather bear with you
opinion Shall be against your honour. than bear you; yet I should bear no cross if I did bear you,
for I think you have no money in your purse.

123
Example from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Sampling from an RNN-LM
?? ??
VIOLA: Why, Salisbury must find his flesh and thought CHARLES: Marry, do I, sir; and I came to acquaint you
That which I am not aps, not a man and in fire, To show with a matter. I am given, sir, secretly to understand that
the reining of the raven and the wars To grace my hand your younger brother Orlando hath a disposition to come
reproach within, and not a fair are hand, That Caesar and in disguised against me to try a fall. To-morrow, sir, I
my goodly father's world; When I was heaven of wrestle for my credit; and he that escapes me without
presence and our fleets, We spare with hours, but cut thy some broken limb shall acquit him well. Your brother is
council I am great, Murdered and by thy master's ready is the
Which but young
real and tender; and, for your love, I would be
there My power to give thee but so much as hell: Some loath to foil him, as I must, for my own honour, if he
service in the noble bondman here, Would show him Shakespeare?!
to come in: therefore, out of my love to you, I came hither
her wine. to acquaint you withal, that either you might stay him
from his intendment or brook such disgrace well as he
KING LEAR: O, if you were a feeble sight, the courtesy of shall run into, in that it is a thing of his own search and
your law, Your sight and several breath, will wear the altogether against my will.
gods With his heads, and my hands are wonder'd at the
deeds, So drop upon your lordship's head, and your TOUCHSTONE: For my part, I had rather bear with you
opinion Shall be against your honour. than bear you; yet I should bear no cross if I did bear you,
for I think you have no money in your purse.

124
Example from http://karpathy.github.io/2015/05/21/rnn-effectiveness/

You might also like