DLAI4 Networks Gans
DLAI4 Networks Gans
Contents
1 Introduction 2
2 Setup 2
3 Density estimation 2
4 Adversarial models 2
[email protected] 1 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
1 Introduction
In this lecture we will look at the use of networks to generate data rather than summarise it. In particular, we will
look at the idea of the generative adversarial network (GAN): an intuitive method which has gained widespread
modern use.
The essential idea of a GAN is to simultaneously train a model to produce simulated samples typical of training
data, and a model which is trying to differentiate the true training data from the simulated training data. The two
models (usually both neural networks) can be thought of as being in competition with each other.
2 Setup
Suppose we have some data samples X1 , X2 , . . . , Xn which are identically distributed according to a given distri-
bution; that is, Xi ∼ X. Let us say that X has a density, which we will call fdata (everything we will do will work
equally well if X has a mass function instead).
Essentially, we want to be able to produce more samples Xi′ ∼ X. We will generally say that samples Xi′ ∼ fmodel ,
and which we will try and make equal to fdata .
In this lecture, we will slightly abuse notation for brevity: we will sometimes say that X f , where we mean that
X has the PDF f . In our case, we will say X1 , X2 . . . Xn ∼ fdata , and
3 Density estimation
We want to try and find a distribution fmodel which is similar to fdata . This is tantamount to estimating the
distribution given by fdata . One standard way we might do this is parametrically by assuming a statistical model
for fmodel : that is, to assume that it comes from a family of distriutions indexed by a parameter θ, so fmodel (x) =
fmodel (x; θ).
We can then choose θ by a maxmimum-likelihood estimate: that is, choose θ = θ∗ where θ∗ is defined as:
n
Y n
X
θ∗ = arg max fmodel (Xi , θ) = arg max ln [fmodel (Xi , θ)]
θ θ
i=1 i=1
We may also estimate densities non-parametrically; for instance, by kernel density estimates.
Given a specification of a probability density function it is (moderately) straightforward to simulate samples
from that distribution, at least in low dimensions.
4 Adversarial models
We will consider a ‘game’ between two players A and B. There is a ‘payoff’ V , which can be observed by both
players, and A wants it to be high and B wants it to be low. For example
1. In a chess game between A and B, V may be the probability that A will win (according to some probability
space)
2. In a market, V is the price of some object, and A (the buyer) wants it to be high and B (the seller) wants it
to be low.
3. In a football game, the payoff V is team A’s score minus team B’s score.
Each player controls only some of the variables which affect V . Suppose A controls variables we will term x and
B controls variables we term y.
1. In a chess game, x is the movements of A’s pieces, and y the movements of B’s
2. In a market, x is a set of things the seller controls; e.g., advertising, price; and y is the decision to buy.
3. In a football match, V is determined by the position of the ball, which is in turn determined by the positions
and actions of the players. The variables x describe the positions and actions of A’s players, and the variables
y describe the positions and actions of B’s players.
[email protected] 2 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
In the case of an adversarial network, our two players are statistical models, which in our case are neural
networks.
• Player A is the generator, trying to generate new samples Xi′ ∼ X. It controls the variables x, which are the
parameters of a generative neural network.
• Player B is the discriminator, trying to tell the samples Xi′ apart from Xi . It controls the variables Y which
are the parameters of a standard neural network.
In this case, the objective V is the negative of the error in distinguishing the values Xi from the values Xi′ . Player
A wants this to be as large as possible (e.g. close to 0), while player B wants it minimised (e.g., wants a large
discrimination error).
We are on the side of the generator: we want to be able to generate samples which the distribution. The
objective we want to maximise is now: h i
arg max min V (x, y) (1)
y x
An important idea is that we will assess the discriminator by Log-loss: that is, we will use the objective:
C(Y, Ŷ ) = Y ln(Ŷ ) + (1 − Y ) ln(1 − Ŷ ) (2)
where Y is an indicator variable which is 1 if the sample is in the generated data and 0 if it is is in the training
data.
[email protected] 3 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
[email protected] 4 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
5.3 Training
To train a GAN, we firstly need to evaluate the gradient of V with respect to θd and θ(g) . We do this as follows:
n h io n h io
∇θ(d) V (G, D) = ∇θ(d) EX∼fdata ln D(X, θ(d) ) + EX∼fmodel ln 1 − D(X, θ(d) )
n h io n h io
= EX∼fdata ∇θ(d) ln D(X, θ(d) ) + EX∼fmodel ∇θ(d) ln 1 − D(X, θ(d) )
∂D(X, θ(d) ∂D(X, θ(d)
1 1
= EX∼fdata − EX∼fmodel (8)
D(X, θ(d) ∂θ(d) 1 − D(X, θ(d) ) ∂θ(d)
which, given samples X1 , X2 , . . . XN ∼ fdata and X1′ , X2′ , . . . XN ∼ fmodel , we can make an unbiased estimate of as:
N N
∂D(x, θ(d) ∂D(x, θ(d)
1 X 1 1 X 1
∇θ(d) V (G, D) ≈ −
N i=1 D(x, θ(d) ∂θ(d) x=Xi N i=1 1 − D(x, θ(d) ) ∂θ(d) x=X ′ i
To evaluate the gradient with respect to θ(g) , we proceed as follows, using the chain rule:
n h io n h io
∇θ(g) V (G, D) = ∇θ(g) EX∼fdata ln D(X, θ(d) ) + EX∼fmodel ln 1 − D(X, θ(d) )
n h io n h io
= ∇θ(g) EX∼fdata ln D(X, θ(d) ) + EZ∼fcode ln 1 − D G(Z, θ(g) ), θ(d)
n h io n h io
= EX∼fdata ∇θ(g) ln D(X, θ(d) ) + EZ∼fcode ∇θ(g) ln 1 − D G(Z, θ(g) ), θ(d)
n h io
= EX∼fdata {0} + EZ∼fcode ∇θ(g) ln 1 − D G(Z, θ(g) ), θ(d)
n h io
= EZ∼fcode ∇θ(g) ln 1 − D G(Z, θ(g) ), θ(d)
(d) (g)
1 ∂D x, θ ∂G(Z, θ )
= EZ∼fcode −
1 − D G(Z, θ(g) ), θ(d) ∂x (g)
∂θ(g)
x=G(Z,θ )
[email protected] 5 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
[email protected] 6 of 7
Deep Learning and Artificial Intelligence Epiphany 2024
Exercises
1. If X = G(Z, θ(g) ), what can be said about the σ− algebras generated by X and Z?
∂D(X, θ(d)
1
varX∼fdata = V1
D(X, θ(d) ∂θ(d)
∂D(X, θ(d)
1
varX∼fmodel (d)
= V2
1 − D(X, θ ) ∂θ(d)
References
Ovidiu Calin. Deep Learning Architectures: A Mathematical Approach. Springer, 2020.
Aston Zhang, Zachary C Lipton, Mu Li, and Alexander J Smola. Dive into deep learning. arXiv preprint
arXiv:2106.11342, 2021. URL https://d2l.ai/index.html.
[email protected] 7 of 7