0% found this document useful (0 votes)
5 views

DLAI4 Networks Gans

This document discusses generative adversarial networks (GANs), which use two neural networks - a generator and a discriminator - competing against each other. The generator tries to generate new samples that match real data, while the discriminator tries to distinguish real from generated samples. GANs provide a way to estimate generative models through an adversarial process.

Uploaded by

rujunhuang2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DLAI4 Networks Gans

This document discusses generative adversarial networks (GANs), which use two neural networks - a generator and a discriminator - competing against each other. The generator tries to generate new samples that match real data, while the discriminator tries to distinguish real from generated samples. GANs provide a way to estimate generative models through an adversarial process.

Uploaded by

rujunhuang2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Deep Learning and Artificial Intelligence Epiphany 2024

Lecture 15- Generative Adversarial Networks


James Liley

Reading list and references

[Calin, 2020, Chapter 19]


[Zhang et al., 2021, Chapter 20]

Contents
1 Introduction 2

2 Setup 2

3 Density estimation 2

4 Adversarial models 2

5 Generative adversarial networks 3


5.1 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.2 Optimal discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.4 Mode collapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

[email protected] 1 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

1 Introduction
In this lecture we will look at the use of networks to generate data rather than summarise it. In particular, we will
look at the idea of the generative adversarial network (GAN): an intuitive method which has gained widespread
modern use.
The essential idea of a GAN is to simultaneously train a model to produce simulated samples typical of training
data, and a model which is trying to differentiate the true training data from the simulated training data. The two
models (usually both neural networks) can be thought of as being in competition with each other.

2 Setup
Suppose we have some data samples X1 , X2 , . . . , Xn which are identically distributed according to a given distri-
bution; that is, Xi ∼ X. Let us say that X has a density, which we will call fdata (everything we will do will work
equally well if X has a mass function instead).
Essentially, we want to be able to produce more samples Xi′ ∼ X. We will generally say that samples Xi′ ∼ fmodel ,
and which we will try and make equal to fdata .
In this lecture, we will slightly abuse notation for brevity: we will sometimes say that X f , where we mean that
X has the PDF f . In our case, we will say X1 , X2 . . . Xn ∼ fdata , and

3 Density estimation
We want to try and find a distribution fmodel which is similar to fdata . This is tantamount to estimating the
distribution given by fdata . One standard way we might do this is parametrically by assuming a statistical model
for fmodel : that is, to assume that it comes from a family of distriutions indexed by a parameter θ, so fmodel (x) =
fmodel (x; θ).
We can then choose θ by a maxmimum-likelihood estimate: that is, choose θ = θ∗ where θ∗ is defined as:
n
Y n
X
θ∗ = arg max fmodel (Xi , θ) = arg max ln [fmodel (Xi , θ)]
θ θ
i=1 i=1

We may also estimate densities non-parametrically; for instance, by kernel density estimates.
Given a specification of a probability density function it is (moderately) straightforward to simulate samples
from that distribution, at least in low dimensions.

4 Adversarial models
We will consider a ‘game’ between two players A and B. There is a ‘payoff’ V , which can be observed by both
players, and A wants it to be high and B wants it to be low. For example

1. In a chess game between A and B, V may be the probability that A will win (according to some probability
space)
2. In a market, V is the price of some object, and A (the buyer) wants it to be high and B (the seller) wants it
to be low.

3. In a football game, the payoff V is team A’s score minus team B’s score.
Each player controls only some of the variables which affect V . Suppose A controls variables we will term x and
B controls variables we term y.
1. In a chess game, x is the movements of A’s pieces, and y the movements of B’s

2. In a market, x is a set of things the seller controls; e.g., advertising, price; and y is the decision to buy.
3. In a football match, V is determined by the position of the ball, which is in turn determined by the positions
and actions of the players. The variables x describe the positions and actions of A’s players, and the variables
y describe the positions and actions of B’s players.

[email protected] 2 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

In the case of an adversarial network, our two players are statistical models, which in our case are neural
networks.
• Player A is the generator, trying to generate new samples Xi′ ∼ X. It controls the variables x, which are the
parameters of a generative neural network.
• Player B is the discriminator, trying to tell the samples Xi′ apart from Xi . It controls the variables Y which
are the parameters of a standard neural network.
In this case, the objective V is the negative of the error in distinguishing the values Xi from the values Xi′ . Player
A wants this to be as large as possible (e.g. close to 0), while player B wants it minimised (e.g., wants a large
discrimination error).
We are on the side of the generator: we want to be able to generate samples which the distribution. The
objective we want to maximise is now: h i
arg max min V (x, y) (1)
y x

5 Generative adversarial networks


A generative adversarial network implements this idea. Formally, our generative network is a function
G(z, θ(g) ) : Z → X
which maps a latent space Z to the same space X from which the data comes. To use the network, we generate
values z in Z randomly according to a probability density fcode (z), which remains fixed.
As above, we denote the distribution of simulated samples as fmodel (x). We have, if Z has PDF fZ = fcode and:
X = G(Z, θ(g) )
then X has PDF fX = fmodel . We generally have that Z has smaller dimension than X.
The discriminator network is a function
D(x, θ(d) ) : X → [0, 1]
which takes a sample and returns a value between 0 and 1, where
• D(x, θ(d) ) = 1 if the discriminator considers that x comes from the generated data
• D(x, θ(d) ) = 0 if the discriminator considers that x comes from the training data
We illustrate a GAN as follows:

Figure 1: Diagram of GAN; reproduced from Calin [2020]

An important idea is that we will assess the discriminator by Log-loss: that is, we will use the objective:
C(Y, Ŷ ) = Y ln(Ŷ ) + (1 − Y ) ln(1 − Ŷ ) (2)
where Y is an indicator variable which is 1 if the sample is in the generated data and 0 if it is is in the training
data.

[email protected] 3 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

5.1 Loss function


Suppose we have a random variable W , which with probability p takes a values Xi with PDF fdata and with
probability (1 − p) takes a value Xi′ with PDF fmodel . We call Y the indicator variable of whether the sample comes
from fdata (in which case Y = 1) or fmodel (in which case Y = 0). Then we have:
P (Y = 1) = p
fW |Y =1 (x) = fdata (x)
fW |Y =0 (x) = fmodel (x) (3)
The quantity we want the discriminator to minimise can be written as:
n o
V (G, D) = EW C(Y, Ŷ )
n o
= EW C(Y, D(W, θ(d) ))
h n oi
= EY EW |Y C(Y, D(W, θ(d) ))
n o n o
= P (Y = 1)EW |Y =1 C(Y, D(W, θ(d) )) + P (Y = 0)EW |Y =0 C(Y, D(W, θ(d) ))
n o
= p EX∼fdata C(1, D(X, θ(d) ))
n o
+ (1 − p)EX∼fmodel C(0, D(X, θ(d) ))
n h io n h io
= p EX∼fdata ln D(X, θ(d) ) + (1 − p)EX∼fmodel ln 1 − D(X, θ(d) ) (4)

and we usually use p = 1/2, and ignore the factor of 2, to get:


n h io n h io
V (G, D) = EX∼fdata ln D(X, θ(d) ) + EX∼fmodel ln 1 − D(X, θ(d) )
n h io n h  io
= EX∼fdata ln D(X, θ(d) ) + EZ∼fcode ln 1 − D G(Z, θ(g) ), θ(d)

5.2 Optimal discriminator


We now show a hopefully slightly familiar theorem (Proposition 19.4.3,Calin [2020]):
Theorem 1. For fixed θ(g) , the discriminator which minimises V (G, D) is given by:
fdata (x)
D(x) = (5)
fdata (x) + fmodel (x)
Proof. We will omit the arguments θ(d) and θ(g) since they is not needed.
Since both fdata and fmodel are defined on the same space X , we write:
Z
V (G, D) = (ln [D(x)] fdata (x) + ln [1 − D(x)] fmodel (x)) dx
X
which can be considered a functional of D(x); we write
Z
V (G, D) = L(x, D(x))dx (6)

where the ‘Lagrangian’ L(x, D(x)) is given by


L(x, D(x)) = ln [D(x)] fdata (x) + ln [1 − D(x)] fmodel
For this to be maximised, we must have
dL(D) fdata (x) fmodel (x)
0= = −
dD D(x) 1 − D(x)
from which
fdata (x)
D(x) = (7)
fdata (x) + fmodel (x)

Of course, we don’t know what fdata is - that’s the whole objective!

[email protected] 4 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

5.3 Training
To train a GAN, we firstly need to evaluate the gradient of V with respect to θd and θ(g) . We do this as follows:
 n h io n h io
∇θ(d) V (G, D) = ∇θ(d) EX∼fdata ln D(X, θ(d) ) + EX∼fmodel ln 1 − D(X, θ(d) )
n h io n h io
= EX∼fdata ∇θ(d) ln D(X, θ(d) ) + EX∼fmodel ∇θ(d) ln 1 − D(X, θ(d) )
∂D(X, θ(d) ∂D(X, θ(d)
   
1 1
= EX∼fdata − EX∼fmodel (8)
D(X, θ(d) ∂θ(d) 1 − D(X, θ(d) ) ∂θ(d)

which, given samples X1 , X2 , . . . XN ∼ fdata and X1′ , X2′ , . . . XN ∼ fmodel , we can make an unbiased estimate of as:
N  N 
∂D(x, θ(d) ∂D(x, θ(d)
 
1 X 1 1 X 1
∇θ(d) V (G, D) ≈ −
N i=1 D(x, θ(d) ∂θ(d) x=Xi N i=1 1 − D(x, θ(d) ) ∂θ(d) x=X ′ i

since, denoting X = {X1 , X2 , . . . XN } and X′ = {X1′ , X2′ , . . . XN



}
N  N 
!
∂D(x, θ(d) ∂D(x, θ(d)
 
1 X 1 1 X 1
EX,X′ −
N i=1 D(x, θ(d) ∂θ(d) x=Xi N i=1 1 − D(x, θ(d) ) ∂θ(d) x=Xi′
N
! N
!
∂D(x, θ(d) ∂D(x, θ(d)
   
1 X 1 1 X 1
= EX,X′ − EX,X′
N i=1 D(x, θ(d) ∂θ(d) x=Xi N i=1 1 − D(x, θ(d) ) ∂θ(d) x=Xi′
N N
∂D(X, θ(d) ∂D(X, θ(d)
   
1 X 1 1 X 1
= EX∼fdata − EX∼f model
N i=1 D(X, θ(d) ∂θ(d) N i=1 1 − D(X, θ(d) ) ∂θ(d)
∂D(X, θ(d) ∂D(X, θ(d)
   
1 1 1 1
= N EX∼fdata − N EX∼f model
N D(X, θ(d) ∂θ(d) N 1 − D(X, θ(d) ) ∂θ(d)
∂D(X, θ(d) ∂D(X, θ(d)
   
1 1
= EX∼fdata (d) (d)
− EX∼fmodel (d)
D(X, θ ∂θ 1 − D(X, θ ) ∂θ(d)
= ∇θ(d) V (G, D)

To evaluate the gradient with respect to θ(g) , we proceed as follows, using the chain rule:
 n h io n h io
∇θ(g) V (G, D) = ∇θ(g) EX∼fdata ln D(X, θ(d) ) + EX∼fmodel ln 1 − D(X, θ(d) )
 n h io n h  io
= ∇θ(g) EX∼fdata ln D(X, θ(d) ) + EZ∼fcode ln 1 − D G(Z, θ(g) ), θ(d)
n h io n h  io
= EX∼fdata ∇θ(g) ln D(X, θ(d) ) + EZ∼fcode ∇θ(g) ln 1 − D G(Z, θ(g) ), θ(d)
n h  io
= EX∼fdata {0} + EZ∼fcode ∇θ(g) ln 1 − D G(Z, θ(g) ), θ(d)
n h  io
= EZ∼fcode ∇θ(g) ln 1 − D G(Z, θ(g) ), θ(d)
  
(d) (g)
1 ∂D x, θ ∂G(Z, θ ) 
= EZ∼fcode − 
1 − D G(Z, θ(g) ), θ(d) ∂x (g)
∂θ(g)
x=G(Z,θ )

which can be approximated, given (only) samples Z1′ , Z2′ , . . . ZN



∼ fcode as:
 
N 
1 X 1 ∂D x, θ(d) (g)
∂G(Z, θ )
∇θ(g) V (G, D) ≈ − 
N i=1 1 − D G(Z, θ(g) ), θ(d) ∂x ∂θ(g)
x=G(Z,θ (g) ) Z=Zi

which, as above, is unbiased.


In practice, when training, we tend to go from a situation where the generator is useless and the discriminator
is fairly good to a situation where the discriminator is useless and the generator fairly good.

[email protected] 5 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

5.4 Mode collapse


A common problem with GANs is the phenomenon of ‘mode collapse’. In this setting, the generator learns to
accurately simulate only a small part of the distribution fdata . The canonical example is training a GAN to the
MNIST dataset (which we looked at in practicals) to see if the generator can draw realistic-looking numbers, and
finding that the generator learns only to draw realistic-looking 7’s. In this way, the generator has missed a ‘mode’
of the training data (in this case, all other numbers)
This can arise due to variable rates of training between the discriminator and generator. If the generator can
learn quickly while the discriminator can not, then the generator may learn to accurately reproduce 7’s, while
the discriminator is not yet capable of penalising this. When the generator improves, the learning rate of the
discriminator has slowed, and it can no longer make large enough changes to improve.

[email protected] 6 of 7
Deep Learning and Artificial Intelligence Epiphany 2024

Exercises

1. If X = G(Z, θ(g) ), what can be said about the σ− algebras generated by X and Z?

2. Give the unconditional PDF of W from definition 3.


3. Assume G(·, θ(g) is invertible. Express fmodel through the functions G(·, θ(g) and fcode .
4. From the proof of theorem 1 show that the given value of D(x) as a stationary point of V (G, D) must
be a maximum.

5. We have free choice over the parameter p in equation 4.


(a) Restate the value of ∇θ(d) V (G, D) in 8 with p retained
(b) Give an unbiased estimator of your new version of ∇θ(d) V (G, D) (you may assume p is known).
(c) Suppose that you happen to know that:

∂D(X, θ(d)
 
1
varX∼fdata = V1
D(X, θ(d) ∂θ(d)
∂D(X, θ(d)
 
1
varX∼fmodel (d)
= V2
1 − D(X, θ ) ∂θ(d)

What values might you want to choose for p?

References
Ovidiu Calin. Deep Learning Architectures: A Mathematical Approach. Springer, 2020.

Aston Zhang, Zachary C Lipton, Mu Li, and Alexander J Smola. Dive into deep learning. arXiv preprint
arXiv:2106.11342, 2021. URL https://d2l.ai/index.html.

[email protected] 7 of 7

You might also like