0% found this document useful (0 votes)

7 views

lec_12_generative_adversarial_networks

Uploaded by

prof.mly786

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

lec_12_generative_adversarial_networks

Uploaded by

prof.mly786

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Deep Learning

Lecture 12 – Generative Adversarial Networks

Prof. Dr.-Ing. Andreas Geiger

Autonomous Vision Group
University of Tübingen / MPI-IS
Agenda

12.1 Generative Adversarial Networks

12.2 GAN Developments

12.3 Research at AVG

2
12.1
Generative Adversarial Networks
Recap: Latent Variable Models
LVMs map between observation space x ∈ RD and latent space z ∈ RQ :

( fw : x 7→ z ) gw : z 7→ x̂
I One latent variable gets associated with each data point in the training set
I The latent vectors are smaller than the observations (Q < D) ⇒ compression
I Models are linear or non-linear, deterministic or stochastic, with/without encoder

A little taxonomy:

Deterministic Probabilistic
Linear Principle Component Analysis Probabilistic PCA
Non-Linear w/ Encoder Autoencoder Variational Autoencoder
Non-Linear w/o Encoder Gen. Adv. Networks 4
Generative Models

I The term generative model refers to any model that takes a dataset drawn from
pdata and learns a probability distribution pmodel to represent pdata
I In some cases, the model estimates pmodel explicitly and therefore allow for
evaluating the (approximate) likelihood/density pmodel (x) of a sample x
I In other cases, the model is only able to generate samples from pmodel
I GANs are prominent examples of this family of implicit models
I They provide a framework for training models without explicit likelihood

5
Generative Models

Generative Models

[Goodfellow: Tutorial on Generative Adversarial Networks, 2017]

6
Generative Adversarial Networks
Generative Adversarial Networks

I VAEs approximate the intractable likelihood using a recognition model

I GANs give up on explicitly modeling the density/likelihood
I Instead, they use an adversarial process in which two models (“players”) are
trained simultaneously, also referred to as two-player game:
I A generator G that captures the data distribution
I A discriminator D that estimates if a sample cam from the data distribution
I The goal of G is to maximize the probability of D making a mistake – to fool it
I Backpropagation can be used to optimize this two-player game
I No approximate inference or slow Markov chains are necessary
I Theoretical results (optimality, convergence) require strong assumptions

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 8
Generative Adversarial Networks

Let x ∈ RD denote an observation and p(z) a prior over latent variables z ∈ RQ .

Let GwG : RQ 7→ RD denote the generator network with induced distribution pmodel .
Let DwD : RD 7→ [0, 1] denote the discriminator network which outputs a probability.

D and G play the following two-player minimax game with value function V (D, G):

G∗ , D∗ = argmin argmax V (D, G)

G D

V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼p(z) [log(1 − D(G(z)))]

We train D to assign probability one to samples from pdata and zero to samples from
pmodel , and G to fool D such that it assigns probability one to samples from pmodel .

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 9
Generative Adversarial Networks

Generator Discriminator
Network Network

Discriminator
Network

I The generator and discriminator can be implemented as MLPs, ConvNets, RNNs

I The discriminator network can be considered a learned loss function on x̂
I After training, the generator is kept to represent pmodel and sample from pmodel
Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 10
Generative Adversarial Networks

I Theoretical analysis shows that this minimax game recovers pmodel = pdata
if G and D are given enough capacity and assuming that D∗ can be reached
I In practice, however, we must use iterative numerical optimization and optimizing
D in the inner loop to completion is computationally prohibitive and would lead to
overﬁtting on ﬁnite datasets
I Therefore, we resort to alternating optimization:
I k steps of optimizing D (typically k ∈ {1, . . . , 5})
I 1 step of optimizing G (using a small enough learning rate)
I This way, we maintain D near its optimal solution as long as G changes slowly

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 11
Algorithm
While not converged do
1. For k steps do
1.1 Draw B training samples {x1 , . . . , xB } from pdata (x)
1.2 Draw B latent samples {z1 , . . . , zB } from p(z)
1.3 Update the discriminator D by ascending its stochastic gradient:
B
1 X
∇wD log D(xb ) + log(1 − D(G(zb )))
B
b=1

2. Draw B latent samples {z1 , . . . , zB } from p(z)

3. Update the generator G by descending its stochastic gradient:
B
1 X
∇wG log(1 − D(G(zb )))
B
b=1

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 12
The Gradient Trick

I Early in training, when G is poor, D

log(D(G(z)))
rejects samples with high conﬁdence 4 log(1 D(G(z)))
I Thus log(1 − D(G(z)) saturates
2
I Instead of training G to minimize
0
log(1 − D(G(z))) we can train G to
maximize log(D(G(z))) 2

I This results in the same ﬁxed point 4

but provides stronger gradients early
on during training 0.0 0.2 0.4 0.6 0.8 1.0
D(G(z))

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 13
Expressiveness

I Similar to a VAE decoder, the generator in GANs is very expressive

I Consider random samples p(z) = N (0, I) mapped via GwG (z) = z/10 + z/kzk
I The generator GwG (z) is a neural network with learned parameters wG

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 14
1D Example
A simple example with linear generator:

x ∼ N (µ, σ)
z ∼ N (0, 1)
G(z) = w0G + w1G z
D(x) = σ(w0D + w1D x + w2D x2 )

I Here, we consider the data distribution and the prior as two different 1D
Gaussians and initialize G(z) = z and D(x) = σ(x), i.e., pmodel (x) = N (x|0, 1)
I The goal is to learn wG and wD such that pmodel (x) = pdata (x) = N (x|µ, σ)
I Remark: The x2 term is needed to provide gradients for the second moment

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 15
1D Example
Iteration: 0
0.5 1.0
pdata(x) D(x)
0.4 pmodel(x) 0.8
p(z)
0.3 0.6

D(x)
p(x)

0.2 0.4

0.1 0.2

x 0

z
15 10 5 0 5 10

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 16
1D Example
Iteration: 500
0.5 1.0
pdata(x) D(x)
0.4 pmodel(x) 0.8
p(z)
0.3 0.6

D(x)
p(x)

0.2 0.4

0.1 0.2

x 0

z
15 10 5 0 5 10

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 16
1D Example
Iteration: 1000
0.5 1.0
pdata(x) D(x)
0.4 pmodel(x) 0.8
p(z)
0.3 0.6

D(x)
p(x)

0.2 0.4

0.1 0.2

x 0

z
15 10 5 0 5 10

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 16
1D Example
Iteration: 1500
0.5 1.0
pdata(x) D(x)
0.4 pmodel(x) 0.8
p(z)
0.3 0.6

D(x)
p(x)

0.2 0.4

0.1 0.2

x 0

z
15 10 5 0 5 10

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 16
1D Example
Iteration: 2000
0.5 1.0
pdata(x) D(x)
0.4 pmodel(x) 0.8
p(z)
0.3 0.6

D(x)
p(x)

0.2 0.4

0.1 0.2

x 0

z
15 10 5 0 5 10

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 16
1D Example
Iteration: 2500
0.5 1.0
pdata(x) D(x)
0.4 pmodel(x) 0.8
p(z)
0.3 0.6

D(x)
p(x)

0.2 0.4

0.1 0.2

x 0

z
15 10 5 0 5 10

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 16
1D Example
Iteration: 0
0.5 1.0
pdata(x) D(x)
0.4 pmodel(x) 0.8
p(z)
0.3 0.6

D(x)
p(x)

0.2 0.4

0.1 0.2

x 0

z
15 10 5 0 5 10

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 17
1D Example
Iteration: 4000
0.5 1.0
pdata(x) D(x)
0.4 pmodel(x) 0.8
p(z)
0.3 0.6

D(x)
p(x)

0.2 0.4

0.1 0.2

x 0

z
15 10 5 0 5 10

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 17
Theoretical Results
Generative Adversarial Networks

Let x ∈ RD denote an observation and p(z) a prior over latent variables z ∈ RQ .

Let GwG : RQ 7→ RD denote the generator network with induced distribution pmodel .
Let DwD : RD 7→ [0, 1] denote the discriminator network which outputs a probability.

D and G play the following two-player minimax game with value function V (D, G):

G∗ , V ∗ = argmin argmax V (D, G)

G D

V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼p(z) [log(1 − D(G(z)))]

We train D to assign probability one to samples from pdata and zero to samples from
pmodel , and G to fool D such that it assigns probability one to samples from pmodel .

19
Optimal Discriminator
Proposition 1. For any given generator G, the optimal discriminator D is:

∗ pdata (x)
DG (x) =
pdata (x) + pmodel (x)

Proof. The training criterion for the discriminator D is to maximize (wrt. D):
Z Z
V (D, G) = pdata (x) log(D(x)) dx + p(z) log(1 − D(G(z))) dz
Zx z

= pdata (x) log(D(x)) + pmodel (x) log(1 − D(x)) dx

For any (a, b) ∈ R2 \ {0, 0}, the function y 7→ a log(y) + b log(1 − y) achieves its
a
maximum in [0, 1] at a+b . The discriminator D does not need to be deﬁned outside
Supp(pdata ) ∪ Supp(pmodel ) where pdata = 0 and pmodel = 0, concluding the proof.
20
Global Optimality
Theorem 1. The global minimum of the virtual training criterion
∗ ∗ ∗
V (DG , G) = Ex∼pdata [log DG (x)] + Ex∼pmodel [log(1 − DG (x))]
pdata (x) pmodel (x)
= Ex∼pdata log + Ex∼pmodel log
pdata (x) + pmodel (x) pdata (x) + pmodel (x)

∗ =
is achieved for pmodel = pdata where DG 1 ∗ , G) = − log 4 ≈ −1.386.
and V (DG
2

Proof. Reformulation in terms of the non-negative Jensen-Shannon divergence yields:

∗
V (DG , G) = KL(pdata , pdata + pmodel ) + KL(pmodel , pdata + pmodel )

pdata + pmodel pdata + pmodel
= − log 4 + KL pdata , + KL pmodel ,
2 2
= − log 4 + JSD(pdata , pmodel )
21
Convergence

Proposition 2. If G and D have enough capacity, and at each update step the
∗ , and p
discriminator D is allowed to reach D = DG model is updated to improve

∗ ∗ ∗
V (DG , pmodel ) = Ex∼pdata [log DG (x)] + Ex∼pmodel [log(1 − DG (x))]
Z
∝ sup pmodel (x) log(1 − D(x)) dx
D x

then pmodel converges to pdata .

Proof. The argument of the supremum is convex in pmodel . The supremum doesn’t
change convexity, thus V (DG∗ ,p
model ) is also convex in pmodel with global optimum
pmodel = pdata as shown in Theorem 1.

22
Assumptions
Remarks on Assumptions:
I The theoretical results above make very strong assumptions:
I The generator G and discriminator D have enough capacity
I The discriminator D reaches its optimum DG ∗
at every outer iteration
I We directly optimize pmodel instead of its parameters w
I In practice, G and D have ﬁnite capacity, D is optimized for only k steps, and
using a neural network to deﬁne G introduces critical points in parameter space.
I Thus, in practice, GANs often do not converge to pdata and might oscillate
I However, neural networks work well in practice, and balancing the updates of G
and D keeps D close to DG∗ in order to backpropagate meaningful gradients to G

I See also: https://srome.github.io/An-Annotated-Proof

23
1D Example
Iteration: 0 V(G, DG* ) = 0.939
1.0
pdata(x) D(x)
pmodel(x) DG* (x) 0.8
p(z)
0.5 0.6
0.4

D(x)
p(x)