0% found this document useful (0 votes)

105 views17 pages

Image-to-Image Translation With Conditional Adversarial Networks

This paper proposes using conditional adversarial networks (cGANs) as a general solution for image-to-image translation problems. cGANs learn a mapping from input images to output images as well as a loss function to train the mapping. The same approach can be applied to many tasks that traditionally require different loss formulations, such as synthesizing photos from labels, reconstructing objects from edges, and colorizing images. The paper demonstrates this approach is effective on a variety of image-to-image translation problems by training cGANs on different datasets while keeping the same architecture and objective.

Uploaded by

Velez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views17 pages

Image-to-Image Translation With Conditional Adversarial Networks

Uploaded by

Velez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros

Berkeley AI Research (BAIR) Laboratory, UC Berkeley

{isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu
Labels to Street Scene Labels to Facade BW to Color
arXiv:1611.07004v3 [cs.CV] 26 Nov 2018

input output
Aerial to Map

input output input output

Day to Night Edges to Photo

input output input output input output

Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
results of the method on several. In each case we use the same architecture and objective, and simply train on different data.

Abstract 1. Introduction
Many problems in image processing, computer graphics,
and computer vision can be posed as “translating” an input
We investigate conditional adversarial networks as a image into a corresponding output image. Just as a concept
general-purpose solution to image-to-image translation may be expressed in either English or French, a scene may
problems. These networks not only learn the mapping from be rendered as an RGB image, a gradient field, an edge map,
input image to output image, but also learn a loss func- a semantic label map, etc. In analogy to automatic language
tion to train this mapping. This makes it possible to apply translation, we define automatic image-to-image translation
the same generic approach to problems that traditionally as the task of translating one possible representation of a
would require very different loss formulations. We demon- scene into another, given sufficient training data (see Figure
strate that this approach is effective at synthesizing photos 1). Traditionally, each of these tasks has been tackled with
from label maps, reconstructing objects from edge maps, separate, special-purpose machinery (e.g., [16, 25, 20, 9,
and colorizing images, among other tasks. Indeed, since the 11, 53, 33, 39, 18, 58, 62]), despite the fact that the setting
release of the pix2pix software associated with this pa- is always the same: predict pixels from pixels. Our goal in
per, a large number of internet users (many of them artists) this paper is to develop a common framework for all these
have posted their own experiments with our system, further problems.
demonstrating its wide applicability and ease of adoption The community has already taken significant steps in this
without the need for parameter tweaking. As a commu- direction, with convolutional neural nets (CNNs) becoming
nity, we no longer hand-engineer our mapping functions, the common workhorse behind a wide variety of image pre-
and this work suggests we can achieve reasonable results diction problems. CNNs learn to minimize a loss function –
without hand-engineering our loss functions either. an objective that scores the quality of results – and although
the learning process is automatic, a lot of manual effort still

1
goes into designing effective losses. In other words, we still x G(x) y
G
have to tell the CNN what we wish it to minimize. But, just
D D
like King Midas, we must be careful what we wish for! If
fake real
we take a naive approach and ask the CNN to minimize the
Euclidean distance between predicted and ground truth pix-
els, it will tend to produce blurry results [43, 62]. This is x x
because Euclidean distance is minimized by averaging all
Figure 2: Training a conditional GAN to map edges→photo. The
plausible outputs, which causes blurring. Coming up with
discriminator, D, learns to classify between fake (synthesized by
loss functions that force the CNN to do what we really want the generator) and real {edge, photo} tuples. The generator, G,
– e.g., output sharp, realistic images – is an open problem learns to fool the discriminator. Unlike an unconditional GAN,
and generally requires expert knowledge. both the generator and discriminator observe the input edge map.
It would be highly desirable if we could instead specify
only a high-level goal, like “make the output indistinguish- large body of literature has considered losses of this kind,
able from reality”, and then automatically learn a loss func- with methods including conditional random fields [10], the
tion appropriate for satisfying this goal. Fortunately, this is SSIM metric [56], feature matching [15], nonparametric
exactly what is done by the recently proposed Generative losses [37], the convolutional pseudo-prior [57], and losses
Adversarial Networks (GANs) [24, 13, 44, 52, 63]. GANs based on matching covariance statistics [30]. The condi-
learn a loss that tries to classify if the output image is real tional GAN is different in that the loss is learned, and can, in
or fake, while simultaneously training a generative model theory, penalize any possible structure that differs between
to minimize this loss. Blurry images will not be tolerated output and target.
since they look obviously fake. Because GANs learn a loss Conditional GANs We are not the first to apply GANs
that adapts to the data, they can be applied to a multitude of in the conditional setting. Prior and concurrent works have
tasks that traditionally would require very different kinds of conditioned GANs on discrete labels [41, 23, 13], text [46],
loss functions. and, indeed, images. The image-conditional models have
In this paper, we explore GANs in the conditional set- tackled image prediction from a normal map [55], future
ting. Just as GANs learn a generative model of data, condi- frame prediction [40], product photo generation [59], and
tional GANs (cGANs) learn a conditional generative model image generation from sparse annotations [31, 48] (c.f. [47]
[24]. This makes cGANs suitable for image-to-image trans- for an autoregressive approach to the same problem). Sev-
lation tasks, where we condition on an input image and general other papers have also used GANs for image-to-image
erate a corresponding output image. mappings, but only applied the GAN unconditionally, re-
GANs have been vigorously studied in the last two lying on other terms (such as L2 regression) to force the
years and many of the techniques we explore in this pa- output to be conditioned on the input. These papers have
per have been previously proposed. Nonetheless, ear- achieved impressive results on inpainting [43], future state
lier papers have focused on specific applications, and prediction [64], image manipulation guided by user con-
it has remained unclear how effective image-conditional straints [65], style transfer [38], and superresolution [36].
GANs can be as a general-purpose solution for image-to- Each of the methods was tailored for a specific applica-
image translation. Our primary contribution is to demon- tion. Our framework differs in that nothing is application-
strate that on a wide variety of problems, conditional specific. This makes our setup considerably simpler than
GANs produce reasonable results. Our second contri- most others.
bution is to present a simple framework sufficient to Our method also differs from the prior works in several
achieve good results, and to analyze the effects of sev- architectural choices for the generator and discriminator.
eral important architectural choices. Code is available at Unlike past work, for our generator we use a “U-Net”-based
https://github.com/phillipi/pix2pix. architecture [50], and for our discriminator we use a convo-
lutional “PatchGAN” classifier, which only penalizes struc-
2. Related work ture at the scale of image patches. A similar PatchGAN ar-
chitecture was previously proposed in [38] to capture local
Structured losses for image modeling Image-to-image style statistics. Here we show that this approach is effective
translation problems are often formulated as per-pixel clas- on a wider range of problems, and we investigate the effect
sification or regression (e.g., [39, 58, 28, 35, 62]). These of changing the patch size.
formulations treat the output space as “unstructured” in the
sense that each output pixel is considered conditionally in- 3. Method
dependent from all others given the input image. Condi-
tional GANs instead learn a structured loss. Structured GANs are generative models that learn a mapping from
losses penalize the joint configuration of the output. A random noise vector z to output image y, G : z → y [24]. In
U-Net
contrast, conditional GANs learn a mapping from observed Encoder-decoder

image x and random noise vector z, to y, G : {x, z} → y.

The generator G is trained to produce outputs that cannot be
distinguished from “real” images by an adversarially trained x y x y
discriminator, D, which is trained to do as well as possible
at detecting the generator’s “fakes”. This training procedure
is diagrammed in Figure 2.
Figure 3: Two choices for the architecture of the generator. The
3.1. Objective “U-Net” [50] is an encoder-decoder with skip connections be-
The objective of a conditional GAN can be expressed as tween mirrored layers in the encoder and decoder stacks.

LcGAN (G, D) =Ex,y [log D(x, y)]+

3.2. Network architectures
Ex,z [log(1 − D(x, G(x, z))], (1)
We adapt our generator and discriminator architectures
where G tries to minimize this objective against an ad- from those in [44]. Both generator and discriminator use
versarial D that tries to maximize it, i.e. G∗ = modules of the form convolution-BatchNorm-ReLu [29].
arg minG maxD LcGAN (G, D). Details of the architecture are provided in the supplemen-
To test the importance of conditioning the discriminator, tal materials online, with key features discussed below.
we also compare to an unconditional variant in which the
discriminator does not observe x: 3.2.1 Generator with skips
LGAN (G, D) =Ey [log D(y)]+ A defining feature of image-to-image translation problems
Ex,z [log(1 − D(G(x, z))]. (2) is that they map a high resolution input grid to a high resolu-
tion output grid. In addition, for the problems we consider,
Previous approaches have found it beneficial to mix the the input and output differ in surface appearance, but both
GAN objective with a more traditional loss, such as L2 dis- are renderings of the same underlying structure. Therefore,
tance [43]. The discriminator’s job remains unchanged, but structure in the input is roughly aligned with structure in the
the generator is tasked to not only fool the discriminator but output. We design the generator architecture around these
also to be near the ground truth output in an L2 sense. We considerations.
also explore this option, using L1 distance rather than L2 as Many previous solutions [43, 55, 30, 64, 59] to problems
L1 encourages less blurring: in this area have used an encoder-decoder network [26]. In
such a network, the input is passed through a series of lay-
LL1 (G) = Ex,y,z [ky − G(x, z)k1 ]. (3) ers that progressively downsample, until a bottleneck layer,
at which point the process is reversed. Such a network re-
Our final objective is quires that all information flow pass through all the layers,
including the bottleneck. For many image translation prob-
G∗ = arg min max LcGAN (G, D) + λLL1 (G). (4) lems, there is a great deal of low-level information shared
G D
between the input and output, and it would be desirable to
Without z, the net could still learn a mapping from x shuttle this information directly across the net. For exam-
to y, but would produce deterministic outputs, and there- ple, in the case of image colorization, the input and output
fore fail to match any distribution other than a delta func- share the location of prominent edges.
tion. Past conditional GANs have acknowledged this and To give the generator a means to circumvent the bottle-
provided Gaussian noise z as an input to the generator, in neck for information like this, we add skip connections, fol-
addition to x (e.g., [55]). In initial experiments, we did not lowing the general shape of a “U-Net” [50]. Specifically, we
find this strategy effective – the generator simply learned add skip connections between each layer i and layer n − i,
to ignore the noise – which is consistent with Mathieu et where n is the total number of layers. Each skip connec-
al. [40]. Instead, for our final models, we provide noise tion simply concatenates all channels at layer i with those
only in the form of dropout, applied on several layers of our at layer n − i.
generator at both training and test time. Despite the dropout
noise, we observe only minor stochasticity in the output of
3.2.2 Markovian discriminator (PatchGAN)
our nets. Designing conditional GANs that produce highly
stochastic output, and thereby capture the full entropy of the It is well known that the L2 loss – and L1, see Fig-
conditional distributions they model, is an important ques- ure 4 – produces blurry results on image generation prob-
tion left open by the present work. lems [34]. Although these losses fail to encourage high-
frequency crispness, in many cases they nonetheless accu- 4. Experiments
rately capture the low frequencies. For problems where this
is the case, we do not need an entirely new framework to To explore the generality of conditional GANs, we test
enforce correctness at the low frequencies. L1 will already the method on a variety of tasks and datasets, including both
do. graphics tasks, like photo generation, and vision tasks, like
semantic segmentation:
This motivates restricting the GAN discriminator to only
model high-frequency structure, relying on an L1 term to • Semantic labels↔photo, trained on the Cityscapes
force low-frequency correctness (Eqn. 4). In order to model dataset [12].
high-frequencies, it is sufficient to restrict our attention to • Architectural labels→photo, trained on CMP Facades
the structure in local image patches. Therefore, we design [45].
a discriminator architecture – which we term a PatchGAN • Map↔aerial photo, trained on data scraped from
– that only penalizes structure at the scale of patches. This Google Maps.
discriminator tries to classify if each N × N patch in an im- • BW→color photos, trained on [51].
age is real or fake. We run this discriminator convolution- • Edges→photo, trained on data from [65] and [60]; bi-
ally across the image, averaging all responses to provide the nary edges generated using the HED edge detector [58]
ultimate output of D. plus postprocessing.
In Section 4.4, we demonstrate that N can be much • Sketch→photo: tests edges→photo models on human-
smaller than the full size of the image and still produce drawn sketches from [19].
high quality results. This is advantageous because a smaller • Day→night, trained on [33].
PatchGAN has fewer parameters, runs faster, and can be • Thermal→color photos, trained on data from [27].
applied to arbitrarily large images. • Photo with missing pixels→inpainted photo, trained
on Paris StreetView from [14].
Such a discriminator effectively models the image as a
Markov random field, assuming independence between pix- Details of training on each of these datasets are provided
els separated by more than a patch diameter. This connec- in the supplemental materials online. In all cases, the in-
tion was previously explored in [38], and is also the com- put and output are simply 1-3 channel images. Qualita-
mon assumption in models of texture [17, 21] and style tive results are shown in Figures 8, 9, 11, 10, 13, 14, 15,
[16, 25, 22, 37]. Therefore, our PatchGAN can be under- 16, 17, 18, 19, 20. Several failure cases are highlighted
stood as a form of texture/style loss. in Figure 21. More comprehensive results are available at
https://phillipi.github.io/pix2pix/.
Data requirements and speed We note that decent re-
3.3. Optimization and inference sults can often be obtained even on small datasets. Our fa-
cade training set consists of just 400 images (see results in
To optimize our networks, we follow the standard ap- Figure 14), and the day to night training set consists of only
proach from [24]: we alternate between one gradient de- 91 unique webcams (see results in Figure 15). On datasets
scent step on D, then one step on G. As suggested in of this size, training can be very fast: for example, the re-
the original GAN paper, rather than training G to mini- sults shown in Figure 14 took less than two hours of training
mize log(1 − D(x, G(x, z)), we instead train to maximize on a single Pascal Titan X GPU. At test time, all models run
log D(x, G(x, z)) [24]. In addition, we divide the objec- in well under a second on this GPU.
tive by 2 while optimizing D, which slows down the rate at
which D learns relative to G. We use minibatch SGD and 4.1. Evaluation metrics
apply the Adam solver [32], with a learning rate of 0.0002,
Evaluating the quality of synthesized images is an open
and momentum parameters β1 = 0.5, β2 = 0.999.
and difficult problem [52]. Traditional metrics such as per-
At inference time, we run the generator net in exactly pixel mean-squared error do not assess joint statistics of the
the same manner as during the training phase. This differs result, and therefore do not measure the very structure that
from the usual protocol in that we apply dropout at test time, structured losses aim to capture.
and we apply batch normalization [29] using the statistics of To more holistically evaluate the visual quality of our re-
the test batch, rather than aggregated statistics of the train- sults, we employ two tactics. First, we run “real vs. fake”
ing batch. This approach to batch normalization, when the perceptual studies on Amazon Mechanical Turk (AMT).
batch size is set to 1, has been termed “instance normal- For graphics problems like colorization and photo gener-
ization” and has been demonstrated to be effective at im- ation, plausibility to a human observer is often the ultimate
age generation tasks [54]. In our experiments, we use batch goal. Therefore, we test our map generation, aerial photo
sizes between 1 and 10 depending on the experiment. generation, and image colorization using this approach.
Input Ground truth L1 cGAN L1 + cGAN

Figure 4: Different losses induce different quality of results. Each column shows results trained under a different loss. Please see
https://phillipi.github.io/pix2pix/ for additional examples.

Second, we measure whether or not our synthesized lution images, but exploited fully-convolutional translation
cityscapes are realistic enough that off-the-shelf recognition (described above) to test on 512 × 512 images, which were
system can recognize the objects in them. This metric is then downsampled and presented to Turkers at 256 × 256
similar to the “inception score” from [52], the object detec- resolution. For colorization, we trained and tested on
tion evaluation in [55], and the “semantic interpretability” 256 × 256 resolution images and presented the results to
measures in [62] and [42]. Turkers at this same resolution.
“FCN-score” While quantitative evaluation of genera-
AMT perceptual studies For our AMT experiments, we tive models is known to be challenging, recent works [52,
followed the protocol from [62]: Turkers were presented 55, 62, 42] have tried using pre-trained semantic classifiers
with a series of trials that pitted a “real” image against a to measure the discriminability of the generated stimuli as a
“fake” image generated by our algorithm. On each trial, pseudo-metric. The intuition is that if the generated images
each image appeared for 1 second, after which the images are realistic, classifiers trained on real images will be able
disappeared and Turkers were given unlimited time to re- to classify the synthesized image correctly as well. To this
spond as to which was fake. The first 10 images of each end, we adopt the popular FCN-8s [39] architecture for se-
session were practice and Turkers were given feedback. No mantic segmentation, and train it on the cityscapes dataset.
feedback was provided on the 40 trials of the main experi- We then score synthesized photos by the classification accu-
ment. Each session tested just one algorithm at a time, and racy against the labels these photos were synthesized from.
Turkers were not allowed to complete more than one ses-
sion. ∼ 50 Turkers evaluated each algorithm. Unlike [62], 4.2. Analysis of the objective function
we did not include vigilance trials. For our colorization ex-
periments, the real and fake images were generated from the Which components of the objective in Eqn. 4 are impor-
same grayscale input. For map↔aerial photo, the real and tant? We run ablation studies to isolate the effect of the L1
fake images were not generated from the same input, in or- term, the GAN term, and to compare using a discriminator
der to make the task more difficult and avoid floor-level re- conditioned on the input (cGAN, Eqn. 1) against using an
sults. For map↔aerial photo, we trained on 256 × 256 reso- unconditional discriminator (GAN, Eqn. 2).
L1 L1+cGAN
that the output look realistic. This variant results in poor
Encoder-decoder

performance; examining the results reveals that the gener-

ator collapsed into producing nearly the exact same output
regardless of input photograph. Clearly, it is important, in
this case, that the loss measure the quality of the match be-
tween input and output, and indeed cGAN performs much
better than GAN. Note, however, that adding an L1 term
U-Net

also encourages that the output respect the input, since the
L1 loss penalizes the distance between ground truth out-
puts, which correctly match the input, and synthesized out-
puts, which may not. Correspondingly, L1+GAN is also
Figure 5: Adding skip connections to an encoder-decoder to create effective at creating realistic renderings that respect the in-
a “U-Net” results in much higher quality results. put label maps. Combining all terms, L1+cGAN, performs
Loss Per-pixel acc. Per-class acc. Class IOU similarly well.
L1 0.42 0.15 0.11 Colorfulness A striking effect of conditional GANs is
GAN 0.22 0.05 0.01
cGAN 0.57 0.22 0.16
that they produce sharp images, hallucinating spatial struc-
L1+GAN 0.64 0.20 0.15 ture even where it does not exist in the input label map. One
L1+cGAN 0.66 0.23 0.17 might imagine cGANs have a similar effect on “sharpening”
Ground truth 0.80 0.26 0.21
in the spectral dimension – i.e. making images more color-
Table 1: FCN-scores for different losses, evaluated on Cityscapes ful. Just as L1 will incentivize a blur when it is uncertain
labels↔photos.
where exactly to locate an edge, it will also incentivize an
average, grayish color when it is uncertain which of sev-
Loss Per-pixel acc. Per-class acc. Class IOU eral plausible color values a pixel should take on. Specially,
Encoder-decoder (L1) 0.35 0.12 0.08
Encoder-decoder (L1+cGAN) 0.29 0.09 0.05 L1 will be minimized by choosing the median of the condi-
U-net (L1) 0.48 0.18 0.13 tional probability density function over possible colors. An
U-net (L1+cGAN) 0.55 0.20 0.14 adversarial loss, on the other hand, can in principle become
Table 2: FCN-scores for different generator architectures (and ob- aware that grayish outputs are unrealistic, and encourage
jectives), evaluated on Cityscapes labels↔photos. (U-net (L1- matching the true color distribution [24]. In Figure 7, we
cGAN) scores differ from those reported in other tables since batch investigate whether our cGANs actually achieve this effect
size was 10 for this experiment and 1 for other tables, and random on the Cityscapes dataset. The plots show the marginal dis-
variation between training runs.)
tributions over output color values in Lab color space. The
Discriminator ground truth distributions are shown with a dotted line. It
receptive field Per-pixel acc. Per-class acc. Class IOU is apparent that L1 leads to a narrower distribution than the
1×1 0.39 0.15 0.10
16×16 0.65 0.21 0.17 ground truth, confirming the hypothesis that L1 encourages
70×70 0.66 0.23 0.17 average, grayish colors. Using a cGAN, on the other hand,
286×286 0.42 0.16 0.11 pushes the output distribution closer to the ground truth.
Table 3: FCN-scores for different receptive field sizes of the dis-
criminator, evaluated on Cityscapes labels→photos. Note that in- 4.3. Analysis of the generator architecture
put images are 256 × 256 pixels and larger receptive fields are
padded with zeros. A U-Net architecture allows low-level information to
shortcut across the network. Does this lead to better results?
Figure 4 shows the qualitative effects of these variations Figure 5 and Table 2 compare the U-Net against an encoder-
on two labels→photo problems. L1 alone leads to reason- decoder on cityscape generation. The encoder-decoder is
able but blurry results. The cGAN alone (setting λ = 0 in created simply by severing the skip connections in the U-
Eqn. 4) gives much sharper results but introduces visual ar- Net. The encoder-decoder is unable to learn to generate
tifacts on certain applications. Adding both terms together realistic images in our experiments. The advantages of the
(with λ = 100) reduces these artifacts. U-Net appear not to be specific to conditional GANs: when
We quantify these observations using the FCN-score on both U-Net and encoder-decoder are trained with an L1 loss,
the cityscapes labels→photo task (Table 1): the GAN-based the U-Net again achieves the superior results.
objectives achieve higher scores, indicating that the synthe-
sized images include more recognizable structure. We also 4.4. From PixelGANs to PatchGANs to ImageGANs
test the effect of removing conditioning from the discrimi-
nator (labeled as GAN). In this case, the loss does not pe- We test the effect of varying the patch size N of our dis-
nalize mismatch between the input and output; it only cares criminator receptive fields, from a 1 × 1 “PixelGAN” to a
L1 1×1 16×16 70×70 286×286

L LL L LL
−1 −1

CVPRFigure 6: Patch size variations. Uncertainty in the output manifests itself differently for different loss functions. Uncertain regions become
−1 −1
CVPR CVPR
−1−1
CVPR −3 −3 −3
CVPR
CVPR
#385 blurry and desaturated under L1. The 1x1 PixelGAN encourages greater color diversity but has no effect on spatial statistics. The 16x16 #385
−3−3 −3
#385
#385 #385
#385
−5 −5 −5
PatchGAN−5creates
−5
locally sharp CVPR
CVPR
CVPR results,
2016
2016 but also leads
Submission
2016Submission
Submission to CONFIDENTIAL
#385.
#385. tiling
#385. artifacts
CONFIDENTIAL
−5
CONFIDENTIAL beyond
REVIEW
REVIEW
REVIEWtheCOPY.
scale
COPY.
COPY. it DO
DO can
DO observe.
NOT
NOT The 70×70 PatchGAN forces
DISTRIBUTE.
DISTRIBUTE.
NOT DISTRIBUTE.
−7 −7
outputs that−7−7are sharp, even ifL1L1cGANincorrect, in both the spatial and spectral (colorfulness)L1cGAN dimensions. The full 286×286 ImageGAN produces
−7 −7
L1
L1 L1
cGAN cGAN
cGAN cGAN
results that−9are visually similar to the 70×70 PatchGAN, but somewhat −9 −9
lower qualityL1+cGAN according to our FCN-score metric (Table 3). Please
−9 L1+cGAN −9 L1+cGAN
−9 L1+cGAN
L1+cGAN L1+cGAN
L1+pixelcGAN
L1+pixelcGAN L1+pixelcGAN
L1+pixelcGAN
L1+pixelcGAN L1+pixelcGAN
648
648 see https://phillipi.github.io/pix2pix/
648 −11
−11
−11
0
0 0 20
2020 40
40 40 60
60 60
Ground
Ground
80
80 80
truth
truth
Ground truth
100
100100
for additional
−11−11
0
−11
0 examples.
20 20 40 40 60 60
Ground
Ground
80
truthtruth
Ground
80
truth
100 100 702702
702
0 20 40 60 80 100
649
649
649 L LL b bb a aa b bb a aa
703703
703
650
650
650 −1
−1−1
−1
−1−1 −1−1
−1 −1 −1
−1 −1 −1
−1
704704
704
651
651
−3
−3 −3−3 −3 −3 −3 −3 Histogram
Histogram
Histogram intersection
intersection
intersection 705705
651 705
−3 −3 −3
Histogram intersection
−3−3 −3 −3
(L)
(L)

(a)

against ground truth

log P (a)

against ground truth

(b)
log P (b)
P(L)

logPP(a)

against ground truth

logPP(b)
652
652
652
−5
−5
−5
−5
−5
−5
−5−5
−5
−5 −5
−5
−5 −5
−5
Loss
Loss
against
L LL a aa
ground truth
b bb 706706
706
logPP

Loss
Loss L a b
log

log
log

653
653 707707
−7 −7 −7 −7
log

−7
−7 −7−7
653
−7
−7
−7 −7
L1L1
−7 −7 −7
L1L1L1 0.81 0.81 0.69
0.81 0.690.69
0.69 0.70
0.70 0.70 707
L1
−9 cGAN
L1 0.81 0.70
cGAN 0.87
cGAN 0.87
0.870.74 0.74 0.840.84
0.74
cGAN
cGAN −9 −9 −9 −9
654
654 708708
−9
−9 −9−9
654
−9
−9
−9 −9 cGAN
L1+cGAN
L1+cGAN
L1+cGAN
−9 −9
cGAN 0.87 0.74 0.84
0.84 708
−11 −11
L1+pixelcGAN
L1+pixelcGAN
L1+pixelcGAN −11 −11−11 −11 −11 L1+cGAN
L1+cGAN
L1+cGAN 0.86 0.86
0.86 0.840.84
0.84 0.820.82
0.82
655 709709
−11 −11
655 −11 Ground
Groundtruth
truth
655 −11
00
−11
7070
20 9090
40 110
110
6060
Ground truth
130
130
8080130 100 150
150
−11
7070 90 90 110110 130130
−11
70 70 90 90 110110 130 130 150 150
−11
70 70 L1+cGAN
70 PixelGAN
PixelGAN
90 90 110 110 0.86
0.83 130 130 0.84
110 0.83 0.68 0.68 0.78 0.82
0.78 709
PixelGAN 0.83 0.68
20 40 100 150
0 7020 90
40
L
110
60 80 100 70 90
aa
110 130 70 90
b bb
110 130 150 90 130
0.78
656
656 LL bb
a(b)a
a a PixelGAN 0.83 0.68 0.78 710710
656 (a)
(a)b
(a) (b)
(b) (c)(c)
(c) (d)(d) 710
657 Figure
Figure
−1
5: (a) distribution
5: Color
Color distributionmatching
matching
−1−1
(b)ofofthe
property
property thecGAN,
cGAN,tested (c)
testedononCityscapes.
Cityscapes.(c.f.
(c.f.Figure 1 of
Figure thethe
1 of original
original(d)GAN
GAN paper [14]).
paper Note
[14]). Note 711711
657
657
−1
−1 −1 (d)
Figure 5: Color distribution matching property of the cGAN, tested on Cityscapes. (c.f. Figure 1 of the original GAN paper [14]). Note 711
658
658 that
that−3 the histogram intersection
−3
that−3 the histogram intersection
the histogram intersection
−3−3
scores
scores
scores arearedominated
are dominated
dominated byby
bydifferences
differences
differences inin the
in thehigh
the high
highprobability
probability
probability region,
region, which
which areareimperceptible
imperceptible in inthe
inthe plots,
plots, 712712
658 Figure Figure region,
1 of thewhich are imperceptible the plots, 712
−3
7:
which Color
show distribution
log probability matching
and property of the cGAN, tested on Cityscapes. (c.f. original GAN paper [24]). Note
which show log probability and −5therefore
therefore emphasize
emphasize differences
differences in the
in thelowlow probability
probability regions.
regions.
−5 −5−5
659
659
659 that thewhich
−5
−5
show log probability and therefore emphasize differences in the low probability regions. 713713
713
−7 histogram intersection scores −7−7 are dominated by differences in the high probability region, which are imperceptible in the plots,
660
660
−7
−7 −7 714714
660 which show log probability
L1 and therefore emphasize
1x1 differences in the low probability regions.70x70
16x16 256x256 714
661
661
−9
−9
−9 L1
L1
−9−9
−9 1x1
1x1 16x16
16x16 70x70
70x70 256x256
256x256 715715
661 −11 −11
715
662
662
−11
−11
−11
−11 716716
662 716
70
70 9090 110
110 130
130 150
150 7070 9090 110110 130130
70 90 110 130 150 70 90 110 130
663 1 Photo → Map Map → Photo 717717
663 full 286 × 286 “ImageGAN” . Figure 6 shows qualitative
663 717
664 Loss % Turkers labeled real % Turkers labeled real 718718
664 results of this analysis and Table 3 quantifies the effects us-
664
L1 2.8% ± 1.0% 0.8% ± 0.3% 718
665
665 719719
665 ing the FCN-score. Note that elsewhere in this paper, unless L1+cGAN 6.1% ± 1.3% 18.9% ± 2.5% 719
666 Figure
Figure 6:
6: Patch size variations. Uncertainty ininthe output manifests itself differently forfordifferent loss functions. Uncertain regions become 720720
666 Figureall
666 specified, 6: Patch
Patch size
experiments size variations.
variations.
use L1. 70Uncertainty
Uncertainty
× 701x1 inthe
PatchGANs, theoutput
outputandmanifests
manifests
for greater itself differently
itselfTable
differently
4: AMT fordifferent
different
“real
loss
lossfunctions.
vsnofake” functions.
test on
Uncertain
Uncertainregions
maps↔aerial regions
photos.
become
become 720
667 blurry
blurry and
and desaturated
desaturated under
under L1. The
The 1x1 PixelGAN
PixelGAN encourages
encourages greater color
color diversity
diversity but buthashas noeffect
effectonon spatial
spatialstatistics.
statistics. TheThe 16x16
16x16 721721
667 blurry
667 this PatchGAN and
section allcreates desaturated
experiments under L1.
use anresults, The
L1+cGAN 1x1 PixelGAN
loss. encourages greater color diversity but has no effect on spatial statistics. The 16x16 721
668 PatchGAN creates locally
locally sharp
sharp results, but
but also
also leads
leads tototiling
tiling artifacts
artifacts beyond
beyond thethescale
Method scale it it
can canobserve.
observe. % The The
Turkers70x70
70x70 PatchGAN
labeled PatchGAN
real forces
forces 722722
668 PatchGAN creates locally sharp results, but also leads to tiling artifacts beyond the scale it can observe. The 70x70 PatchGAN forces
668 The
outputs
outputsPixelGAN
that are has even
sharp, no effect on spatial
if incorrect, in both sharpness
the spatial andbutspectral (coforfulness) dimensions.
L2 regression fromThe [62] full 256x256
16.3% ± ImageGAN
2.4% produces 722
669
669 outputs that that are are sharp,
sharp, even even ifif incorrect,
incorrect, in inboth
boththe thespatial
spatialand andspectral
spectral(coforfulness)
(coforfulness) dimensions.
dimensions. The
Thefull 256x256
full27.8%
256x256 ImageGAN
ImageGANproduces produces 723723
669 doesresults
increase
results that are the colorfulness
visually similar to oftothe
the results
70x70 (quantified
PatchGAN, insomewhat
but somewhat lower quality according
Zhang et al. to to
2016 ourour
[62] FCN-score metric
± 2.7%(Table 2).2). 723
670
670 results that that are are visually
visually similarsimilar to the
the 70x70
70x70PatchGAN,
PatchGAN,but butsomewhat lower
lowerquality
quality according
Ours according to our FCN-score
FCN-score metric
22.5% ±metric
(Table
1.6% (Table 2). 724724
670 Figure 7). For example, the bus in Figure 6 is painted gray 724
671
671 725725
671 when the net is trained with an L1 loss, but becomes red Table 5: AMT “real vs fake” test on colorization. 725
672 Classification
Classification Ours
Ours Input
Input Ground
Ground truth L1L1 cGAN 726726
672
672 with theL2PixelGAN
673 L2 [44]
[44] Classification
(rebal.)
loss. Color
(rebal.) [44]
[44] (L1
histogram
(L1 Ours
++cGAN)
cGAN) Ground
matching
Ground istrutha
truth Input Groundtruth truth L1 cGAN
cGAN 727
726
673 L2 [44] (rebal.)
673 common problem in image processing [49], and PixelGANs
[44] (L1 + cGAN) Ground truth 727
727
674 generator convolutionally, on larger images than those on 728
674
674 728
728
675 may be a promising lightweight solution. which it was trained. We test this on the map↔aerial photo 729
675
675 729
729
676 Using a 16×16 PatchGAN is sufficient to promote sharp task. After training a generator on 256×256 images, we test 730
676
676 730
730
677 outputs, and achieves good FCN-scores, but also leads to
677
it on 512 × 512 images. The results in Figure 8 demonstrate 731
731
677 732 731
678
678
tiling artifacts. The 70 × 70 PatchGAN alleviates these the effectiveness
Figure 8: Applying ofathis approach.
conditional GAN to semantic segmentation. 732
678 Figure
Figure 8:
8: Applying
Applyingasharp aconditional
conditional GAN totosemantic segmentation. 733 732
679
679
artifacts and achieves slightly better scores. Scaling be- The cGAN produces images GAN that looksemantic
at glance segmentation.
like the 733
679 The cGAN produces sharp images that look atat glance like
like the
680 yond this, to the full 286 × 286 ImageGAN, does not ap- 4.5.
groundPerceptual
The truth, but invalidation
cGAN produces sharp images that look
fact include many small, hallucinated objects. glance the 734 733
680
680 ground truth, but in fact include many small, hallucinated objects. 734
681 pear to improve the visual quality of the results, and in ground truth, but in fact include many small, hallucinated objects. 735 734
681
681 We validate the perceptual realism of our results on the 735
682 fact gets a considerably lower FCN-score (Table 3). This 736 735
682
682 tasks of map↔aerial photograph and grayscale→color. Re- 736
683 may be because the ImageGAN has many more parameters nearly discrete, rather than “images”, with their continuous- 737 736
683 sults of
nearly our AMT rather
discrete, experiment for map↔photo arecontinuous-
given in 737
683
684 and greater depth than the 70 × 70 PatchGAN, and may be
684
nearly
valued discrete,
variation. ratherthan
Although
“images”,
thancGANs“images”, with
withtheir
achieve their
some continuous-
success, 738 737
738
684 Table 4.
valued
valued The aerial
variation.
variation. photos
Although generated
cGANs by our
achieve method
somefooled
some success, 739 738
685
685 harder to train. they are far from theAlthough best available cGANs methodachieve for solving success,
this 739
685 participants
they are on
far from
18.9% the ofbesttrials, significantly
available method above
for the
solving L1this 740 739
686
686 Fully-convolutional translation An advantage of the problem: simply using L1 regression gets better scores thanthis
they are far from the best available method for solving 740
686
687 baseline,
problem:
problem:
using
which
a cGAN,
simply
simply produces
as shown
using
usinginL1 L1 blurry
regressionresults
regression
Table 4. We argue
getsand
gets better nearly
better
that for
scores
scoresneverthan
visionthan 741 740
687
687
PatchGAN is that a fixed-size patch discriminator can be fooledusing aacGAN,
participants. as shown
In in
contrast, Table in 4.4.
theWe argue
photo→map that for vision
direc- 741
688 using
problems, cGAN,
the goal as shown
(i.e. in Table
predicting We
output argue
close that
to for
ground vision 742 741
688 applied
688
689 Figure to 7:arbitrarily
Colorization large images.
results We mayGANs
of conditional also versus
applythe theL2 tion problems,
our
problems,method the
the goal
only
goal (i.e.
fooled
(i.e. predicting
participants
predicting output
output on close
6.1%
close toto ground
of tri-
ground
742
743 742
689 Figure 7:
Figure 7: from
regression Colorization
Colorization results
[44] andresults the full of conditional
of method
conditional GANs
GANs versus
(classification versus the
withthe L2 truth)
re-L2 als,truth) may be less ambiguous than graphics tasks, and re- 743
689
690 regression from [44] and may
this was
truth) maylosses
construction be less
not
be less ambiguous
significantly
likeambiguous
L1 are mostly than graphics
graphics tasks, and re-
different
than sufficient.than tasks,
the and
perfor- re- 744 743
[44] and the full
sizemethod (classificationdepth ofwith col-re-
1 We achieve this variation in patch by
690 regression
balancing) from
from [46]. and
The the
cGANs full canadjusting
method produce the
(classification
compelling the
with reconstruction losses like L1 are mostly sufficient. 744
690
691 balancing)
GANorizations
discriminator. from Details[46]. of The
this cGANs
process, andcanthe produce
discriminator compelling
architec- col- mance of the
construction L1 baseline
losses like (based
L1 are on
mostly bootstrap
sufficient. test). This 745 744
691 balancing)(first fromtwo [46].rows), Thebut cGANshave acan common
producefailure mode of
compelling col- 745
of may be because minor structural errors are more visible
691 tures, orizations
692 are provided in thetwo in the 746 745
692 orizationsa(first
producing grayscale
(first two orsupplemental
rows), but
but have
desaturated
rows), have materials
result (lastonline.
aa common
common failure
row).failure modemode of 746
692
693 producing a grayscale or desaturated result (last row). 747 746
producing a grayscale or desaturated result (last row).
693
693
694 4. Conclusion 748747
747
694
694
695 To begin to test this, we train a cGAN (with/without L1
4.
4. Conclusion
Conclusion 749748
748
695
696
695 loss)To
Toonbegin to
to test
cityscape
begin this,
this, we
we train
testphoto!labels.train aaFigure
cGAN
cGAN8(with/without
shows qualita-
(with/without L1
L1 The results in this paper suggest that conditional adver- 750749
749
696
697
696 loss)
tive on
results,cityscape
and photo!labels.
quantitative Figure
classification 8 shows
accuracies
loss) on cityscape photo!labels. Figure 8 shows qualita- qualita-
are re- sarialThe results
networks
The in
in this
resultsare athis paper
paper suggest
promising that
approach
suggest conditional
thatfor adver-
many image-
conditional adver- 751750
750
697
698
697 tive
tive results,
ported in Table
results, and
and4.quantitative
Interestingly,
quantitative classification
cGANs, trained
classification accuracies
without
accuracies are re-
the
are re- sarial
sarial networks
to-image translation
networks are aa promising
aretasks, approach
especially
promising for
formany
those involving
approach image-
highly
many image- 752751
751
698
699
698 ported
L1 loss,in
ported areTable
in able 4.
Table to Interestingly,
4. solve cGANs,
this problem
Interestingly, trained
trainedwithout
at a reasonable
cGANs, degree
without the
the to-image
structured translation
to-imagegraphical
translation tasks,
tasks,especially
outputs. those
thoseinvolving
These networks
especially learn a highly
involving loss
highly 753752
752
Map to aerial photo Aerial photo to map

input output input output

Figure 8: Example results on Google Maps at 512x512 resolution (model was trained on images at 256 × 256 resolution, and run convo-
lutionally on the larger images at test time). Contrast adjusted for clarity.
Classification Ours Input Ground truth L1 cGAN
L2 [62] (rebal.) [62] (L1 + cGAN) Ground truth

Figure 10: Applying a conditional GAN to semantic segmenta-

tion. The cGAN produces sharp images that look at glance like
the ground truth, but in fact include many small, hallucinated ob-
jects.

ble 5). We also tested the results of [62] and a variant of

their method that used an L2 loss (see [62] for details). The
Figure 9: Colorization results of conditional GANs versus the L2 conditional GAN scored similarly to the L2 variant of [62]
regression from [62] and the full method (classification with re- (difference insignificant by bootstrap test), but fell short of
balancing) from [64]. The cGANs can produce compelling col-
[62]’s full method, which fooled participants on 27.8% of
orizations (first two rows), but have a common failure mode of
producing a grayscale or desaturated result (last row).
trials in our experiment. We note that their method was
specifically engineered to do well on colorization.
in maps, which have rigid geometry, than in aerial pho- 4.6. Semantic segmentation
tographs, which are more chaotic.
We trained colorization on ImageNet [51], and tested Conditional GANs appear to be effective on problems
on the test split introduced by [62, 35]. Our method, with where the output is highly detailed or photographic, as is
L1+cGAN loss, fooled participants on 22.5% of trials (Ta- common in image processing and graphics tasks. What
by Kaihu Chen by Jack Qiao by Mario Klingemann

#fotogenerator

sketch by Ivy Tsai by Bertrand Gondouin by Brannon Dorsey sketch by Yann LeCun
Figure 11: Example applications developed by online community based on our pix2pix codebase: #edges2cats [3] by Christopher Hesse,
Background removal [6] by Kaihu Chen, Palette generation [5] by Jack Qiao, Sketch → Portrait [7] by Mario Klingemann, Sketch→
Pokemon [1] by Bertrand Gondouin, “Do As I Do” pose transfer [2] by Brannon Dorsey, and #fotogenerator by Bosman et al. [4].

Loss Per-pixel acc. Per-class acc. Class IOU

L1 0.86 0.42 0.35
cGAN 0.74 0.28 0.22
L1+cGAN 0.83 0.36 0.29
Table 6: Performance of photo→labels on cityscapes.

about vision problems, like semantic segmentation, where

the output is instead less complex than the input?
To begin to test this, we train a cGAN (with/without L1
loss) on cityscape photo→labels. Figure 10 shows qualita- Figure 12: Learning to see: Gloomy Sunday: An interactive artis-
tive results, and quantitative classification accuracies are re- tic demo developed by Memo Akten [8] based on our pix2pix
ported in Table 6. Interestingly, cGANs, trained without the codebase. Please click the image to play the video in a browser.
L1 loss, are able to solve this problem at a reasonable degree
the pix2pix code we released. Nonetheless, they demon-
of accuracy. To our knowledge, this is the first demonstra-
strate the promise of our approach as a generic commodity
tion of GANs successfully generating “labels”, which are
tool for image-to-image translation problems.
nearly discrete, rather than “images”, with their continuous-
valued variation2 . Although cGANs achieve some success, 5. Conclusion
they are far from the best available method for solving this
problem: simply using L1 regression gets better scores than The results in this paper suggest that conditional adver-
using a cGAN, as shown in Table 6. We argue that for vi- sarial networks are a promising approach for many image-
sion problems, the goal (i.e. predicting output close to the to-image translation tasks, especially those involving highly
ground truth) may be less ambiguous than graphics tasks, structured graphical outputs. These networks learn a loss
and reconstruction losses like L1 are mostly sufficient. adapted to the task and data at hand, which makes them ap-
plicable in a wide variety of settings.
4.7. Community-driven Research
Acknowledgments: We thank Richard Zhang, Deepak
Since the initial release of the paper and our pix2pix Pathak, and Shubham Tulsiani for helpful discussions, Sain-
codebase, the Twitter community, including computer vi- ing Xie for help with the HED edge detector, and the online
sion and graphics practitioners as well as visual artists, have community for exploring many applications and suggesting
successfully applied our framework to a variety of novel improvements. Thanks to Christopher Hesse, Memo Ak-
image-to-image translation tasks, far beyond the scope of ten, Kaihu Chen, Jack Qiao, Mario Klingemann, Brannon
the original paper. Figure 11 and Figure 12 show just a Dorsey, Gerda Bosman, Ivy Tsai, and Yann LeCun for al-
few examples from the #pix2pix hashtag, including Back- lowing the use of their creations in Figure 11 and Figure 12.
ground removal, Palette generation, Sketch → Portrait, This work was supported in part by NSF SMA-1514512,
Sketch→Pokemon, ”Do as I Do” pose transfer, Learning NGA NURI, IARPA via Air Force Research Laboratory, In-
to see: Gloomy Sunday, as well as the bizarrely popular tel Corp, Berkeley Deep Drive, and hardware donations by
#edges2cats and #fotogenerator. Note that these applica- Nvidia. J.-Y.Z. is supported by the Facebook Graduate Fel-
tions are creative projects, were not obtained in controlled, lowship. Disclaimer: The views and conclusions contained
scientific conditions, and may rely on some modifications to herein are those of the authors and should not be interpreted
2 Note that the label maps we train on are not exactly discrete valued, as necessarily representing the official policies or endorse-
as they are resized from the original maps using bilinear interpolation and ments, either expressed or implied, of IARPA, AFRL or the
saved as jpeg images, with some compression artifacts. U.S. Government.
Input Ground truth Output Input Ground truth Output

Figure 13: Example results of our method on Cityscapes labels→photo, compared to ground truth.

Input Ground truth Output Input Ground truth Output

Figure 14: Example results of our method on facades labels→photo, compared to ground truth.
Input Ground truth Output Input Ground truth Output

Figure 15: Example results of our method on day→night, compared to ground truth.

Input Ground truth Output Input Ground truth Output

Figure 16: Example results of our method on automatically detected edges→handbags, compared to ground truth.
Input Ground truth Output Input Ground truth Output

Figure 17: Example results of our method on automatically detected edges→shoes, compared to ground truth.

Input Output Input Output Input Output

Figure 18: Additional results of the edges→photo models applied to human-drawn sketches from [19]. Note that the models were trained
on automatically detected edges, but generalize to human drawings
Figure 19: Example results on photo inpainting, compared to [43], on the Paris StreetView dataset [14]. This experiment demonstrates that
the U-net architecture can be effective even when the predicted pixels are not geometrically aligned with the information in the input – the
information used to fill in the central hole has to be found in the periphery of these photos.

Input Ground truth Output Input Ground truth Output

Figure 20: Example results on translating thermal images to RGB photos, on the dataset from [27].

Day Night Labels Facade Labels Street scene

Edges Shoe Edges Handbag Sketch Shoe Sketch Handbag

Figure 21: Example failure cases. Each pair of images shows input on the left and output on the right. These examples are selected as some
of the worst results on our tasks. Common failures include artifacts in regions where the input image is sparse, and difficulty in handling
unusual inputs. Please see https://phillipi.github.io/pix2pix/ for more comprehensive results.
References [21] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis
using convolutional neural networks. In NIPS, 2015. 4
[1] Bertrand gondouin. https://twitter.com/
[22] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer
bgondouin/status/818571935529377792.
using convolutional neural networks. CVPR, 2016. 4
Accessed, 2017-04-21. 9
[23] J. Gauthier. Conditional generative adversarial nets for
[2] Brannon dorsey. https://twitter.com/
convolutional face generation. Class Project for Stanford
brannondorsey/status/806283494041223168.
CS231N: Convolutional Neural Networks for Visual Recog-
Accessed, 2017-04-21. 9
nition, Winter semester, (5):2, 2014. 2
[3] Christopher hesse. https://affinelayer.com/
[24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
pixsrv/. Accessed: 2017-04-21. 9
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
[4] Gerda bosman, tom kenter, rolf jagerman, and daan gosman.
erative adversarial nets. In NIPS, 2014. 2, 4, 6, 7
https://dekennisvannu.nl/site/artikel/
[25] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H.
Help-ons-kunstmatige-intelligentie-testen/
Salesin. Image analogies. In SIGGRAPH, 2001. 1, 4
9163. Accessed: 2017-08-31. 9
[5] Jack qiao. http://colormind.io/blog/. Accessed: [26] G. E. Hinton and R. R. Salakhutdinov. Reducing the
2017-04-21. 9 dimensionality of data with neural networks. Science,
313(5786):504–507, 2006. 3
[6] Kaihu chen. http://www.terraai.org/
imageops/index.html. Accessed, 2017-04-21. [27] S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon. Mul-
9 tispectral pedestrian detection: Benchmark dataset and base-
line. In CVPR, 2015. 4, 13, 16
[7] Mario klingemann. https://twitter.com/
quasimondo/status/826065030944870400. [28] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be
Accessed, 2017-04-21. 9 Color!: Joint End-to-end Learning of Global and Local Im-
[8] Memo akten. https://vimeo.com/260612034. Ac- age Priors for Automatic Image Colorization with Simulta-
cessed, 2018-11-07. 9 neous Classification. ACM Transactions on Graphics (TOG),
35(4), 2016. 2
[9] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm
for image denoising. In CVPR, 2005. 1 [29] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and deep network training by reducing internal covariate shift. In
A. L. Yuille. Semantic image segmentation with deep con- ICML, 2015. 3, 4
volutional nets and fully connected crfs. In ICLR, 2015. 2 [30] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for
[11] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. real-time style transfer and super-resolution. In ECCV, 2016.
Sketch2photo: internet image montage. ACM Transactions 2, 3
on Graphics (TOG), 28(5):124, 2009. 1 [31] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learning
[12] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, to generate images of outdoor scenes from attributes and se-
R. Benenson, U. Franke, S. Roth, and B. Schiele. The mantic layouts. arXiv preprint arXiv:1612.00215, 2016. 2
cityscapes dataset for semantic urban scene understanding. [32] D. Kingma and J. Ba. Adam: A method for stochastic opti-
In CVPR, 2016. 4, 16 mization. ICLR, 2015. 4
[13] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep gen- [33] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient
erative image models using a laplacian pyramid of adversar- attributes for high-level understanding and editing of outdoor
ial networks. In NIPS, 2015. 2 scenes. ACM Transactions on Graphics (TOG), 33(4):149,
[14] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What 2014. 1, 4, 16
makes paris look like paris? ACM Transactions on Graphics, [34] A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoen-
31(4), 2012. 4, 13, 17 coding beyond pixels using a learned similarity metric. In
[15] A. Dosovitskiy and T. Brox. Generating images with per- ICML, 2016. 3
ceptual similarity metrics based on deep networks. In NIPS, [35] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep-
2016. 2 resentations for automatic colorization. ECCV, 2016. 2, 8,
[16] A. A. Efros and W. T. Freeman. Image quilting for texture 16
synthesis and transfer. In SIGGRAPH, 2001. 1, 4 [36] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Te-
[17] A. A. Efros and T. K. Leung. Texture synthesis by non- jani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single im-
parametric sampling. In ICCV, 1999. 4 age super-resolution using a generative adversarial network.
[18] D. Eigen and R. Fergus. Predicting depth, surface normals In CVPR, 2017. 2
and semantic labels with a common multi-scale convolu- [37] C. Li and M. Wand. Combining markov random fields and
tional architecture. In ICCV, 2015. 1 convolutional neural networks for image synthesis. CVPR,
[19] M. Eitz, J. Hays, and M. Alexa. How do humans sketch 2016. 2, 4
objects? In SIGGRAPH, 2012. 4, 12 [38] C. Li and M. Wand. Precomputed real-time texture synthe-
[20] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. sis with markovian generative adversarial networks. ECCV,
Freeman. Removing camera shake from a single photo- 2016. 2, 4
graph. ACM Transactions on Graphics (TOG), 25(3):787– [39] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
794, 2006. 1 networks for semantic segmentation. In CVPR, 2015. 1, 2, 5
[40] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale [58] S. Xie and Z. Tu. Holistically-nested edge detection. In
video prediction beyond mean square error. ICLR, 2016. 2, ICCV, 2015. 1, 2, 4
3 [59] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-
[41] M. Mirza and S. Osindero. Conditional generative adversar- level domain transfer. ECCV, 2016. 2, 3
ial nets. arXiv preprint arXiv:1411.1784, 2014. 2 [60] A. Yu and K. Grauman. Fine-Grained Visual Comparisons
[42] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adel- with Local Learning. In CVPR, 2014. 4
son, and W. T. Freeman. Visually indicated sounds. In [61] A. Yu and K. Grauman. Fine-grained visual comparisons
CVPR, 2016. 5 with local learning. In CVPR, 2014. 16
[43] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. [62] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
Efros. Context encoders: Feature learning by inpainting. In tion. ECCV, 2016. 1, 2, 5, 7, 8, 16
CVPR, 2016. 2, 3, 13, 17 [63] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative
[44] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- adversarial network. In ICLR, 2017. 2
sentation learning with deep convolutional generative adver- [64] Y. Zhou and T. L. Berg. Learning temporal transformations
sarial networks. In ICLR, 2016. 2, 3, 16 from time-lapse videos. In ECCV, 2016. 2, 3, 8
[45] R. Š. Radim Tyleček. Spatial pattern templates for recogni- [65] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros.
tion of objects with regular structure. In German Conference Generative visual manipulation on the natural image mani-
on Pattern Recognition, 2013. 4, 16 fold. In ECCV, 2016. 2, 4, 16
[46] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and
H. Lee. Generative adversarial text to image synthesis. In
ICML, 2016. 2
[47] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst,
M. Botvinick, and N. de Freitas. Generating interpretable
images with controllable structure. In ICLR Workshop, 2017.
2
[48] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and
H. Lee. Learning what and where to draw. In NIPS, 2016. 2
[49] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Color
transfer between images. IEEE Computer Graphics and Ap-
plications, 21:34–41, 2001. 7
[50] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
tional networks for biomedical image segmentation. In MIC-
CAI, 2015. 2, 3
[51] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge.
International Journal of Computer Vision, 115(3):211–252,
2015. 4, 8, 16
[52] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
ford, and X. Chen. Improved techniques for training gans. In
NIPS, 2016. 2, 4, 5
[53] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven
hallucination of different times of day from a single outdoor
photo. ACM Transactions on Graphics (TOG), 32(6):200,
2013. 1
[54] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normal-
ization: The missing ingredient for fast stylization. arXiv
preprint arXiv:1607.08022, 2016. 4
[55] X. Wang and A. Gupta. Generative image modeling using
style and structure adversarial networks. In ECCV, 2016. 2,
3, 5
[56] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
Image quality assessment: from error visibility to struc-
tural similarity. IEEE Transactions on Image Processing,
13(4):600–612, 2004. 2
[57] S. Xie, X. Huang, and Z. Tu. Top-down learning for struc-
tured labeling with convolutional pseudoprior. In ECCV,
2015. 2
6. Appendix 286 × 286 discriminator:
C64-C128-C256-C512-C512-C512
6.1. Network architectures
We adapt our network architectures from those
in [44]. Code for the models is available at 6.2. Training details
https://github.com/phillipi/pix2pix. Random jitter was applied by resizing the 256×256 input
Let Ck denote a Convolution-BatchNorm-ReLU layer images to 286 × 286 and then randomly cropping back to
with k filters. CDk denotes a Convolution-BatchNorm- size 256 × 256.
Dropout-ReLU layer with a dropout rate of 50%. All convo- All networks were trained from scratch. Weights were
lutions are 4 × 4 spatial filters applied with stride 2. Convo- initialized from a Gaussian distribution with mean 0 and
lutions in the encoder, and in the discriminator, downsample standard deviation 0.02.
by a factor of 2, whereas in the decoder they upsample by a Cityscapes labels→photo 2975 training images from
factor of 2. the Cityscapes training set [12], trained for 200 epochs, with
random jitter and mirroring. We used the Cityscapes val-
6.1.1 Generator architectures idation set for testing. To compare the U-net against an
encoder-decoder, we used a batch size of 10, whereas for
The encoder-decoder architecture consists of:
the objective function experiments we used batch size 1.
encoder:
We find that batch size 1 produces better results for the U-
C64-C128-C256-C512-C512-C512-C512-C512
net, but is inappropriate for the encoder-decoder. This is
decoder:
because we apply batchnorm on all layers of our network,
CD512-CD512-CD512-C512-C256-C128-C64
and for batch size 1 this operation zeros the activations on
After the last layer in the decoder, a convolution is ap-
the bottleneck layer. The U-net can skip over the bottleneck,
plied to map to the number of output channels (3 in general,
but the encoder-decoder cannot, and so the encoder-decoder
except in colorization, where it is 2), followed by a Tanh
requires a batch size greater than 1. Note, an alternative
function. As an exception to the above notation, Batch-
strategy is to remove batchnorm from the bottleneck layer.
Norm is not applied to the first C64 layer in the encoder.
See errata for more details.
All ReLUs in the encoder are leaky, with slope 0.2, while
Architectural labels→photo 400 training images from
ReLUs in the decoder are not leaky.
[45], trained for 200 epochs, batch size 1, with random jitter
The U-Net architecture is identical except with skip con-
and mirroring. Data were split into train and test randomly.
nections between each layer i in the encoder and layer n − i
in the decoder, where n is the total number of layers. The Maps↔aerial photograph 1096 training images
skip connections concatenate activations from layer i to scraped from Google Maps, trained for 200 epochs, batch
layer n − i. This changes the number of channels in the size 1, with random jitter and mirroring. Images were
decoder: sampled from in and around New York City. Data were
U-Net decoder: then split into train and test about the median latitude of the
CD512-CD1024-CD1024-C1024-C1024-C512 sampling region (with a buffer region added to ensure that
-C256-C128 no training pixel appeared in the test set).
BW→color 1.2 million training images (Imagenet train-
ing set [51]), trained for ∼ 6 epochs, batch size 4, with only
6.1.2 Discriminator architectures
mirroring, no random jitter. Tested on subset of Imagenet
The 70 × 70 discriminator architecture is: val set, following protocol of [62] and [35].
C64-C128-C256-C512 Edges→shoes 50k training images from UT Zappos50K
After the last layer, a convolution is applied to map to dataset [61] trained for 15 epochs, batch size 4. Data were
a 1-dimensional output, followed by a Sigmoid function. split into train and test randomly.
As an exception to the above notation, BatchNorm is not Edges→Handbag 137K Amazon Handbag images from
applied to the first C64 layer. All ReLUs are leaky, with [65], trained for 15 epochs, batch size 4. Data were split into
slope 0.2. train and test randomly.
All other discriminators follow the same basic architec- Day→night 17823 training images extracted from 91
ture, with depth varied to modify the receptive field size: webcams, from [33] trained for 17 epochs, batch size 4,
1 × 1 discriminator: with random jitter and mirroring. We use 91 webcams as
C64-C128 (note, in this special case, all convolutions are training, and 10 webcams for test.
1 × 1 spatial filters) Thermal→color photos 36609 training images from set
16 × 16 discriminator: 00–05 of [27], trained for 10 epochs, batch size 4. Images
C64-C128 from set 06-11 are used for testing.
Photo with missing pixels→inpainted photo 14900
training images from [14], trained for 25 epochs, batch size
4, and tested on 100 held out images following the split of
[43].
6.3. Errata
For all experiments reported in this paper with batch
size 1, the activations of the bottleneck layer are zeroed by
the batchnorm operation, effectively making the innermost
layer skipped. This issue can be fixed by removing batch-
norm from this layer, as has been done in the public code.
We observe little difference with this change and therefore
leave the experiments as is in the paper.

6.4. Change log

arXiv v2 Reran generator architecture comparisons
(Section 4.3) with batch size equal to 10 rather than
1, so that bottleneck layer is not zeroed (see Errata).
Reran FCN-scores with minor details cleaned up (re-
sults saved losslessly as pngs, removed unecessary
downsampling). FCN-scores computed using scripts at
https://github.com/phillipi/pix2pix/tree/
master/scripts/eval cityscapes, commit
d7e7b8b. Updated several figures and text. Added addi-
tional results on thermal→color photos and inpainting, as
well as community contributions.
arXiv v3 Added additional results on community contri-
butions. Fixed minor typos.

Pix2Pix GANs Presentation
No ratings yet
Pix2Pix GANs Presentation
25 pages
lecture16 GAN cont
No ratings yet
lecture16 GAN cont
35 pages
06 cGAN
No ratings yet
06 cGAN
45 pages
Few-Shot Image Generation via Style Adaptation and Content Preservation
No ratings yet
Few-Shot Image Generation via Style Adaptation and Content Preservation
12 pages
3rd unit Notes
No ratings yet
3rd unit Notes
16 pages
DiffIT
No ratings yet
DiffIT
22 pages
Addison-Wesley - Franklin G.F., Powell J.D., Workman M.L. - Digital Control of Dynamic Systems, 3E
100% (2)
Addison-Wesley - Franklin G.F., Powell J.D., Workman M.L. - Digital Control of Dynamic Systems, 3E
382 pages
CycleGAN_CVPR2017
No ratings yet
CycleGAN_CVPR2017
18 pages
Ext 1 - Parametric - HSC Questions
No ratings yet
Ext 1 - Parametric - HSC Questions
8 pages
s41095-023-0371-3
No ratings yet
s41095-023-0371-3
13 pages
BATCH 16 (1)
No ratings yet
BATCH 16 (1)
24 pages
Breaking the Dilemma of Medical Image-to-image
No ratings yet
Breaking the Dilemma of Medical Image-to-image
18 pages
B8 Fake+Image+Detection+Writeup+Paper AL
No ratings yet
B8 Fake+Image+Detection+Writeup+Paper AL
6 pages
Contextual Loss ECCV 2018
No ratings yet
Contextual Loss ECCV 2018
16 pages
2022_The effect of loss function on conditional generative adversarial networks
No ratings yet
2022_The effect of loss function on conditional generative adversarial networks
12 pages
2-7
No ratings yet
2-7
6 pages
2102.04699
No ratings yet
2102.04699
9 pages
AI resubmtion
No ratings yet
AI resubmtion
18 pages
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
No ratings yet
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
10 pages
1032c868
No ratings yet
1032c868
9 pages
N-19906
No ratings yet
N-19906
8 pages
Wang High-Resolution Image Synthesis CVPR 2018 Paper
No ratings yet
Wang High-Resolution Image Synthesis CVPR 2018 Paper
10 pages
UVCGAN UNetVision Transformer Cycle-consistent GAN for Unpaired
No ratings yet
UVCGAN UNetVision Transformer Cycle-consistent GAN for Unpaired
17 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
12 pages
Contrastive Learning For Unpaired Image-to-Image Translation
No ratings yet
Contrastive Learning For Unpaired Image-to-Image Translation
29 pages
CycleGAN - Learning To Translate Images (Without Paired Training Data) - by Sarah Wolf - Towards Data Science
No ratings yet
CycleGAN - Learning To Translate Images (Without Paired Training Data) - by Sarah Wolf - Towards Data Science
9 pages
Learning by Competing Competitive Multigenerator Based Adversaial Learning
No ratings yet
Learning by Competing Competitive Multigenerator Based Adversaial Learning
14 pages
Image-to-Image Difussion Models
No ratings yet
Image-to-Image Difussion Models
29 pages
Image Disentanglement and Uncooperative Re-Entanglement For High-Fidelity Image-to-Image Translation
No ratings yet
Image Disentanglement and Uncooperative Re-Entanglement For High-Fidelity Image-to-Image Translation
12 pages
2302.05543
No ratings yet
2302.05543
8 pages
Evening MidTerm Exam Time Table Fall-2023
No ratings yet
Evening MidTerm Exam Time Table Fall-2023
7 pages
Generative Adversarial Networks For Visible To Infrared Video Con
No ratings yet
Generative Adversarial Networks For Visible To Infrared Video Con
14 pages
Invertible Conditional GANs For Image Editing
No ratings yet
Invertible Conditional GANs For Image Editing
9 pages
Conditional GAN: Deep Image Processing Seminar
No ratings yet
Conditional GAN: Deep Image Processing Seminar
61 pages
Control Net
No ratings yet
Control Net
12 pages
Kim2019 Article LatentTransformationsNeuralNet
No ratings yet
Kim2019 Article LatentTransformationsNeuralNet
15 pages
Combining Transformations LESSON
No ratings yet
Combining Transformations LESSON
5 pages
ardizzone2019 - Conditional Coupling Layers
No ratings yet
ardizzone2019 - Conditional Coupling Layers
11 pages
Frank Gabel Eml2018 Report
No ratings yet
Frank Gabel Eml2018 Report
15 pages
Art2Real - Unfolding The Reality of Artworks
No ratings yet
Art2Real - Unfolding The Reality of Artworks
11 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
lata2019
No ratings yet
lata2019
4 pages
Project Work: Final-ISA (Review 4)
No ratings yet
Project Work: Final-ISA (Review 4)
29 pages
Unpaired Image To Image Translation CycleGAn
No ratings yet
Unpaired Image To Image Translation CycleGAn
18 pages
Lesson 5 Probability and Probabilty Distributions
No ratings yet
Lesson 5 Probability and Probabilty Distributions
29 pages
Bassano Vacchini PDF
No ratings yet
Bassano Vacchini PDF
151 pages
Report 16
No ratings yet
Report 16
9 pages
Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks
No ratings yet
Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks
18 pages
Practical Lab Sessions 2018
No ratings yet
Practical Lab Sessions 2018
25 pages
Wang CNN-Generated Images Are Surprisingly Easy To Spot... For Now CVPR 2020 Paper
No ratings yet
Wang CNN-Generated Images Are Surprisingly Easy To Spot... For Now CVPR 2020 Paper
10 pages
Sketch2face: Conditional Generative Adversarial Networks For Transforming Face Sketches Into Photorealistic Images
No ratings yet
Sketch2face: Conditional Generative Adversarial Networks For Transforming Face Sketches Into Photorealistic Images
9 pages
Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks
No ratings yet
Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks
20 pages
CNN-generated Images Are Surprisingly Easy To Spot... For Now
No ratings yet
CNN-generated Images Are Surprisingly Easy To Spot... For Now
13 pages
Rau J Statistical Mechanics in A Nutshell
No ratings yet
Rau J Statistical Mechanics in A Nutshell
23 pages
Sketch Image Translation
No ratings yet
Sketch Image Translation
7 pages
Marvin Macan
No ratings yet
Marvin Macan
3 pages
Kadhim 2021 J. Phys. Conf. Ser. 1963 012104
No ratings yet
Kadhim 2021 J. Phys. Conf. Ser. 1963 012104
9 pages
Photographic Text-to-Image Synthesis With A Hierarchically-Nested Adversarial Network
No ratings yet
Photographic Text-to-Image Synthesis With A Hierarchically-Nested Adversarial Network
10 pages
The Six Fronts of The Generative Adversarial Networks
No ratings yet
The Six Fronts of The Generative Adversarial Networks
11 pages
Mean Sea Level, GPS, and The Geoid
No ratings yet
Mean Sea Level, GPS, and The Geoid
9 pages
08 1.5 Context - Practical Systems 4-31 PDF
No ratings yet
08 1.5 Context - Practical Systems 4-31 PDF
40 pages
Automatic Colorization With Deep Convolutional Generative Adversarial Networks
No ratings yet
Automatic Colorization With Deep Convolutional Generative Adversarial Networks
8 pages
Image-to-Image Translation With Conditional Adversarial Networks
No ratings yet
Image-to-Image Translation With Conditional Adversarial Networks
17 pages
A Survey of Image Synthesis and Editing With Generative Adversarial Networks PDF
No ratings yet
A Survey of Image Synthesis and Editing With Generative Adversarial Networks PDF
15 pages
Image-to-Image Translation With Conditional Adversarial Networks (Review)
No ratings yet
Image-to-Image Translation With Conditional Adversarial Networks (Review)
3 pages
Duncan 'Shadow' Louca: Hero Hoard
No ratings yet
Duncan 'Shadow' Louca: Hero Hoard
3 pages
Parametric Equations For T-Butt Weld Toe Stress Intensity Factors
No ratings yet
Parametric Equations For T-Butt Weld Toe Stress Intensity Factors
12 pages
Keys For Encrypted Files
No ratings yet
Keys For Encrypted Files
2 pages
Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
No ratings yet
Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
24 pages
Q2 DLP Week 5
No ratings yet
Q2 DLP Week 5
12 pages
Lab 5 PDF
No ratings yet
Lab 5 PDF
3 pages
Perceptual Adversarial Networks For Image-to-Image Transformation
No ratings yet
Perceptual Adversarial Networks For Image-to-Image Transformation
13 pages
Document Query
No ratings yet
Document Query
5 pages
Solutions of Some 1 Order, 1 Degree D.E.: Differential Equations
No ratings yet
Solutions of Some 1 Order, 1 Degree D.E.: Differential Equations
23 pages
Stepwise Optimization Modeling of Asphalt Concrete Mix Design
No ratings yet
Stepwise Optimization Modeling of Asphalt Concrete Mix Design
5 pages
Sr. No Title Published Problem Statement Methodology Dataset Dataset Avail-Ability
No ratings yet
Sr. No Title Published Problem Statement Methodology Dataset Dataset Avail-Ability
2 pages
A55 Quadratic Equation Solving and Algebra Questions
No ratings yet
A55 Quadratic Equation Solving and Algebra Questions
70 pages
Alt 2600 Hack Faq
100% (1)
Alt 2600 Hack Faq
107 pages
Alt 2600 Hack Faq
100% (1)
Alt 2600 Hack Faq
107 pages
BTP Report On Text To Image Synthesis
No ratings yet
BTP Report On Text To Image Synthesis
62 pages
Tutorial - 5 and 6
100% (1)
Tutorial - 5 and 6
2 pages
Conditional Image To Image Translation
No ratings yet
Conditional Image To Image Translation
9 pages
Digital Computer Fundamentals
No ratings yet
Digital Computer Fundamentals
7 pages
String
No ratings yet
String
35 pages
Deep Generative Adversarial Networks For Image-To
No ratings yet
Deep Generative Adversarial Networks For Image-To
26 pages
Sparse Coding and Dictionary Learning For Image Analysis: Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro
No ratings yet
Sparse Coding and Dictionary Learning For Image Analysis: Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro
21 pages
FEM ANSYS Modal Analysis PDF
No ratings yet
FEM ANSYS Modal Analysis PDF
12 pages
Newton's Law 2
No ratings yet
Newton's Law 2
1 page
10 - Recursive
No ratings yet
10 - Recursive
22 pages
NEW SAT Teachers Guide
100% (1)
NEW SAT Teachers Guide
262 pages
MTAP Grade4 Regional Finals 2005G4
86% (14)
MTAP Grade4 Regional Finals 2005G4
2 pages
Kinamatics
No ratings yet
Kinamatics
10 pages
Book
100% (1)
Book
212 pages
CRM BOL Z Creation Final
100% (1)
CRM BOL Z Creation Final
14 pages
Global Illumination: Advancing Vision: Insights into Global Illumination
From Everand
Global Illumination: Advancing Vision: Insights into Global Illumination
Fouad Sabry
No ratings yet

Image-to-Image Translation With Conditional Adversarial Networks

Uploaded by

Image-to-Image Translation With Conditional Adversarial Networks

Uploaded by

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros

Berkeley AI Research (BAIR) Laboratory, UC Berkeley

input output input output

input output input output input output

image x and random noise vector z, to y, G : {x, z} → y.

LcGAN (G, D) =Ex,y [log D(x, y)]+

performance; examining the results reveals that the gener-

against ground truth

against ground truth

against ground truth

input output input output

Figure 10: Applying a conditional GAN to semantic segmenta-

ble 5). We also tested the results of [62] and a variant of

Loss Per-pixel acc. Per-class acc. Class IOU

about vision problems, like semantic segmentation, where

Input Ground truth Output Input Ground truth Output

Input Ground truth Output Input Ground truth Output

Input Output Input Output Input Output

Input Ground truth Output Input Ground truth Output

Day Night Labels Facade Labels Street scene

Edges Shoe Edges Handbag Sketch Shoe Sketch Handbag

6.4. Change log

You might also like