0% found this document useful (0 votes)
83 views

One-Step Image Translation With Text-to-Image Models

This document summarizes a research paper that introduces a new method for one-step image-to-image translation using pre-trained text-to-image diffusion models. The proposed method adapts a single-step diffusion model through adversarial learning objectives to new tasks and domains without requiring paired training data. It consolidates modules in the diffusion model into a single generator network and feeds conditioning information directly to improve structure preservation and reduce conflicts. Evaluation shows the method outperforms existing GAN and diffusion approaches for unpaired tasks and achieves comparable results to recent works for paired tasks, but with faster single-step inference.

Uploaded by

xepit98367
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

One-Step Image Translation With Text-to-Image Models

This document summarizes a research paper that introduces a new method for one-step image-to-image translation using pre-trained text-to-image diffusion models. The proposed method adapts a single-step diffusion model through adversarial learning objectives to new tasks and domains without requiring paired training data. It consolidates modules in the diffusion model into a single generator network and feeds conditioning information directly to improve structure preservation and reduce conflicts. Evaluation shows the method outperforms existing GAN and diffusion approaches for unpaired tasks and achieves comparable results to recent works for paired tasks, but with faster single-step inference.

Uploaded by

xepit98367
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

arXiv:2403.12036v1 [cs.

CV] 18 Mar 2024


One-Step Image Translation with
Text-to-Image Models
Gaurav Parmar1 Taesung Park2 Srinivasa Narasimhan1 Jun-Yan Zhu1

Carnegie Mellon University1 Adobe Research2

Abstract. In this work, we address two limitations of existing condi-


tional diffusion models: their slow inference speed due to the iterative de-
noising process and their reliance on paired data for model fine-tuning.
To tackle these issues, we introduce a general method for adapting a
single-step diffusion model to new tasks and domains through adversar-
ial learning objectives. Specifically, we consolidate various modules of
the vanilla latent diffusion model into a single end-to-end generator net-
work with small trainable weights, enhancing its ability to preserve the
input image structure while reducing overfitting. We demonstrate that,
for unpaired settings, our model CycleGAN-Turbo outperforms existing
GAN-based and diffusion-based methods for various scene translation
tasks, such as day-to-night conversion and adding/removing weather ef-
fects like fog, snow, and rain. We extend our method to paired settings,
where our model pix2pix-Turbo is on par with recent works like Control-
Net for Sketch2Photo and Edge2Image, but with a single-step inference.
This work suggests that single-step diffusion models can serve as strong
backbones for a range of GAN learning objectives. Our code and models
are available at https://github.com/GaParmar/img2img-turbo.

1 Introduction

Conditional diffusion models [5, 38, 48, 73] have empowered users to generate
images based on both spatial conditioning and text prompts, enabling various
image synthesis applications that demand precise user controls over scene layout,
user sketches, and human poses. Despite their huge success, these models face
two primary challenges. First, the iterative nature of diffusion models makes
inference slow, limiting real-time applications, such as interactive Sketch2Photo.
Second, model training often requires curating large-scale paired datasets, posing
significant costs for many applications, while being infeasible for others [77].
In this work, we introduce a one-step image-to-image translation method
applicable to both paired and unpaired settings. Our method achieves visually
appealing results comparable to existing conditional diffusion models, while re-
ducing the number of inference steps to 1. More importantly, our method can be
trained without image pairs. Our key idea is to efficiently adapt a pre-trained
text-conditional one-step diffusion model, such as SD-Turbo [54], to new domains
and tasks via adversarial learning objectives.
Unfortunately, directly applying standard diffusion adapters like Control-
Net [73] to the one-step setting proved less effective in our experiments. Unlike
2 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

Day Night Night Day

Clear Rainy Rainy Clear

Canny “... colorful” “... transparent” Sketch “… photo, garden, cloudy” “… city, cloudy”

Fig. 1: We present a general method for adapting a single-step diffusion model, such
as SD-Turbo [54], to new tasks and domains through adversarial learning. This enables
us to leverage the internal knowledge of pre-trained diffusion models while achieving
efficient inference (e.g., 0.3 seconds for 512x512 image). Our single-step image-to-image
translation models, called CycleGAN-Turbo and pix2pix-Turbo, can synthesize realistic
outputs for unpaired (top) and paired settings (bottom), respectively, on various tasks.

traditional diffusion models, we observe that the noise map directly influences the
output structure in the one-step model. Consequently, feeding both noise maps
and input conditioning through additional adapter branches results in conflicting
information for the network. Especially for unpaired cases, this strategy leads to
the original network being disregarded by the end of training. Moreover, many
visual details in the input image are lost during image-to-image translation, due
to imperfect reconstruction by the multi-stage pipeline (Encoder-UNet-Decoder)
of the SD-Turbo model. This loss of detail is particularly noticeable and crucial
when the input is a real image, such as in day-to-night translation.
To tackle these challenges, we propose a new generator architecture that
leverages SD-Turbo weights while preserving the input image structure. First,
we feed the conditioning information directly to the noise encoder branch of
the UNet. This enables the network to adapt to new controls directly, avoiding
conflicts between the noise map and the input control. Second, we consolidate
the three separate modules, Encoder, UNet, and Decoder, into a single end-to-
end trainable architecture. For this, we employ LoRA [17] to adapt the original
network to new controls and domains, reducing overfitting and fine-tuning time.
Finally, to preserve the high-frequency details of the input, we incorporate skip
connections between the encoder and decoder via zero-conv [73]. Our architec-
ture is versatile, serving as a plug-and-play model for conditional GAN learning
objectives such as CycleGAN and pix2pix [19, 77]. To our knowledge, our work
is the first to achieve one-step image translation with a text-to-image model.
One-Step Image Translation with Text-to-Image Models 3

We primarily focus on the harder unpaired translation tasks, such as con-


verting from day to night and vice versa and adding/removing weather effects
to/from images. We show that our model CycleGAN-Turbo significantly out-
performs both existing GANs-based and diffusion-based methods in terms of
distribution matching and input structure preservation, while achieving greater
efficiency than diffusion-based methods. We include an extensive ablation study
regarding each design choice of our method.
To demonstrate the versatility of our architecture, we also perform exper-
iments for paired settings, such as Edge2Image or Sketch2Photo. Our model
called pix2pix-Turbo achieves visually comparable results with recent condi-
tional diffusion models, while reducing the number of inference steps to 1. We
can generate diverse outputs by interpolating between noise maps used in pre-
trained model and our model’s encoder outputs. In summary, our work suggests
that one-step pre-trained text-to-image models can serve as a strong and versa-
tile backbone for many downstream image synthesis tasks.

2 Related Work
Image-to-Image translation. Recent advances in generative models have en-
abled many image-to-image translation applications. Paired image translation
methods [19, 41, 51, 65, 75, 79] map an image from a source domain to a target
domain, using a combination of reconstruction [20,74] and adversarial losses [13].
More recently, various conditional diffusion models have emerged, integrating
text and spatial conditions for image translation tasks [2, 5, 28, 38, 48, 64, 73].
These methods often build upon pre-trained text-to-image models. For instance,
works like GLIGEN [28], T2I-Adapter [38], and ControlNet [73] introduce effec-
tive fine-tuning techniques using adapters such as gated transformer layers or
zero-convolution layers. However, the model training still requires a large num-
ber of training pairs. In contrast, our approach can leverage large-scale diffusion
models without image pairs, with significantly faster inference speed.
In many cases where paired input and output images are unavailable, several
techniques have been proposed, including cycle consistency [24, 70, 77], shared
intermediate latent space [18, 27, 29], content preservation loss [56, 60], and con-
trastive learning [14, 40]. Recent works [52, 59, 67] have also explored diffusion
models for unpaired translation tasks. However, these GAN-based or diffusion-
based methods typically require training from scratch on new domains. Instead,
we introduce the first unpaired learning method leveraging pre-trained diffusion
models, demonstrating better results than existing methods.
Text-to-Image models. Large-scale text-conditioned models [3,11,21,39,46,49]
have significantly improved image quality and diversity through training on
internet-scale datasets [6, 55]. Several works [15, 35, 37, 42, 62] have proposed
zero-shot methods for editing real images with pre-trained text-to-image mod-
els. For example, SDEdit [35] edits real images by adding noise to the in-
put image and subsequently denoises with a pre-trained model according to
the text prompt. Prompt-to-Prompt works further manipulate or preserve fea-
tures in cross-attention and self-attention layers during the image editing pro-
cess [8, 9, 12, 15, 42, 44, 62]. Others fine-tune the networks or text embeddings
4 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

”driving in the night”


LoRA
Input 𝑥 Adapters Output y

Zero-conv
First Stage Skip Connections
Zero-conv

Fig. 2: Our generator architecture. We tightly integrate three separate modules in


the original latent diffusion models into a single end-to-end network with small trainable
weights. This architecture allows us to translate the input image x to the output y,
while retaining the input scene structure. We use LoRA adapters [17] in each module,
introduce skip connections and Zero-Convs [73] between input and output, and retrain
the first layer of the U-Net. Blue boxes indicate trainable layers. Semi-transparent
layers are frozen. The same generator can be used for various GAN objectives.

for the input image before image editing [23, 37] or employ more precise inver-
sion methods [57,63]. Despite their impressive results, they frequently encounter
difficulties in complex scenes with many objects. Our work can be viewed as aug-
menting these methods with paired or unpaired data from new domains/tasks.
One-step generative models. To expedite diffusion model inference, recent
works focus on reducing the number of sampling steps using fast ODE solvers [22,
32], or distilling slow multistep teacher models into fast few-step student mod-
els [36, 50]. Regressing directly from noise to images often produces blurry re-
sults [33, 76]. For this, various distillation methods use consistency model train-
ing [34, 58], adversarial learning [54, 69], variational score distillation [66, 71],
Rectified Flow [30, 31], and their combinations [54]. Other methods directly use
GANs for text-to-image synthesis [21, 53]. Different from these works that focus
on one-step text-to-image synthesis, we present the first one-step conditional
model that use both text and conditioning images. Our method beats the base-
line that directly uses the original ControlNet with one-step distilled models.
3 Method
We start with a one-step pre-trained text-to-image model capable of generating
realistic images. However, our goal is to translate an input real image from
a source domain to a target domain, such as converting a day driving image
to night. In Section 3.1, we explore different conditioning methods for adding
structure to our model and the corresponding challenges. Next, in Section 3.2,
we investigate the common issue of detail loss (e.g., text, hands, street signs)
that plagues latent-space models [47] and propose a solution to address it. We
then discuss our unpaired image translation method in Section 3.3, with further
extensions to paired settings and stochastic generation (Section 3.4).

3.1 Adding Conditioning Input


To convert a text-to-image model into an image translation model, we first need
to find an effective way to incorporate the input image x into the model.
One-Step Image Translation with Text-to-Image Models 5

Input Noise Output Image Input Noise Output Image

Condition Image

Condition
Encoder

(a) (b) (c)

Fig. 3: (Left) The one-step model learns to map the input noise to the output image.
Note that the features of SD2.1-Turbo forms a coherent layout (a) from the noise map.
(Right) Unfortunately, adding condition encoder branches [38,73] causes conflicts, since
features (b) from the new branch represent a different layout compared to the original
feature (a). This conflict deteriorates the downstream feature (c) in the SD-Turbo
Decoder, affecting the output quality. The feature maps are visualized with PCA.

Conflicts between noise and conditional input. One common strategy


for incorporating conditional input into Diffusion models is introducing extra
adapter branches [38, 73], as shown in Figure 3. Concretely, we initialize a sec-
ond encoder, labeled as the Condition Encoder, either with the weights of the
Stable Diffusion Encoder [73] or using a lightweight network with randomly ini-
tialized weights [38]. This Control Encoder takes the input image x, and outputs
feature maps at multiple resolutions to the pre-trained Stable Diffusion model
through residual connections. This method has yielded remarkable outcomes for
controlling diffusion models. Nonetheless, as illustrated in Figure 3, using two
encoders (U-Net Encoder and Condition Encoder) to process a noise map and
an input image presents challenges in the context of one-step models. Unlike
multi-step diffusion models, the noise map in the one-step model directly con-
trols the layout and pose of generated images, often contradicting the structure
of the input image. Hence, the decoder receives two sets of residual features, each
representing distinct structures, making the training process more challenging.
Direct conditioning input. Figure 3 also illustrates that the structure of the
generated image by the pre-trained model is significantly influenced by the noise
map z. Based on this insight, we propose that the conditioning input should be
fed to the network directly. Figure 7 and Table 4 additionally show that using
direct conditioning achieves better results than using an additional encoder. To
allow the backbone model to adapt to new conditioning, we add several LoRA
weights [17] to various layers in the U-Net (see Figure 2).

3.2 Preserving Input Details


A key challenge that prevents the use of latent diffusion models (LDM [47]) in
multi-object and complex scenes is the lack of detail preservation.
Why details are lost. The image encoder of Latent Diffusion Models (LDMs)
compresses input images spatially by a factor of 8 while increasing the channel
count from 3 to 4. This design speeds up the training and inference of diffu-
sion models. However, it may not be ideal for image translation tasks, which
6 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

Input Without Skip Connections With Skip Connections (Ours)

Fig. 4: Skip Connections help retain details. We visualize the outputs of our
day-to-night models trained with and without skip connections. It is clearly seen that
adding skip connections preserves the details of the input daytime image. The zoomed
in crops of the night images are gamma-adjusted by 1.5 for easier visualization.

require preserving fine details of the input image. We illustrate this issue in Fig-
ure 4, where we take an input daytime driving image (left) and translate it to a
corresponding nighttime driving with an architecture that does not use skip con-
nections (middle). Observe that fine-grained details, such as text, street signs,
and cars in the distance, are not preserved. In contrast, employing an architec-
ture that incorporates skip connections (right) results in a translated image that
significantly better retains these intricate details.
Connecting first stage encoder and decoder. To capture fine-grained vi-
sual details of the input image, we add skip connections between the Encoder
and Decoder networks (see Figure 2). Specifically, we extract four intermediate
activations following each downsampling block within the encoder, process them
via a 1×1 zero-convolution layer [73], and then feed them into the corresponding
upsampling block in the decoder. This method ensures the retention of intricate
details throughout the image translation process.

3.3 Unpaired Training

We use Stable Diffusion Turbo (v2.1) with one-step inference as the base network
for all of our experiments. Here we show that our generator can be used in a
modified CycleGAN formulation [77] for unpaired translation. Concretely, we
aim to convert images from a source domain X ⊂ RH×W ×3 to some desired target
domain Y ⊂ RH×W ×3 , given an unpaired dataset X = {x ∈ X }, Y = {y ∈ Y} .
Our method includes two translation functions G(x, cY ): X → Y and G(y, cX ):
Y → X. Both translations use the same network G as described in Section 3.1
and Section 3.2, but different captions cX and cY that correspond to the task.
For example, in the day → night translation task, cX is Driving in the day,
and cY is Driving in the night. As depicted in Figure 2, we keep most layers
frozen and only train the first convolutional layer and the added LoRA adapters.
Cycle consistency with perceptual loss. The cycle consistency loss Lcycle
enforces that for each source image x, the two translation functions should bring
One-Step Image Translation with Text-to-Image Models 7

it back to itself. We denote Lrec a combination of L1 difference and LPIPS [74].


Please refer to Appendix D for the weighting.

\lbleq {cycle_loss} \begin {aligned} \mathcal {L}_{\text {cycle}} = \mathbb {E}_{x} \left [ \Lrec (\G (\G (x, \captionY ), \captionX ), x) \right ] \\ + ~ \mathbb {E}_{y} \left [\Lrec (\G (\G (y, \captionX ), \captionY ), y)\right ] \end {aligned}
(1)

Adversarial loss. We use an adversarial loss [13] for both domains to encourage
the translated outputs to match the corresponding target domains. We use two
adversarial discriminators, DX and DY , that aim to classify real images from
the translated images for the corresponding domains. Both discriminators use
the CLIP model as a backbone, following the recommendations of Vision-Aided
GAN [26]. The adversarial loss can be defined as:

\lbleq {gan_loss} \begin {aligned} \mathcal {L}_\text {GAN} = \mathbb {E}_{y} \left [ \log \DiscY (y) \right ] + \mathbb {E}_{x} \left [ \log (1 - \DiscY (G(x, \captionY )))\right ] \\ + \mathbb {E}_{x} \left [ \log \DiscX (x) \right ] + \mathbb {E}_{y} \left [ \log (1 - \DiscX (G(y, \captionX )))\right ] \end {aligned}
(2)

Full objective. The complete training objective comprises of three different


losses: cycle consistency loss Lcycle , adversarial loss LGAN and identity regu-
larization loss Lidt = Ey [Lrec (G(y, cY ), y)] + Ex [Lrec (G(x, cX ), x)]. The loss is
weighted by λidt and λgan , as follows:

\lbleq {full_objective}\begin {aligned} \arg \min _\G \mathcal {L}_{\text {cycle}} + \lambda _\text {idt}\mathcal {L}_{\text {idt}} + \lambda _\text {GAN} \mathcal {L}_\text {GAN}. \end {aligned} (3)

3.4 Extensions
While our primary focus is on unpaired learning, we also demonstrate two ex-
tensions to learn other types of GAN objectives, such as learning from paired
data and generating stochastic outputs.
Paired training. We adapt our translation network G to paired settings, such
as converting edges or sketches to images. We refer to the paired version of
our method as pix2pix-Turbo. In the paired setting, we aim to learn a single
translation function G(x, c): X → Y , where X is the source domain (e.g., input
sketch), Y is the target domain (e.g., output image), and c is the input caption.
For paired training objective, we use (1) reconstruction loss as a combination
of perceptual loss and pixel-space reconstruction loss, (2) GAN loss, similar to
the loss in Equation 2, but only for the target domain, and (3) CLIP text-image
alignment loss LCLIP [45]. Please find more details in Appendix D.
Generating diverse outputs Generating stochastic outputs is important in
many image translation tasks, e.g., sketch-to-image generation. However, en-
abling a one-step model to generate diverse outputs is challenging as it needs to
make use of additional input noise, which often gets ignored [18,78]. We propose
generating diverse outputs by interpolating the features and model weights to-
ward the pretrained model, which already produces diverse outputs. Concretely,
given an interpolation coefficient γ, we make the following three changes. First,
we combine the Gaussian noise and the encoder output. Our generator G(x, z, r)
now takes three inputs: the input image x, a noise map z, and the coefficient
8 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

CUT Instruct-pix2pix CycleGAN-Turbo CUT Instruct-pix2pix CycleGAN-Turbo


Input
Input (GAN based) (diffusion based) (Ours) (GAN based) (diffusion based) (Ours)

Zebra to Horse
Horse to Zebra
Summer to Winter

Winter to Summer
Fig. 5: Comparison to baselines on 256 × 256 datasets. We compare our un-
paired method to CUT [40] and Instruct-pix2pix [5], the best-performing GAN-based
and diffusion methods, respectively. CUT outputs images that often contain severe im-
age artifacts. Whereas, Instruct-pix2pix fails to preserve the input image structure.

γ. The updated function G(x, z, γ) first combines the noise z and the encoder
output: γ Genc (x) + (1 − γ) z. We then feed the combined signal to the U-Net.
Second, we also scale the LoRA adapter weights and outputs of the skip
connections according to θ = θ0 + γ · ∆θ, where θ0 and ∆θ denote the original
weights and newly added weights, respectively.
Finally, we scale the reconstruction loss according to the coefficient γ.

\lbleq {finetune_rec} \mathcal {L}_{\text {diverse}} = \mathbb {E}_{x, y, z, \strength } \left [ \strength \mathcal {L}_\text {rec}(G(x, z, \strength ), y) \right ]. (4)

Notably, γ = 0 corresponds to the default stochastic behavior of the pre-


trained model, in which case the reconstruction loss is not enforced. γ = 1 cor-
responds to the deterministic translation described in Sections 3.3 and 3.4. We
finetune our image translation models with varying interpolation coefficients.
Figure 9 shows that such a finetuning enables our model to generate diverse
outputs by sampling different noises during inference time.

4 Experiments

We conduct extensive experiments on several image translation tasks, orga-


nized into three main categories. First, we compare our method to several prior
GAN-based and diffusion model image translation methods, demonstrating bet-
ter quantitative and qualitative results. Second, we analyze the effectiveness of
every component of our unpaired method, CycleGAN-Turbo, by incorporating
them one at a time in Section 4.2. Finally, we show how our method works
on paired settings and generates diverse outputs in Section 4.3. Please find the
code, models, and interactive demos on our GitHub page https://github.com/
GaParmar/img2img-turbo.
One-Step Image Translation with Text-to-Image Models 9

CycleGAN Instruct-pix2pix CycleGAN-Turbo


Input
(GAN based) (diffusion based) (Ours)
Day to Night
Night to Day
Foggy to Clear

Fig. 6: Comparison to baselines on driving datasets (512 × 512). We compare


our unpaired translation method to CycleGAN [77] and Instruct-pix2pix [5], the best
performing GAN-based and diffusion methods for this dataset. CycleGAN does not
use existing text-to-image models and, as a result, generates artifacts in the outputs,
e.g., the sky regions in the day-to-night translation. In contrast, Instruct-pix2pix uses
a large text-to-image model but does not use the unpaired dataset. So, the Instruct-
pix2pix outputs look unnatural and vastly different than the images in our datasets.

Training details. Our total trainable parameters for the unpaired models on
the driving datasets is 330 MB, including the LoRA weights, zero-conv layer,
and first conv layer of U-Net. Please find the hyperparameters and architecture
details in Appendix D.
Datasets. We conduct unpaired translation experiments on two commonly used
datasets (Horse ↔ Zebra and Yosemite Summer ↔ Winter), and two higher
resolution driving datasets (day ↔ night and clear ↔ foggy from BDD100k [72]
and DENSE [4]). For the first two datasets, we follow CycleGAN [77] and load
286×286 images and use random 256×256 crops when training. During inference,
we directly apply translation at 256 × 256. For driving datasets, we resize all
images to 512 × 512 during both training and inference. For evaluation, we use
the corresponding validation sets.
10 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

Table 1: Evaluation on standard CycleGAN datasets (256 × 256). Comparison


to prior GAN-based and Diffusion-based methods on standard CycleGAN datasets
using FID to measure image quality and distribution alignment and DINO-Struct. to
measure structure preservation. Our method achieves the lowest DINO-Struct. across
all tasks and the lowest FID on all tasks except Horse → Zebra, while being magnitudes
faster than diffusion-based models. Cycle-Diffusion obtains a slightly better FID but
at the cost of large increase in DINO Struct., resulting in poor translation overall.
Horse → Zebra Zebra → Horse Summer → Winter Winter → Summer
Infrence
Method time DINO DINO DINO DINO
FID ↓ Struct. ↓ FID ↓ Struct. ↓ FID ↓ Struct. ↓ FID ↓ Struct. ↓
CycleGAN [77] 0.01s 74.9 3.2 133.8 2.6 62.9 2.6 66.1 2.3
CUT [40] 0.01s 43.9 6.6 186.7 2.5 72.1 2.1 68.5 2.1
SDEdit [35] 1.56s 77.2 4.0 198.5 4.6 66.1 2.1 76.9 2.1
Plug&Play [62] 7.57s 57.3 5.2 152.4 3.8 67.3 2.8 73.3 2.6
Pix2Pix-Zero [42] 14.75s 81.5 8.0 147.4 7.8 68.0 3.0 93.4 4.3
Cycle-Diffusion [67] 3.72s 38.6 6.0 132.5 5.8 64.1 3.6 70.3 3.6
DDIB [59] 4.37s 44.4 13.1 163.3 11.1 90.8 7.2 88.9 6.8
InstructPix2Pix [5] 3.86s 51.0 6.8 141.5 7.0 68.3 3.7 85.6 4.4
CycleGAN-Turbo 0.13s 41.0 2.1 127.5 1.8 56.3 0.6 60.7 0.6

Evaluation Protocol. An effective image translation method must satisfy two


key criteria: (1) matching the data distribution of the target domain and (2)
preserving the structure of the input image in the translated output. We eval-
uate the distribution matching using FID [16], following the clean-FID’s imple-
mentation [43]. We assess adherence to the second criterion with DINO-Struct-
Dist [61], which measures the structure similarity of two images in feature space.
We report all DINO Structure scores multiplied by 100. A lower FID score in-
dicates a closer match to the reference target distribution and greater realism,
while a lower DINO-Struct-Dist suggests a more accurate preservation of the
input structure in the translated image. A low FID score with a high DINO-
Struct-Dist indicates that a method is not able to adhere to the input structure.
A low DINO-Struct-Dist but a high FID suggests that a method barely alters the
input image. It is crucial to consider both of these scores together. Additionally,
we compare the inference runtime of all methods in Tables 1 and 2 on a Nvidia
RTX A6000 GPU and include a human perceptual study.
4.1 Comparison to Unpaired Methods
We compare CycleGAN-Turbo to prior GAN-based unpaired image translation
methods, zero-shot image editing methods, and diffusion models trained for im-
age editing using their publicly available code. Qualitatively, Figures 5 and 6
reveal that existing methods, both GAN-based and diffusion-based, struggle to
achieve the right balance between output realism and structural preservation.
Comparison to GAN-based methods. We compare our method to two un-
paired GAN models - CycleGAN [77] and CUT [40]. We train these baseline mod-
els with default hyperparameters on all datasets for 100,000 steps and choose
the best checkpoint. Tables 1 and 2 show quantitative comparisons on eight
unpaired translation tasks. CycleGAN and CUT demonstrate effective perfor-
mance, achieving low FID and DINO-Structure scores on simpler, object-centric
One-Step Image Translation with Text-to-Image Models 11

Table 2: Comparison on 512 × 512 driving datasets. Our method outperforms


all GAN-based and diffusion-based baselines on all driving datasets. InstructPix2pix
gets a slightly lower DINO-Struct for Day → Night, but a much higher FID, thus not
matching the target distribution well. Plug&Play has similar results for Night → Day.

Day → Night Night → Day Clear → Foggy Foggy → Clear


Infrence
Method time DINO DINO DINO DINO
FID ↓ Struct. ↓ FID ↓ Struct. ↓ FID ↓ Struct. ↓ FID ↓ Struct. ↓

CycleGAN [77] 0.02s 36.3 3.6 92.3 4.9 153.3 3.6 177.3 3.9
CUT [40] 0.03s 40.7 3.5 98.5 3.8 152.6 3.4 163.9 4.8
SDEdit [35] 3.10s 111.7 3.4 116.1 4.1 185.3 3.1 209.8 4.7
Plug&Play [62] 19.67s 80.8 2.9 121.3 2.8 179.6 3.6 193.5 3.5
Pix2Pix-Zero [42] 43.28s 81.3 4.7 188.6 5.8 209.3 5.5 367.2 13.0
Cycle-Diffusion [67] 11.38s 101.1 3.1 110.7 3.7 178.1 3.6 185.8 3.1
DDIB [59] 11.93s 172.6 9.1 190.5 7.8 257.0 13.0 286.0 7.2
InstructPix2Pix [5] 11.41s 80.7 2.1 89.4 6.2 170.8 7.6 233.9 4.8
CycleGAN-Turbo 0.29s 31.3 3.0 45.2 3.8 137.0 1.4 147.7 2.4

datasets, such as horse → zebra (Figure 13). Our method slightly outperforms
these in terms of both FID and DINO-structure distance metrics. However, for
more complex scenes, such as night → day, CycleGAN and CUT get significantly
higher FID scores than our method, often hallucinating undesirable artifacts
(Figure 15).
Comparison to diffusion-based editing methods. Next, we compare our
method to several diffusion-based methods in Tables 1 and 2. First, we consider
recent zero-shot image translation methods, including SDEdit [35], Plug-and-
Play [62], pix2pix-zero [42], CycleDiffusion [67], and DDIB [59] that use a pre-
trained text-to-image diffusion model and translate the images through different
text prompts. Note that the original DDIB implementation involves training two
separate domain-specific diffusion models from scratch. To improve its perfor-
mance and have a fair comparison, we replace the domain-specific models with
a pre-trained text-to-image model. We also compare to Instruct-pix2pix [5], a
conditional diffusion model trained for text-based image editing.
As shown in Table 1 and Figure 14, on object-centric datasets such as a horse
→ zebra, these methods can generate realistic zebras but struggle to precisely
match the object poses, as indicated by consistently large DINO-structure scores.
On driving datasets, those editing methods perform noticeably worse due to
three reasons: (1) the models struggle to generate complex scenes containing
multiple objects, (2) these methods (except Instruct-pix2pix) need to first invert
the images to a noise map, introducing potential artifacts, and (3) the pre-
trained models cannot synthesize street view images similar to the one captured
by the driving datasets. Table 2 and Figure 16 show that across all four driving
translation tasks, these methods output poor quality images, reflected by a high
FID score, and do not adhere to input image structure, reflected in high DINO-
Structure distance values.
Human Preference Study Next, we conduct a human preference study on
Amazon Mechanical Turk (AMT) to evaluate the quality of images produced
by the different methods. We use the complete validation set from the relevant
12 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

Table 3: Human Preference Evaluation. We conduct a study that asks users to


pick images that look more like the target domain. We rate every image in the validation
with 3 different users. Our method is preferred across all datasets, with the exception
of Clear to Foggy.

Method Day → Night Night → Day Clear → Foggy Foggy → Clear


CycleGAN [77] 45.9% 37.4% 45.4% 26.7%
ours 54.1% 62.6% 54.6% 73.3%
InstructPix2Pix [5] 25.1% 29.1% 69.4% 13.3%
ours 74.9% 70.9% 30.6% 86.7%

Table 4: Ablation with Horse to Zebra. The values in parentheses reflect the rela-
tive change compared to our final method. First, Conf. A trains the unpaired translation
model with randomly initialized weights and suffers from a large FID increase. Next,
Conf. B, C, and D try different input types and show that direct input achieves the
best performance. Finally, our method adds skip connections to Conf. D and shows
an improvement in structure preservation. Ablation on other tasks is shown in Ap-
pendix A.

Horse → Zebra Zebra → Horse


Input Pre
Method Skip -trained
Type DINO DINO
FID ↓ Struct. ↓ FID ↓ Struct. ↓
Conf. A Direct Input x x 128.6 (+214%) 5.2 (+148%) 167.1 (+31%) 4.6 (+156%)
Conf. B ControlNet x ✓ 41.2 (+0%) 7.3 (+248%) 99.4 (-22%) 8.6 (+378%)
Conf. C T2I-Adapter x ✓ 55.4 (+35%) 4.7 (+124%) 135.4 (+6%) 4.8 (+167%)
Conf. D Direct Input x ✓ 40.1 (-2%) 4.4 (+110%) 116.2 (-9%) 3.0 (+67%)
Ours Direct Input ✓ ✓ 41.0 2.1 127.5 1.8

datasets, with each comparison independently evaluated by three unique users.


We present the outputs of two models side-by-side and ask users to choose which
one follows the target prompt more accurately in an unlimited time. For instance,
we collect 1,500 comparisons for the Day to Night translation task with 500
validation images. The prompt presented to the users is: “Which image looks
more like a real picture of a driving scene taken in the night?”
Table 3 compares our method to CycleGAN [77], the best performing GAN-
based method, and Instruct-Pix2Pix [5], the best performing diffusion-based
method. Our method outperforms the two baselines across all datasets, except
for the Clear to Foggy translation task. In this case, users favor InstructPix2Pix’s
results, as it outputs more artistic fog images. However, InstructPix2Pix fails to
preserve the input structure, as indicated by its high DINO-Struct score (7.6)
compared to ours (1.4). Moreover, its results substantially diverge from the target
fog dataset, reflected by a high FID score (170.8) compared to ours (137.0), as
noted in Table 2.
4.2 Ablation Study
Here, we show the effectiveness of our algorithmic designs through an extensive
ablation study in Table 4 and Figure 7.
One-Step Image Translation with Text-to-Image Models 13

Foggy to Clear Horse to Zebra Zebra to Horse


Input
Config A
Config B
Config C
Config D
Ours

Fig. 7: Ablating individual components. Our final formulation achieves the best
content preservation and realism, compared to other design choices described in Table 4.

Using pre-trained weights. First, we assess the impact of using a pre-trained


network. In Table 4 Config A, we train an unpaired model on the Horse ↔ Zebra
dataset but with randomly initialized weights rather than pre-trained weights.
Without leveraging the prior from the pre-trained text-to-image model, the out-
put images look unnatural, as shown in Figure 7 Config A. This observation is
corroborated by a large increase in FID across both tasks in Table 4.
Different ways of adding conditioning inputs. Next, we compare three ways
of adding structure input to the model. Config B uses a ControlNet Encoder [73],
Config C uses the T2I-Adapter [38], and finally, Config D directly feeds the input
image to the base network without any additional branches. Config B obtains a
comparable FID to Config D. However, it also has a significantly higher DINO-
Structure distance, indicating that the ControlNet encoder struggles to match
the input’s structure. This is also observed in Figure 7; Config B (third row)
consistently changes the scene structure and hallucinates new objects, such as
partial buildings in the case of driving scenes and unnatural zebra patterns
for the horse-to-zebra translation. Config C uses a lightweight T2I-Adapter to
learn the structure and achieves worse FID and DINO-Struct scores, and output
images that have several artifacts and poor structure preservation.
Skip Connections and trainable encoder and decoder. Finally, we can
see the effects of skip connections by comparing Config D to our final method
CycleGAN-Turbo in Table 4 and Figure 7. Across all tasks, adding skip con-
14 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

LCM-ControlNet SD-Turbo ControlNet Ours ControlNet


Canny
1 step 1 step 1 step 100 step

Fig. 8: Comparison on paired edge-to-image task (512 × 512). Our method


(runtime: 0.29s) achieves higher realism than existing one-step methods and is com-
petitive with the 100-step ControlNet (runtime: 18.85s).

nections and training the encoder and decoder jointly can significantly improve
structure preservation, albeit at the cost of a small increase in FID.
Additional results. Please see Appendix A and C for additional ablation stud-
ies on other datasets, the effect of model training with varying numbers of train-
ing images, and the role of encoder-decoder fine-tuning.

4.3 Extensions
Paired translation. We train Edge2Photo and Sketch2Photo models on a
community-collected dataset of 300K artistic images [1]. We extract Canny
edges [7] and HED contours [68]. As our method and baselines use different
datasets, we show visual comparisons instead of conducting FID evaluation.
More details on training data and preprocessing are included in Appendix D.
We compare our paired method pix2pix-Turbo to existing one-step and
multi-step translation methods in Figure 8, including two one-step baselines that
use Latent Consistency Models [34] and the Stable Diffusion - Turbo [54] with
a ControlNet adapter. While these approaches can produce results in one step,
their image quality degrades. Next, we compare it to the vanilla ControlNet,
which uses Stable Diffusion with 100 steps. We additionally use classifier-free
guidance and a long descriptive negative prompt for the 100-step ControlNet
baseline. This approach can generate more pleasing outputs compared to the
one-step baselines, as shown in Figure 8. Our method generates compelling out-
One-Step Image Translation with Text-to-Image Models 15

Input Sketch “cat selfie in a park, photograph, high quality” “cat clouds, painting”

Input Sketch “turtle, cyberpunk style, in the city” “colorful turtle cartoon, blurry mountain background”

Fig. 9: Generating diverse outputs. By varying the input noise map, our method
can generate diverse outputs from the same input conditioning. Moreover, the output
style can be controlled by changing the text conditioning.

puts with only one forward pass, without negative prompting or classifier-free
guidance.
Generating diverse outputs. Finally, in Figure 9, we show that our method
can be used to generate diverse outputs as described in Section 3.4. Given the
same input sketch and user prompt, we can sample different noise maps and
generate diverse multi-modal outputs, such as cats in different styles, variations
in the background, and turtles with different shell patterns.

5 Discussion and Limitations


Our work suggests that one-step pre-trained models can serve as a strong and
versatile backbone model for many downstream image synthesis tasks. Adapting
these models to new tasks and domains can be achieved through various GANs
objectives, without the need for multi-step diffusion training. Our model training
only requires a small number of additional trainable parameters.
Limitations. Although our model can produce visually appealing results with a
single step, it does have limitations. First, we cannot specify the strength of the
guidance, as our backbone model SD-Turbo does not use classifier-free guidance.
Guided distillation [36] could be a promising solution to enable guidance control.
Second, our method does not support negative prompt, a convenient way of
reducing artifacts. Third, model training with cycle-consistency loss and high-
capacity generators is memory-intensive. Exploring one-sided method [40] for
higher-resolution image synthesis is a meaningful next step.
Acknowledgments. We thank Anurag Ghosh, Nupur Kumari, Sheng-Yu Wang,
Muyang Li, Sean Liu, Or Patashnik, George Cazenavette, Phillip Isola, and
Alyosha Efros for fruitful discussions and valuable feedback on our manuscript.
This work was partly supported by GM Research Israel, NSF IIS-2239076, the
Packard Fellowship, and Adobe Research.

References
1. Midjourney v5 dataset. https://huggingface.co/datasets/wanng/midjourney-
v5-202304-clean (2023) 14, 28
16 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

2. Avrahami, O., Hayes, T., Gafni, O., Gupta, S., Taigman, Y., Parikh, D., Lischinski,
D., Fried, O., Yin, X.: Spatext: Spatio-textual representation for controllable image
generation. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2023) 3
3. Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila,
T., Laine, S., Catanzaro, B., Karras, T., Liu, M.Y.: ediff-i: Text-to-image diffusion
models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
3
4. Bijelic, M., Gruber, T., Mannan, F., Kraus, F., Ritter, W., Dietmayer, K., Heide,
F.: Seeing through fog without seeing fog: Deep multimodal sensor fusion in unseen
adverse weather. In: IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR) (June 2020) 9, 27
5. Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image
editing instructions. In: IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR) (2023) 1, 3, 8, 9, 10, 11, 12
6. Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text
pair dataset. https://github.com/kakaobrain/coyo-dataset (2022) 3
7. Canny, J.: A computational approach to edge detection. IEEE Transactions on
Pattern Analysis and Machine Intelligence (TPAMI) PAMI-8(6), 679–698 (1986)
14, 28
8. Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free
mutual self-attention control for consistent image synthesis and editing. In: IEEE
International Conference on Computer Vision (ICCV) (2023) 3
9. Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite:
Attention-based semantic guidance for text-to-image diffusion models. ACM Trans-
actions on Graphics (TOG) 42(4), 1–10 (2023) 3
10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-
Scale Hierarchical Image Database. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2009) 27
11. Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-
a-scene: Scene-based text-to-image generation with human priors. In: European
Conference on Computer Vision (ECCV). pp. 89–106. Springer (2022) 3
12. Ge, S., Park, T., Zhu, J.Y., Huang, J.B.: Expressive text-to-image generation with
rich text. In: IEEE International Conference on Computer Vision (ICCV) (2023)
3
13. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Neural Information
Processing Systems (NeurIPS) (2014) 3, 7
14. Han, J., Shoeiby, M., Petersson, L., Armin, M.A.: Dual contrastive learning for un-
supervised image-to-image translation. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). pp. 746–755 (2021) 3
15. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.:
Prompt-to-prompt image editing with cross attention control. In: International
Conference on Learning Representations (ICLR) (2022) 3
16. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained
by a two time-scale update rule converge to a local nash equilibrium. Conference
on Neural Information Processing Systems (NeurIPS) 30 (2017) 10
17. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.:
Lora: Low-rank adaptation of large language models. In: International Conference
on Learning Representations (ICLR) (2022) 2, 4, 5
One-Step Image Translation with Text-to-Image Models 17

18. Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-
image translation. In: European Conference on Computer Vision (ECCV) (2018)
3, 7
19. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-
tional adversarial networks. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2017) 2, 3
20. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and
super-resolution. In: European Conference on Computer Vision (ECCV) (2016) 3
21. Kang, M., Zhu, J.Y., Zhang, R., Park, J., Shechtman, E., Paris, S., Park, T.: Scaling
up gans for text-to-image synthesis. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2023) 3, 4
22. Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-
based generative models. In: Conference on Neural Information Processing Systems
(NeurIPS) (2022) 4
23. Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.:
Imagic: Text-based real image editing with diffusion models. In: IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2023) 4
24. Kim, T., Cha, M., Kim, H., Lee, J.K., Kim, J.: Learning to discover cross-domain
relations with generative adversarial networks. In: International Conference on
Machine Learning (ICML) (2017) 3
25. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014) 27
26. Kumari, N., Zhang, R., Shechtman, E., Zhu, J.Y.: Ensembling off-the-shelf models
for gan training. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (June 2022) 7
27. Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Diverse image-to-
image translation via disentangled representations. In: European Conference on
Computer Vision (ECCV) (2018) 3
28. Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y.J.: Gligen: Open-
set grounded text-to-image generation. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (2023) 3
29. Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation net-
works. In: Neural Information Processing Systems (NeurIPS) (2017) 3
30. Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer
data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 4
31. Liu, X., Zhang, X., Ma, J., Peng, J., et al.: Instaflow: One step is enough for high-
quality diffusion-based text-to-image generation. In: International Conference on
Learning Representations (ICLR) (2023) 4
32. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver
for diffusion probabilistic model sampling in around 10 steps. Conference on Neural
Information Processing Systems (NeurIPS) 35, 5775–5787 (2022) 4
33. Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for
improved sampling speed. arXiv preprint arXiv:2101.02388 (2021) 4
34. Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod-
els: Synthesizing high-resolution images with few-step inference. arXiv preprint
arXiv:2310.04378 (2023) 4, 14
35. Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided
image synthesis and editing with stochastic differential equations. In: International
Conference on Learning Representations (ICLR) (2022) 3, 10, 11
18 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

36. Meng, C., Rombach, R., Gao, R., Kingma, D., Ermon, S., Ho, J., Salimans, T.: On
distillation of guided diffusion models. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (2023) 4, 15
37. Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion
for editing real images using guided diffusion models. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). pp. 6038–6047 (2023) 3, 4
38. Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter:
Learning adapters to dig out more controllable ability for text-to-image diffusion
models. arXiv preprint arXiv:2302.08453 (2023) 1, 3, 5, 13, 21
39. Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B.,
Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and edit-
ing with text-guided diffusion models. In: International Conference on Machine
Learning (ICML) (2022) 3
40. Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired
image-to-image translation. In: European Conference on Computer Vision (ECCV)
(2020) 3, 8, 10, 11, 15
41. Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with
spatially-adaptive normalization. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2019) 3
42. Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot
image-to-image translation. In: ACM SIGGRAPH (2023) 3, 10, 11
43. Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in
gan evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2022) 10
44. Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., Cohen-Or, D.: Localizing
object-level shape variations with text-to-image diffusion models. arXiv preprint
arXiv:2303.11306 (2023) 3
45. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International Conference on Machine Learning
(ICML) (2021) 7
46. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-
conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
(2022) 3
47. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2022) 4, 5
48. Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi,
M.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH. pp. 1–10
(2022) 1, 3
49. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour,
K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-
to-image diffusion models with deep language understanding. Conference on Neural
Information Processing Systems (NeurIPS) (2022) 3
50. Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models.
In: International Conference on Learning Representations (ICLR) (2022) 4
51. Sangkloy, P., Lu, J., Fang, C., Yu, F., Hays, J.: Scribbler: Controlling deep image
synthesis with sketch and color. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2017) 3
One-Step Image Translation with Text-to-Image Models 19

52. Sasaki, H., Willcocks, C.G., Breckon, T.P.: Unit-ddpm: Unpaired image transla-
tion with denoising diffusion probabilistic models. arXiv preprint arXiv:2104.05358
(2021) 3
53. Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: Stylegan-t: Unlocking the
power of gans for fast large-scale text-to-image synthesis. In: International Confer-
ence on Machine Learning (ICML) (2023) 4
54. Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla-
tion. arXiv preprint arXiv:2311.17042 (2023) 1, 2, 4, 14
55. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M.,
Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-
scale dataset for training next generation image-text models. Conference on Neural
Information Processing Systems (NeurIPS) 35, 25278–25294 (2022) 3
56. Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning
from simulated and unsupervised images through adversarial training. In: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 3
57. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International
Conference on Learning Representations (ICLR) (2020) 4
58. Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Interna-
tional Conference on Machine Learning (ICML) (2023) 4
59. Su, X., Song, J., Meng, C., Ermon, S.: Dual diffusion implicit bridges for image-
to-image translation. In: International Conference on Learning Representations
(ICLR) (2023) 3, 10, 11
60. Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation.
In: International Conference on Learning Representations (ICLR) (2017) 3
61. Tumanyan, N., Bar-Tal, O., Bagon, S., Dekel, T.: Splicing vit features for seman-
tic appearance transfer. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2022) 10
62. Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for
text-driven image-to-image translation. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (2023) 3, 10, 11
63. Wallace, B., Gokul, A., Naik, N.: Edict: Exact diffusion inversion via coupled trans-
formations. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2023) 4
64. Wang, T., Zhang, T., Zhang, B., Ouyang, H., Chen, D., Chen, Q., Wen,
F.: Pretraining is all you need for image-to-image translation. arXiv preprint
arXiv:2205.12952 (2022) 3
65. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-
resolution image synthesis and semantic manipulation with conditional gans. In:
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 3
66. Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer:
High-fidelity and diverse text-to-3d generation with variational score distillation.
Conference on Neural Information Processing Systems (NeurIPS) 36 (2024) 4
67. Wu, C.H., la Torre, F.D.: A latent space of stochastic diffusion models for zero-
shot image editing and guidance. In: IEEE International Conference on Computer
Vision (ICCV) (2023) 3, 10, 11
68. Xie, S., Tu, Z.: Holistically-nested edge detection. In: IEEE International Confer-
ence on Computer Vision (ICCV) (2015) 14
69. Xu, Y., Zhao, Y., Xiao, Z., Hou, T.: Ufogen: You forward once large scale text-to-
image generation via diffusion gans. arXiv preprint arXiv:2311.09257 (2023) 4
20 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

70. Yi, Z., Zhang, H., Tan, P., Gong, M.: Dualgan: Unsupervised dual learning for
image-to-image translation. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2017) 3
71. Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park,
T.: One-step diffusion with distribution matching distillation. In: IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2024) 4
72. Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell,
T.: Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In:
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 9,
27
73. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image
diffusion models. In: IEEE International Conference on Computer Vision (ICCV)
(2023) 1, 2, 3, 4, 5, 6, 13, 21, 28
74. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effec-
tiveness of deep features as a perceptual metric. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2018) 3, 7
75. Zhao, S., Cui, J., Sheng, Y., Dong, Y., Liang, X., Chang, E.I., Xu, Y.: Large scale
image completion via co-modulated generative adversarial networks. In: Interna-
tional Conference on Learning Representations (ICLR) (2021) 3
76. Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., Anandkumar, A.: Fast sam-
pling of diffusion models via operator learning. In: International Conference on
Machine Learning (ICML). PMLR (2023) 4
77. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation
using cycle-consistent adversarial networks. In: IEEE International Conference on
Computer Vision (ICCV) (2017) 1, 2, 3, 6, 9, 10, 11, 12, 27
78. Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman,
E.: Toward multimodal image-to-image translation. Conference on Neural Infor-
mation Processing Systems (NeurIPS) 30 (2017) 7
79. Zhu, P., Abdal, R., Qin, Y., Wonka, P.: Sean: Image synthesis with semantic region-
adaptive normalization. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). pp. 5104–5113 (2020) 3
One-Step Image Translation with Text-to-Image Models 21

Table 5: Ablation with Day to Night. The values in parentheses reflect the relative
change compared to our final method. First, Conf. A trains the unpaired translation
model with randomly initialized weights and suffers from a large FID increase. Next,
Conf. B, C, and D try different input types and show that direct input achieves the
best performance. Finally, our method adds skip connections to Conf. D and shows an
improvement in both distribution matching and structure preservation.

Day → Night Night → Day


Input Pre
Method Skip -trained
Type DINO DINO
FID ↓ Struct. ↓ FID ↓ Struct. ↓
Conf. A Direct Input x x 86.3 (+176%) 4.4 (+47%) 105.8 (+134%) 5.3 (+39%)
Conf. B ControlNet x ✓ 35.8 (+14%) 5.4 (+80%) 48.7 (+8%) 5.5 (+45%)
Conf. C T2I-Adapter x ✓ 34.2 (+9%) 4.2 (+40%) 54.6 (+21%) 6.4 (+68%)
Conf. D Direct Input x ✓ 33.5 (+7%) 4.0 (+33%) 48.5 (+7%) 4.9 (+29%)
Ours Direct Input ✓ ✓ 31.3 3.0 45.2 3.8

Appendix
Next, we start with Section A, which provides additional ablation study
results on more datasets. Section B follows with more baseline comparisons with
all GAN-based and Diffusion-based baselines. Section C shows an additional
analysis of the Condition Encoder conflict, the effects of varying the dataset size,
and the role of encoder-decoder finetuning. Finally, in Section D, we provide the
hyperparameters and training details.

A Additional Ablation Study

Table 3 in the main paper shows the results of an ablation study on the Horse to
Zebra translation. We show more qualitative ablation results on this dataset in
Figure 10. Next, we perform the same ablation on the Day to Night translation
qualitatively in Figures 11, 12 and Table 5. Similar to the main paper, we com-
pare to four variants: (1) Config A uses randomly initialized weights rather than
pre-trained weights, (2) Config B uses a ControlNet Encoder [73], (3) Config C
uses the T2I-Adapter [38], and (4) Config D directly feeds the input image to
the base network without skip connections.
Our full method outperforms all other variants in terms of distribution match-
ing (FID) and structure preservation (DINO Structure Distance).

B Additional Baseline Comparisons

Figures 5 and 6 in the main paper show a comparison of our method with the
best-performing GAN baseline and the best-performing diffusion-based baseline.
Here, we show additional qualitative comparisons with all GAN baselines in
Figures 13 and 15, as well as all diffusion-based baselines in Figures 14 and
16. Our method consistently produces more realistic outputs while retaining the
structure of input images.
22 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

Horse to Zebra Zebra to Horse


Input
Config A
Config B
Config C
Config D
Ours

Fig. 10: Ablating individual components. Additional ablation results on the Horse
↔ Zebra dataset. Our final method, shown in the bottom row, achieves the best trans-
lation results.
One-Step Image Translation with Text-to-Image Models 23

Input
Config A
Config B
Config C
Config D
Ours

Fig. 11: Ablating individual components. Additional ablation results on the Day
→ Night translation. Our method, shown in the bottom row, generates the most con-
vincing translations with the best detail preservation. Please zoom in to see the differ-
ences.
24 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

Input
Config A
Config B
Config C
Config D
Ours

Fig. 12: Ablating individual components. Additional ablation results on the Night
→ Day translation. Our method, shown in the bottom row, generates the most convinc-
ing translations with the best detail preservation. Please zoom in to see the differences.
One-Step Image Translation with Text-to-Image Models 25

Input
CycleGAN
CUT
Ours

Fig. 13: Comparison to GAN-based baselines. Additional comparison to Cycle-


GAN and CUT on Horse ↔ Zebra translation task.

Table 6: Training with a different number of input images.

Day → Night Night → Day


# Day # Night
Image Image FID DINO FID DINO
↓ Struct. ↓ ↓ Struct. ↓
10 10 42.4 3.0 65.6 4.0
100 100 31.8 3.3 47.4 3.8
1000 1000 31.2 3.4 47.4 3.8
36,728 27,971 31.3 3.0 45.2 3.8

C Additional Analysis
Conflict with Condition Encoder. Figure 3 in the main paper illustrates
the conflicting features when a conditioning image is added through a separate
encoder. Here, we show that using a Condition Encoder, as depicted in Figure 3
of the main paper, results in the original network getting ignored. In Figure 17,
we show the output with different noise maps but the same condition image. The
different noise maps generate perceptually similar output images, indicating that
the original SD-Turbo Encoder features have been ignored.
Varying the Dataset Size. Next, we evaluate the efficacy of our method across
datasets of different sizes. We use Day to Night translation dataset, which com-
prises 36,728 Day images and 27,971 Night images. To understand the impact of
dataset size on performance, we trained three additional models on progressively
reduced subsets of the original dataset: 1,000 images, 100 images, and finally,
10 images. Table 6 shows that reducing the number of training images results
26 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

Input
SDEdit
Plug&Play
pix2pix
zero
Diffusion
Cycle
DDIB
Instruct
Pix2Pix
Ours

Fig. 14: Comparison to Diffusion-based baselines. Additional comparison to


diffusion-based baselines on Horse ↔ Zebra translation task.

in a slight increase in FID, but the structure preservation is largely unchanged


across all different settings. This suggests that our model can be trained on small
datasets.

Role of Skip Connections. We additionally evaluate the role of skip connec-


tions by considering a baseline that finetunes the VAE Encoder and Decoder
without adding skip connections. Figure 18 shows that this baseline fails to pre-
serve fine details such as text and street signs.
One-Step Image Translation with Text-to-Image Models 27

Input
CycleGAN
CUT
Ours

Fig. 15: Comparison to GAN-based baselines. Additional comparison to Cycle-


GAN and CUT on the Day → Night translation task.

D Training Details

Unpaired translation. For all unpaired translation evaluations, we use the


four datasets listed below. For Day and Night datasets, we use 500 images from
the corresponding validation at test time. The validation set for Foggy images
comprises 50 images from the DENSE dataset.

– Horse ↔ Zebra: Following CycleGAN [77], we use the 939 images form wild
horse class and 1177 images from the zebra class in Imagenet [10].
– Yosemite Winter ↔ Summer : We use 854 winter and 1273 summer photos
of Yosemite collected from Flickr in CycleGAN [77].
– Day ↔ Night: We use the Day and Night subsets of the BDD100k dataset [72]
for this task.
– Clear ↔ Foggy: We use daytime clear images from BDD100k (12,454 images)
and 572 foggy images from the ‘dense-fog’ split of the DENSE dataset [4].

For all unpaired translation experiments, we use the Adam solver [25] with a
learning rate of 1e-6 with a batch size of 8, λidt = 1 and λGAN = 0.5
Paired translation. The training objective for the paired translation consists
of three losses as mentioned in Section 3.4 of the main paper: reconstruction
loss Lrec (L2 and LPIPS), GAN loss LGAN , and CLIP text-image alignment
loss LCLIP . The full learning objective is shown below. We use λGAN = 0.4,
λCLIP = 4.
\lbleq {full_paired_objective}\begin {aligned} \arg \min _\G \mathcal {L}_{\text {rec}} + \lambda _\text {clip}\mathcal {L}_{\text {CLIP}} + \lambda _\text {GAN} \mathcal {L}_\text {GAN}. \end {aligned} (5)
28 Gaurav Parmar Taesung Park Srinivasa Narasimhan Jun-Yan Zhu

Input
SDEdit
Plug&Play
pix2pix
zero
Diffusion
Cycle
DDIB
Instruct
Pix2Pix
Ours

Fig. 16: Comparison to Diffusion-based baselines. Additional comparison to sev-


eral diffusion-based baselines on the Day → Night translation task.

We train our paired method pix2pix-Turbo for two tasks: Edge2Image and
Sketch2Image. Both tasks use the same community-collected dataset of artistic
images [1] and follow the pre-processing of ControlNet [73].
– Edge2Image. We use a Canny edge detector [7] with random threshold at
training time. We train with Adam optimizer with a learning rate of 1e-5 for
7,500 steps with a batch size of 40.
– Sketch2Image. We generate synthetic sketches by first using a HED detector
and applying data augmentations such as random thresholds, non-maximal
suppression, and random morphological transformations. Our Sketch2Image
is initialized with the Edge2Image model and fine-tuned for 5,000 steps with
the same learning rate, batch size, and optimizer.
One-Step Image Translation with Text-to-Image Models 29

Outputs with different random seed with a Condition Encoder


Input

Fig. 17: Different outputs with the same input image and different noise
maps. We observe that the noise maps do not alter the image structure, suggesting
that the noises have been largely ignored.

Finetune VAE without Finetune VAE with skip


Input skip connections connections (ours)

Fig. 18: Finetuning encoder-decoder without skip connections. Here we fine-


tune the Encoder and Decoder of the VAE without adding skip connections (middle
column). Without skip connections, the method struggles to retain important details
such as the text “ON RED” on the street sign in the top row image, the text on the
store sign, and the pedestrian crossing sign in the bottom row image. In contrast, our
method, with skip connections, better preserves these details.

You might also like