Liu Towards Understanding Cross and Self-Attention in Stable Diffusion For Text-Guided CVPR 2024 Paper
Liu Towards Understanding Cross and Self-Attention in Stable Diffusion For Text-Guided CVPR 2024 Paper
Bingyan Liu1,2 , Chengyu Wang2 *, Tingfeng Cao1,2 , Kui Jia3 *, Jun Huang2
1
South China University of Technology, 2 Alibaba Group,
3
School of Data Science, The Chinese University of Hong Kong, Shenzhen
{eeliubingyan, setingfengcao}@mail.scut.edu.cn,
{chengyu.wcy,huangjun.hj}@alibaba-inc.com, [email protected]
7817
and self-attention of the original image in the attention layers FPE, making the image editing process simpler and more
and then injects them into the target image generation pro- effective. In experimental tests over multiple datasets, FPE
cess. Among these methods, attention layers play a crucial outperforms current popular methods.
role in controlling the image layout and the relationship be- Overall, our paper contributes to the understanding of at-
tween the generated image and the input prompt. However, tention maps in Stable Diffusion and provides a practical
inappropriate modifications to attention layers can yield var- solution for overcoming the limitations of inaccurate TIE.
ied editing outcomes and even lead to editing failures. For
example, as depicted in Figure 1, editing authentic images 2. Related Works
on cross-attention layers can result in editing failures; con-
verting a man into a robot or changing the color of a car to Text-guided Image Editing (TIE) [39] is a crucial task involv-
red fails. Moreover, some operations in the above-mentioned ing the modification of an input image with requirements
methods can be revised and optimized. expressed by texts. These approaches can be broadly catego-
rized into two groups: tuning-free methods and fine-tuning
In our paper, we explore attention map modification to
based methods.
gain comprehensive insights into the underlying mechanisms
of TIE using diffusion-based models. Specifically, we fo- 2.1. Tuning-free Methods
cus on the attribution of TIE and ask the fundamental ques-
tion: how does the modification of attention layers contribute Tuning-free TIE methods aim to control the generated image
to diffusion-based TIE? To answer this question, we care- in the denoising process. To achieve this goal, SDEdit [17]
fully construct new datasets and meticulously investigate uses the given guidance image as the initial noise in the
the impact of modifying the attention maps on the resulting denoising step, which leads to impressive results. Other
images. This is accomplished by probe analysis [3, 16] and methods operate in the feature space of diffusion models to
systematic exploration of attention map modification with achieve successful editing results. One notable example is
different blocks in the diffusion model. We find that (1) edit- P2P [10], which discovers that manipulating cross-attention
ing cross-attention maps in diffusion models is optional for layers allows for controlling the relationship between the
image editing. Replacing or refining cross-attention maps spatial layout of the image and each word in the text. Null-
between the source and target image generation process is text inversion [18] further employs an optimization method
dispensable and can result in failed image editing. (2) The to reconstruct the guidance image and utilizes P2P for real
cross-attention map is not only a weight measure of the image editing. DiffEdit [5] automatically generates a mask
conditional prompt at the corresponding positions in the gen- by comparing different text prompts to help guide the areas
erated image but also contains the semantic features of the of the image that need editing. PnP [34] focuses on spatial
conditional token. Therefore, replacing the target image’s features and self-affinities to control the generated image’s
cross-attention map with the source image’s map may yield structure without restricting interaction with the text. Ad-
unexpected outcomes. (3) Self-attention maps are crucial to ditionally, MasaCtrl [2] converts self-attention in diffusion
the success of the TIE task, as they reflect the association models into a mutual and mask-guided self-attention strategy,
between image features and retain the spatial information of enabling pose transformation. In this paper, we aim to pro-
the image. Based on our findings, we propose a simplified vide in-depth insights into the attention layers of diffusion
and effective algorithm called Free-Prompt-Editing (FPE). models and further propose a more streamlined tuning-free
FPE performs image editing by replacing the self-attention TIE approach.
map in specific attention layers during denoising, without
needing a source prompt. It is beneficial for real image edit- 2.2. Fine-tuning Based Methods
ing scenarios. The contributions of our paper are as follows: The core idea of fine-tuning-based TIE methods is to synthe-
• We conduct a comprehensive analysis of how attention size ideal new images by model fine-tuning over the knowl-
layers impact image editing results in diffusion models edge of domain-specific data [8, 12, 13, 27] or by introduc-
and answer why TIE methods based on cross-attention ing additional guidance information [1, 19, 40]. Dream-
map replacement can lead to unstable results. Booth [27] fine-tunes all the parameters in the diffusion
• We design experiments to prove that cross-attention maps model while keeping the text transformer frozen and uti-
not only serve as the weight of the corresponding token on lizes generated images as the regularization dataset. Textual
the corresponding pixel but also contain the characteristic Inversion [8] optimizes a new word embedding token for
information of the token. In contrast, self-attention is each concept. Imagic [13] learns the approximate text em-
crucial in ensuring that the edited image retains the original bedding of the input image through tuning and then edits
image’s layout information and shape details. the posture of the object in the image by interpolating the
• Based on our experimental findings, we simplify currently approximate text embedding and the target text embedding.
popular tuning-free image editing methods and propose ControlNet [40] and T2I-Adapter [19] allow users to guide
7818
the generated images through input images by tuning addi- `v (Pemb ). Intuitively, each cell in the cross-attention map,
tional network modules. Instructpix2pix [1] fully fine-tunes denoted as Mij , determines the weights attributed to the
the diffusion model by constructing image-text-image triples value of the j-th token relative to the spatial feature i of the
in the form of instructions, enabling users to edit authentic image. The cross-attention map enables the diffusion model
images using instruction prompts, such as “turn a man into a to locate/align the tokens of the prompt in the image area.
cyborg”. In contrast to these works, our method focuses on
tuning-free techniques without the fine-tuning process. 3.2. Self-Attention in Stable Diffusion
As depicted in Figure 2, unlike cross-attention, the self-
!!"#$$ : Queries
(from noise image) attention layer receives the keys matrix Kself and the query
matrix Qself from the noisy image self (zt ) through learned
Cross-attention layer
#!"#$$ : Keys $!"#$$ : %!"#$$ : Values &'%&'(( ()) ): Qself = `¯q ( self (zt )), Kself = `¯K ( self (zt )), (3)
(from prompt) Attention maps (from prompt) Output features
!
!(*+, : Queries
Qself Kself T
(from noise image)
Mself = Softmax p (4)
dself
Self-attention layer
7819
Class Layer 3 Layer 6 Layer 9 Layer 10 Layer 12 Layer 14 Layer 16 Avg.
Cross-attention map
Source image Top-0 Top-1 Top-2 Top-3 Top-4 Top-5 white 0.97 1.00 0.94 0.97 0.97 0.61 0.85 0.90
orange 0.97 1.00 0.94 0.92 0.89 0.94 0.83 0.93
yellow 0.96 0.77 1.00 0.98 1.00 0.36 0.68 0.82
red 0.97 0.97 0.93 0.85 0.70 0.23 0.65 0.76
Cross-attention map
car“
“a rabbit
Self-attention map
car“
word in the prompt has a corresponding attention map asso-
ciated with the image, indicating that the information related “a rabbit
standing
7820
all the structural information from the original image but Source image All token − “a” − “a, car” − “a, blue, car” − all Direct generation
7821
Algorithm 1 Free-Prompt-Editing for a generated image. transforms various attributes, styles, scenes, and categories
Input: Psrc : a source prompt; Pdst : a target prompt; S: random of the original images.
seed;
Output: Isrc : source image; Idst : edited image;
1: zT ⇠ N (0, 1), a unit Gaussian random value sampled with
random seed S;
2: zT⇤ zT ;
Generation image
3: for t = T, T 1, . . . , 1 do
“a graffiti of a “a photo of a “an
Source image
4: zt 1 , Mself DM (zt , Psrc , t); goldfish” goldfish” embroidery of
a goldfish”
5: zt⇤ 1 DM (zt⇤ , Pdst , t){Mself
⇤
Mself };
6: end for
7: Return (Isrc Decoder(z0 ), Idst Decoder(z0⇤ ));
Real image
5: zt⇤ 1 DM (zt⇤ , Pdst , t){Mself
⇤
Mself }; “a photo of a “a photo of a
“a white
6: end for Source image
skyscraper” tower in
snow”
wedding cake”
for generated images and one for real images. The gener-
ated images dataset includes Car-fake-edit and ImageNet- “a photorealistic
fake-edit, where Car-fake-edit contains 756 prompt pairs, Source image
image of a
mountain full of
“a photorealistic
image of a
“a polygonal
illustration of a
snowy mountain” mountain”
and ImageNet-fake-edit contains 1182 prompt pairs sam- trees”
pled from FlexIT [4] and ImageNet [28]. The real image
Figure 6. Results of our method on image-text pairs from Wild-TI2I
datasets include Car-real-edit, sampled from the Stanford and ImageNet-R-TI2I.
Car (CARS196) dataset [15], containing 3321 image-prompt
pairs, and ImageNet-real-edit, which contains 1092 pairs.
For more details, see section 7.2 in the Supplementary Ma- 5.2.1 Comparison to Prior/Concurrent Work
terial. In addition, we also use benchmarks constructed by
In this section, we compare our work with state-of-the-art
PnP [34]. These benchmarks contain two datasets: Wild-
image editing methods, including (i) P2P [10] (with null
TI2I and ImageNet-R-TI2I. For generated images, Wild-TI2I
text inversion [18] for the real image scene), (ii) PnP [34],
contains 70 prompt pairs, and ImageNet-R-TI2I contains 150
(iii) SDEdit [17] under two noise levels (0.5 and 0.75), (iv)
pairs. For real images, Wild-TI2I contains 78 image-prompt
DiffEdit [5], (v) MasaCtrl [2], (vi) Pix2pixzero [22], (vii)
pairs, and ImageNet-R-TI2I includes 150 pairs.
Shape-guided [21], and (viii) InstructPix2Pix [1]. We further
We utilize Clip Score (CS) and Clip Directional Simi-
present the image editing results using other Stable Diffusion-
larity (CDS) [9, 23] to quantitatively analyze and compare
based models to demonstrate the universality of our method,
our method with currently popular image editing algorithms.
including Realistic-V23 , Deliberate4 , and Anything-V45 .
The underlying model for our experiments is Stable Diffu-
Comparison to P2P We first compare our method with
sion 1.52 . The experimental results of comparative methods
P2P [10] for synthetic image editing scenes and P2P com-
are produced using the publicly disclosed codes from their
bined with null text inversion [18] for real image scenes,
original papers with unified random seeds.
both denoted as P2P. The experimental results are shown in
5.2. Image Editing Results Figure 7 and Table 4. In Figure 7, it is evident that when
performing color transformation on a real image by modi-
We evaluate our method through quantitative and qualitative fying the cross-attention map, the editing fails. The editing
analyses. As illustrated in Figure 6, we showcase the editing
3 https : / / huggingface . co / SG161222 / Realistic _
outcomes of our method, demonstrating that it successfully
Vision_V2.0
2 https : / / huggingface . co / runwayml / stable - 4 https://huggingface.co/XpucT/Deliberate
diffusion-v1-5 5 https://huggingface.co/xyn-ai/anything-v4.0
7822
results of P2P for car color tend to replicate the color (white) step method that leads to significant computational overhead;
of the original image. Regarding the category conversion editing a single image in a generated image editing scenario
results for generated images, we observe that while P2P can takes approximately 335.65 seconds. In contrast, our method
accurately transform different animals, the edited results still only requires around 6.30 seconds on an A100 GPU with
retain appearances of sheep. This leads to an incomplete 40GB memory, as Table 5 indicates.
conversion for patterned animals such as giraffes, leopards, Table 5 presents the quantitative experimental results of
and tigers. Unlike P2P, our method operates only at the different editing algorithms on the Wild-TI2I and ImageNet-
self-attention layers and is not susceptible to editing failures R-TI2I benchmarks. From Table 5, it is evident that our
caused by modifications to the cross-attention map. method outperforms all others in terms of the CDS metric.
This indicates that our method excels in preserving the spa-
CS " CDS " tial structure of the original image and performing editing
Dataset
P2P Ours P2P Ours
according to the requirements of the target prompt, yielding
Car-fake-edit 25.96 26.02 0.2451 0.2659
Car-real-edit 24.64 24.85 0.2288 0.2605 superior results. Meanwhile, our method achieves a good
ImageNet-fake-edit 27.42 27.80 0.2401 0.2560 balance between time consumption and effectiveness, as
ImageNet-real-edit 26.17 26.35 0.2426 0.2468 demonstrated in Table 5.
Table 4. Quantitative experimental results over Car-fake-edit,
Realistic V-2
ImageNet-fake-edit, Car-real-edit and ImageNet-real-edit.
7823
Source image Ours P2P PnP SDEdit 0.5 SDEdit 0.75 DiffEidt Pix2pix-zero Shape-guided MacaCtrl InstructPix2Pix
Generation TI2I-Generation
Wild TI2I- IImageNet-R-
“a photo of
a poodle“
“a photo of
rubber
ducks
walking on
street“
“an
embroider
ImageNet-R-TI2I
y of a
penguin“
“a photo of
a jeep“
“a photo of
a silver
robot
Wild TI2I-Real
walking on
the moon“
“a bronze
horse in a
museum “
Figure 8. Comparison to prior works. Left to right: source image, target prompt, our result, P2P [10], PnP [34], SDEdit [17] w/ two noising
levels, DiffEdit [5], Pix2pixzero [22], Shape-guided [21], MasaCtrl [2] and InstructPix2Pix [1] (fine-tuning based method.)
Table 5. Quantitative experimental results over Wild-TI2I and ImageNet-R-TI2I benchmarks, including real and generated guidance images.
CS: Clip score [23] and CDS: Clip Directional Similarity [9, 23]. Editing Times: per-image/second
without complex operations, it still has some limitations. editing methods that rely on it. On the contrary, the self-
Firstly, our method is constrained by the generative capabil- attention map captures the spatial structural information of
ities of the TIS model. Our editing method will fail if the the original image, playing an essential role in preserving
generative model cannot produce images consistent with the the image’s inherent structure during editing. Based on our
target prompt description. When editing real images, the comprehensive analysis and empirical evidence, we have
original image must first be reconstructed. Some detailed streamlined current image editing algorithms and proposed
information, especially facial details, may be lost during the an innovative image editing approach. Our approach does
reconstruction process, primarily due to the limitations of not require additional tuning or the alignment of target and
the VQ autoencoder [14]. Optimizing the VQ autoencoder is source prompts to achieve effective object or background
beyond the scope of this paper, as our objective is to provide editing in images. In extensive experiments across multiple
a simple and universal editing framework. Addressing these datasets, our simplified method has outperformed existing
challenges will be part of our future work. image editing algorithms. Furthermore, our algorithm can
be seamlessly adapted to other TIS models.
6. Conclusion
Acknowledgements This work is partially supported by
In this work, we utilized probe analysis and conducted ex- Alibaba Cloud through the Research Talent Program with
periments to elucidate the following insights on TIS models: South China University of Technology, and the Program
the cross-attention map carries the semantic information for Guangdong Introducing Innovative and Entrepreneurial
of the prompt, which leads to the ineffectiveness of image Teams (No. 2017ZT07X183).
7824
References [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu,
Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora:
[1] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- Low-rank adaptation of large language models. arXiv preprint
structpix2pix: Learning to follow image editing instructions. arXiv:2106.09685, 2021. 2
In Proceedings of the IEEE/CVF Conference on Computer [13] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
Vision and Pattern Recognition, pages 18392–18402, 2023. Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
1, 2, 3, 6, 7, 8 Text-based real image editing with diffusion models. In Pro-
[2] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- ceedings of the IEEE/CVF Conference on Computer Vision
aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual and Pattern Recognition, pages 6007–6017, 2023. 2
self-attention control for consistent image synthesis and edit- [14] Diederik P Kingma and Max Welling. Auto-encoding varia-
ing. In Proceedings of the IEEE/CVF International Confer- tional bayes. In International Conference on Learning Repre-
ence on Computer Vision (ICCV), pages 22560–22570, 2023. sentations, 2014. 1, 8
1, 2, 6, 7, 8 [15] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d
[3] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christo- object representations for fine-grained categorization. In Pro-
pher D. Manning. What does BERT look at? an analysis of ceedings of the IEEE international conference on computer
BERT’s attention. In Proceedings of the 2019 ACL Workshop vision workshops, pages 554–561, 2013. 6, 1
BlackboxNLP: Analyzing and Interpreting Neural Networks [16] Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E.
for NLP, pages 276–286. Association for Computational Lin- Peters, and Noah A. Smith. Linguistic knowledge and trans-
guistics, 2019. 2, 3 ferability of contextual representations. In Proceedings of
[4] Guillaume Couairon, Asya Grechka, Jakob Verbeek, Holger the 2019 Conference of the North American Chapter of the
Schwenk, and Matthieu Cord. Flexit: Towards flexible se- Association for Computational Linguistics: Human Language
mantic image translation. In Proceedings of the IEEE/CVF Technologies, Volume 1 (Long and Short Papers), pages 1073–
Conference on Computer Vision and Pattern Recognition, 1094. Association for Computational Linguistics, 2019. 2,
pages 18270–18279, 2022. 6, 2 3
[5] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and [17] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun
Matthieu Cord. Diffedit: Diffusion-based semantic image Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image
editing with mask guidance. In The Eleventh International synthesis and editing with stochastic differential equations. In
Conference on Learning Representations, 2023. 1, 2, 6, 7, 8 International Conference on Learning Representations, 2022.
[6] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, 1, 2, 6, 7, 8
and Mubarak Shah. Diffusion models in vision: A survey. [18] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and
IEEE Transactions on Pattern Analysis and Machine Intelli- Daniel Cohen-Or. Null-text inversion for editing real im-
gence, 2023. 1 ages using guided diffusion models. In Proceedings of
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina the IEEE/CVF Conference on Computer Vision and Pattern
Toutanova. BERT: pre-training of deep bidirectional trans- Recognition, pages 6038–6047, 2023. 1, 2, 6, 4, 5
formers for language understanding. In Proceedings of the [19] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian
2019 Conference of the North American Chapter of the As- Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-
sociation for Computational Linguistics: Human Language adapter: Learning adapters to dig out more controllable
Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, ability for text-to-image diffusion models. arXiv preprint
June 2-7, 2019, Volume 1 (Long and Short Papers), pages arXiv:2302.08453, 2023. 2
4171–4186. Association for Computational Linguistics, 2019. [20] OpenAI. Improving image generation with better captions.
5 https://cdn.openai.com/papers/dall- e- 3.
[8] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, pdf, 2023. 1
Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An [21] Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi,
image is worth one word: Personalizing text-to-image genera- Xihui Liu, Maka Karalashvili, Anna Rohrbach, and Trevor
tion using textual inversion. arXiv preprint arXiv:2208.01618, Darrell. Shape-guided diffusion with inside-outside attention.
2022. 2 In Proceedings of the IEEE/CVF Winter Conference on Ap-
[9] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, plications of Computer Vision, pages 4198–4207, 2024. 1, 6,
Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- 7, 8
guided domain adaptation of image generators. ACM Trans- [22] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun
actions on Graphics (TOG), 41(4):1–13, 2022. 6, 8 Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image
[10] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, translation. In ACM SIGGRAPH 2023 Conference Proceed-
Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image ings, pages 1–11, 2023. 1, 6, 7, 8
editing with cross-attention control. In The Eleventh Interna- [23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
tional Conference on Learning Representations, 2023. 1, 2, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
5, 6, 7, 8 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
[11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- transferable visual models from natural language supervi-
sion probabilistic models. Advances in Neural Information sion. In International conference on machine learning, pages
Processing Systems, 33:6840–6851, 2020. 1 8748–8763. PMLR, 2021. 1, 5, 6, 8
7825
[24] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Polosukhin. Attention is all you need. Advances in neural
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and information processing systems, 30, 2017. 5
Peter J Liu. Exploring the limits of transfer learning with [36] Michael E Wall, Andreas Rechtsteiner, and Luis M Rocha.
a unified text-to-text transformer. The Journal of Machine Singular value decomposition and principal component anal-
Learning Research, 21(1):5485–5551, 2020. 1 ysis. In A practical approach to microarray data analysis,
[25] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, pages 91–109. Springer, 2003. 4
and Mark Chen. Hierarchical text-conditional image genera- [37] Chengyu Wang, Zhongjie Duan, Bingyan Liu, Xinyi Zou,
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. Cen Chen, Kui Jia, and Jun Huang. Pai-diffusion: Con-
1 structing and serving a family of open chinese diffusion mod-
[26] Robin Rombach, Andreas Blattmann, Dominik Lorenz, els for text-to-image synthesis on the cloud. arXiv preprint
Patrick Esser, and Björn Ommer. High-resolution image arXiv:2309.05534, 2023. 1
synthesis with latent diffusion models. In Proceedings of [38] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Run-
the IEEE/CVF Conference on Computer Vision and Pattern sheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-
Recognition, pages 10684–10695, 2022. 1, 5 Hsuan Yang. Diffusion models: A comprehensive survey of
[27] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, methods and applications. ACM Computing Surveys, 2022. 1
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine [39] Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang,
tuning text-to-image diffusion models for subject-driven gen- Shijian Lu, Lingjie Liu, Adam Kortylewski, Christian
eration. In Proceedings of the IEEE/CVF Conference on Com- Theobalt, and Eric Xing. Multimodal image synthesis and
puter Vision and Pattern Recognition, pages 22500–22510, editing: A survey and taxonomy. IEEE Transactions on Pat-
2023. 2 tern Analysis and Machine Intelligence, 2023. 2
[28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [40] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, conditional control to text-to-image diffusion models. In
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Proceedings of the IEEE/CVF International Conference on
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- Computer Vision, pages 3836–3847, 2023. 2
lenge. International Journal of Computer Vision (IJCV), 115
(3):211–252, 2015. 6, 2
[29] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li,
Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael
Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho-
torealistic text-to-image diffusion models with deep language
understanding. Advances in Neural Information Processing
Systems, 35:36479–36494, 2022. 1
[30] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
Open dataset of clip-filtered 400 million image-text pairs.
arXiv preprint arXiv:2111.02114, 2021. 1
[31] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes,
Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al.
Laion-5b: An open large-scale dataset for training next gen-
eration image-text models. Advances in Neural Information
Processing Systems, 35:25278–25294, 2022. 1
[32] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In International Confer-
ence on Machine Learning, pages 2256–2265. PMLR, 2015.
1
[33] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
diffusion implicit models. In International Conference on
Learning Representations, 2021. 5
[34] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel.
Plug-and-play diffusion features for text-driven image-to-
image translation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
1921–1930, 2023. 1, 2, 6, 7, 8
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
7826