0% found this document useful (0 votes)

23 views10 pages

Liu Towards Understanding Cross and Self-Attention in Stable Diffusion For Text-Guided CVPR 2024 Paper

This paper analyzes the roles of cross and self-attention mechanisms in Stable Diffusion for text-guided image editing. It reveals that cross-attention maps often lead to editing failures due to their object attribution information, while self-attention maps are essential for preserving geometric details. The authors propose a simplified tuning-free method called Free-Prompt-Editing (FPE) that improves image editing outcomes by modifying only self-attention maps during the denoising process.

Uploaded by

Tài Tong Teo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views10 pages

Liu Towards Understanding Cross and Self-Attention in Stable Diffusion For Text-Guided CVPR 2024 Paper

Uploaded by

Tài Tong Teo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

Towards Understanding Cross and Self-Attention in Stable Diffusion for

Text-Guided Image Editing

Bingyan Liu1,2 , Chengyu Wang2 *, Tingfeng Cao1,2 , Kui Jia3 *, Jun Huang2
1
South China University of Technology, 2 Alibaba Group,
3
School of Data Science, The Chinese University of Hong Kong, Shenzhen
{eeliubingyan, setingfengcao}@mail.scut.edu.cn,
{chengyu.wcy,huangjun.hj}@alibaba-inc.com, [email protected]

Abstract Source image P2P Ours

Deep Text-to-Image Synthesis (TIS) models such as Sta- “a photo

ble Diffusion have recently gained significant popularity for of a silver
robot”
creative text-to-image generation. However, for domain-
specific scenarios, tuning-free Text-guided Image Editing
(TIE) is of greater importance for application developers.
This approach modifies objects or object properties in im- “a red car”
ages by manipulating feature components in attention layers
during the generation process. Nevertheless, little is known
about the semantic meanings that these attention layers have
Figure 1. An example showing that our method can perform more
learned and which parts of the attention maps contribute consistent and realistic TIE compared to P2P [10].
to the success of image editing. In this paper, we conduct
an in-depth probing analysis and demonstrate that cross-
attention maps in Stable Diffusion often contain object at-
tribution information, which can result in editing failures. tion, capturing substantial attention from both academia and
In contrast, self-attention maps play a crucial role in pre- industry [6, 20, 37, 38]. These TIS models are trained on
serving the geometric and shape details of the source image vast amounts of image-text pairs, such as Laion [30, 31],
during the transformation to the target image. Our anal- and employ cutting-edge techniques, including large-scale
ysis offers valuable insights into understanding cross and pre-trained language models [23, 24], variational auto-
self-attention mechanisms in diffusion models. Furthermore, encoders [14], and diffusion models [11, 32] to achieve
based on our findings, we propose a simplified, yet more success in generating realistic images with vivid details.
stable and efficient, tuning-free procedure that modifies only Specifically, Stable Diffusion stands out as a popular and
the self-attention maps of specified attention layers during extensively studied model, making significant contributions
the denoising process. Experimental results show that our to the open-source community.
simplified method consistently surpasses the performance of In addition to image generation, these TIS models pos-
popular approaches on multiple datasets. 1 sess powerful image editing capabilities, which hold great
importance as they aim to modify images while ensuring
realism, naturalness, and meeting human preferences. Text-
1. Introduction guided Image Editing (TIE) involves modifying an input
image based on a descriptive prompt. Existing TIE meth-
Text-to-Image Synthesis (TIS) models, such as Stable Dif- ods [1, 2, 2, 5, 10, 17, 18, 21, 22, 34] achieve remarkable
fusion [26], DALL-E 2 [25], and Imagen [29], have demon- effects in image translation, style transfer, and appearance
strated remarkable visual effects for text-to-image genera- replacement, as well as preserving the input structure and
scene layout. To this end, Prompt-to-Prompt (P2P) [10]
* Co-corresponding authors.
1 Source code and datasets
are available at https : / / github .
modifies image regions by replacing cross-attention maps
com / alibaba / EasyNLP / tree / master / diffusion / corresponding to the target edit words in the source prompt.
FreePromptEditing. Plug-and-Play (PnP) [34] first extracts the spatial features

7817
and self-attention of the original image in the attention layers FPE, making the image editing process simpler and more
and then injects them into the target image generation pro- effective. In experimental tests over multiple datasets, FPE
cess. Among these methods, attention layers play a crucial outperforms current popular methods.
role in controlling the image layout and the relationship be- Overall, our paper contributes to the understanding of at-
tween the generated image and the input prompt. However, tention maps in Stable Diffusion and provides a practical
inappropriate modifications to attention layers can yield var- solution for overcoming the limitations of inaccurate TIE.
ied editing outcomes and even lead to editing failures. For
example, as depicted in Figure 1, editing authentic images 2. Related Works
on cross-attention layers can result in editing failures; con-
verting a man into a robot or changing the color of a car to Text-guided Image Editing (TIE) [39] is a crucial task involv-
red fails. Moreover, some operations in the above-mentioned ing the modification of an input image with requirements
methods can be revised and optimized. expressed by texts. These approaches can be broadly catego-
rized into two groups: tuning-free methods and fine-tuning
In our paper, we explore attention map modification to
based methods.
gain comprehensive insights into the underlying mechanisms
of TIE using diffusion-based models. Specifically, we fo- 2.1. Tuning-free Methods
cus on the attribution of TIE and ask the fundamental ques-
tion: how does the modification of attention layers contribute Tuning-free TIE methods aim to control the generated image
to diffusion-based TIE? To answer this question, we care- in the denoising process. To achieve this goal, SDEdit [17]
fully construct new datasets and meticulously investigate uses the given guidance image as the initial noise in the
the impact of modifying the attention maps on the resulting denoising step, which leads to impressive results. Other
images. This is accomplished by probe analysis [3, 16] and methods operate in the feature space of diffusion models to
systematic exploration of attention map modification with achieve successful editing results. One notable example is
different blocks in the diffusion model. We find that (1) edit- P2P [10], which discovers that manipulating cross-attention
ing cross-attention maps in diffusion models is optional for layers allows for controlling the relationship between the
image editing. Replacing or refining cross-attention maps spatial layout of the image and each word in the text. Null-
between the source and target image generation process is text inversion [18] further employs an optimization method
dispensable and can result in failed image editing. (2) The to reconstruct the guidance image and utilizes P2P for real
cross-attention map is not only a weight measure of the image editing. DiffEdit [5] automatically generates a mask
conditional prompt at the corresponding positions in the gen- by comparing different text prompts to help guide the areas
erated image but also contains the semantic features of the of the image that need editing. PnP [34] focuses on spatial
conditional token. Therefore, replacing the target image’s features and self-affinities to control the generated image’s
cross-attention map with the source image’s map may yield structure without restricting interaction with the text. Ad-
unexpected outcomes. (3) Self-attention maps are crucial to ditionally, MasaCtrl [2] converts self-attention in diffusion
the success of the TIE task, as they reflect the association models into a mutual and mask-guided self-attention strategy,
between image features and retain the spatial information of enabling pose transformation. In this paper, we aim to pro-
the image. Based on our findings, we propose a simplified vide in-depth insights into the attention layers of diffusion
and effective algorithm called Free-Prompt-Editing (FPE). models and further propose a more streamlined tuning-free
FPE performs image editing by replacing the self-attention TIE approach.
map in specific attention layers during denoising, without
needing a source prompt. It is beneficial for real image edit- 2.2. Fine-tuning Based Methods
ing scenarios. The contributions of our paper are as follows: The core idea of fine-tuning-based TIE methods is to synthe-
• We conduct a comprehensive analysis of how attention size ideal new images by model fine-tuning over the knowl-
layers impact image editing results in diffusion models edge of domain-specific data [8, 12, 13, 27] or by introduc-
and answer why TIE methods based on cross-attention ing additional guidance information [1, 19, 40]. Dream-
map replacement can lead to unstable results. Booth [27] fine-tunes all the parameters in the diffusion
• We design experiments to prove that cross-attention maps model while keeping the text transformer frozen and uti-
not only serve as the weight of the corresponding token on lizes generated images as the regularization dataset. Textual
the corresponding pixel but also contain the characteristic Inversion [8] optimizes a new word embedding token for
information of the token. In contrast, self-attention is each concept. Imagic [13] learns the approximate text em-
crucial in ensuring that the edited image retains the original bedding of the input image through tuning and then edits
image’s layout information and shape details. the posture of the object in the image by interpolating the
• Based on our experimental findings, we simplify currently approximate text embedding and the target text embedding.
popular tuning-free image editing methods and propose ControlNet [40] and T2I-Adapter [19] allow users to guide

7818
the generated images through input images by tuning addi- `v (Pemb ). Intuitively, each cell in the cross-attention map,
tional network modules. Instructpix2pix [1] fully fine-tunes denoted as Mij , determines the weights attributed to the
the diffusion model by constructing image-text-image triples value of the j-th token relative to the spatial feature i of the
in the form of instructions, enabling users to edit authentic image. The cross-attention map enables the diffusion model
images using instruction prompts, such as “turn a man into a to locate/align the tokens of the prompt in the image area.
cyborg”. In contrast to these works, our method focuses on
tuning-free techniques without the fine-tuning process. 3.2. Self-Attention in Stable Diffusion
As depicted in Figure 2, unlike cross-attention, the self-
!!"#$$ : Queries
(from noise image) attention layer receives the keys matrix Kself and the query
matrix Qself from the noisy image self (zt ) through learned
Cross-attention layer

linear projections `¯K and `¯Q , respectively. The self-attention

map is defined as:

#!"#$$ : Keys $!"#$$ : %!"#$$ : Values &'%&'(( ()) ): Qself = `¯q ( self (zt )), Kself = `¯K ( self (zt )), (3)
(from prompt) Attention maps (from prompt) Output features
!
!(*+, : Queries
Qself Kself T
(from noise image)
Mself = Softmax p (4)
dself
Self-attention layer

where dself is the dimension of Kself and Qself . Mself

determines the weights assigned to the relevance of the i-
th and j-th spatial features in the image and can affect the
spatial layout and shape details of the generated image. Con-
sequently, the self-attention map can be utilized to preserve
#(*+, : Keys $(*+, : %(*+, : Values &'(*+, ()) ):
(from noise image) Attention maps (from noise image) Output features the spatial structure characteristics of the original image
throughout the image editing process.
Figure 2. Cross and self-attention layers in Stable Diffusion.
3.3. Probing Analysis
Yet, the semantics of cross and self-attention maps remain
3. Analysis on Cross and Self-Attention unclear. Are these attention maps merely weight matrices,
In this section, we analyze how cross and self-attention maps or do they contain feature information of the image? To
in Stable Diffusion contribute to the effectiveness of TIE. answer these questions, we aim to explore the meaning of
attention maps in diffusion models. Inspired by probing anal-
3.1. Cross-Attention in Stable Diffusion ysis methods [3, 16] in the field of NLP, we propose building
datasets and training classification networks to explore the
In Stable Diffusion and other similar models, cross-attention
properties of attention maps. Our fundamental idea is that
layers play a crucial role in fusing images and texts, allowing
if a trained classifier can accurately classify attention maps
T2I models to generate images that are consistent with tex-
from different categories, then the attention map contains
tual descriptions. As depicted in the upper part of Figure 2,
meaningful feature representation of the category informa-
the cross-attention layer receives the query, key, and value
tion. Therefore, we introduce a task-specific classifier on top
matrices, i.e., Qcross , Kcross , and Vcross , from the noisy
of the diffusion model’s cross-attention and self-attention
image and prompt. Specifically, Qcross is derived from the
layers. This classifier is a two-layer MLP designed to predict
spatial features of the noisy image cross (zt ) by learned
specific semantic properties of the attention maps. To present
linear projections `q , while Kcross and Vcross are projected
the analysis results more visually, we utilize color adjectives
from the textual embedding Pemb of the input prompt P
and animal nouns to form prompt datasets, each containing
using learned linear projections denoted as `k and `v , respec-
ten categories. For the color adjective, there are two prompt
tively. The cross-attention map is defined as:
formats: a <color> car and a <color> <object>. The
Qcross = `q ( cross (zt )), Kcross = `k (Pemb ) (1) prompt format for animal nouns is a/an <animal> standing
! in the park. After generating the prompts, we employ the
Qcross Kcross T probing method to extract the cross-attention maps corre-
Mcross = Softmax p (2) sponding to the words <color> and <animal>, along with
dcross
the self-attention maps in the attention layers. Finally, by
where dcross is the dimension of the keys and queries. The training and evaluating the performance of the classifiers, we
final output is defined as the fused feature of the text and gain insights into the semantic knowledge captured by the
image, denoted as ˆ(zt ) = Mcross Vcross , where Vcross = attention maps.

7819
Class Layer 3 Layer 6 Layer 9 Layer 10 Layer 12 Layer 14 Layer 16 Avg.
Cross-attention map

Source image a white horse in the park

dog 1.00 1.00 1.00 1.00 0.89 0.76 1.00 0.95
horse 0.96 1.00 1.00 1.00 0.64 1.00 0.91 0.93
sheep 0.97 1.00 1.00 1.00 1.00 0.90 0.97 0.98
leopard 0.97 1.00 1.00 1.00 0.97 0.79 0.87 0.94
tiger 1.00 1.00 0.97 1.00 0.88 1.00 0.97 0.97
green 0.93 0.91 0.91 0.96 0.67 0.38 0.60 0.77
Self-attention map

Source image Top-0 Top-1 Top-2 Top-3 Top-4 Top-5 white 0.97 1.00 0.94 0.97 0.97 0.61 0.85 0.90
orange 0.97 1.00 0.94 0.92 0.89 0.94 0.83 0.93
yellow 0.96 0.77 1.00 0.98 1.00 0.36 0.68 0.82
red 0.97 0.97 0.93 0.85 0.70 0.23 0.65 0.76

Table 1. Probing accuracy of cross-attention map in difference

Figure 3. The heatmaps of cross-attention and self-attention maps layers.
in a generated image with the prompt ”a white horse in the park”.
The visualization of the cross-attention map corresponds to each Layer Layer Layer Layer No Direct
Source image
word in the prompt. The visualization of the self-attention map is 1~16 4~14 7~12 10 replacement generation

the top-6 components obtained after SVD [36]. “a coral

Cross-attention map
car“

“a rabbit

3.4. Probing Results on Cross-Attention Maps

standing
in the
park“

What does the cross-attention map learn? We directly vi-

sualize the attention maps, as demonstrated in Figure 3. Each “a coral

Self-attention map
car“
word in the prompt has a corresponding attention map asso-
ciated with the image, indicating that the information related “a rabbit
standing

to the word exists in specific areas of the image. However, in the

park“

is this information exclusive to these areas? Referring to

Equation 2, we observe that Mcross is derived from Kcross Figure 4. Results of cross-attention and self-attention map replace-
and Qcross , indicating that Mcross carries information from ments in difference layers of the diffusion model.
both. To validate this hypothesis, we conduct probing ex-
periments on Mcross , with the results presented in Table 1. Class Layer 3 Layer 6 Layer 9 Layer 10 Layer 12 Layer 14 Layer 16 Avg.
dog 0.53 0.60 0.78 0.60 0.53 0.47 0.38 0.55
Due to space limitations, we show only the probing results horse 0.50 0.70 0.82 0.65 0.68 0.53 0.28 0.59
for five colors and five animals from the last layer of the sheep 0.53 0.45 0.25 0.45 0.62 0.53 0.25 0.44
leopard 0.47 0.65 0.57 0.60 0.47 0.65 0.60 0.57
down, middle, and up blocks. As evident in Table 1, the tiger 0.23 0.12 0.55 0.20 0.45 0.42 0.53 0.36
trained classifier achieves high accuracy in both the color green 0.00 0.00 0.05 0.00 0.05 0.00 0.12 0.03
white 0.00 0.05 0.30 0.55 0.03 0.15 0.25 0.19
and animal classification tasks. For instance, the average orange 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
accuracy for classifying “sheep” reaches 98%, and that for yellow 0.00 0.42 0.07 0.05 0.00 0.30 0.07 0.13
red 0.00 0.15 0.28 0.20 0.00 0.20 0.10 0.13
“orange” reaches 93%. These results demonstrate that the
cross-attention map acts as a reliable category representation,
Table 2. Probing accuracy of self-attention map in difference layers.
indicating that it reflects not only weight information but also
contains category-related features. This explains the failure
of image editing using cross-attention map replacement. The
self-attention map generated from images containing color
upper part of Figure 4 illustrates the editing results obtained
prompts. For animals, the results are better, although not as
by replacing the cross-attention map of the corresponding
precise as those using cross-attention maps. This discrepancy
word (“rabbit” and “coral”) at different cross-attention layers.
may be attributed to the irregular spatial structure present
It is apparent that when all layers are replaced, the editing
in the self-attention map corresponding to the color prompt.
results are the least satisfactory. The dog fails to transform
Conversely, the self-attention map corresponding to the ani-
completely into a rabbit, and the black car cannot turn into a
mal prompt contains structural information of different ani-
coral car. Conversely, when the cross-attention map is left
mals, enabling the learning of category information through
unaltered, correct editing results can be achieved. The com-
recognizing structural or contour features. As shown in the
plete and more additional experimental results are available
lower part of Figure 3, the first component of the horse’s
in Section 8 in the Supplementary Material.
self-attention map clearly expresses the outline information
of the horse. The lower part of Figure 4 showcases our exper-
3.5. Probing Results on Self-Attention Maps
imental results of operating on the self-attention map across
What does the self-attention map learn? Table 2 presents different attention layers. When the self-attention map of all
the results of the probing experiments. The results in- layers in the source image is replaced during the generation
dicate that the trained classifier struggles to classify the process of the target image, the resulting target image retains

7820
all the structural information from the original image but Source image All token − “a” − “a, car” − “a, blue, car” − all Direct generation

hinders successful editing. Conversely, if we do not replace “a

brown
the self-attention map, we obtain an image identical to that car“

generated directly using the target prompt. As a compromise, “a

replacing the self-attention map in Layers 4 to 14 allows for turquoi

se car“

preserving the structural information of the original image

to the greatest extent while ensuring successful editing. This Figure 5. Editing results on replacing attention maps of different
experimental result further supports the idea that the self- tokens in a prompt. “-” is a minus sign. - “a” represents subtracting
attention map in Layers 4 to 14 does not serve as a reliable the cross-attention map corresponding to “a”.
category representation but does contain valuable spatial
structure information of the image.
methods like P2P [10] replace the cross-attention map in the
3.6. Probing Results for Other Tokens
source and target image generation process. This requires
Do cross-attention maps corresponding to non-edited modifying the original prompt to find the corresponding
words contain category information? Furthermore, we attention map for replacement. However, this limitation
explore the attention maps associated with non-edited words. prevents the direct application of P2P to editing real images,
This is relevant because within a text sequence, the text as they do not come with an original prompt.
embedding for each word retains the contextual information Based on our exploration of attention layers, our core idea
of the sentence, particularly when a transformer-based text is to combine the layout and contents of Isrc with the se-
encoder [7, 23] is utilized. We employ the prompt data in the mantic information synthesized with the target prompt Pdst
format of a <color> car for our probing experiments. The to synthesize the desired image Idst that retains the struc-
experimental results are presented in Table 3. The findings ture and content information of the original image Isrc . To
demonstrate that the article “a” does not encompass any achieve this, we adapt the self-attention hijack mechanism
category information of color. In contrast, the noun “car,” in the diffusion model’s attention layers 4 to 14 during the
when modified by the color adjective, does contain color denoising process between the source and target images. For
category information. Consequently, if we replace the cross- generated image editing, we substitute the target image’s
attention map corresponding to a non-edited word with the self-attention map with the source image’s self-attention
cross-attention map of the target image, color information map during the diffusion denoising process. When working
may be introduced, ultimately resulting in editing failures. with actual images, we first obtain the necessary latents for
This observation is also evident from the experimental results reconstructing the real image by employing the inversion
in Figure 5, where replacing the cross-attention maps of non- operation [33]. Subsequently, during the editing process, we
edited words likewise leads to editing failures. replace the self-attention map of the real image within the
generation process of the target image. We can accomplish
Class Layer 3 Layer 6 Layer 9 Layer 10 Layer 12 Layer 14 Layer 16 Avg.
green 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the TIE task for the following reasons: 1) the cross-attention
white 0.20 0.00 0.70 0.00 0.72 0.30 0.05 0.28 mechanism [26, 35] facilitates the fusion of the synthetic im-
orange 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
yellow 0.12 0.00 0.00 0.00 0.00 0.00 0.00 0.02 age and the target prompt, allowing the target prompt and the
red 0.50 0.00 0.82 0.00 0.00 0.00 0.00 0.19 image to be automatically aligned even without introducing
green 0.67 0.67 0.00 0.00 0.00 0.50 0.00 0.26
white 0.33 1.00 0.83 0.58 0.00 0.00 1.00 0.54 the cross-attention map of the source prompt; 2) the self-
orange 0.60 1.00 0.80 1.00 1.00 0.40 0.80 0.80 attention map contains spatial layout and shape details of the
yellow 0.50 0.25 0.00 0.00 0.12 0.25 0.00 0.16
red 0.38 0.88 0.75 0.12 0.12 0.38 0.00 0.38 source image, and the self-attention mechanism [35] allows
for the injection of structural information from the original
Table 3. Probing analysis of cross-attention maps w.r.t. difference image into the generated target image. Algorithms 1 and 2
tokens. The upper part shows the classification results correspond- present the pseudocode for our simplified method applied
ing to the token “a”, and the lower shows results for “car”. to generated and real images, respectively. FPE can also
be combined with null text inversion for real image editing
(refer to Section 10 in the Supplementary Material).
4. Our Approach
Based on our exploration of attention layers, we propose a 5. Experiments
more straightforward yet more stable and efficient approach
5.1. Experimental Settings
named Free-Prompt-Editing (FPE). Let Isrc be the image to
be edited. Our goal is to synthesize a new desired image Idst Since there are no publicly available datasets specifically de-
based on the target prompt Pdst while preserving the content signed to verify the effectiveness of image editing algorithms,
and structure of the original image Isrc . Current editing we construct two types of image-prompt pairs datasets: one

7821
Algorithm 1 Free-Prompt-Editing for a generated image. transforms various attributes, styles, scenes, and categories
Input: Psrc : a source prompt; Pdst : a target prompt; S: random of the original images.
seed;
Output: Isrc : source image; Idst : edited image;
1: zT ⇠ N (0, 1), a unit Gaussian random value sampled with
random seed S;
2: zT⇤ zT ;

Generation image
3: for t = T, T 1, . . . , 1 do
“a graffiti of a “a photo of a “an
Source image
4: zt 1 , Mself DM (zt , Psrc , t); goldfish” goldfish” embroidery of
a goldfish”
5: zt⇤ 1 DM (zt⇤ , Pdst , t){Mself
⇤
Mself };
6: end for
7: Return (Isrc Decoder(z0 ), Idst Decoder(z0⇤ ));

Algorithm 2 Free-Prompt-Editing for a real image.

“an oil “a photo of a “a photo of a
Input: Pdst : a target prompt; Isrc : real image; Source image painting of a
white horse”
zebra in the husky on the
snow” grass”
Output: Idst : edited image; Ires : reconstructed image;
1: {zt }T t=0 DDIM inv(Isrc );
2: zT⇤ zT ;
3: for t = T, T 1, . . . , 1 do
4: zt 1 , Mself DM (zt , t);

Real image
5: zt⇤ 1 DM (zt⇤ , Pdst , t){Mself
⇤
Mself }; “a photo of a “a photo of a
“a white
6: end for Source image
skyscraper” tower in
snow”
wedding cake”

7: Return (Ires Decoder(z0 ), Idst Decoder(z0⇤ ));

for generated images and one for real images. The gener-
ated images dataset includes Car-fake-edit and ImageNet- “a photorealistic
fake-edit, where Car-fake-edit contains 756 prompt pairs, Source image
image of a
mountain full of
“a photorealistic
image of a
“a polygonal
illustration of a
snowy mountain” mountain”
and ImageNet-fake-edit contains 1182 prompt pairs sam- trees”

pled from FlexIT [4] and ImageNet [28]. The real image
Figure 6. Results of our method on image-text pairs from Wild-TI2I
datasets include Car-real-edit, sampled from the Stanford and ImageNet-R-TI2I.
Car (CARS196) dataset [15], containing 3321 image-prompt
pairs, and ImageNet-real-edit, which contains 1092 pairs.
For more details, see section 7.2 in the Supplementary Ma- 5.2.1 Comparison to Prior/Concurrent Work
terial. In addition, we also use benchmarks constructed by
In this section, we compare our work with state-of-the-art
PnP [34]. These benchmarks contain two datasets: Wild-
image editing methods, including (i) P2P [10] (with null
TI2I and ImageNet-R-TI2I. For generated images, Wild-TI2I
text inversion [18] for the real image scene), (ii) PnP [34],
contains 70 prompt pairs, and ImageNet-R-TI2I contains 150
(iii) SDEdit [17] under two noise levels (0.5 and 0.75), (iv)
pairs. For real images, Wild-TI2I contains 78 image-prompt
DiffEdit [5], (v) MasaCtrl [2], (vi) Pix2pixzero [22], (vii)
pairs, and ImageNet-R-TI2I includes 150 pairs.
Shape-guided [21], and (viii) InstructPix2Pix [1]. We further
We utilize Clip Score (CS) and Clip Directional Simi-
present the image editing results using other Stable Diffusion-
larity (CDS) [9, 23] to quantitatively analyze and compare
based models to demonstrate the universality of our method,
our method with currently popular image editing algorithms.
including Realistic-V23 , Deliberate4 , and Anything-V45 .
The underlying model for our experiments is Stable Diffu-
Comparison to P2P We first compare our method with
sion 1.52 . The experimental results of comparative methods
P2P [10] for synthetic image editing scenes and P2P com-
are produced using the publicly disclosed codes from their
bined with null text inversion [18] for real image scenes,
original papers with unified random seeds.
both denoted as P2P. The experimental results are shown in
5.2. Image Editing Results Figure 7 and Table 4. In Figure 7, it is evident that when
performing color transformation on a real image by modi-
We evaluate our method through quantitative and qualitative fying the cross-attention map, the editing fails. The editing
analyses. As illustrated in Figure 6, we showcase the editing
3 https : / / huggingface . co / SG161222 / Realistic _
outcomes of our method, demonstrating that it successfully
Vision_V2.0
2 https : / / huggingface . co / runwayml / stable - 4 https://huggingface.co/XpucT/Deliberate

diffusion-v1-5 5 https://huggingface.co/xyn-ai/anything-v4.0

7822
results of P2P for car color tend to replicate the color (white) step method that leads to significant computational overhead;
of the original image. Regarding the category conversion editing a single image in a generated image editing scenario
results for generated images, we observe that while P2P can takes approximately 335.65 seconds. In contrast, our method
accurately transform different animals, the edited results still only requires around 6.30 seconds on an A100 GPU with
retain appearances of sheep. This leads to an incomplete 40GB memory, as Table 5 indicates.
conversion for patterned animals such as giraffes, leopards, Table 5 presents the quantitative experimental results of
and tigers. Unlike P2P, our method operates only at the different editing algorithms on the Wild-TI2I and ImageNet-
self-attention layers and is not susceptible to editing failures R-TI2I benchmarks. From Table 5, it is evident that our
caused by modifications to the cross-attention map. method outperforms all others in terms of the CDS metric.
This indicates that our method excels in preserving the spa-
CS " CDS " tial structure of the original image and performing editing
Dataset
P2P Ours P2P Ours
according to the requirements of the target prompt, yielding
Car-fake-edit 25.96 26.02 0.2451 0.2659
Car-real-edit 24.64 24.85 0.2288 0.2605 superior results. Meanwhile, our method achieves a good
ImageNet-fake-edit 27.42 27.80 0.2401 0.2560 balance between time consumption and effectiveness, as
ImageNet-real-edit 26.17 26.35 0.2426 0.2468 demonstrated in Table 5.
Table 4. Quantitative experimental results over Car-fake-edit,

Realistic V-2
ImageNet-fake-edit, Car-real-edit and ImageNet-real-edit.

Source image brown red gold lime orange

“10 y.o boy, “…,80 y.o man,
Car-real-edit

Source image “…,man,… “ short haircut,…“ short haircut ,…“

Deliberate

Source image giraffe cat leopard tiger dog

ImageNet-fake-edit

“1boy,gloden “1man, 40 y.o,

Source image “1boy, short
hair, white cloth, gloden hair,
haircut,…“
blue sea,…“ white cloth,…“
Anything V-4

Source image “…,green “…,black

“1dog,…“
eyes,…“ hair,…“
Figure 7. Comparisons results with P2P [10] on Car-real-edit and
ImageNet-fake-edit. Upper part: P2P. Lower part: ours.
Figure 9. Experimental results of our method using other TIS
models, including Realistic-V2, Deliberate and Anything-V4.
Comparison to Other Methods Further, we compare our
method with other state-of-the-art (SOTA) image editing
methods [1, 2, 5, 10, 17, 21, 22, 34] over the Wild-TI2I and 5.2.2 Results in Other TIS Models
ImageNet-R-TI2I benchmarks. The experimental results are
presented in Figure 8 and Table 5. As shown in Figure 8, our We have applied our method to other TIS models based on
method successfully converts different inputs for both real Stable Diffusion-style frameworks to demonstrate its trans-
and synthetic images. In all examples, our method achieves ferability. Figure 9 showcases the editing results of our
high-fidelity editing that aligns with the target prompt while method on Realistic-V2, Deliberate, and Anything-V4 TIS
preserving the original image’s structural information to the models. From these results, it can be observed that our
greatest extent possible. In contrast, SDEdit and Instruct- method is capable of effectively editing images on other dif-
Pix2Pix struggle to preserve the structural information of fusion models as well. For example, it can transform a girl
the original image. SDEdit aligns the editing results bet- into a boy, change a boy’s age to 10 or 80, modify hairstyles,
ter with the target prompt when there is high-level noise change hair colors, alter backgrounds, and switch categories.
but fails in the presence of low-level noise. InstructPix2Pix
5.3. Limitations and Discussion
retains consistency with the target prompt but loses the orig-
inal structural information. DiffEdit and Pix2pix-zero also Although our method employs probe analysis to elucidate
struggle to perform better editing based on the target prompt. the role of attention layers in the TIS model and proposes
Similarly, PnP achieves good editing results, but it is a two- a novel method for editing images in multiple scenarios

7823
Source image Ours P2P PnP SDEdit 0.5 SDEdit 0.75 DiffEidt Pix2pix-zero Shape-guided MacaCtrl InstructPix2Pix
Generation TI2I-Generation
Wild TI2I- IImageNet-R-

“a photo of
a poodle“

“a photo of
rubber
ducks
walking on
street“

“an
embroider
ImageNet-R-TI2I

y of a
penguin“

“a photo of
a jeep“

“a photo of
a silver
robot
Wild TI2I-Real

walking on
the moon“

“a bronze
horse in a
museum “

Figure 8. Comparison to prior works. Left to right: source image, target prompt, our result, P2P [10], PnP [34], SDEdit [17] w/ two noising
levels, DiffEdit [5], Pix2pixzero [22], Shape-guided [21], MasaCtrl [2] and InstructPix2Pix [1] (fine-tuning based method.)

ImageNet-R-TI2I fake ImageNet-R-TI2I real Wild-fake Wild-real Editing Times (s)

Method
CS " CDS " CS " CDS " CS " CDS " CS " CDS " fake # real #
SDEdit (0.5) - - 28.37 0.1415 - - 27.48 0.1220 - 2.59
SDEdit (0.75) - - 30.17 0.2171 - - 29.79 0.2007 - 3.35
Shape-Guided - - 26.01 0.1090 - - 26.53 0.1330 - 16.02
DiffEdit 26.68 0.0748 26.50 0.0909 25.59 0.0794 26.33 0.0879 9.02 4.85
Pix2pixzero 27.94 0.2271 28.96 0.1415 28.19 0.2864 29.55 0.1462 24.92 36.76
P2P 28.88 0.3394 28.56 0.2146 27.85 0.2796 28.42 0.1930 6.41 55.32
PnP 28.83 0.2318 28.76 0.2073 28.20 0.2838 28.46 0.2020 335.65 384.26
MasaCtrl 29.66 0.3024 31.40 0.2170 29.96 0.3474 29.33 0.2101 6.18 10.90
Ours 29.79 0.3559 29.05 0.2271 27.88 0.3116 29.04 0.2234 6.30 10.75

Table 5. Quantitative experimental results over Wild-TI2I and ImageNet-R-TI2I benchmarks, including real and generated guidance images.
CS: Clip score [23] and CDS: Clip Directional Similarity [9, 23]. Editing Times: per-image/second

without complex operations, it still has some limitations. editing methods that rely on it. On the contrary, the self-
Firstly, our method is constrained by the generative capabil- attention map captures the spatial structural information of
ities of the TIS model. Our editing method will fail if the the original image, playing an essential role in preserving
generative model cannot produce images consistent with the the image’s inherent structure during editing. Based on our
target prompt description. When editing real images, the comprehensive analysis and empirical evidence, we have
original image must first be reconstructed. Some detailed streamlined current image editing algorithms and proposed
information, especially facial details, may be lost during the an innovative image editing approach. Our approach does
reconstruction process, primarily due to the limitations of not require additional tuning or the alignment of target and
the VQ autoencoder [14]. Optimizing the VQ autoencoder is source prompts to achieve effective object or background
beyond the scope of this paper, as our objective is to provide editing in images. In extensive experiments across multiple
a simple and universal editing framework. Addressing these datasets, our simplified method has outperformed existing
challenges will be part of our future work. image editing algorithms. Furthermore, our algorithm can
be seamlessly adapted to other TIS models.
6. Conclusion
Acknowledgements This work is partially supported by
In this work, we utilized probe analysis and conducted ex- Alibaba Cloud through the Research Talent Program with
periments to elucidate the following insights on TIS models: South China University of Technology, and the Program
the cross-attention map carries the semantic information for Guangdong Introducing Innovative and Entrepreneurial
of the prompt, which leads to the ineffectiveness of image Teams (No. 2017ZT07X183).

7824
References [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu,
Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora:
[1] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- Low-rank adaptation of large language models. arXiv preprint
structpix2pix: Learning to follow image editing instructions. arXiv:2106.09685, 2021. 2
In Proceedings of the IEEE/CVF Conference on Computer [13] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
Vision and Pattern Recognition, pages 18392–18402, 2023. Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
1, 2, 3, 6, 7, 8 Text-based real image editing with diffusion models. In Pro-
[2] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- ceedings of the IEEE/CVF Conference on Computer Vision
aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual and Pattern Recognition, pages 6007–6017, 2023. 2
self-attention control for consistent image synthesis and edit- [14] Diederik P Kingma and Max Welling. Auto-encoding varia-
ing. In Proceedings of the IEEE/CVF International Confer- tional bayes. In International Conference on Learning Repre-
ence on Computer Vision (ICCV), pages 22560–22570, 2023. sentations, 2014. 1, 8
1, 2, 6, 7, 8 [15] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d
[3] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christo- object representations for fine-grained categorization. In Pro-
pher D. Manning. What does BERT look at? an analysis of ceedings of the IEEE international conference on computer
BERT’s attention. In Proceedings of the 2019 ACL Workshop vision workshops, pages 554–561, 2013. 6, 1
BlackboxNLP: Analyzing and Interpreting Neural Networks [16] Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E.
for NLP, pages 276–286. Association for Computational Lin- Peters, and Noah A. Smith. Linguistic knowledge and trans-
guistics, 2019. 2, 3 ferability of contextual representations. In Proceedings of
[4] Guillaume Couairon, Asya Grechka, Jakob Verbeek, Holger the 2019 Conference of the North American Chapter of the
Schwenk, and Matthieu Cord. Flexit: Towards flexible se- Association for Computational Linguistics: Human Language
mantic image translation. In Proceedings of the IEEE/CVF Technologies, Volume 1 (Long and Short Papers), pages 1073–
Conference on Computer Vision and Pattern Recognition, 1094. Association for Computational Linguistics, 2019. 2,
pages 18270–18279, 2022. 6, 2 3
[5] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and [17] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun
Matthieu Cord. Diffedit: Diffusion-based semantic image Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image
editing with mask guidance. In The Eleventh International synthesis and editing with stochastic differential equations. In
Conference on Learning Representations, 2023. 1, 2, 6, 7, 8 International Conference on Learning Representations, 2022.
[6] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, 1, 2, 6, 7, 8
and Mubarak Shah. Diffusion models in vision: A survey. [18] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and
IEEE Transactions on Pattern Analysis and Machine Intelli- Daniel Cohen-Or. Null-text inversion for editing real im-
gence, 2023. 1 ages using guided diffusion models. In Proceedings of
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina the IEEE/CVF Conference on Computer Vision and Pattern
Toutanova. BERT: pre-training of deep bidirectional trans- Recognition, pages 6038–6047, 2023. 1, 2, 6, 4, 5
formers for language understanding. In Proceedings of the [19] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian
2019 Conference of the North American Chapter of the As- Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-
sociation for Computational Linguistics: Human Language adapter: Learning adapters to dig out more controllable
Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, ability for text-to-image diffusion models. arXiv preprint
June 2-7, 2019, Volume 1 (Long and Short Papers), pages arXiv:2302.08453, 2023. 2
4171–4186. Association for Computational Linguistics, 2019. [20] OpenAI. Improving image generation with better captions.
5 https://cdn.openai.com/papers/dall- e- 3.
[8] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, pdf, 2023. 1
Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An [21] Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi,
image is worth one word: Personalizing text-to-image genera- Xihui Liu, Maka Karalashvili, Anna Rohrbach, and Trevor
tion using textual inversion. arXiv preprint arXiv:2208.01618, Darrell. Shape-guided diffusion with inside-outside attention.
2022. 2 In Proceedings of the IEEE/CVF Winter Conference on Ap-
[9] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, plications of Computer Vision, pages 4198–4207, 2024. 1, 6,
Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- 7, 8
guided domain adaptation of image generators. ACM Trans- [22] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun
actions on Graphics (TOG), 41(4):1–13, 2022. 6, 8 Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image
[10] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, translation. In ACM SIGGRAPH 2023 Conference Proceed-
Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image ings, pages 1–11, 2023. 1, 6, 7, 8
editing with cross-attention control. In The Eleventh Interna- [23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
tional Conference on Learning Representations, 2023. 1, 2, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
5, 6, 7, 8 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
[11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- transferable visual models from natural language supervi-
sion probabilistic models. Advances in Neural Information sion. In International conference on machine learning, pages
Processing Systems, 33:6840–6851, 2020. 1 8748–8763. PMLR, 2021. 1, 5, 6, 8

7825
[24] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Polosukhin. Attention is all you need. Advances in neural
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and information processing systems, 30, 2017. 5
Peter J Liu. Exploring the limits of transfer learning with [36] Michael E Wall, Andreas Rechtsteiner, and Luis M Rocha.
a unified text-to-text transformer. The Journal of Machine Singular value decomposition and principal component anal-
Learning Research, 21(1):5485–5551, 2020. 1 ysis. In A practical approach to microarray data analysis,
[25] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, pages 91–109. Springer, 2003. 4
and Mark Chen. Hierarchical text-conditional image genera- [37] Chengyu Wang, Zhongjie Duan, Bingyan Liu, Xinyi Zou,
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. Cen Chen, Kui Jia, and Jun Huang. Pai-diffusion: Con-
1 structing and serving a family of open chinese diffusion mod-
[26] Robin Rombach, Andreas Blattmann, Dominik Lorenz, els for text-to-image synthesis on the cloud. arXiv preprint
Patrick Esser, and Björn Ommer. High-resolution image arXiv:2309.05534, 2023. 1
synthesis with latent diffusion models. In Proceedings of [38] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Run-
the IEEE/CVF Conference on Computer Vision and Pattern sheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-
Recognition, pages 10684–10695, 2022. 1, 5 Hsuan Yang. Diffusion models: A comprehensive survey of
[27] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, methods and applications. ACM Computing Surveys, 2022. 1
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine [39] Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang,
tuning text-to-image diffusion models for subject-driven gen- Shijian Lu, Lingjie Liu, Adam Kortylewski, Christian
eration. In Proceedings of the IEEE/CVF Conference on Com- Theobalt, and Eric Xing. Multimodal image synthesis and
puter Vision and Pattern Recognition, pages 22500–22510, editing: A survey and taxonomy. IEEE Transactions on Pat-
2023. 2 tern Analysis and Machine Intelligence, 2023. 2
[28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [40] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, conditional control to text-to-image diffusion models. In
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Proceedings of the IEEE/CVF International Conference on
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- Computer Vision, pages 3836–3847, 2023. 2
lenge. International Journal of Computer Vision (IJCV), 115
(3):211–252, 2015. 6, 2
[29] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li,
Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael
Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho-
torealistic text-to-image diffusion models with deep language
understanding. Advances in Neural Information Processing
Systems, 35:36479–36494, 2022. 1
[30] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
Open dataset of clip-filtered 400 million image-text pairs.
arXiv preprint arXiv:2111.02114, 2021. 1
[31] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes,
Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al.
Laion-5b: An open large-scale dataset for training next gen-
eration image-text models. Advances in Neural Information
Processing Systems, 35:25278–25294, 2022. 1
[32] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In International Confer-
ence on Machine Learning, pages 2256–2265. PMLR, 2015.
1
[33] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
diffusion implicit models. In International Conference on
Learning Representations, 2021. 5
[34] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel.
Plug-and-play diffusion features for text-driven image-to-
image translation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
1921–1930, 2023. 1, 2, 6, 7, 8
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia