0% found this document useful (0 votes)
4 views

Text-To-image Editing by Image Information Removal

Uploaded by

NIKHILESH RD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Text-To-image Editing by Image Information Removal

Uploaded by

NIKHILESH RD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Text-to-image Editing by Image Information Removal


2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) | 979-8-3503-1892-0/24/$31.00 ©2024 IEEE | DOI: 10.1109/WACV57701.2024.00515

Zhongping Zhang1 Jian Zheng2 Jacob Zhiyuan Fang2 Bryan A. Plummer1


1 2
Boston University Amazon Alexa AI
1 2
{zpzhang,bplum}@bu.edu {nzhengji, zyfang}@amazon.com

Abstract $ 3URPSW 2SWLPL]DWLRQEDVHGPHWKRGV % 3URPSW6WUXFWXUH*XLGDQFH

Diffusion models have demonstrated impressive perfor- )LQHWXQH3UHWUDLQHG


0RGHOVRU(PEHGGLQJV
3UHWUDLQHG
 3
'LIIXVLRQ0RGHO

mance in text-guided image generation. Current meth-


ods that leverage the knowledge of these models for im-
age editing either fine-tune them using the input image (e.g.,
Imagic) or incorporate structure information as additional & 3URPSW2ULJLQDO,PDJH RXUV
3URPSW$LUSODQHLVIO\LQJLQ
3URPS
constraints (e.g., ControlNet). However, fine-tuning large- WK HVX
WKHVXQVHW

scale diffusion models on a single image can lead to se- « 3UHWUDLQHG


 3
'LIIXVLRQ0RGHO
vere overfitting issues and lengthy inference time. Informa-
tion leakage from pretrained models also make it challeng- ,PDJH,QIRUPDWLRQ
5HPRYDO
ing to preserve image content not related to the text input. ,QSXW,PDJH

Additionally, methods that incorporate structural guidance


(e.g., edge maps, semantic maps, keypoints) find retaining Figure 1. We aim to edit the specific content of the input im-
age according to text descriptions while preserving text-irrelevant
attributes like colors and textures difficult. Using the in-
image content. Prior work based on large-scale diffusion models
put image as a control could mitigate these issues, but since has followed two major approaches for image editing: (A), fine-
these models are trained via reconstruction, a model can tuning the pretrained models or text embeddings (e.g., Imagic [9]
simply hide information about the original image when en- or Dreambooth [29]), or (B), introducing structural guidance as ad-
coding it to perfectly reconstruct the image without learning ditional constraint to control the spatial information of the gener-
the editing task. To address these challenges, we propose ated image (e.g., ControlNet [40] or MaskSketch [2]). In our work,
a text-to-image editing model with an Image Information shown in (C), our approach conditions on both the original image
Removal module (IIR) that selectively erases color-related and the structural guidance, to better preserve the text-irrelevant
and texture-related information from the original image, al- content of the image. E.g., our model successfully preserves the
lowing us to better preserve the text-irrelevant content and original attributes of the airplane (outlined by the green bounding
avoid issues arising from information hiding. Our experi- box) in the generated image. In contast, previous methods such
as Imagic (A) and ControlNet (B) not only alter the sky and back-
ments on CUB, Outdoor Scenes, and COCO reports our ap-
ground but also modify the attributes of the airplane (outlined by
proach achieves the best editability-fidelity trade-off results. the red bounding boxes), which is unwanted in this example.
In addition, a user study on COCO shows that our edited
images are preferred 35% more often than prior work.

Figure 1 (A); or 2) introducing the structural guidance (e.g.,


1. Introduction edge map, user scribble, segmentation map, or pose esti-
mation) as additional constraints for image generation, as
Text-driven image editing aims to modify the specific shown in Figure 1 (B). The effectiveness of these models
content of an image based on its textual descriptions. In- have been demonstrated on tasks like style transfer [42],
spired by the powerful capability of large-scale text-to- texture editing [1], shape editing [9], appearance modifi-
image generation models [16, 26, 28, 30], recent approaches cation [29], color editing [40], among others. However, for
have leveraged the prior knowledge of these pretrained optimization-based methods, fine-tuning large-scale mod-
models for image editing [1, 2, 9, 29, 40, 42]. The major- els on single or few images results in severe over-fitting
ity of existing editing approaches follow two strategies: 1) issues and prolongs inference time [42]. Images gener-
Optimization-based methods: updating network parameters ated by finetuned models and embeddings may contain un-
or feature embeddings for each input image, as shown in expected visual artifacts due to the information leakage

2642-9381/24/$31.00 ©2024 IEEE 5220


DOI 10.1109/WACV57701.2024.00515
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
and fail to preserve the text-irrelevant content of original We summarize the contribution of our work as follows:
image [42]. Structure-guided methods also meet pitfalls: • We introduce the original image as an additional guidance
structural guidance usually contains no information about to pretrained generative diffusion models for image edit-
colors or textures, these frameworks have difficulty preserv- ing tasks. Compared with existing image editing meth-
ing the text-irrelevant content of the original image. As out- ods [9, 40], IIR-Net more effectively preserves the text-
lined by red bounding boxes in Figure 1 (A) and (B), we ob- irrelevant content of the input image while also generat-
serve both Imagic [9] and ControlNet [40] fail to preserve ing new features according to the language descriptions.
the text-irrelevant content of the original image: Imagic • We propose an image information removal module to
modifies the shape of the airplane while ControlNet changes counter the identical mapping issue [12]. IIR-Net par-
the color and textures of airplane. tially erases the image information such as colors or tex-
To address the aforementioned issues, we introduce the tures from the input, and reconstruct the original image
original image as an additional control for image editing. according to text descriptions and attribute-excluded fea-
In this way the model can fully incorporate the content of tures. Compared with prior work for solving this is-
the input image, allowing it to effectively preserve the text- sue [14, 37, 38], IIR-Net does not require attribute labels
irrelevant content. However, this results in an identity map- to learn disentangled features or attribute classifiers, and,
ping issue [12], where the model can simply map the input thus, can be applied to images without attribute labels.
directly to the output. This is primarily caused by the image • We conduct extensive quantitative and qualitative exper-
reconstruction objective in editing task, which is perfectly iments on three public datasets CUB [35], COCO [17],
optimized using an identity mapping. Prior works attempt and Outdoor Scenes [11]. Our results demonstrate that
to alleviate such issue by either learning disentangled fea- our model improves the fedility-editability trade-off over
tures [19, 37, 38], or uses attribute classifier to remove the the state-of-the-art with obvious inference speed advan-
target attribute [14, 15]. Both these approaches unavoid- tages. E.g., compared to Imagic [9], IIR-Net improves
ably introduce additional computational overhead that also the LPIPS score from 0.57 to 0.30 on COCO, with an in-
greatly limits their application scenarios. For example, in ference speed improvement of two orders of magnitude.
Figure 1, the input image only has text annotations and does
not has scene attribute labels such as “daylight” or “sunset.” 2. Related Work
Therefore, these methods cannot be applied to convert the Feed-forward transformation image generation and
input image from “daylight” to “sunset.” editing. Early work in text-to-image generation and edit-
We propose an Image Information Removal module ing often used text-to-image generator based on conditional
(IIR-Net) to partially remove the image information from GANs [3, 12, 13, 21, 27, 33, 36, 39]. Limited by the scala-
the input image, as illustrated in Figure 1 (C). Specifically, bility of Conditional GAN and size of image datasets, these
this erasure of image information arises from two com- methods only supported specific image domains and lan-
ponents. First, we localize the Region of Interest (RoI1 ) guage descriptions. More recent methods typically trained
and erase the color-related information. Second, we apply conditional diffusion models [25, 26, 28, 30] on massive
Gaussian noise on the input image which randomly elimi- datasets (e.g., LAION-400M [31]). Due to the difficulty to
nates the texture-related information. By tweaking the noise obtain image pairs before and after editing, current image
intensity, the model is capable of adapting to various tasks editing frameworks [2, 9, 22, 29, 40, 42] are mostly devel-
accordingly. For example, in color editing tasks, we de- oped based on pretrained text-to-generation models [28,30].
crease the noise intensity to zero to preserve most infor- However, among these methods, methods that leverage the
mation from the input image except the color. In the tex- feed-forward transformation mechanism mostly focus on
ture editing task, a higher value of noise intensity is used structural guidance. E.g., ControlNet [40] leverages struc-
to eliminate most information from the target region, leav- ture maps like edge map, semantic map, or pose estima-
ing only the structural prior. Given the original image, we tion to control the spatial structure of generated images, and
then concatenate the structure map with attribute-excluded MaskSketch [2] uses sketch as additional control to gener-
features as additional controls to editing model. With our ate images. Thus, these methods cannot preserve the other
simple while effective image information removal module, attributes of the image such as colors or textures well, and
we avoid the identical mapping issue as now the model is may result in significant deviation from the input image. To
forced to not only reconstruct the original, but also predict solve this issue, we incorporate the original image as input
the noised image regions. to our model and propose an image information removal to
solve the identical mapping issue [12].
1 We refer to the modified regions of the target image as RoI. In our
Optimization-based Methods Prior work has demon-
work, RoI is localized by Grounded-SAM [10,18]. For tasks that the entire
image is subject to modification such as scene attribute transfer or style strated that optimization-based methods, which update net-
transfer, we simply define the entire image as the RoI. work parameters on each image input, work well for im-

5221

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
age generation [32, 34, 43]. Several methods use CLIP [23] where E is the pretrained encoder of VQGAN [4] to en-
as a constraint to guide the embedding features of pre- code image xt to latent features zt , and vice versa. For
dicted images [1, 5, 8, 20]. Inspired by the success of conditional generation, the denoising autoencoders  take
pretrained text-to-image generation frameworks [25, 26, τθ (y) as additional input, where τθ (y) represents a domain-
30], researchers have also proposed methods to finetune specific encoder to extract feature embeddings from the
these models for image editing (e.g., Imagic [9], Dream- condition y. This condition y represents elements like text
booth [29], SINE [42], Textual Inversion [6]). Compared prompts and semantic maps, among others. Given image-
to feed-forward transformation methods [2, 40], these mod- condition pairs, the Conditional Latent Diffusion Model
els retain more information from the original image since (CLDM) is optimized by
they take the whole image instead of just a structure map as  
additional guidance. However, as we will show in Section LCLDM := EE(x),y,∼N (0,1),t || − θ (zt , t, τθ (y))||22 ,
4.2, some image content such as background or irrelevant (2)
attributes of target objects may still be changed in this pro-
cess. In addition, the inference time of these optimization- where τθ , θ are jointly optimized. In our model, the condi-
based methods is much longer than feed-forward transfor- tion y consist of text descriptions S and the original image
mation methods due to image-specific finetuning. x0 and is defined as

τθ (y) := {τθ1 (S), τθ2 (R(x0 ))}, (3)


3. IIR-Net: Text-to-Image Editing by Image
Information Removal where we use the CLIP model [24] as τθ1 (·) to encode the
text descriptions S and use ControlNet [40] as τθ2 (·) to en-
Given an input image x and its corresponding text code the feature R(x0 ). R(·) denotes our image information
prompt S, our task aims to create the desired content ac- removal module, which we discuss in the next section.
cording to S while preserving the text-irrelevant content of
x. To achieve this, we incorporate the original image x as 3.2. Image Information Removal
an additional control to pretrained text-to-image generation
As discussed in the Introduction, training solely on im-
model, which is discussed in Section 3.1. However, since
age reconstruction can lead to the identical mapping issue.
the model is trained on the image reconstruction task, the
Previous approaches address this issue by learning disen-
incorporation of the original image can lead to the identical
tangled features [7] or attribute classifiers [14]. However,
mapping issue, in which the model simply maps the input
these methods require annotated attributes, restricting their
image as the output. To address this challenge, we propose
application scenarios. To overcome this challenge, we pro-
our image information removal module in Section 3.2. Fig-
pose our image information removal module, which incor-
ure 2 provides an overview of our approach.
porates color and texture removal operations. Our removal
3.1. Conditional Latent Diffusion Model operations effectively mitigates the identical mapping issue
without the requirement for additional annotated labels.
As discussed in the Introduction, preserving the text- Color-related Information Removal. In Figure 2 (B), we
irrelevant content of the original image is critical for text- present our color information removal operation. Given the
to-image editing. Leveraging the structural guidance as an input image x0 and its corresponding text prompt S, we
additional hint (e.g., ControlNet [40], MaskSketch [2]) can employ Grounded-SAM [10, 18] to localize the RoI. The
lead to significant information loss from the original image. color information of x0 is then erased by
To address this, we introduce the original image as addi-
tional control to our model, which preserves all information x0 = rgb2gray(x  mRoI ) + x  (1 − mRoI ), (4)
from the input image. In this section, we first introduce
the pretrained text-to-image generation model, Stable Dif- where mRoI is the Grounded-SAM segmentation mask.
fusion [28], as preliminaries to our method, and discuss our Through the application of color-related information re-
IIR component in Section 3.2. moval to the input image x0 , our model demonstrates pro-
Given an input image x0 and its corresponding noisy im- ficiency in color-related editing tasks, such as transforming
age xT , Stable Diffusion [28] consists of a series of equally a ”white airplane” into a ”green airplane.” However, as de-
weighted denoising autoencoders θ (xt , t), where t ranges picted in Figure 5, the model encounters challenges when
from 1 ∼ T . The deonising autoencoders are trained to attempting to modify texture-related information, such as
predict the noise  in xT according to time step t and noisy changing ”lawn” to ”sand.” To address this limitation, we
input xt . The objective function of Stable Diffusion is introduce our texture-related information removal module.
  Texture-related Information Removal. We eliminate the
LLDM := EE(x),∼N (0,1),t || − θ (zt , t)||22 , (1) texture-related information by adding noise to the image

5222

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
$ &RQGLWLRQDO/DWHQW'LIIXVLRQ0RGHO % ,PDJH,QIRUPDWLRQ5HPRYDO

'LIIXVLRQ
᷑ 3URFHVV '
«
] ]7
*URXQGHG
6$0 ᷠᶚ
'HQRLVLQJ
3URFHVV '
,PDJH «
,QIRUPDWLRQ ᷠᶚ LQSXWLPDJH
5HPRYDO
]7 ] 5HFRQVWUXFWLRQ FRQGLWLRQ
/RVV
7KLVELUGLV\HOORZ
ɟ
UHG EODFN LQ FRORU
ZLWKDVKRUWEHDN &/,3 VHJPHQWDWLRQ UJEJUD\
PDVNRIWKH5R, QRLVHDXJPHQWDWLRQ
&RQGLWLRQLQJLPDJHKLQWWH[W

Figure 2. IIR-Net Overview. Our model mainly consists of two modules: (A) Conditional Latent Diffusion Model: We introduce the
original image x as additional control to our model to preserve the text-irrelevant features of x. See Section 3.1 for detailed discussion; (B)
Image Information Removal Module: We erase the image information mainly by two operations. First, we convert RGB values to the gray
values in the RoI to exclude the color information. Second, we add Gaussian noise to the input image to partially erase the texture-related
information. This module is applied to address the identical mapping issue. See Section 3.2 for detailed discussion.

condition x0 of our model, denoted by and COCO [17]. CUB [35] is contains 200 bird species
K
that we split into 8,855 ᶚtraining images and 2,933 test im-
 ages. Ourdoor Scenes [11] contains 8,571 images captured
q(xK |x0 ) = q(xk |xk−1 ); (5)
k=1
from 101 webcams, with each webcam collecting 60∼120
 images showcasing different attributes like weather, season,
q(xk |xk−1 ) = N (xk ; 1 − βk xk−1 ; βk I), (6) or time of day. COCO [17] contains 82,783 training images
where k denotes the time step applied to xk , which is dif- and 40,504 validation images. Following [9], we randomly
ferent from the time step t applied to xt . Note that xt is select 150 test images from each dataset to evaluate the per-
obtained by adding noise to the original image x0 in diffu- formance of each method.
sion models, whereas xk is obtained by adding noise to the Metrics. Following [9], we adopt the perceptual metric
image condition x0 in diffusion models. During training we LPIPS [41] and CLIP score [23] as our quantitative metrics.
randomly sample xk from {x0 , . . . , xK }. LPIPS measures the image fidelity and CLIP evaluates the
While xk inherently preserves the structure information model’s editability. Additionally, we perform quantitative
of x0 , we find that explicitly incorporating additional struc- experiments by user study and inference time to evaluate
tural guidance, such as edges, helps the model better cap- the effectiveness and efficiency of our model.
ture structural information. Thus, we concatenate xk with
Baselines. We compare IIR-Net with three state-of-the-
the predictions of a Canny Edge detector C(x0 ). Thus, the
art approaches: Text2LIVE [1], Imagic [9], and Control-
output of our image information removal module is:
Net [40]. For Text2LIVE, we set the optimization steps
R(x0 ) = [xk , C(x0 )] . (7) to 600. For Imagic, both the text embedding optimization
steps and model fine-tuning steps are set to 500. We sample
Given the output of our image information removal mod- the interpolation hyperparameter η from 0.1 to 1 with a 0.1
ule R(x0 ), the final objective of IIR-Net is defined as: interval, and the guidance scale is set to 3. For ControlNet
 and IIR-Net, we generate images with a CFG-scale of 9.0,
LIIR−Net := EE(x),y,∼N (0,1),t  − θ (zt , t, τθ1 (S),
and DDIM steps of 20 by default.
 (8)
τθ2 (R(x0 )))22 . Implementation Details. We initialized our model weights
from Stable Diffusion 1.5 [28] and ControlNet [40]. Dur-
4. Experiments ing training, we applied a batch size of 8 and a maxi-
mum learning rate of 1 × 10−6 . We finetuned our mod-
4.1. Datasets and Experiment Settings
els approximately 100 epochs on the CUB [35] dataset,
Datasets. We evaluate the performance of our model on and around 5 epochs on the Outdoor Scenes [11] and
three standard datasets, CUB [35], Outdoor Scenes [11], COCO [17] datasets. The training process was parallelized

5223

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
$ &8% % 2XWGRRU6FHQHV
7KLVELUGKDVD\HOORZ $VPDOOELUGVSRUWVLWV 7KLVELUGKDV\HOORZ %LUGZLWKDUHGERG\
ERG\ZLWKEOXHZLQJV ZKLWHEHOO\DQGEODFN ERG\ZLWKD\HOORZ UHGFURZQDQGEODFN
DQGUHGFURZQ KHDG EHOO\DEURZQEHDN ZLQJV :LQWHU6QRZ&ROG 6XPPHU'D\OLJKW (YHQLQJ6XQVHW
,QSXW,PDJH
7H[W/,9(
,PDJLF
&RQWURO1HW
2XUV

Figure 3. Qualitative comparison on CUB and Ourdoor Scenes. From top to bottom: input image, Text2LIVE [1], Imagic [9], Control-
Net [40], and ours. Generated images have 512 pixels on their shorter side. See Section 4.2 for discussion.

on 2 NVIDIA RTX-A6000s. To adapt the image conditions prompt, they both fail to effectively apply them to the cor-
in our model, we configured the channel of the image en- rect corresponding parts of the bird. Imagic generates a bird
coder block to 4, with 3 channels for RGB images and 1 with a blue crown and yellow wings, while ControlNet gen-
channel for the edge map. We finetuned the Stable Diffu- erates a blue head and a red breast. In constrast, our model
sion decoder for experiments on CUB, as these images pri- accurately edits the bird by parts according to the prompt
marily focus on various birds with a consistent style. We and produce a bird with blue wings, yellow body, and red
froze the Stable Diffusion Decoder for the Ourdoor Scenes crown. In addition, we observe that the background of im-
and COCO datasets, since these datasets comprising natural ages generated by Imagic and ControlNet has been changed.
images with diverse objects and varying styles. This is due to the fact that Imagic and ControlNet do not di-
rectly use the original image as their input. E.g., Imagic
4.2. Qualitative Results optimizes the text embeddings to get features that reflect
Entire-image Editing on the CUB and Outdoor Scenes the attributes of the original image, and ControlNet uses the
Datasets. Figure 3 presents a qualitative comparison of the Canny Edge map as input. Thus, it is challenging for these
edited images generated by our model and the baselines. In method to preserve the text-irrelevant content of the origi-
Figure 3 (A), we present a comparison on the CUB [35] nal image. In contrast, our model takes the original image
dataset. We observe that our model can accurately manip- as input and only erases the text-relevant content, thus pre-
ulate parts of the bird while preserving the text-irrelevant serving the text-irrelevant content effectively.
content of the original image. For example, in the first col- In Figure 3 (B), we present a comparison on the Out-
umn of Figure 3 (A), while baselines such as ControlNet door Scenes [11] dataset. Consistent with our findings on
and Imagic can recognize “yellow” and “blue” from the text the CUB dataset, we observe that baselines like Imagic and

5224

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
%OXHEXVRQVWUHHW 7H[W/,9( ,PDJLF &RQWURO1HW 2XUV

$LUSODQHLQVXQVHW

&RZRQZDWHU

YDQJRJKVW\OH

Figure 4. Qualitative comparison for various editing tasks on the COCO dataset. From top to bottom: color editing, scene attribute transfer,
texture editing, and style transfer. Generated images have 512 pixels on their shorter side. Objects that are the target of modification are in
red bounding boxes and whereas objects that should be preserved are in green bounding boxes. See Section 4.2 for discussion.

ControlNet tend to modify the text-irrelevant contents of our method produces images that are well-aligned with the
the original image, such as the textures and background, text descriptions while non-edited components better repre-
while Text2LIVE only introduces limited visual effects to sent the original images. E.g., in the color editing task, al-
the original image and may fail to generate images aligned though Imagic and ControlNet generate a blue bus accord-
with the text descriptions. For example, in the second col- ing to the text prompt, Imagic changes the original shape
umn of Figure 3 (B), images produced by Imagic and Con- of the bus and ControlNet modifies the bus’ texture. In con-
trolNet are well aligned with text descriptions (“summer,” trast, our method only modifies the color attribute while pre-
“daylight”), but they introduce unexpected objects such as serving irrelevant attributes. Furthermore, our model gener-
trees or a lake to the image. In contrast, Text2LIVE pre- ates images that appear more natural and visually appealing.
serves the original image well, but fails to align with text E.g., in the scene attribute transfer task, the visual effect of
descriptions, as seen with the snow-covered field in sum- “sunset” brought by our model is naturally aligned with the
mer. However, our method effectively modifies the desired original image, whereas Text2LIVE introduces obvious ar-
content, such as changing “winter” to “summer,” while pre- tificial effects to the airplane.
serving the original content of the image. Finally, we evaluate our model on tasks where the orig-
Region-based Image Editing on COCO. Unlike object- inal ControlNet performs well, such as texture editing and
centric datasets such as CUB and Outdoor Scenes, COCO style transfer. Our results show that adapting text-to-image
images can contain complex scenes with many objects, yet generation models to image editing tasks does not notably
only parts of the input image may require modification. compromise their capabilities. For example, in the style
Thus, we apply Grounding-DINO [18] and SAM [10] to transfer examples on COCO images our method still retains
localize the Region of Interest (RoI) that requires editing2 . the ability to transfer a photorealistic image to an artistic
Figure 4 presents a qualitative comparison of our method style. See the supplementary for additional examples.
and prior work on various image editing tasks. We find that
Ablation Study. In Figure 5, we provide ablation study
2 Since Text2LIVE and Imagic automatically localize the RoI, we apply of IIR-Net. We find that without our unsupervised image
Grounding-DINO and SAM to ControlNet and our method. content removal mechanism, the model always outputs the

5225

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
GHFUHDVLQJQRLVH
,QSXW,PDJH )XOO,PDJHDVLQSXW

7DUJHW7H[W³$JUHHQDLUSODQH´

7DUJHW7H[W³7UHHRQVDQG´

Figure 5. Ablation Study. We perform experiments to evaluate the effectiveness of our color removal and texture removal operations.
Images generated without our image information removal module are outlined by the blue bounding box. See Section 4.2 for discussion.

input image as the predicted image, i.e., the identical map- User Study. We conducted a user study to quantitatively
ping issue [12]. E.g., the images in the blue bounding box evaluate the performance of IIR-Net, as shown in Table 2.
remain white airplane and green grass, showing a lack of We randomly selected 30 images from COCO and applied
alignment with the text descriptions. By incorporating the each model to generate the modified images, resulting in a
color removal mechanism (see images with low noise level), total of 120 generated images. Each image was annotated
our model performs well on tasks such as color editing. For three times by users and we asked our annotators to judge
example, when changing the airplane’s color from white whether the image is correctly manipulated based on the
to green, our model preserves the most of the airplane’s text guidance while preserving the text-irrelevant content of
attributes, only modifying the color. We observe that the the original image. In the table, we report that IIR-Net sig-
color removal mechanism can find texture editing challeng- nificantly outperforms baselines. See the supplementary for
ing. For example, as seen in the second row of the figure, additional details on our user study.
the images generated with low noise level still exhibit the Inference Time Table 2 presents a comparison of the in-
grass texture instead of the intended “sand” texture. There- ference time and their standard error using the same Stable
fore, we incorporate noise augmentation to the input images Diffusion v1.5 [28] backbone for Imagic, ControlNet, and
to better handle such editing tasks. As shown in the second our method. All methods are benchmarked on a NVIDIA
row, our model successfully modifies the grass texture to RTX A6000 GPU. We find our method has significantly
sand under high-level noise conditions. In practical applica- faster inference times compared to Imagic, boosting infer-
tions, users can adjust the noise level according to different ence speed by two orders of magnitude when processing
editing tasks to achieve optimal performance. 512×512 images. In addition, our method is approximately
50x faster than Text2LIVE. We note both ControlNet and
4.3. Quantitative Results our method have around 5s inference time, demonstrating
that approach introduces negligible overhead to ControlNet.
Editability-fidelity Tradeoff. Table 1 reports our quantita-
tive results on CUB, Outdoor Scenes, and COCO. As ob-
5. Limitations & Broader Impacts.
served in our qualitative experiments, our model achieves a
better tradeoff between image fidelity and editability com- Limitations. We identify three failure cases of our methods
pared to other state-of-the-art methods. E.g., our model in this section: First, the attributes of the original image are
achieves the best LPIPS scores (0.138 and 0.301) and com- likely to be modified in non-rigid image editing tasks. Sec-
parable CLIP scores (29.57 and 24.30) on CUB and COCO. ond, it is challenging for our method to change the bright-
In Outdoor Scenes, our model achieves the highest CLIP ness of the input image drastically. Third, the target object
score and the second best LPIPS score. Text2LIVE achieves may be localized and segmented inaccurately. We present
better LPIPS score than our method on Outdoor Scenes. examples of these three failure cases in Figure 6. As shown
However, it may due to the fact that Text2LIVE mainly aug- in the top row, though our method can achieve non-rigid
ment the scenes with new visual effects, rather than directly image editing according to the input image and a modified
modifying the attributes of the scenes. E.g., Text2LIVE structural guidance, we observe that the model fails to map
fails to change the grassland to a snowy landscape or con- some attributes to the correct parts. E.g., the bird of the
vert lush trees to bare ones in the scenes. input image has a grey crown while the edited image gen-

5226

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
CUB Outdoor Scenes COCO
LPIPS ↓ CLIP ↑ LPIPS ↓ CLIP ↑ LPIPS ↓ CLIP ↑
Imagic [9] 0.406 27.03 0.551 22.85 0.567 21.53
Text2live [1] 0.162 30.37 0.218 22.64 0.495 25.11
ControlNet [40] 0.528 29.49 0.618 23.89 0.606 23.57
ours 0.138 29.57 0.479 25.45 0.301 24.30

Table 1. Quantitative experiments of image manipulation on CUB [35], Outdoor Scenes [11], and COCO [17] datasets. CLIP [23] is used
to evaluate the image editing performance and LPIPS is applied to evaluate image fidelity. Generated images have been resized to 224×224
resolution for CLIP score. We use the “ViT-B/32” version of CLIP. See Section 4.3 for discussion.

3URPSW ,QSXW,PDJH 6WUXFWXUH*XLGDQFH (GLWHG,PDJH


the bottom row, we observe that our segmentation module
fails to accurately localize the target object according to the
$ELUGZLWKDVKRUW prompt due to text ambiguity. While the prompt specifies
EHDN
the chair on the right-hand side, our model modifies the at-
tributes of the chair on the left-hand side.

Broader Impacts. Our model is designed to perform im-


'D\OLJKWVXQQ\ age editing according to user-provided language descrip-
tions. Thus, it enables modification of attributes such as
colors, textures, or styles in the original images. As other
image generation and editing approaches, our model may
be used to synthesize images that contains misinformation.
$JUHHQFKDLURQWKH
ULJKW Therefore, it is important for practitioners to review and
control how images are manipulated to avoid misinforma-
tion. Further research on detecting machine-generated im-
ages is needed to mitigate this potential issue.
Figure 6. Failure cases include inconsistencies with the original
image in non-rigid image editing task (top); challenges in notably
modifying the brightness of the image (middle), and inaccurate lo- 6. Conclusion
calization or segmentation (bottom). See Section 5 for discussion.
In this paper, we propose IIR-Net, a text-to-image edit-
ing model that incorporates the original image by selec-
Method User Preference Inference Time tively erasing the image information. IIR-Net mainly con-
Text2LIVE [1] 30.0% 281.6±1.72s sists of two stages: an conditional diffusion model that takes
Imagic [9] 23.3% 483.4±1.31s the original image as additional control, and an image in-
ControlNet [40] 33.3% 5.0±0.04s formation removal module to address the identical map-
IIR-Net (ours) 68.3% 5.0±0.03s ping issue. We demonstrate that IIR-Net outperforms the
state-of-the-art in both qualitative and quantitative evalua-
Table 2. We randomly select 30 images from COCO for user tions on CUB, Outdoor Scenes, and COCO datasets. For
study and speed evaluation. Top row reports user judgments on example, compared to Imagic, IIR-Net improves the LPIPS
the correctness of the image manipulation. Bottom row reports score from 0.57 to 0.30 and the CLIP score from 21.53 to
speed for our method v.s. the baselines. Our method has negligi- 24.30 on COCO, with a speed improvement of two orders
ble overhead compared to ControlNet, and is significantly faster of magnitude. We also use qualitative examples to demon-
than Text2LIVE and Imagic. See Section 4.3 for discussion. strate the effectiveness of our model on various image edit-
ing tasks, validating that our model can modify the target
erate a bird whose head is gray. The color of wings is also attribute according to language descriptions while preserv-
slightly different from the input bird. In the middle row, ing the text-irrelevant content of the original image well.
we find that our model fails to change the brightness of the Acknowledgements This material is based upon work
image in some cases. E.g., the input image is a night view. supported, in part, by DARPA under agreement number
Therefore, the brightness of the image is low in this im- HR00112020054. Any opinions, findings, and conclusions
age and the model tend to reconstruct an image with a low or recommendations are those of the author(s) and do not
brightness even if the target text is “daylight,” “sunny.” In necessarily reflect the views of the supporting agencies.

5227

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
References [13] Bowen Li, Xiaojuan Qi, Philip Torr, and Thomas
Lukasiewicz. Lightweight generative adversarial networks
[1] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- for text-guided image manipulation. Advances in Neural In-
ten, and Tali Dekel. Text2live: Text-driven layered image formation Processing Systems, 33:22020–22031, 2020. 2
and video editing. In Computer Vision–ECCV 2022: 17th [14] Nannan Li and Bryan A Plummer. Supervised attribute in-
European Conference, Tel Aviv, Israel, October 23–27, 2022, formation removal and reconstruction for image manipula-
Proceedings, Part XV, pages 707–723. Springer, 2022. 1, 3, tion. In Computer Vision–ECCV 2022: 17th European Con-
4, 5, 8 ference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
[2] Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, Part XVII, pages 457–473. Springer, 2022. 2, 3
and Irfan Essa. Masksketch: Unpaired structure-guided [15] Nannan Li, Kevin J Shih, and Bryan A Plummer. Col-
masked image generation. In Proceedings of the IEEE/CVF lecting the puzzle pieces: Disentangled self-driven hu-
Conference on Computer Vision and Pattern Recognition, man pose transfer by permuting textures. arXiv preprint
2023. 1, 2, 3 arXiv:2210.01887, 2022. 2
[3] Helisa Dhamo, Azade Farshad, Iro Laina, Nassir Navab, [16] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian-
Gregory D Hager, Federico Tombari, and Christian Rup- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee.
precht. Semantic image manipulation using scene graphs. Gligen: Open-set grounded text-to-image generation. In Pro-
In Proceedings of the IEEE/CVF conference on computer vi- ceedings of the IEEE/CVF Conference on Computer Vision
sion and pattern recognition, pages 5213–5222, 2020. 2 and Pattern Recognition, pages 22511–22521, 2023. 1
[4] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming [17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
transformers for high-resolution image synthesis. In Pro- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
ceedings of the IEEE/CVF conference on computer vision Zitnick. Microsoft coco: Common objects in context. In
and pattern recognition, pages 12873–12883, 2021. 3 Computer Vision–ECCV 2014: 13th European Conference,
[5] Kevin Frans, Lisa Soros, and Olaf Witkowski. Clipdraw: Zurich, Switzerland, September 6-12, 2014, Proceedings,
Exploring text-to-drawing synthesis through language-image Part V 13, pages 740–755. Springer, 2014. 2, 4, 8
encoders. Advances in Neural Information Processing Sys- [18] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao
tems, 35:5207–5218, 2022. 3 Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
Zhu, et al. Grounding dino: Marrying dino with grounded
[6] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
pre-training for open-set object detection. arXiv preprint
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
arXiv:2303.05499, 2023. 2, 3, 6
Or. An image is worth one word: Personalizing text-to-
image generation using textual inversion. arXiv preprint [19] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar
arXiv:2208.01618, 2022. 3 Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier
Bachem. Challenging common assumptions in the unsuper-
[7] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano,
vised learning of disentangled representations. In interna-
Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-
tional conference on machine learning, pages 4114–4124.
guided domain adaptation of image generators. ACM Trans-
PMLR, 2019. 2
actions on Graphics (TOG), 41(4):1–13, 2022. 3
[20] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and
[8] Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Rana Hanocka. Text2mesh: Text-driven neural stylization
Abbeel, and Ben Poole. Zero-shot text-guided object genera- for meshes. In Proceedings of the IEEE/CVF Conference
tion with dream fields. In Proceedings of the IEEE/CVF Con- on Computer Vision and Pattern Recognition, pages 13492–
ference on Computer Vision and Pattern Recognition, pages 13502, 2022. 3
867–876, 2022. 3 [21] Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-
[9] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Hui- adaptive generative adversarial networks: manipulating im-
wen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. ages with natural language. Advances in neural information
Imagic: Text-based real image editing with diffusion mod- processing systems, 31, 2018. 2
els. In IEEE/CVF Conference on Computer Vision and Pat- [22] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
tern Recognition, 2023. 1, 2, 3, 4, 5, 8 Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
[10] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Mark Chen. Glide: Towards photorealistic image generation
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- and editing with text-guided diffusion models. In Interna-
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- tional Conference on Machine Learning (ICML), 2022. 2
thing. arXiv preprint arXiv:2304.02643, 2023. 2, 3, 6 [23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[11] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
and James Hays. Transient attributes for high-level under- Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
standing and editing of outdoor scenes. ACM Transactions transferable visual models from natural language supervi-
on graphics (TOG), 33(4):1–11, 2014. 2, 4, 5, 8 sion. In International conference on machine learning, pages
[12] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS 8748–8763. PMLR, 2021. 3, 4, 8
Torr. Manigan: Text-guided image manipulation. In Pro- [24] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
ceedings of the IEEE/CVF Conference on Computer Vision Amodei, Ilya Sutskever, et al. Language models are unsu-
and Pattern Recognition, pages 7880–7889, 2020. 2, 7 pervised multitask learners. OpenAI blog, 1(8):9, 2019. 3

5228

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
[25] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, ulate latent space semantics for facial attribute editing. In
and Mark Chen. Hierarchical text-conditional image gen- Proceedings of the IEEE/CVF Conference on Computer Vi-
eration with clip latents. arXiv preprint arXiv:2204.06125, sion and Pattern Recognition, pages 2951–2960, 2021. 2
2022. 2, 3 [38] Xu Yao, Alasdair Newson, Yann Gousseau, and Pierre Hel-
[26] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, lier. A latent transformer for disentangled face editing in im-
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. ages and videos. In Proceedings of the IEEE/CVF interna-
Zero-shot text-to-image generation. In International Confer- tional conference on computer vision, pages 13789–13798,
ence on Machine Learning, pages 8821–8831. PMLR, 2021. 2021. 2
1, 2, 3 [39] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
[27] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
geswaran, Bernt Schiele, and Honglak Lee. Generative ad- gan: Text to photo-realistic image synthesis with stacked
versarial text to image synthesis. In International conference generative adversarial networks. In Proceedings of the IEEE
on machine learning, pages 1060–1069. PMLR, 2016. 2 international conference on computer vision, pages 5907–
[28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, 5915, 2017. 2
Patrick Esser, and Björn Ommer. High-resolution image [40] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
synthesis with latent diffusion models. In Proceedings of conditional control to text-to-image diffusion models. In
the IEEE/CVF Conference on Computer Vision and Pattern Proceedings of the IEEE/CVF International Conference on
Recognition, pages 10684–10695, 2022. 1, 2, 3, 4, 7 Computer Vision, 2023. 1, 2, 3, 4, 5, 8
[29] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, [41] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine man, and Oliver Wang. The unreasonable effectiveness of
tuning text-to-image diffusion models for subject-driven deep features as a perceptual metric. In Proceedings of the
generation. arXiv preprint arXiv:2208.12242, 2022. 1, 2, IEEE conference on computer vision and pattern recogni-
3 tion, pages 586–595, 2018. 4
[30] Chitwan Saharia, William Chan, Saurabh Saxena, Lala [42] Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Metaxas, and Jian Ren. Sine: Single image edit-
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, ing with text-to-image diffusion models. arXiv preprint
et al. Photorealistic text-to-image diffusion models with deep arXiv:2212.04489, 2022. 1, 2, 3
language understanding. Advances in Neural Information [43] Zhongping Zhang, Huiwen He, Bryan A. Plummer, Zhenyu
Processing Systems, 35:36479–36494, 2022. 1, 2, 3 Liao, and Huayan Wang. Complex scene image editing by
[31] Christoph Schuhmann, Richard Vencu, Romain Beaumont, scene graph comprehension. In British Machine Vision Con-
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo ference (BMVC), 2023. 3
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
Open dataset of clip-filtered 400 million image-text pairs.
arXiv preprint arXiv:2111.02114, 2021. 2
[32] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Sin-
gan: Learning a generative model from a single natural im-
age. In Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 4570–4580, 2019. 3
[33] Ming Tao, Bing-Kun Bao, Hao Tang, Fei Wu, Longhui Wei,
and Qi Tian. De-net: Dynamic text-guided image editing ad-
versarial networks. In Proceedings of the AAAI Conference
on Artificial Intelligence, 2023. 2
[34] Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali
Dekel. Splicing vit features for semantic appearance transfer.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 10748–10757, 2022.
3
[35] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-
ona, and Serge Belongie. The caltech-ucsd birds-200-2011
dataset. 2011. 2, 4, 5, 8
[36] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
grained text to image generation with attentional generative
adversarial networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1316–
1324, 2018. 2
[37] Guoxing Yang, Nanyi Fei, Mingyu Ding, Guangzhen Liu,
Zhiwu Lu, and Tao Xiang. L2m-gan: Learning to manip-

5229

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply

You might also like