Text-To-image Editing by Image Information Removal
Text-To-image Editing by Image Information Removal
5221
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
age generation [32, 34, 43]. Several methods use CLIP [23] where E is the pretrained encoder of VQGAN [4] to en-
as a constraint to guide the embedding features of pre- code image xt to latent features zt , and vice versa. For
dicted images [1, 5, 8, 20]. Inspired by the success of conditional generation, the denoising autoencoders take
pretrained text-to-image generation frameworks [25, 26, τθ (y) as additional input, where τθ (y) represents a domain-
30], researchers have also proposed methods to finetune specific encoder to extract feature embeddings from the
these models for image editing (e.g., Imagic [9], Dream- condition y. This condition y represents elements like text
booth [29], SINE [42], Textual Inversion [6]). Compared prompts and semantic maps, among others. Given image-
to feed-forward transformation methods [2, 40], these mod- condition pairs, the Conditional Latent Diffusion Model
els retain more information from the original image since (CLDM) is optimized by
they take the whole image instead of just a structure map as
additional guidance. However, as we will show in Section LCLDM := EE(x),y,∼N (0,1),t || − θ (zt , t, τθ (y))||22 ,
4.2, some image content such as background or irrelevant (2)
attributes of target objects may still be changed in this pro-
cess. In addition, the inference time of these optimization- where τθ , θ are jointly optimized. In our model, the condi-
based methods is much longer than feed-forward transfor- tion y consist of text descriptions S and the original image
mation methods due to image-specific finetuning. x0 and is defined as
5222
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
$ &RQGLWLRQDO/DWHQW'LIIXVLRQ0RGHO % ,PDJH,QIRUPDWLRQ5HPRYDO
'LIIXVLRQ
᷑ 3URFHVV '
«
] ]7
*URXQGHG
6$0 ᷠᶚ
'HQRLVLQJ
3URFHVV '
,PDJH «
,QIRUPDWLRQ ᷠᶚ LQSXWLPDJH
5HPRYDO
]7 ] 5HFRQVWUXFWLRQ FRQGLWLRQ
/RVV
7KLVELUGLV\HOORZ
ɟ
UHG EODFN LQ FRORU
ZLWKDVKRUWEHDN &/,3 VHJPHQWDWLRQ UJEJUD\
PDVNRIWKH5R, QRLVHDXJPHQWDWLRQ
&RQGLWLRQLQJLPDJHKLQWWH[W
Figure 2. IIR-Net Overview. Our model mainly consists of two modules: (A) Conditional Latent Diffusion Model: We introduce the
original image x as additional control to our model to preserve the text-irrelevant features of x. See Section 3.1 for detailed discussion; (B)
Image Information Removal Module: We erase the image information mainly by two operations. First, we convert RGB values to the gray
values in the RoI to exclude the color information. Second, we add Gaussian noise to the input image to partially erase the texture-related
information. This module is applied to address the identical mapping issue. See Section 3.2 for detailed discussion.
condition x0 of our model, denoted by and COCO [17]. CUB [35] is contains 200 bird species
K
that we split into 8,855 ᶚtraining images and 2,933 test im-
ages. Ourdoor Scenes [11] contains 8,571 images captured
q(xK |x0 ) = q(xk |xk−1 ); (5)
k=1
from 101 webcams, with each webcam collecting 60∼120
images showcasing different attributes like weather, season,
q(xk |xk−1 ) = N (xk ; 1 − βk xk−1 ; βk I), (6) or time of day. COCO [17] contains 82,783 training images
where k denotes the time step applied to xk , which is dif- and 40,504 validation images. Following [9], we randomly
ferent from the time step t applied to xt . Note that xt is select 150 test images from each dataset to evaluate the per-
obtained by adding noise to the original image x0 in diffu- formance of each method.
sion models, whereas xk is obtained by adding noise to the Metrics. Following [9], we adopt the perceptual metric
image condition x0 in diffusion models. During training we LPIPS [41] and CLIP score [23] as our quantitative metrics.
randomly sample xk from {x0 , . . . , xK }. LPIPS measures the image fidelity and CLIP evaluates the
While xk inherently preserves the structure information model’s editability. Additionally, we perform quantitative
of x0 , we find that explicitly incorporating additional struc- experiments by user study and inference time to evaluate
tural guidance, such as edges, helps the model better cap- the effectiveness and efficiency of our model.
ture structural information. Thus, we concatenate xk with
Baselines. We compare IIR-Net with three state-of-the-
the predictions of a Canny Edge detector C(x0 ). Thus, the
art approaches: Text2LIVE [1], Imagic [9], and Control-
output of our image information removal module is:
Net [40]. For Text2LIVE, we set the optimization steps
R(x0 ) = [xk , C(x0 )] . (7) to 600. For Imagic, both the text embedding optimization
steps and model fine-tuning steps are set to 500. We sample
Given the output of our image information removal mod- the interpolation hyperparameter η from 0.1 to 1 with a 0.1
ule R(x0 ), the final objective of IIR-Net is defined as: interval, and the guidance scale is set to 3. For ControlNet
and IIR-Net, we generate images with a CFG-scale of 9.0,
LIIR−Net := EE(x),y,∼N (0,1),t − θ (zt , t, τθ1 (S),
and DDIM steps of 20 by default.
(8)
τθ2 (R(x0 )))22 . Implementation Details. We initialized our model weights
from Stable Diffusion 1.5 [28] and ControlNet [40]. Dur-
4. Experiments ing training, we applied a batch size of 8 and a maxi-
mum learning rate of 1 × 10−6 . We finetuned our mod-
4.1. Datasets and Experiment Settings
els approximately 100 epochs on the CUB [35] dataset,
Datasets. We evaluate the performance of our model on and around 5 epochs on the Outdoor Scenes [11] and
three standard datasets, CUB [35], Outdoor Scenes [11], COCO [17] datasets. The training process was parallelized
5223
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
$ &8% % 2XWGRRU6FHQHV
7KLVELUGKDVD\HOORZ $VPDOOELUGVSRUWVLWV 7KLVELUGKDV\HOORZ %LUGZLWKDUHGERG\
ERG\ZLWKEOXHZLQJV ZKLWHEHOO\DQGEODFN ERG\ZLWKD\HOORZ UHGFURZQDQGEODFN
DQGUHGFURZQ KHDG EHOO\DEURZQEHDN ZLQJV :LQWHU6QRZ&ROG 6XPPHU'D\OLJKW (YHQLQJ6XQVHW
,QSXW,PDJH
7H[W/,9(
,PDJLF
&RQWURO1HW
2XUV
Figure 3. Qualitative comparison on CUB and Ourdoor Scenes. From top to bottom: input image, Text2LIVE [1], Imagic [9], Control-
Net [40], and ours. Generated images have 512 pixels on their shorter side. See Section 4.2 for discussion.
on 2 NVIDIA RTX-A6000s. To adapt the image conditions prompt, they both fail to effectively apply them to the cor-
in our model, we configured the channel of the image en- rect corresponding parts of the bird. Imagic generates a bird
coder block to 4, with 3 channels for RGB images and 1 with a blue crown and yellow wings, while ControlNet gen-
channel for the edge map. We finetuned the Stable Diffu- erates a blue head and a red breast. In constrast, our model
sion decoder for experiments on CUB, as these images pri- accurately edits the bird by parts according to the prompt
marily focus on various birds with a consistent style. We and produce a bird with blue wings, yellow body, and red
froze the Stable Diffusion Decoder for the Ourdoor Scenes crown. In addition, we observe that the background of im-
and COCO datasets, since these datasets comprising natural ages generated by Imagic and ControlNet has been changed.
images with diverse objects and varying styles. This is due to the fact that Imagic and ControlNet do not di-
rectly use the original image as their input. E.g., Imagic
4.2. Qualitative Results optimizes the text embeddings to get features that reflect
Entire-image Editing on the CUB and Outdoor Scenes the attributes of the original image, and ControlNet uses the
Datasets. Figure 3 presents a qualitative comparison of the Canny Edge map as input. Thus, it is challenging for these
edited images generated by our model and the baselines. In method to preserve the text-irrelevant content of the origi-
Figure 3 (A), we present a comparison on the CUB [35] nal image. In contrast, our model takes the original image
dataset. We observe that our model can accurately manip- as input and only erases the text-relevant content, thus pre-
ulate parts of the bird while preserving the text-irrelevant serving the text-irrelevant content effectively.
content of the original image. For example, in the first col- In Figure 3 (B), we present a comparison on the Out-
umn of Figure 3 (A), while baselines such as ControlNet door Scenes [11] dataset. Consistent with our findings on
and Imagic can recognize “yellow” and “blue” from the text the CUB dataset, we observe that baselines like Imagic and
5224
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
%OXHEXVRQVWUHHW 7H[W/,9( ,PDJLF &RQWURO1HW 2XUV
$LUSODQHLQVXQVHW
&RZRQZDWHU
YDQJRJKVW\OH
Figure 4. Qualitative comparison for various editing tasks on the COCO dataset. From top to bottom: color editing, scene attribute transfer,
texture editing, and style transfer. Generated images have 512 pixels on their shorter side. Objects that are the target of modification are in
red bounding boxes and whereas objects that should be preserved are in green bounding boxes. See Section 4.2 for discussion.
ControlNet tend to modify the text-irrelevant contents of our method produces images that are well-aligned with the
the original image, such as the textures and background, text descriptions while non-edited components better repre-
while Text2LIVE only introduces limited visual effects to sent the original images. E.g., in the color editing task, al-
the original image and may fail to generate images aligned though Imagic and ControlNet generate a blue bus accord-
with the text descriptions. For example, in the second col- ing to the text prompt, Imagic changes the original shape
umn of Figure 3 (B), images produced by Imagic and Con- of the bus and ControlNet modifies the bus’ texture. In con-
trolNet are well aligned with text descriptions (“summer,” trast, our method only modifies the color attribute while pre-
“daylight”), but they introduce unexpected objects such as serving irrelevant attributes. Furthermore, our model gener-
trees or a lake to the image. In contrast, Text2LIVE pre- ates images that appear more natural and visually appealing.
serves the original image well, but fails to align with text E.g., in the scene attribute transfer task, the visual effect of
descriptions, as seen with the snow-covered field in sum- “sunset” brought by our model is naturally aligned with the
mer. However, our method effectively modifies the desired original image, whereas Text2LIVE introduces obvious ar-
content, such as changing “winter” to “summer,” while pre- tificial effects to the airplane.
serving the original content of the image. Finally, we evaluate our model on tasks where the orig-
Region-based Image Editing on COCO. Unlike object- inal ControlNet performs well, such as texture editing and
centric datasets such as CUB and Outdoor Scenes, COCO style transfer. Our results show that adapting text-to-image
images can contain complex scenes with many objects, yet generation models to image editing tasks does not notably
only parts of the input image may require modification. compromise their capabilities. For example, in the style
Thus, we apply Grounding-DINO [18] and SAM [10] to transfer examples on COCO images our method still retains
localize the Region of Interest (RoI) that requires editing2 . the ability to transfer a photorealistic image to an artistic
Figure 4 presents a qualitative comparison of our method style. See the supplementary for additional examples.
and prior work on various image editing tasks. We find that
Ablation Study. In Figure 5, we provide ablation study
2 Since Text2LIVE and Imagic automatically localize the RoI, we apply of IIR-Net. We find that without our unsupervised image
Grounding-DINO and SAM to ControlNet and our method. content removal mechanism, the model always outputs the
5225
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
GHFUHDVLQJQRLVH
,QSXW,PDJH )XOO,PDJHDVLQSXW
7DUJHW7H[W³$JUHHQDLUSODQH´
7DUJHW7H[W³7UHHRQVDQG´
Figure 5. Ablation Study. We perform experiments to evaluate the effectiveness of our color removal and texture removal operations.
Images generated without our image information removal module are outlined by the blue bounding box. See Section 4.2 for discussion.
input image as the predicted image, i.e., the identical map- User Study. We conducted a user study to quantitatively
ping issue [12]. E.g., the images in the blue bounding box evaluate the performance of IIR-Net, as shown in Table 2.
remain white airplane and green grass, showing a lack of We randomly selected 30 images from COCO and applied
alignment with the text descriptions. By incorporating the each model to generate the modified images, resulting in a
color removal mechanism (see images with low noise level), total of 120 generated images. Each image was annotated
our model performs well on tasks such as color editing. For three times by users and we asked our annotators to judge
example, when changing the airplane’s color from white whether the image is correctly manipulated based on the
to green, our model preserves the most of the airplane’s text guidance while preserving the text-irrelevant content of
attributes, only modifying the color. We observe that the the original image. In the table, we report that IIR-Net sig-
color removal mechanism can find texture editing challeng- nificantly outperforms baselines. See the supplementary for
ing. For example, as seen in the second row of the figure, additional details on our user study.
the images generated with low noise level still exhibit the Inference Time Table 2 presents a comparison of the in-
grass texture instead of the intended “sand” texture. There- ference time and their standard error using the same Stable
fore, we incorporate noise augmentation to the input images Diffusion v1.5 [28] backbone for Imagic, ControlNet, and
to better handle such editing tasks. As shown in the second our method. All methods are benchmarked on a NVIDIA
row, our model successfully modifies the grass texture to RTX A6000 GPU. We find our method has significantly
sand under high-level noise conditions. In practical applica- faster inference times compared to Imagic, boosting infer-
tions, users can adjust the noise level according to different ence speed by two orders of magnitude when processing
editing tasks to achieve optimal performance. 512×512 images. In addition, our method is approximately
50x faster than Text2LIVE. We note both ControlNet and
4.3. Quantitative Results our method have around 5s inference time, demonstrating
that approach introduces negligible overhead to ControlNet.
Editability-fidelity Tradeoff. Table 1 reports our quantita-
tive results on CUB, Outdoor Scenes, and COCO. As ob-
5. Limitations & Broader Impacts.
served in our qualitative experiments, our model achieves a
better tradeoff between image fidelity and editability com- Limitations. We identify three failure cases of our methods
pared to other state-of-the-art methods. E.g., our model in this section: First, the attributes of the original image are
achieves the best LPIPS scores (0.138 and 0.301) and com- likely to be modified in non-rigid image editing tasks. Sec-
parable CLIP scores (29.57 and 24.30) on CUB and COCO. ond, it is challenging for our method to change the bright-
In Outdoor Scenes, our model achieves the highest CLIP ness of the input image drastically. Third, the target object
score and the second best LPIPS score. Text2LIVE achieves may be localized and segmented inaccurately. We present
better LPIPS score than our method on Outdoor Scenes. examples of these three failure cases in Figure 6. As shown
However, it may due to the fact that Text2LIVE mainly aug- in the top row, though our method can achieve non-rigid
ment the scenes with new visual effects, rather than directly image editing according to the input image and a modified
modifying the attributes of the scenes. E.g., Text2LIVE structural guidance, we observe that the model fails to map
fails to change the grassland to a snowy landscape or con- some attributes to the correct parts. E.g., the bird of the
vert lush trees to bare ones in the scenes. input image has a grey crown while the edited image gen-
5226
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
CUB Outdoor Scenes COCO
LPIPS ↓ CLIP ↑ LPIPS ↓ CLIP ↑ LPIPS ↓ CLIP ↑
Imagic [9] 0.406 27.03 0.551 22.85 0.567 21.53
Text2live [1] 0.162 30.37 0.218 22.64 0.495 25.11
ControlNet [40] 0.528 29.49 0.618 23.89 0.606 23.57
ours 0.138 29.57 0.479 25.45 0.301 24.30
Table 1. Quantitative experiments of image manipulation on CUB [35], Outdoor Scenes [11], and COCO [17] datasets. CLIP [23] is used
to evaluate the image editing performance and LPIPS is applied to evaluate image fidelity. Generated images have been resized to 224×224
resolution for CLIP score. We use the “ViT-B/32” version of CLIP. See Section 4.3 for discussion.
5227
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
References [13] Bowen Li, Xiaojuan Qi, Philip Torr, and Thomas
Lukasiewicz. Lightweight generative adversarial networks
[1] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- for text-guided image manipulation. Advances in Neural In-
ten, and Tali Dekel. Text2live: Text-driven layered image formation Processing Systems, 33:22020–22031, 2020. 2
and video editing. In Computer Vision–ECCV 2022: 17th [14] Nannan Li and Bryan A Plummer. Supervised attribute in-
European Conference, Tel Aviv, Israel, October 23–27, 2022, formation removal and reconstruction for image manipula-
Proceedings, Part XV, pages 707–723. Springer, 2022. 1, 3, tion. In Computer Vision–ECCV 2022: 17th European Con-
4, 5, 8 ference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
[2] Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, Part XVII, pages 457–473. Springer, 2022. 2, 3
and Irfan Essa. Masksketch: Unpaired structure-guided [15] Nannan Li, Kevin J Shih, and Bryan A Plummer. Col-
masked image generation. In Proceedings of the IEEE/CVF lecting the puzzle pieces: Disentangled self-driven hu-
Conference on Computer Vision and Pattern Recognition, man pose transfer by permuting textures. arXiv preprint
2023. 1, 2, 3 arXiv:2210.01887, 2022. 2
[3] Helisa Dhamo, Azade Farshad, Iro Laina, Nassir Navab, [16] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian-
Gregory D Hager, Federico Tombari, and Christian Rup- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee.
precht. Semantic image manipulation using scene graphs. Gligen: Open-set grounded text-to-image generation. In Pro-
In Proceedings of the IEEE/CVF conference on computer vi- ceedings of the IEEE/CVF Conference on Computer Vision
sion and pattern recognition, pages 5213–5222, 2020. 2 and Pattern Recognition, pages 22511–22521, 2023. 1
[4] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming [17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
transformers for high-resolution image synthesis. In Pro- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
ceedings of the IEEE/CVF conference on computer vision Zitnick. Microsoft coco: Common objects in context. In
and pattern recognition, pages 12873–12883, 2021. 3 Computer Vision–ECCV 2014: 13th European Conference,
[5] Kevin Frans, Lisa Soros, and Olaf Witkowski. Clipdraw: Zurich, Switzerland, September 6-12, 2014, Proceedings,
Exploring text-to-drawing synthesis through language-image Part V 13, pages 740–755. Springer, 2014. 2, 4, 8
encoders. Advances in Neural Information Processing Sys- [18] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao
tems, 35:5207–5218, 2022. 3 Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
Zhu, et al. Grounding dino: Marrying dino with grounded
[6] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
pre-training for open-set object detection. arXiv preprint
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
arXiv:2303.05499, 2023. 2, 3, 6
Or. An image is worth one word: Personalizing text-to-
image generation using textual inversion. arXiv preprint [19] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar
arXiv:2208.01618, 2022. 3 Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier
Bachem. Challenging common assumptions in the unsuper-
[7] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano,
vised learning of disentangled representations. In interna-
Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-
tional conference on machine learning, pages 4114–4124.
guided domain adaptation of image generators. ACM Trans-
PMLR, 2019. 2
actions on Graphics (TOG), 41(4):1–13, 2022. 3
[20] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and
[8] Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Rana Hanocka. Text2mesh: Text-driven neural stylization
Abbeel, and Ben Poole. Zero-shot text-guided object genera- for meshes. In Proceedings of the IEEE/CVF Conference
tion with dream fields. In Proceedings of the IEEE/CVF Con- on Computer Vision and Pattern Recognition, pages 13492–
ference on Computer Vision and Pattern Recognition, pages 13502, 2022. 3
867–876, 2022. 3 [21] Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-
[9] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Hui- adaptive generative adversarial networks: manipulating im-
wen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. ages with natural language. Advances in neural information
Imagic: Text-based real image editing with diffusion mod- processing systems, 31, 2018. 2
els. In IEEE/CVF Conference on Computer Vision and Pat- [22] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
tern Recognition, 2023. 1, 2, 3, 4, 5, 8 Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
[10] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Mark Chen. Glide: Towards photorealistic image generation
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- and editing with text-guided diffusion models. In Interna-
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- tional Conference on Machine Learning (ICML), 2022. 2
thing. arXiv preprint arXiv:2304.02643, 2023. 2, 3, 6 [23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[11] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
and James Hays. Transient attributes for high-level under- Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
standing and editing of outdoor scenes. ACM Transactions transferable visual models from natural language supervi-
on graphics (TOG), 33(4):1–11, 2014. 2, 4, 5, 8 sion. In International conference on machine learning, pages
[12] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS 8748–8763. PMLR, 2021. 3, 4, 8
Torr. Manigan: Text-guided image manipulation. In Pro- [24] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
ceedings of the IEEE/CVF Conference on Computer Vision Amodei, Ilya Sutskever, et al. Language models are unsu-
and Pattern Recognition, pages 7880–7889, 2020. 2, 7 pervised multitask learners. OpenAI blog, 1(8):9, 2019. 3
5228
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply
[25] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, ulate latent space semantics for facial attribute editing. In
and Mark Chen. Hierarchical text-conditional image gen- Proceedings of the IEEE/CVF Conference on Computer Vi-
eration with clip latents. arXiv preprint arXiv:2204.06125, sion and Pattern Recognition, pages 2951–2960, 2021. 2
2022. 2, 3 [38] Xu Yao, Alasdair Newson, Yann Gousseau, and Pierre Hel-
[26] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, lier. A latent transformer for disentangled face editing in im-
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. ages and videos. In Proceedings of the IEEE/CVF interna-
Zero-shot text-to-image generation. In International Confer- tional conference on computer vision, pages 13789–13798,
ence on Machine Learning, pages 8821–8831. PMLR, 2021. 2021. 2
1, 2, 3 [39] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
[27] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
geswaran, Bernt Schiele, and Honglak Lee. Generative ad- gan: Text to photo-realistic image synthesis with stacked
versarial text to image synthesis. In International conference generative adversarial networks. In Proceedings of the IEEE
on machine learning, pages 1060–1069. PMLR, 2016. 2 international conference on computer vision, pages 5907–
[28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, 5915, 2017. 2
Patrick Esser, and Björn Ommer. High-resolution image [40] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
synthesis with latent diffusion models. In Proceedings of conditional control to text-to-image diffusion models. In
the IEEE/CVF Conference on Computer Vision and Pattern Proceedings of the IEEE/CVF International Conference on
Recognition, pages 10684–10695, 2022. 1, 2, 3, 4, 7 Computer Vision, 2023. 1, 2, 3, 4, 5, 8
[29] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, [41] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine man, and Oliver Wang. The unreasonable effectiveness of
tuning text-to-image diffusion models for subject-driven deep features as a perceptual metric. In Proceedings of the
generation. arXiv preprint arXiv:2208.12242, 2022. 1, 2, IEEE conference on computer vision and pattern recogni-
3 tion, pages 586–595, 2018. 4
[30] Chitwan Saharia, William Chan, Saurabh Saxena, Lala [42] Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Metaxas, and Jian Ren. Sine: Single image edit-
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, ing with text-to-image diffusion models. arXiv preprint
et al. Photorealistic text-to-image diffusion models with deep arXiv:2212.04489, 2022. 1, 2, 3
language understanding. Advances in Neural Information [43] Zhongping Zhang, Huiwen He, Bryan A. Plummer, Zhenyu
Processing Systems, 35:36479–36494, 2022. 1, 2, 3 Liao, and Huayan Wang. Complex scene image editing by
[31] Christoph Schuhmann, Richard Vencu, Romain Beaumont, scene graph comprehension. In British Machine Vision Con-
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo ference (BMVC), 2023. 3
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
Open dataset of clip-filtered 400 million image-text pairs.
arXiv preprint arXiv:2111.02114, 2021. 2
[32] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Sin-
gan: Learning a generative model from a single natural im-
age. In Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 4570–4580, 2019. 3
[33] Ming Tao, Bing-Kun Bao, Hao Tang, Fei Wu, Longhui Wei,
and Qi Tian. De-net: Dynamic text-guided image editing ad-
versarial networks. In Proceedings of the AAAI Conference
on Artificial Intelligence, 2023. 2
[34] Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali
Dekel. Splicing vit features for semantic appearance transfer.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 10748–10757, 2022.
3
[35] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-
ona, and Serge Belongie. The caltech-ucsd birds-200-2011
dataset. 2011. 2, 4, 5, 8
[36] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
grained text to image generation with attentional generative
adversarial networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1316–
1324, 2018. 2
[37] Guoxing Yang, Nanyi Fei, Mingyu Ding, Guangzhen Liu,
Zhiwu Lu, and Tao Xiang. L2m-gan: Learning to manip-
5229
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:30:03 UTC from IEEE Xplore. Restrictions apply