0% found this document useful (0 votes)
46 views

Multi-Object Editing in Personalized Text-To-Image Diffusion Model via Segmentation Guidance

Uploaded by

aswanthkmr03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Multi-Object Editing in Personalized Text-To-Image Diffusion Model via Segmentation Guidance

Uploaded by

aswanthkmr03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 979-8-3503-4485-1/24/$31.

00 ©2024 IEEE | DOI: 10.1109/ICASSP48485.2024.10447048

MULTI-OBJECT EDITING IN PERSONALIZED TEXT-TO-IMAGE DIFFUSION MODEL


VIA SEGMENTATION GUIDANCE

Haruka Matsuda† , Ren Togo†† , Keisuke Maeda†† , Takahiro Ogawa†† , Miki Haseyama††

School of Engineering, Hokkaido University, Japan
††
Faculty of Information Science and Technology, Hokkaido University, Japan
E-mail:{matsuda, togo, maeda, ogawa, mhaseyama}@lmd.ist.hokudai.ac.jp

ABSTRACT

This paper presents a personalized text-to-image diffusion


model for multiple object editing that can improve visual fidelity
of the target image and editing ability with a segmentation-based
restriction and continual learning. Multiple personalization tasks
face the problem of destabilization, especially when the number of
targets increases and the concepts of the targets are similar. The
proposed method introduces a segmentation guide into continual
learning to improve performance for multiple objects. The seg- Fig. 1: Example of generated images. “V*” is an identifier of the
mentation guide helps to separate each concept by restricting the object in the target image. The state-of-the-art methods have the
regions of target objects during both training and inference. The problem of losing the visual fidelity of the target objects when out-
proposed method learns these concepts by continual learning with putting similar concepts.
Elastic Weight Consolidation, and achieves the output of multiple
target objects with concept separation while maintaining visual fi-
delity. Experimental results demonstrate that the proposed method generate multiple objects by merging separately trained models or
successfully maintains visual fidelity for multiple target objects. performing the joint training of all targets.
Although the above methods have succeeded in improving per-
Index Terms— Personalization, diffusion model, ControlNet, formance in the output of multiple targets, there are issues with bal-
continual learning, generative AI. ancing between reconstruction and editability [7, 11]. Specifically,
it is difficult to maintain the visual fidelity of objects when recon-
structing them in the new scenes and to edit the output image based
1. INTRODUCTION on the input text prompt. In particular, Kumari et al. [7] report that
these capabilities are significantly impaired when generating similar
The digital transformation has rapidly advanced in the creative in- concepts, resulting in failed output, such as the output of only one
dustry, and many creators can effectively utilize digital devices to object or a mixed concept object as shown in Fig. 1. While train-
create their works [1]. In particular, the text-to-image models such ing with limited attention regions based on mask images helps to
as Stable Diffusion [2] are widely utilized as a tool for creative ex- maintain visual fidelity [9, 12–14], too much attention to the target
pression. While users can control the output images by changing the objects causes overfitting and losing editing ability. Conversely, it
input text prompt, generating images that align with personal prefer- has been shown that the models with a high editing ability cannot
ences requires users to search for appropriate text prompts. With the generally ensure the visual fidelity of the target [15]. Although there
increasing demands to reduce the burden on users, personalization is a trade-off between these two capabilities, the model that performs
methods are attracting attention [3, 4]. These methods enable image the personalization task for multiple subjects should maintain these
generation that reflects the characteristics of user-provided objects capabilities at a level sufficient to separate similar concepts.
by training the model to capture specific features. In this paper, we propose a personalization method for multiple
DreamBooth [3] is one of the main approaches for fine-tuning targets that can maintain visual fidelity and the editing ability in the
the diffusion model from several images of a specific target. By output of similar concepts. The proposed method introduces a seg-
forming a correspondence between a user-specified concept and a mentation guide through the encoder mechanism of ControlNet [16]
unique identifier to distinguish the concept, the model can generate and performs continual learning based on DreamBooth. The purpose
images of a specific target. Since DreamBooth has the potential to of introducing the segmentation guide is to separate concepts, and
be extended for more convenient personalization tasks, numerous continual learning aims to handle multiple target outputs. Specifi-
related methods have been proposed [5–8]. For instance, various cally, by associating a specific segmented region with a personaliza-
methods achieve the simultaneous output of multiple targets while tion target in the training phase, the proposed method achieves the
maintaining their visual fidelity [7–11], which is a challenge that output control not only with a specific text containing a unique iden-
has been difficult for previous methods to overcome. These methods tifier, but also with a specific segmentation guide. In the case of per-
sonalization for multiple targets, we separate each target concept by
This work was partly supported by JSPS KAKENHI Grant Numbers assigning a specific segmented region to each object. Moreover, we
JP21H03456 and JP23K11141. prevent performance degradation of previous tasks during continual

979-8-3503-4485-1/24/$31.00 ©2024 IEEE 8140 ICASSP 2024

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply
learning by introducing Elastic Weight Consolidation (EWC) [17], where i is segmentation image data corresponding to the content
which contributes to maintaining the visual fidelity of the target ob- image data x. The first term of Eq. (1) and LCLDM have the same
jects. Finally, we summarize the contribution of this paper. purpose in terms of mapping conditions to target objects. Here, sim-
• By associating the segmentation guide with the personaliza- ply adding these two losses promotes overfitting, which is a problem
tion target, the proposed method succeeds in outputting the of DreamBooth. Moreover, in the case where the weight of LCLDM is
target from a specific segmented region and improves the edit- too large, the segmentation guide is ignored during inference and the
ing ability. In particular, we can separate similar concepts by training target is simply generated. To prevent the above problems,
mapping a specific segmented region for each target when we introduce a hyperparameter γ ∈ [0, 1] that changes as the training
personalizing multiple targets. progresses. As a result, the overall loss function in the fine-tuning
for the first object S1 is expressed as follows:
• Since the continual learning method with EWC restricts pa-
rameter updates when learning multiple target concepts, the
proposed method can maintain the visual fidelity of previ- LS1 = (1 − γ)LLDM (xS1 , yS1 ) + λreg LLDM (xreg reg
S 1 , yS 1 )
ously learned targets. (4)
+ γLCLDM (xS1 , yS1 , iS1 ).
Consequently, the proposed method achieves the output of multiple
objects with similar concepts, maintaining an appropriate balance By performing fine-tuning based on the proposed loss mentioned
between visual fidelity and editing ability as shown in Fig. 1. above, our method enables the formation of the correspondence be-
tween the specific target and the specific segmentation image.
2. PROPOSED METHOD

2.1. Personalized Text-to-Image Diffusion Model 2.3. Subsequent Training with EWC Loss
The proposed method performs fine-tuning based on the Dream-
The objective of the subsequent training is to prevent performance
Booth framework. In order to make the model learn the target object
degradation on previous tasks, in addition to achieving the goals of
S, we use content image data xS and its corresponding text yS for
the initial training. Following the ideas of Kirkpatrick et al. [17], we
personalization, and regularized image data xregS and its correspond-
address this issue by updating the diffusion model parameters with-
ing text ySreg for regularization as inputs. Note that the roles of xreg
S
out deviating from the past distribution. Specifically, the following
and ySreg are to address overfitting and language-drift problems [3].
loss LEWC is added to Eq. (4) in the second and later fine-tuning
Then yS consists of a unique identifier and a class descriptor of the
for objects Sn (n = 2, 3, . . . , N ; N being the number of the target
object, such as “a [identifier] [class noun]”, and ySreg includes only
objects) :
the class, such as “a [class noun]”. Although xreg S are generated by
the frozen diffusion model in the original DreamBooth, we alterna- X
tively use real images as xreg LEWC (θSn ,i ) = FSn−1 ,i (θSn ,i − θS∗ n−1 ,i )2 , (5)
S because utilizing real images is effec-
tive in maintaining the visual fidelity [7]. The loss function LS can i

be expressed as follows: where θSn is a network parameter being updated, θS∗ n−1 is a network
parameter obtained from the fine-tuned model of the previous ses-
LS = LLDM (xS , yS ) + λreg LLDM (xreg reg
S , yS ), (1) sion, FSn−1 is a Fisher information matrix calculated in each fine-
where λreg indicates the importance of the class prior, and LLDM is tuning session, and i is the index of the parameter corresponding to
the loss of Latent Diffusion Model (LDM) [2] expressed by the fol- the layer to be updated. Note that both θS∗ n−1 and FSn−1 are treated
lowing equation: as constants in the calculation of the loss LEWC . Here, the Fisher
information matrix serves as an approximation of the posterior dis-
tribution [20], and the importance of the previous task is aggregated
√ √
LLDM (x, y) = E kǫ − ǫθ ( ᾱt x + 1 − ᾱt ǫ, t, τθ (y))k22 , (2) for each layer of the model [21]. The matrix value is defined as
 
follows:
where t is a timestep, ᾱt is a step size based on a variance schedule, ǫ "
is a random noise, τθ (·) is the text encoder of Contrastive Language- 2 #

Image Pre-Training (CLIP) [18], and ǫθ (·, ·, ·) is the U-Net [19]. F =E log L(θ|x) , (6)
∂θ
The proposed method adds an additional equation to LS to learn the
correspondence with the segmentation image and to limit excessive where L(θ|x) is the likelihood function and can be calculated by ap-
updating of the parameters. proximating the variational lower bound [22]. Here, according to the
theory of Denoising Diffusion Probabilistic Models (DDPM) [23],
2.2. Initial Training with ControlNet Loss the definition of the diffusion model loss starts with optimizing a
negative log-likelihood. In the subsequent steps, this loss equation is
The purpose of the first training session is to form a correspondence transformed to the base form of LLDM by splitting the equation, re-
between the specific target and the segmented image, not only the ducing a constant term, and simplifying the coefficients. Since these
specific text. To perform this additional conditioning, the following operations are all linear transformations, Eq. (2) can be regarded as
loss LCLDM is added to Eq. (1). a proportional function of the log-likelihood. Therefore, we can cal-
culate the Fisher matrix of the latent diffusion model as expressed

√ by the following equation:
LCLDM (x, y, i) = E kǫ − ǫθ ( ᾱt x  
2
(3)


 ∂
FSn ,i ≈ E  LLDM (xSn , ySn ) . (7)
+ 1 − ᾱt ǫ, t, τθ (y), i)k22 , ∂θ
θ=θSn ,i

8141

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply
U-Net. By repeating the denoising loop with the above-mentioned
training flow, the proposed method performs training with segmen-
tation image conditioning.

2.5. Inference Phase

In the inference phase, the proposed method outputs multiple tar-


gets, separating each concept by referencing the segmentation guide
via a semantic segmentation model from ControlNet. We control
the output of multiple objects through the operation of input text
prompt and segmentation guide images. Specifically, the segmen-
tation guide contains specific segmented regions corresponding to
each target. This correspondence is formed in the training phase.
Similarly, the input text prompts contains words with unique iden-
tifiers corresponding to each target. The proposed method sets the
Fig. 2: Architecture overview of the proposed method in the training detailed content of the output using a text prompt, as in conventional
phase. The frozen ControlNet is utilized to propagate the segmenta- text-to-image methods, and specifies the approximate pixel position
tion feature to the U-Net. of the target using the segmentation guide. Consequently, the pro-
posed method succeeds in outputting multiple objects while separat-
The calculated matrix contains information about how the parameter ing similar concepts.
contributes to the task performance. Finally, the loss function in the
fine-tuning for the subsequent objects Sn is expressed as follows:
3. EXPERIMENTS

LSn = (1 − γ)LLDM (xSn , ySn ) + λreg LLDM (xreg reg


S n , yS n ) Dataset. For the target images for personalization, we used the
+ γLCLDM (xSn , ySn , iSn ) (8) datasets provided by DreamBooth [3] and Custom-Diffusion [7].
+ λSn−1 LEWC (θSn ,i ), The DreamBooth dataset consists of 30 subjects with 5-6 images
for each target, and the Custom-Diffusion dataset consists of 101
where λSn−1 is a hyperparameter that indicates the importance of subjects with 3-15 images for each target. For the regularization im-
the past task to the current task and balances with Eq. (4). Con- ages, we used the MSCOCO [24] dataset and the Flicker30K [25]
sequently, the proposed method can restrict the parameter update dataset. We compared our method with the following three state-of-
related to the previous tasks. the-art methods: (1) Textual Inversion (TI) [4], (2) CustomDiffusion
(CD) [7], and (3) ED-LoRA [9].
Evaluation Metrics. To evaluate the performance of the model,
2.4. Training Phase
we used two indices based on the CLIP score. (1) Image-Alignment:
In order to perform fine-tuning with additional conditioning, we con- We calculate the similarity between the generated image contain-
nect pre-trained ControlNet to U-Net as shown in Fig. 2. Since the ing the personalization target and the real image of the personaliza-
function of ControlNet is an image encoder that inputs the segmen- tion target used for training. This can evaluate whether the model
tation image as an additional condition, we freeze all layers of the produces output that captures the characteristics of the target. (2)
ControlNet so that the feature map obtained from the segmentation Text-Alignment: We calculate the similarity between the text prompt
image is not modified. Furthermore, the proposed method freezes used to generate the image containing the personalization target and
the blocks that do not contain the attention layer of U-Net. This the generated image. However, words containing identifiers used in
procedure is based on the analysis of Kumari et al. [7], which re- the generation prompt (e.g., a V* dog) are replaced with words of
vealed that the attention layer significantly contributes to the person- the class to which the target belongs (e.g., a dog). It can evaluate
alization task in fine-tuning. Specifically, the U-Net of the diffusion whether the model reflects the textual content in the images.
model consists of nine main blocks: four down blocks, one middle Training details. We performed fine-tuning for the four targets
block, and four up blocks. The down block and the up block nearest and evaluated the model in two scenarios: single-target output and
to the middle block do not contain the attention layer. Therefore, multiple-target output. Following the experiments of Kumari et
the proposed method freezes these two blocks and updates the other al. [7], we generated the following four types of prompts for each
blocks. This operation prevents the update of parameters unrelated target: (A) scene change without changing the target object, (B)
to the personalization task and contributes to maintaining model per- insert a new object along with a target object, (C) style change of
formance. the target object, and (D) material change of the target object. We
The proposed method performs fine-tuning with the above struc- used two prompts for each type, totaling eight prompts per target
ture based on the training flow of LDM [2]. Specifically, the main object. From each prompt, we generated 25 images, with a total
input to the U-Net structure is the noisy image xS,t , and the main in- of 200 images per object for evaluation. We created segmentation
put to the ControlNet structure is the segmentation image iS , which images for reference during training and inference, respectively.
corresponds to xS,t within one training step. The two structures also The segmentation images for training were generated by passing the
receive a text yS and the timestep t ∼ [0, . . . , T ] as inputs. The U- target image through MaskDINO [26], a Segment Anything Model.
Net structure estimates the noise to be removed from xS,t at time For inference, we generated segmentation images by randomly en-
t. The ControlNet part adds residual blocks from its down blocks larging and reducing the segmented regions used in training, then
and middle block to the corresponding middle and up blocks of the pasting them at arbitrary locations on a solid-color image.

8142

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply
Fig. 3: Qualitative comparison results. The segmentation image is used for the proposed method.

and ED-LoRA [9] generate almost the same number of objects as


specified, but do not fully separate the concepts. Besides, CD has
difficulty in reflecting the textual contents of prompts (C) and (D).
In contrast, our proposed method succeeds in minimizing concept
mixing and generates images that maintain the visual fidelity of the
target.

3.2. Quantitative Comparison


Figure 4 shows the quantitative results of our method and the state-
of-the-art methods [4, 7, 9]. This figure shows that the training step
800 of the proposed method is the best image alignment for both sin-
gle and multiple target outputs, indicating that the proposed method
is able to generate images that better capture the visual fidelity of the
target. TI records better text alignment than the other methods, but
Fig. 4: Results of a CLIP score for single target objects (left), and tends to be lower image alignment. Conversely, the optimal balance
multiple target objects (right). The crossed markers indicate the av- between image and text alignment is reached at training step 600 in
erage values of the four targets. The blue dotted line shows the our method, confirming its overall superior performance.
trajectory of the proposed method as the number of training steps
changes.
4. CONCLUSION

We have proposed a novel personalization method for multiple tar-


3.1. Qualitative Comparison gets with continual learning by leveraging a segmentation guide. Our
method achieves the separation of similar concepts by utilizing a
Figure 3 shows the qualitative results of outputting multiple person- segmentation guide and preserves the visual fidelity of multiple ob-
alization targets from four types of prompts (A)-(D). For all four jects through the use of EWC in continual learning. Experimental
types of prompts, the proposed method successfully generates im- results demonstrate that our method outperforms state-of-the-art ap-
ages that capture features from the target image while also reflecting proaches in terms of visual fidelity and is capable of generating mul-
the textual content. For example, TI [4] suffers from the simulta- tiple targets with distinct concepts.
neous output of multiple objects maintaining visual fidelity. CD [7]

8143

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply
5. REFERENCES [15] O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and
D. Lischinski, “Break-a-scene: Extracting multiple concepts
[1] M. Henderson, K. Bokor, M. Dookie, and M. Shirotori, “Cre- from a single image,” in Proc. ACM SIGGRAPH Conference
ative economy outlook 2022,” United Nations, pp. 70–75, and Exhibition on Computer Graphics and Interactive Tech-
2022. niques in Asia, 2023, pp. 1–12.
[2] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Om- [16] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional con-
mer, “High-resolution image synthesis with latent diffusion trol to text-to-image diffusion models,” in Proc. IEEE Interna-
models,” in Proc. IEEE/CVF Conference on Computer Vision tional Conference on Computer Vision, 2023, pp. 3836–3847.
and Pattern Recognition, 2022, pp. 10684–10695.
[17] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des-
[3] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho,
K. Aberman, “Dreambooth: Fine tuning text-to-image diffu- A. Grabska-Barwinska, et al., “Overcoming catastrophic for-
sion models for subject-driven generation,” in Proc. IEEE/CVF getting in neural networks,” National Academy of Sciences,
Conference on Computer Vision and Pattern Recognition, vol. 114, no. 13, pp. 3521–3526, 2017.
2023, pp. 22500–22510.
[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh,
[4] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano,
S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al.,
G. Chechik, and D. Cohen-Or, “An image is worth one
“Learning transferable visual models from natural language
word: Personalizing text-to-image generation using textual in-
supervision,” in Proc. International Conference on Machine
version,” in Proc. International Conference on Learning Rep-
Learning, 2021, pp. 8748–8763.
resentations, 2023, pp. 1–31.
[19] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolu-
[5] W. Chen, H. Hu, Y. Li, N. Rui, X. Jia, M. Chang, and W. W.
tional networks for biomedical image segmentation,” in Proc.
Cohen, “Subject-driven text-to-image generation via appren-
International Conference on Medical Image Computing and
ticeship learning,” arXiv preprint arXiv:2304.00186, 2023.
Computer-Assisted Intervention, 2015, pp. 234–241.
[6] R. Gal, M. Arar, Y. Atzmon, A. H. Bermano, G. Chechik,
and D. Cohen-Or, “Encoder-based domain tuning for fast per- [20] A. Ly, M. Marsman, J. Verhagen, R. P. P. P. Grasman, and
sonalization of text-to-image models,” ACM Transactions on E. Wagenmakers, “A tutorial on fisher information,” Journal
Graphics, vol. 42, no. 4, pp. 1–13, 2023. of Mathematical Psychology, vol. 80, pp. 40–55, 2017.
[7] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J. Zhu, [21] Y. Li, R. Zhang, J. C. Lu, and E. Shechtman, “Few-shot im-
“Multi-concept customization of text-to-image diffusion,” in age generation with elastic weight consolidation,” Advances
Proc. IEEE/CVF Conference on Computer Vision and Pattern in Neural Information Processing Systems, vol. 33, pp. 15885–
Recognition, 2023, pp. 1931–1941. 15896, 2020.
[8] Y. Tewel, R. Gal, G. Chechik, and Y. Atzmon, “Key-locked [22] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner, “Varia-
rank one editing for text-to-image personalization,” in Proc. tional continual learning,” in Proc. International Conference
ACM Special Interest Group on Computer Graphics, 2023, pp. on Learning Representations, 2018, pp. 1–18.
1–11. [23] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion proba-
[9] Y. Gu, X. Wang, J. Z. Wu, Y. Shi, Chen Y., Z. Fan, W. Xiao, bilistic models,” Advances in Neural Information Processing
R. Zhao, S. Chang, W. Wu, Y. Ge, Shan Y., and M. Z. Shou, Systems, vol. 33, pp. 6840–6851, 2020.
“Mix-of-show: Decentralized low-rank adaptation for multi- [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
concept customization of diffusion models,” Advances in Neu- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
ral Information Processing Systems, 2023. Zitnick, “Microsoft COCO: Common objects in context,” in
[10] L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, and F. Yang, Proc. IEEE European Conference on Computer Vision, 2014,
“SVDiff: Compact parameter space for diffusion fine-tuning,” pp. 740–755.
in Proc. IEEE International Conference on Computer Vision,
[25] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From im-
2023, pp. 7323–7334.
age descriptions to visual denotations: New similarity metrics
[11] Z. Liu, R. Feng, K. Zhu, Y. Zhang, K. Zheng, Y. Liu, D. Zhao, for semantic inference over event descriptions,” Transactions
J. Zhou, and Y. Cao, “Cones: Concept neurons in diffusion of the Association for Computational Linguistics, vol. 2, pp.
models for customized generation,” in Proc. International 67–78, 2014.
Conference on Machine Learning, 2023, pp. 21548–21566.
[26] F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H. Y.
[12] Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, Shum, “Mask DINO: Towards a unified transformer-based
“ELITE: Encoding visual concepts into textual embeddings for framework for object detection and segmentation,” in Proc.
customized text-to-image generation,” in Proc. IEEE Inter- IEEE/CVF Conference on Computer Vision and Pattern Recog-
national Conference on Computer Vision, 2023, pp. 15943– nition, 2023, pp. 3041–3050.
15953.
[13] R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, Hao Dong, Peng
Gao, and Hongsheng Li, “Personalize segment anything model
with one shot,” arXiv preprint arXiv:2305.03048, 2023.
[14] Y. Watanabe, R. Togo, K. Maeda, T. Ogawa, and M. Haseyama,
“Text-guided facial image manipulation for wild images via
manipulation direction-based loss,” in Proc. IEEE Interna-
tional Conference on Image Processing, 2023, pp. 361–365.

8144

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply

You might also like