Multi-Object Editing in Personalized Text-To-Image Diffusion Model via Segmentation Guidance
Multi-Object Editing in Personalized Text-To-Image Diffusion Model via Segmentation Guidance
Haruka Matsuda† , Ren Togo†† , Keisuke Maeda†† , Takahiro Ogawa†† , Miki Haseyama††
†
School of Engineering, Hokkaido University, Japan
††
Faculty of Information Science and Technology, Hokkaido University, Japan
E-mail:{matsuda, togo, maeda, ogawa, mhaseyama}@lmd.ist.hokudai.ac.jp
ABSTRACT
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply
learning by introducing Elastic Weight Consolidation (EWC) [17], where i is segmentation image data corresponding to the content
which contributes to maintaining the visual fidelity of the target ob- image data x. The first term of Eq. (1) and LCLDM have the same
jects. Finally, we summarize the contribution of this paper. purpose in terms of mapping conditions to target objects. Here, sim-
• By associating the segmentation guide with the personaliza- ply adding these two losses promotes overfitting, which is a problem
tion target, the proposed method succeeds in outputting the of DreamBooth. Moreover, in the case where the weight of LCLDM is
target from a specific segmented region and improves the edit- too large, the segmentation guide is ignored during inference and the
ing ability. In particular, we can separate similar concepts by training target is simply generated. To prevent the above problems,
mapping a specific segmented region for each target when we introduce a hyperparameter γ ∈ [0, 1] that changes as the training
personalizing multiple targets. progresses. As a result, the overall loss function in the fine-tuning
for the first object S1 is expressed as follows:
• Since the continual learning method with EWC restricts pa-
rameter updates when learning multiple target concepts, the
proposed method can maintain the visual fidelity of previ- LS1 = (1 − γ)LLDM (xS1 , yS1 ) + λreg LLDM (xreg reg
S 1 , yS 1 )
ously learned targets. (4)
+ γLCLDM (xS1 , yS1 , iS1 ).
Consequently, the proposed method achieves the output of multiple
objects with similar concepts, maintaining an appropriate balance By performing fine-tuning based on the proposed loss mentioned
between visual fidelity and editing ability as shown in Fig. 1. above, our method enables the formation of the correspondence be-
tween the specific target and the specific segmentation image.
2. PROPOSED METHOD
2.1. Personalized Text-to-Image Diffusion Model 2.3. Subsequent Training with EWC Loss
The proposed method performs fine-tuning based on the Dream-
The objective of the subsequent training is to prevent performance
Booth framework. In order to make the model learn the target object
degradation on previous tasks, in addition to achieving the goals of
S, we use content image data xS and its corresponding text yS for
the initial training. Following the ideas of Kirkpatrick et al. [17], we
personalization, and regularized image data xregS and its correspond-
address this issue by updating the diffusion model parameters with-
ing text ySreg for regularization as inputs. Note that the roles of xreg
S
out deviating from the past distribution. Specifically, the following
and ySreg are to address overfitting and language-drift problems [3].
loss LEWC is added to Eq. (4) in the second and later fine-tuning
Then yS consists of a unique identifier and a class descriptor of the
for objects Sn (n = 2, 3, . . . , N ; N being the number of the target
object, such as “a [identifier] [class noun]”, and ySreg includes only
objects) :
the class, such as “a [class noun]”. Although xreg S are generated by
the frozen diffusion model in the original DreamBooth, we alterna- X
tively use real images as xreg LEWC (θSn ,i ) = FSn−1 ,i (θSn ,i − θS∗ n−1 ,i )2 , (5)
S because utilizing real images is effec-
tive in maintaining the visual fidelity [7]. The loss function LS can i
be expressed as follows: where θSn is a network parameter being updated, θS∗ n−1 is a network
parameter obtained from the fine-tuned model of the previous ses-
LS = LLDM (xS , yS ) + λreg LLDM (xreg reg
S , yS ), (1) sion, FSn−1 is a Fisher information matrix calculated in each fine-
where λreg indicates the importance of the class prior, and LLDM is tuning session, and i is the index of the parameter corresponding to
the loss of Latent Diffusion Model (LDM) [2] expressed by the fol- the layer to be updated. Note that both θS∗ n−1 and FSn−1 are treated
lowing equation: as constants in the calculation of the loss LEWC . Here, the Fisher
information matrix serves as an approximation of the posterior dis-
tribution [20], and the importance of the previous task is aggregated
√ √
LLDM (x, y) = E kǫ − ǫθ ( ᾱt x + 1 − ᾱt ǫ, t, τθ (y))k22 , (2) for each layer of the model [21]. The matrix value is defined as
follows:
where t is a timestep, ᾱt is a step size based on a variance schedule, ǫ "
is a random noise, τθ (·) is the text encoder of Contrastive Language- 2 #
∂
Image Pre-Training (CLIP) [18], and ǫθ (·, ·, ·) is the U-Net [19]. F =E log L(θ|x) , (6)
∂θ
The proposed method adds an additional equation to LS to learn the
correspondence with the segmentation image and to limit excessive where L(θ|x) is the likelihood function and can be calculated by ap-
updating of the parameters. proximating the variational lower bound [22]. Here, according to the
theory of Denoising Diffusion Probabilistic Models (DDPM) [23],
2.2. Initial Training with ControlNet Loss the definition of the diffusion model loss starts with optimizing a
negative log-likelihood. In the subsequent steps, this loss equation is
The purpose of the first training session is to form a correspondence transformed to the base form of LLDM by splitting the equation, re-
between the specific target and the segmented image, not only the ducing a constant term, and simplifying the coefficients. Since these
specific text. To perform this additional conditioning, the following operations are all linear transformations, Eq. (2) can be regarded as
loss LCLDM is added to Eq. (1). a proportional function of the log-likelihood. Therefore, we can cal-
culate the Fisher matrix of the latent diffusion model as expressed
√ by the following equation:
LCLDM (x, y, i) = E kǫ − ǫθ ( ᾱt x
2
(3)
√
∂
FSn ,i ≈ E LLDM (xSn , ySn ) . (7)
+ 1 − ᾱt ǫ, t, τθ (y), i)k22 , ∂θ
θ=θSn ,i
8141
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply
U-Net. By repeating the denoising loop with the above-mentioned
training flow, the proposed method performs training with segmen-
tation image conditioning.
8142
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply
Fig. 3: Qualitative comparison results. The segmentation image is used for the proposed method.
8143
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply
5. REFERENCES [15] O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and
D. Lischinski, “Break-a-scene: Extracting multiple concepts
[1] M. Henderson, K. Bokor, M. Dookie, and M. Shirotori, “Cre- from a single image,” in Proc. ACM SIGGRAPH Conference
ative economy outlook 2022,” United Nations, pp. 70–75, and Exhibition on Computer Graphics and Interactive Tech-
2022. niques in Asia, 2023, pp. 1–12.
[2] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Om- [16] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional con-
mer, “High-resolution image synthesis with latent diffusion trol to text-to-image diffusion models,” in Proc. IEEE Interna-
models,” in Proc. IEEE/CVF Conference on Computer Vision tional Conference on Computer Vision, 2023, pp. 3836–3847.
and Pattern Recognition, 2022, pp. 10684–10695.
[17] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des-
[3] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho,
K. Aberman, “Dreambooth: Fine tuning text-to-image diffu- A. Grabska-Barwinska, et al., “Overcoming catastrophic for-
sion models for subject-driven generation,” in Proc. IEEE/CVF getting in neural networks,” National Academy of Sciences,
Conference on Computer Vision and Pattern Recognition, vol. 114, no. 13, pp. 3521–3526, 2017.
2023, pp. 22500–22510.
[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh,
[4] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano,
S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al.,
G. Chechik, and D. Cohen-Or, “An image is worth one
“Learning transferable visual models from natural language
word: Personalizing text-to-image generation using textual in-
supervision,” in Proc. International Conference on Machine
version,” in Proc. International Conference on Learning Rep-
Learning, 2021, pp. 8748–8763.
resentations, 2023, pp. 1–31.
[19] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolu-
[5] W. Chen, H. Hu, Y. Li, N. Rui, X. Jia, M. Chang, and W. W.
tional networks for biomedical image segmentation,” in Proc.
Cohen, “Subject-driven text-to-image generation via appren-
International Conference on Medical Image Computing and
ticeship learning,” arXiv preprint arXiv:2304.00186, 2023.
Computer-Assisted Intervention, 2015, pp. 234–241.
[6] R. Gal, M. Arar, Y. Atzmon, A. H. Bermano, G. Chechik,
and D. Cohen-Or, “Encoder-based domain tuning for fast per- [20] A. Ly, M. Marsman, J. Verhagen, R. P. P. P. Grasman, and
sonalization of text-to-image models,” ACM Transactions on E. Wagenmakers, “A tutorial on fisher information,” Journal
Graphics, vol. 42, no. 4, pp. 1–13, 2023. of Mathematical Psychology, vol. 80, pp. 40–55, 2017.
[7] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J. Zhu, [21] Y. Li, R. Zhang, J. C. Lu, and E. Shechtman, “Few-shot im-
“Multi-concept customization of text-to-image diffusion,” in age generation with elastic weight consolidation,” Advances
Proc. IEEE/CVF Conference on Computer Vision and Pattern in Neural Information Processing Systems, vol. 33, pp. 15885–
Recognition, 2023, pp. 1931–1941. 15896, 2020.
[8] Y. Tewel, R. Gal, G. Chechik, and Y. Atzmon, “Key-locked [22] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner, “Varia-
rank one editing for text-to-image personalization,” in Proc. tional continual learning,” in Proc. International Conference
ACM Special Interest Group on Computer Graphics, 2023, pp. on Learning Representations, 2018, pp. 1–18.
1–11. [23] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion proba-
[9] Y. Gu, X. Wang, J. Z. Wu, Y. Shi, Chen Y., Z. Fan, W. Xiao, bilistic models,” Advances in Neural Information Processing
R. Zhao, S. Chang, W. Wu, Y. Ge, Shan Y., and M. Z. Shou, Systems, vol. 33, pp. 6840–6851, 2020.
“Mix-of-show: Decentralized low-rank adaptation for multi- [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
concept customization of diffusion models,” Advances in Neu- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
ral Information Processing Systems, 2023. Zitnick, “Microsoft COCO: Common objects in context,” in
[10] L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, and F. Yang, Proc. IEEE European Conference on Computer Vision, 2014,
“SVDiff: Compact parameter space for diffusion fine-tuning,” pp. 740–755.
in Proc. IEEE International Conference on Computer Vision,
[25] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From im-
2023, pp. 7323–7334.
age descriptions to visual denotations: New similarity metrics
[11] Z. Liu, R. Feng, K. Zhu, Y. Zhang, K. Zheng, Y. Liu, D. Zhao, for semantic inference over event descriptions,” Transactions
J. Zhou, and Y. Cao, “Cones: Concept neurons in diffusion of the Association for Computational Linguistics, vol. 2, pp.
models for customized generation,” in Proc. International 67–78, 2014.
Conference on Machine Learning, 2023, pp. 21548–21566.
[26] F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H. Y.
[12] Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, Shum, “Mask DINO: Towards a unified transformer-based
“ELITE: Encoding visual concepts into textual embeddings for framework for object detection and segmentation,” in Proc.
customized text-to-image generation,” in Proc. IEEE Inter- IEEE/CVF Conference on Computer Vision and Pattern Recog-
national Conference on Computer Vision, 2023, pp. 15943– nition, 2023, pp. 3041–3050.
15953.
[13] R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, Hao Dong, Peng
Gao, and Hongsheng Li, “Personalize segment anything model
with one shot,” arXiv preprint arXiv:2305.03048, 2023.
[14] Y. Watanabe, R. Togo, K. Maeda, T. Ogawa, and M. Haseyama,
“Text-guided facial image manipulation for wild images via
manipulation direction-based loss,” in Proc. IEEE Interna-
tional Conference on Image Processing, 2023, pp. 361–365.
8144
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:31:16 UTC from IEEE Xplore. Restrictions apply