0% found this document useful (0 votes)
2 views

custom_diffusion

The document discusses advancements in customizing text-to-image diffusion models, focusing on efficient training and low storage requirements. It highlights methods for merging weights of individual concepts and generating images with multiple concepts, achieving significant improvements in speed and efficiency compared to traditional models. The proposed approach allows for personalized and compositional image generation while maintaining a manageable storage footprint.

Uploaded by

Dinesh Ch
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

custom_diffusion

The document discusses advancements in customizing text-to-image diffusion models, focusing on efficient training and low storage requirements. It highlights methods for merging weights of individual concepts and generating images with multiple concepts, achieving significant improvements in speed and efficiency compared to traditional models. The proposed approach allows for personalized and compositional image generation while maintaining a manageable storage footprint.

Uploaded by

Dinesh Ch
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Multi-Concept Customization of Text-to-

Image Diffusion

Nupur Kumari Bingliang Richard Eli Jun-Yan Zhu


Zhang Zhang Shechtman

CVPR, 2023.
Large-scale text-to-image models
“teddy bears mixing
sparkling chemicals as
mad scientists in a
steampunk style”

A photograph of the inside of a subway train.


There are raccoons sitting on the seats. One of
them is reading a newspaper. The window shows
“A teddy bear on a
the city in the background. skateboard in Times
Square.”
Diffusion models Autoregressive models GANs
(DALL-E 2, Stable (Image GPT, Parti) (GigaGAN)
Diffusion)
Stable
Text-to-image isn’t perfect…
Diffusion

Photo of a moongate
Stable
Text-to-image isn’t perfect…
Diffusion

Actual moongate
images
Photo of a moongate
Stable
Text-to-image isn’t perfect…
Diffusion

Actual moongate
images
Photo of a moongate
Stable
Customization Diffusion

Actual moongate
images
Photo of a moongate
Custom
Customization Diffusion

Actual moongate
images
Photo of a moongate
Custom
Customization Diffusion

Actual moongate
images
Photo of a moongate
Custom
Unseen contexts Diffusion

Actual moongate
images
Moongate in the middle of highway
Custom
Unseen contexts Diffusion

Actual moongate
images
Moongate in snowy ice
Custom
Unseen contexts Diffusion

Actual moongate
images
A puppy in front of Moongate
No knowledge of personal Stable
concepts Diffusion

Jun-Yan’s dog, Stark

A dark grey color weimaraner dog


No knowledge of personal Stable
concepts Diffusion

Jun-Yan’s dog, Stark

A dark grey color weimaraner dog


Custom
Customization Diffusion

Jun-Yan’s dog, Stark

V* dog wearing sunglasses


Custom
Multiple concepts Diffusion

Jun-Yan’s dog, Stark


Actual moongate
images V* dog wearing sunglasses in front of moongate
The Bottleneck
The space of training images we can describe

is smaller than

the space of images we can imagine.


How to efficiently customize
text-to-image diffusion models?
Diffusion models

Forward diffusion process (fixed)

Reverse diffusion process (learned generative model)

*slides motivated from https://cvpr2022-tutorial-diffusion-models.github.io


Diffusion model training
Photo of a Moongate

Diffusion
+ =
Model (U-Net)

L2 loss
Fine-tuning all model weights

Photo of a moongate Moongate in snowy ice

Storage requirement. 4GB storage for each fine-tuned model.


Compute requirement. It requires more VRAM/training time.
Compositionality. Hard to combine multiple models.
Analyze change in weights

: updated weights
where
: pretrained weights
Analyze change in weights

: updated weights
where
: pretrained weights
Analyze change in weights

: updated weights
where
: pretrained weights
Analyze change in weights

: updated weights
where
: pretrained weights
Text-image Cross-Attention

=
*
= *
photo

moon
gate
of

i.e.
a

Output =
Text-image Cross-Attention

Text features only


photo

moon
gate

input to and
of
a

Trainable Frozen
Only fine-tune cross-attention
layers

photo

moon
gate
of
a
Text
transformer

Attention

Attention
ResNet

ResNet
KV ... KV
Q Q

Diffusion Model U-Net

Trainable Frozen
Generated samples for target concept
Photo of a moongate

Pretrained Model Fine-tuned Model


Generated samples for similar concepts
Photo of a moon

Pretrained Model Fine-tuned Model


How to prevent overfitting?

+
Photo of a Photo of a sky full of stars
Blood moon
{moongate} {moongate} and the moon
... ...
Target images Add regularization images
Generated samples for target concept
Photo of a moongate

Pretrained Model Fine-tuned Model


Generated samples for similar concepts
Photo of a moon

Pretrained Model Fine-tuned Model


Personalized concepts

How to describe personalized


concepts?

V* dog

Where V* is a modifier token


in the text embedding space

Jun-Yan’s dog, Stark


Personalized concepts
Also fine-tune the modifier token V* that describes the
personalized concept

photo

dog
of

V*
a
Text
transformer

Attention

Attention
ResNet

ResNet
KV ... KV
Q Q

Diffusion Model U-Net

Trainable Frozen
Single concept results

V* dog wearing headphones


Single concept results

A watercolor painting of V*
tortoise plushy on a mountain
Single concept results

V* table and an orange sofa


Single concept results

Drawings from Aaron


Hertzmann
Painting of dog in the style
of V* art
Multiple new concepts?

+ ?
Joint training
1. Combine the training dataset of multiple concepts
image
Target

V* dog Moongate
Regularization
images

Dog Cute dog Wisdom moon Gated entry


Joint training

Requires re-training for each choice of composition

100 concepts -> 4950 combinations of two concepts.

100 concepts -> 161, 700 combinations of three


concepts.
an we merge weights of individual concepts

+
V* dog wearing
sunglasses
in front of a
moongate
Objective function for merging weights

photo

photo
dog

dog
of

of
V*

V*
a

a
Text transformer Text transformer

* = *
Fine-tuned
Merge
weights for
d
V* dog
weight
Objective function for merging weights

moongate

moongate
photo

photo
of

of
a

a
Text transformer Text transformer

* = *
Fine-tuned
Merge
weights for
d
moongate
weight
Objective function for merging weights

i.e.

: target prompts, e.g., {photo of a V* dog, photo of moongate}

But still being similar to pretrained weights on a


collection of random text prompts .

i.e.
Constrained least square problem

Constraints:
Merging weights of individual concepts

Closed form solution using lagrange multiplier

Differentiating the above and solving for and ,


Two concept results

V1* dog in front of


moongate
Two concept results

V1* dog in front of


moongate
Two concept results

V1* flower in the V2*


wooden pot on a table
Two concept results

V1* flower in the V2*


wooden pot on a table
Two concept results

V1* chair with the V2* cat


sitting on it near a beach
Two concept results

V1* chair with the V2* cat


sitting on it near a beach
Two concept results

The V1* cat is sitting


inside a V2* wooden pot
and looking up
Two concept results

The V1* cat is sitting


inside a V2* wooden pot
and looking up
Two concept results

Drawings from Aaron


Hertzmann V1* art style painting
of V2* wooden pot
Two concept results

Drawings from Aaron


Hertzmann V1* art style painting
of V2* wooden pot
Concurrent works
• DreamBooth: https://dreambooth.github.io/
• Fine-tuning all the weights

• Textual Inversion: https://textual-inversion.github.io/


• Optimizing text embedding with frozen weights
Qualitative comparison (single-
concept)
Target Images

V* teddybear in
Times Square??
Qualitative comparison (single-
concept)
Target Images Custom Diffusion (Ours) DreamBooth Textual Inversion

V* teddybear in Times Square


Qualitative comparison (multi-
concept)
Target Images Custom Diffusion (Ours) DreamBooth Textual Inversion

V1* flower in the V2* wooden pot on a table


Quantitative metrics
Target Image

CLIP Image alignment:


Image
Encoder Sim( , )

Generated
Image

dog
playing CLIP
Text
Text alignment:
with a
ball Encoder Sim( , )
Quantitative comparison (single-concept)
Quantitative comparison (single-concept)
Quantitative comparison (single-concept)
Quantitative comparison (single-concept)
Quantitative comparison (multi-concept)
Quantitative comparison (multi-concept)
Quantitative comparison (multi-concept)
Quantitative comparison (multi-concept)
Memory requirement
Each custom diffusion model: 75MB storage

Analyze the difference in pretrained and fine-tuned


weights
Compressing fine-tuned weights
75MB 15MB 0.1MB 0.08MB

Target Custom Top 20% 1 Rank 0 Rank


image Diffusion rank
Limitations

Ours Pretrained model

V1* dog and a V2* cat dog and a cat


playing together playing together

The two concepts


are entangled
Summary
• Efficient training (~6 minutes on 2 A100s)

• Low storage: 15~75 MB/concept (vs 4GB for complete


model)

• On-the-fly weight merge for 2-3 concepts within 1


second

74

You might also like