custom_diffusion
custom_diffusion
Image Diffusion
CVPR, 2023.
Large-scale text-to-image models
“teddy bears mixing
sparkling chemicals as
mad scientists in a
steampunk style”
Photo of a moongate
Stable
Text-to-image isn’t perfect…
Diffusion
Actual moongate
images
Photo of a moongate
Stable
Text-to-image isn’t perfect…
Diffusion
Actual moongate
images
Photo of a moongate
Stable
Customization Diffusion
Actual moongate
images
Photo of a moongate
Custom
Customization Diffusion
Actual moongate
images
Photo of a moongate
Custom
Customization Diffusion
Actual moongate
images
Photo of a moongate
Custom
Unseen contexts Diffusion
Actual moongate
images
Moongate in the middle of highway
Custom
Unseen contexts Diffusion
Actual moongate
images
Moongate in snowy ice
Custom
Unseen contexts Diffusion
Actual moongate
images
A puppy in front of Moongate
No knowledge of personal Stable
concepts Diffusion
is smaller than
Diffusion
+ =
Model (U-Net)
L2 loss
Fine-tuning all model weights
: updated weights
where
: pretrained weights
Analyze change in weights
: updated weights
where
: pretrained weights
Analyze change in weights
: updated weights
where
: pretrained weights
Analyze change in weights
: updated weights
where
: pretrained weights
Text-image Cross-Attention
=
*
= *
photo
moon
gate
of
i.e.
a
Output =
Text-image Cross-Attention
moon
gate
input to and
of
a
Trainable Frozen
Only fine-tune cross-attention
layers
photo
moon
gate
of
a
Text
transformer
Attention
Attention
ResNet
ResNet
KV ... KV
Q Q
Trainable Frozen
Generated samples for target concept
Photo of a moongate
+
Photo of a Photo of a sky full of stars
Blood moon
{moongate} {moongate} and the moon
... ...
Target images Add regularization images
Generated samples for target concept
Photo of a moongate
V* dog
photo
dog
of
V*
a
Text
transformer
Attention
Attention
ResNet
ResNet
KV ... KV
Q Q
Trainable Frozen
Single concept results
A watercolor painting of V*
tortoise plushy on a mountain
Single concept results
+ ?
Joint training
1. Combine the training dataset of multiple concepts
image
Target
V* dog Moongate
Regularization
images
+
V* dog wearing
sunglasses
in front of a
moongate
Objective function for merging weights
photo
photo
dog
dog
of
of
V*
V*
a
a
Text transformer Text transformer
* = *
Fine-tuned
Merge
weights for
d
V* dog
weight
Objective function for merging weights
moongate
moongate
photo
photo
of
of
a
a
Text transformer Text transformer
* = *
Fine-tuned
Merge
weights for
d
moongate
weight
Objective function for merging weights
i.e.
i.e.
Constrained least square problem
Constraints:
Merging weights of individual concepts
V* teddybear in
Times Square??
Qualitative comparison (single-
concept)
Target Images Custom Diffusion (Ours) DreamBooth Textual Inversion
Generated
Image
dog
playing CLIP
Text
Text alignment:
with a
ball Encoder Sim( , )
Quantitative comparison (single-concept)
Quantitative comparison (single-concept)
Quantitative comparison (single-concept)
Quantitative comparison (single-concept)
Quantitative comparison (multi-concept)
Quantitative comparison (multi-concept)
Quantitative comparison (multi-concept)
Quantitative comparison (multi-concept)
Memory requirement
Each custom diffusion model: 75MB storage
74