Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
👦 👨
👦👨 👦👨
🏛
2
this LLMs’ synthetic data-driven self-improvement, this (SOTA) results.
study proposes that achieving stable and scalable cus-
tomized generation necessitates an analogous model-data 2. Related Work
co-evolution paradigm, where less-controllable preceding
customized models systematically synthesize better cus-
2.1. Text-to-image generation
tomization data for successive more-controllable variants, Recent years have witnessed explosive growth in text-to-
enabling persistent co-evolution between enhanced cus- image (T2I) models [6, 7, 17, 25, 28, 30, 31, 44, 46].
tomized models and enriched customization data, as illus- Apart from some work that chooses the GAN or autoregres-
trated in Figure. 2. sive paradigm, most of current text-to-image work chose
Technically, to achieve the model-data co-evolution, this the denoising diffuison [11, 37] as their image generation
study addresses two fundamental challenges, i.e., (1) how framework. Early exploratory work [21, 29, 34] have val-
to establish a systematic synthetic data curation framework idated the feasibility of using diffusion models for text-to-
that reliably harnesses knowledge distillation from less- image generation and demonstrated their superior perfor-
controllable models; and (2) how to develop a general- mance compared to other methods. The efficiency, qual-
ized customization model framework capable of hierarchi- ity, and capacity for T2I diffusion models are keeping im-
cal controllability adaptation, ensuring seamless scalability proved in the following work. LDM [31] suggests training
across varying degrees of controllability. To be specific, as the diffusion model in latent space significantly improves
for the synthetic data curation framework, we introduce a the efficiency and output resolution, which become the de-
progressive synthesis pipeline that transitions from single- fault choice for many subsequent works such as Stable Dif-
subject to multi-subject in-context generation, combined fusion series [7, 25], Imagen3 [2] and Flux [17]. Recent
with a multi-stage filtration mechanism to curate unprece- work [7, 17, 24] replaces the unet [32] to transformer and
dented high-resolution, high-quality paired customization shows the impressive quality and scalability of the trans-
data through fine-grained ensembled filtering. As for the former backbone.
customization model framework, we develop UNO to fully
2.2. Subject-driven generation
unlock the multi-condition contextual capabilities of Dif-
fusion Transformers (DiT) through iterative simple-to-hard Subject-driven generation has been widely studied in the
training, preserving the base architecture’s scalability with context of diffusion models. Dreambooth [33], textual in-
minimal modifications. Moreover, we propose universal version [8] and LoRA [12] introduce the subject-driven gen-
Rotary Position Embedding (UnoPE) to effectively equip eration capability by introducing lightweight new parame-
UNO with the capability of mitigating the attribute confu- ters and perform parameter efficiency tuning for each sub-
sion issue when scaling visual subject controls. ject. The major drawback of those methods is the cum-
Our contributions are summarized as: bersome fine-tuning process for each new-coming subject.
Conceptual Contribution. We identify that current IP-adapter [45], BLIP Diffusion [18] use an extra image
data-driven approaches in customized model design inher- encoder and new layers to encode the reference image of
ently suffer from scalability constraints rooted in fundamen- the subject and inject it into the diffusion model, achiev-
tal data bottlenecks. To address this limitation, we pio- ing a subject-driven generation without further finetuning
neer a model-data co-evolution paradigm that achieves en- for a new concept. For DiT, IC LoRA [13] and Ominicon-
hanced controllability while enabling stable and scalable trol [38] have explored the inherent image reference capa-
customized generation. bility in the transformer, pointing that the DiT itself can
be used as the image encoder for the subject reference.
Technical Contribution. (1) We develop a system-
Many further works follow this reference image injection
atic framework for synthetic data curation that produces
approach and have made improvements in various aspects
high-fidelity, high-resolution paired customization datasets
like facial identity [10, 40], image-text joint controlabil-
through progressive in-context synthesis and multi-stage fil-
ity [14], multiple reference subject support [20, 41]. De-
tering. (2) We propose UNO, a universal customization
spite these advances, the aforementioned work heavily re-
architecture that enables seamless scalability across multi-
lies on paired images, which are hard to collect, especially
condition control through minimal yet effective modifica-
for multi-subject scenes.
tion of DiT.
Experimental Contribution.We conduct extensive ex-
3. Methodology
periments on DreamBench [33] and multi-subject driven
generation benchmarks. Our UNO achieve the highest This section introduces our proposed model-data co-
DINO and CLIP-I scores among these two tasks. This evolution paradigm, encompassing the systematic synthetic
demonstrates its strong subject similarity and text control- data curation framework detailed in Sec. 3.2 and the gen-
lability, showcasing its capability to deliver state-of-the-art eralized customization model framework (i.e., UNO) ex-
3
(a) Single-Subject In-Context Generation
Taxonomy Tree Training Pairs
ing. As depicted in Fig. 3, we introduce a high-resolution,
Generated
highly-consistent data synthesis pipeline to tackle this chal-
Subject and Scene Set
T2I
VLM-based
Filter
lenge, capitalizing on the intrinsic in-context generation ca-
…
Model
'A diptych with two side-by-
side images of same <\subject1>.
Left: <\subject1> in <\scene1>.
$
𝐼!"# '
𝐼!"# 𝐼%&%
pabilities of DiT-based models. Through the utilization of
Right: <\subject1> together
with <\subject2> in <\scene2>’ An elderly woman with gray hair wearing
a brown coat and a scarf, holding a globe.
meticulously crafted text inputs, DiT models exhibit the ca-
Text Template $
𝐼!"# + 𝐼%&%
pacity to generate subject-consistent grid outcomes. In con-
…
(b) Multi-Subject In-Context Generation
Training
trast to prior methodologies like OminiControl [38], which
generate single-subject consistent data at a resolution of
… VLM-based
pitcher
OVD
S2I
Model
Filter
$ '
512 × 512, our approach establishes a more comprehen-
𝐼!"# 𝐼!"# 𝐼%&%
Cropped Subject
Subject Prompts
A European Robin perched on the
sive pipeline that progresses from single-subject to multi-
edge of a white ceramic pitcher.
globe
subject data generation. This advancement enables the di-
'
𝐼!"#
rect production of three distinct high-resolution image pairs
(i.e., 1024 × 1024, 1024 × 768, and 768 × 1024), signifi-
Figure 3. Illustration of our proposed synthetic data curation
cantly broadening the application spectrum and diversity of
framework based on in-context data generation.
the synthesized data.
We emphasize that the superior quality of the synthe-
pounded upon in Sec. 3.3. Specifically, the foundational sized data can significantly enhance model performance. To
work of DiT [24] is elucidated in Sec. 3.1. Section 3.2 substantiate this assertion, we developed a filtering mecha-
provides an in-depth exploration of the construction of our nism based on the Vision-Language Model (VLM) to eval-
innovative subject-consistent dataset, comprising meticu- uate the quality of the generated image pairs. Subsequently,
lously curated single-subject and multi-subject image pairs. we conducted experiments utilizing synthesized data across
Furthermore, Sec. 3.3 outlines our methodology for trans- various quality score levels. As depicted in Fig. 5, high-
forming a Text-toImage (T2I) DiT model into a Subject- quality scores image pairs can significantly enhance the
to-Image (S2I) model, showcasing its contextual generation subject similarity of the results with higher DINO [22] and
capabilities. This adaptation involves an iterative training CLIP-I score, which verifies that our automated data cura-
framework designed to facilitate multi-image perception, tion framework can continuously supplement high-quality
textual comprehension, and condition generation conducive data and improve model performance.
to subject-driven synthesis. Single-Subject In-Context Generation. In order to in-
crease dataset diversity, we initially formulated a taxonomy
3.1. Preliminary tree comprising 365 overarching classes sourced from Ob-
The original DiT architecture focuses solely on class- ject365 [36], alongside finer-grained categories encompass-
conditional image generation. It departs from the com- ing distinctions in age, profession, and attire styles. Within
monly used U-Net backbone, instead employing full trans- each category, we leverage the capabilities of a Large Lan-
former layers that operate on latent patches. More re- guage Model (LLM) to generate an extensive array of sub-
cently, image generators such as Stable Diffusion 3 [7] and jects and varied settings. Through the amalgamation of
FLUX.1 [17] are built upon MM-DiT, which incorporates these outputs with predefined text templates, we are able
a multi-modal attention mechanism and takes as input the to derive millions of text prompts for the T2I model, facili-
concatenation of the embeddings of text and image inputs. tating the generation of subject-consistent image pairs.
The multi-modal attention operation projects position- Initially generated image pairs often encounter several
encoded tokens into query Q, key K, and value V represen- issues, such as subject inconsistency and missing subjects.
tations, enabling attention computation across all tokens: To efficiently filter the data, we first split the image pair into
1
the reference image Iref and the target image Itgt , then calcu-
QK⊤
Attention ([zt , c]) = softmax √ V, (1) late the DINOv2 [22] between the two images. This method
d is effective in filtering out images with significantly lower
consistency. Subsequently, VLM will be further employed
where Z = [zt , c] denotes the concatenation of image and to provide a score list evaluating different aspects (i.e., ap-
text tokens. This allows both representations to function pearance, details, and attributes), which can be represented
within their own respective spaces while still taking the by the following equation:
other into account.
S = VLM(Iref , Itgt , cy ) ∈ RN ×1 , (2)
3.2. Synthetic Data Curation Framework
The paucity of high-quality, subject-consistent datasets has score = Average(S), (3)
long presented a formidable obstacle for subject-driven gen- where cy represents the input text to VLM, N denotes the
eration, severely constraining the scalability of model train- number of evaluated dimensions that are automatically gen-
4
Text Prompt Target Image 𝐼$%$ Reference Images 𝐼!"#
$
𝐼!"# %
𝐼!"#
Stage Ⅰ Input Stage Ⅱ Input Learnable Frozen
A soccer ball next to a
retro-style microphone
on a wooden floor. N× … Universal Rotary Position
Embedding (UnoPE)
Width
FFN
T5 Text Encoder VAE Encoder
Height
+ Noise
MM-Attention
UnoPE
UNO Sa
m
pl
… … es
Position index
UNO-DiT Blocks
Figure 4. Illustration of the training framework of UNO. It introduces two pivotal enhancements to the model: progressive cross-modal
alignment and universal rotary position embedding(UnoPE). The progressive cross-modal alignment is divided into two stages. In the Stage
I, we use single-subject in-context generated data to finetune the pretrained T2I model into an S2I model. In the Stage II, we continue
training on generated multiple-subject data pairs. The UnoPE can effectively equip UNO with the capability of mitigating the attribute
confusion issue when scaling visual subject controls.
0.34
0.82 Data quality
image onto the resulting image. Our proposed pipeline ef-
0.70 0.33 2.5-3
0.80
0.65 0.32
3.5-4
fectively mitigates this concern.
CLIP-T
CLIP-I
0.78
DINO
4
0.60
0.76 0.31
0.55
0.50
0.74
0.72
0.30 3.3. Customization Model Framework (UNO)
0.29
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
step(×1000) step(×1000) step(×1000) In this section, we will provide a detailed explanation
of how to iteratively train a multi-image conditioned S2I
Figure 5. Model performance on Dreambench [33]. We conduct model from a DiT-based T2I model. It should be noted that
experiments under different quality score levels. all the training data we used originate from images gener-
ated by our in-context data generation method proposed in
the previous chapter.
erated by VLM, S signifies the output which is parsed into
Progressive cross-modal alignment. Original T2I models
a score list, and score indicates the final consistency score
gradually transform pure Gaussian noise into text-adherent
of generated image pairs.
images through an iterative denoising process. During this
Multi-Subject In-Context Generation. The comprehen-
process, the VAE encoder E(·) first encodes the target image
sive dataset from the preceding phase will be utilized to
Itgt into a noisy latent zt = E(Itgt ). zt is then concatenated
train a S2I model, which is conditioned on both single-
with the encoded text token c, forming the input z for the
image and text inputs. Subsequently, this trained S2I model,
DiT model. This process can be formulated as:
along with the dataset, will be employed to generate multi-
subject consistent data in the current stage. Illustrated in z = Concatenate(c, zt ), (4)
Fig. 3(b), we will initially employ an open-vocabulary de-
To incorporate multi-image conditions Iref =
tector (OVD) to identify subjects beyond those present in 1 2 N
1 [Iref , Iref , . . . , Iref ], we introduce a progressive training
Iref . The extracted cropped images and their correspond-
paradigm that progresses from simpler to more complex
ing subject prompts will then be input into our trained S2I
2 scenarios. We view the training phase with single-image
model to derive new results for Iref . Traditional approaches
conditions as the initial phase for cross-modal alignment.
often encountered significant failure rates as models strug-
Given that the original input comprises solely text tokens
gled to preserve the original subject’s identity. How-
and noisy latents, the introduction of noise-free reference
ever, models trained using our proposed in-context training
image tokens could potentially disrupt the original conver-
methodology can effectively surmount this challenge, yield-
gence distribution. Such disruption may result in training
ing highly consistent outcomes with ease. Further elabora-
instability or suboptimal outcomes. Hence, we opt for a
tion on this topic will be provided in the subsequent section.
gradual complexity approach rather than directly exposing
Some may question the necessity of generating new data,
the model to multiple reference image inputs. In Stage
suggesting that the cropped part could be treated simply as
2 I, depicted in Fig. 4, only a single image serves as the
Iref . However, we contend that relying solely on cropped
reference image. We utilize z1 as the input multi-modal
images as the training dataset may introduce copy-paste is-
tokens for the DiT model:
sues. This scenario arises when the model fails to adhere to
1
the textual prompt and merely “pastes" the input reference z1 = Concatenate(c, zt , E(Iref )), (5)
5
Reference Image Prompt UNO(Ours) OminiControl FLUX IP-Adapter OmniGen RealCustom++ SSR-Encoder
A clock on
top of green
grass with
sunflowers
around it.
A sneaker
with the
Eiffel Tower
in the
background.
A purple toy.
A red toy.
A candle and
a clock on top
of a purple
rug in a forest.
A bowl and a
can on top of
a white rug
A sneaker
and a teapot
on top of
pink fabric
A vase and
a stuffed
animal with
a mountain
in the
background
6
After Stage I training, the model is capable of processing 4. Experiments
single-subject driven generation tasks. We then train the
model with multi-image conditions to tackle more complex 4.1. Experiments Setting
multi-subject driven generation scenarios. z2 can be de- Implementation Details. To self-evolution our base DiT-
scribed as follows: based T2I model, we firstly take the FLUX.1 dev [17] as
the pretrained model. We train the model with a learning
i i
zref = E(Iref ), i = 1, . . . , N, (6) rate of 10−5 and a total batch size of 16. For the progres-
sive cross-modal alignment, we first train the model using
1 2 N
z2 = Concatenate(c, zt , zref , zref , . . . , zref ), (7) single-subject pair-data for 5, 000 steps. Then, we continue
training on multi-subject pair-data for another 5, 000 steps.
where N is set to 2 in our paper. During Stage I, the
Specifically, we generated 230k and 15k data pairs for these
T2I model is trained to refer to the input reference im-
two stages respectively using the in-context data generation
age and prompt, with the goal of generating single subject-
method mentioned above. We conduct the entire experi-
consistent results. Stage II is designed to enable the S2I
ment on 8 NVIDIA A100 GPUs and trained the model using
model to refer to multiple input images and inject infor-
a LoRA [12] rank of 512 throughout the training process.
mation into corresponding latent spaces. Through iterative
training, the inherent in-context generation capability of T2I Comparative Methods. As a tuning-free method, our
model is unlocked, eliciting more controllability from sin- model is capable of handling both single-subject and multi-
gle text-to-image models. subject driven generation. We compare it with some leading
methods in these two tasks respectively, including Omni-
Universal rotary position embedding(UnoPE). An im-
gen [44], Ominicontrol [38], FLUX IPAdapter v2 [39], Ms-
portant consideration for incorporating multi-image condi-
diffusion [41], MIP-Adapter [15], RealCustom++ [20], and
tions into a DiT-based T2I model pertains to the aspect of
SSR-Encoder [48].
position encoding. In the context of FLUX.1 [17], the uti-
lization of Rotary Position Embedding (RoPE) necessitates Evaluation Metrics. Following previous works, we use
the assignment of position indices (i, j) to both text and standard automatic metrics to evaluate both subject simi-
image tokens, thereby influencing the interaction among larity and text fidelity. Specifically, we employ cosine sim-
multimodal tokens. Within the original model architec- ilarity measures between generated images and reference
ture, text tokens are assigned a consistent position index of images within CLIP [27] and DINO [22] spaces, referred
(0, 0), while noisy image tokens are allocated position in- to as CLIP-I and DINO scores, respectively, to assess sub-
dices (i, j) where i ∈ [0, w − 1] and j ∈ [0, h − 1]. Here, ject similarity. Additionally, we calculate the cosine simi-
h and w denote the height and width of the noisy latent, larity between the prompt and the image CLIP embeddings
respectively. (CLIP-T) to evaluate text fidelity. For single-subject driven
Our newly introduced image conditions reuse the same generation, we measure all methods on DreamBench [33]
format to inherit the implicit position correspondence of for fairness. For multi-subject driven generation, we follow
the original model. However, we start from the maximum previous studies [15, 19] that involve 30 different combi-
height and width of the noisy image tokens, as shown in nations of two subjects from DreamBench, including com-
Fig. 4, which begins with the diagonal position. The posi- binations of non-live and live objects. For each combina-
N
tion index for the latent zref is defined as: tion, we generate 6 images per prompt using 25 text prompts
from DreamBench, resulting in 4, 500 image groups for all
subjects.
(i′ , j ′ ) = (i + w(N −1) , j + h(N −1) ), (8)
4.2. Qualitative Analyses
where i ∈ [0, wN ), j ∈ [0, hN ), with wN and hN repre-
N
senting the width and height of the latent zref , respectively. We compare with various state-of-the-art methods to ver-
′ ′
Here, i and j are the adjusted position indices. To prevent ify the effectiveness of our proposed UNO. We show the
the generated image from over-referencing the spatial struc- comparison of single-image condition generation results in
ture of the reference image, we adjust the position indices Fig. 6. In the first two rows, our UNO nearly perfectly
within a certain range. In the scenario of multi-image condi- keeps the subject detail (e.g., the numbers on the dial of the
tions, different reference images inherently have a semantic clock) in the reference image, while other methods strug-
gap. Our proposed UnoPE can further prevent the model gle to maintain the details. In the following two rows, we
from learning the original spatial distribution of reference demonstrate the editability. UNO can maintain subject sim-
images, thereby focusing on obtaining layout information ilarity while editing attributes, specifically colors, whereas
from text features. This enables the model to improve its other methods either fail to maintain subject similarity or do
performance in subject similarity while maintaining good not follow the text edit instructions. As a contrast, Omini-
text controllability. Control [38] has good retention ability but may encounter
7
Method DINO ↑ CLIP-I ↑ CLIP-T ↑ User Evaluations of Single-subject Driven Generation User Evaluations of Multi-subject Driven Generation
text fidelity text fidelity
(subject) (subject)
Oracle(reference images) 0.774 0.885 - 5 5
Table 2. Quantitative results for multi-subject driven gener- Table 4. Effect of progressive cross-modal alignment. The
ation. Our method achieves state-of-the-art performance among model exhibits superior performance on DreamBench [33] after
both tuning methods and tuning-free methods. undergoing progressive cross-modal alignment, in contrast to be-
ing trained exclusively on single-subject pair-data, despite both
models undergoing an identical number of training steps.
copy-paste risks, e.g., a red robot in the last row of Fig. 6.
We show the comparison results of multi-image condition
generation in Fig. 7. Our method can keep all reference im- questionnaires to showcase the superiority of UNO. For
ages while adhering to text responses, whereas other meth- subjective assessment, 30 evaluators including both domain
ods either fail to maintain subject consistency or miss the experts and non-experts, assessed 300 image combinations
input text editing instructions. covering both single-subject and multi-subject driven gen-
eration tasks. For each case, evaluators rank the best results
4.3. Quantitative Evaluations across five dimensions, including text fidelity at the subject
Automatic scores. Tab. 1 compares our UNO on Dream- level, text fidelity at the background level, subject similar-
Bench [33] against both tuning-based and tuning-free meth- ity, composition quality, and visual appeal. As shown in
ods. UNO has a significant lead over previous methods with Fig. 8, the results reveal that our UNO not only excels in
the highest DINO and CLIP-I scores of 0.760 and 0.835 subject similarity and text fidelity but also achieves strong
respectively in zero-shot scenarios, and a leading CLIP-I performance in other dimensions.
score of 0.304. We also compare our method in the multi-
4.4. Ablation Study
image condition scenario in Tab. 2. UNO achieves the high-
est DINO and CLIP-I scores and has competitive CLIP-T Effect of synthetic data curation framework. Tab. 3
scores compared to existing leading methods. This shows shows the effects of different modules of UNO. When us-
that UNO can greatly improve subject similarity while ad- ing augmented cropped part images from the target image
2
hering to text descriptions. instead of generated Iref , we observe a significant decline in
User study. We further conduct a user study via online all metrics in Tab. 3. In Fig. 9, the results tend to merely
8
w/o cross-modal
Method DINO ↑ CLIP-I ↑ CLIP-T ↑ Reference Images&Prompt UNO(Ours) $
w/o generated 𝐼!"#
alignment
w/o UnoPE
Method DINO ↑ CLIP-I ↑ CLIP-T ↑ Figure 9. Ablation study of UNO. Zoom in for details.
w/o offset 0.386 0.674 0.323
w/ width-offset 0.508 0.724 0.321
w/ height-offset 0.501 0.719 0.306 effectively reducing the copy-paste phenomenon. Extensive
UNO (Ours) 0.542 0.733 0.322 experiments show that UNO achieves high-quality similar-
ity and controllability in both single-subject and multiple-
Table 6. Comparison with different forms of position index off- subject customization.
sets. We report the results on the multi-subject driven generation
benchmark.
References
[1] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell,
copy-paste the subjects and almost do not respond to the Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort,
text prompt description. Deep Ganguli, Tom Henighan, et al. Training a helpful and
Effect of progressive cross-modal alignment. As shown harmless assistant with reinforcement learning from human
in Tab. 3 and Fig. 9, there is a significant drop in both DINO feedback. arXiv preprint arXiv:2204.05862, 2022. 2
and CLIP-I scores, as well as in subject similarity, when the [2] Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole
model is directly exposed to multiple reference image in- Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan,
Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen
puts without progressive cross-modal alignment. Further-
3. arXiv preprint arXiv:2408.07009, 2024. 3
more, as shown in Tab. 4, progressive cross-modal align-
[3] Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen
ment can increase the upper limit of the model in single- Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen,
image condition scenarios. Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-
Effect of UnoPE. As shown in Tab. 3, there is a signifi- strong generalization: Eliciting strong capabilities with weak
cant drop in both DINO and CLIP-I scores when cloning supervision. arXiv preprint arXiv:2312.09390, 2023. 2
the position index from the target image without using Un- [4] Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas,
oPE. In Fig. 9, the generated images can follow the text de- Jiajun Wu, and Gordon Wetzstein. Diffusion self-distillation
scriptions but hardly reference the input images. We further for zero-shot customized image generation. arXiv preprint
compared with different forms of position index offsets, as arXiv:2411.18616, 2024. 13
shown in Tabs. 5 and 6, and our method achieves the best [5] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W
results, which demonstrates the superiority of our proposed Cohen. Re-imagen: Retrieval-augmented text-to-image gen-
UnoPE. erator. arXiv preprint arXiv:2209.14491, 2022. 8
[6] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao,
5. Conclusion
Hongxia Yang, et al. Cogview: Mastering text-to-image gen-
In this paper, we present UNO, a universal customization eration via transformers. NIPS, 34:19822–19835, 2021. 3
architecture that unlock the multi-condition contextual ca- [7] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim
pabilities of diffusion transformer. This is achieved through Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik
progressive cross-modal alignment and universal rotary po- Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti-
fied flow transformers for high-resolution image synthesis.
sition embedding. The training of UNO consists of two
In ICML, 2024. 3, 4
steps. The first step uses single-image input to elicit the
[8] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
subject-to-image capabilities in diffusion transformers. The
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
next step involves further training on multiple-subject data Or. An image is worth one word: Personalizing text-to-
pairs. Our proposed universal rotary position embedding image generation using textual inversion. arXiv preprint
can also significantly improves subject similarity. Addi- arXiv:2208.01618, 2022. 2, 3, 8
tionally, we present a progressive synthesis pipeline that [9] Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John
evolves from single-subject to multi-subject in-context gen- Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh,
eration. This pipeline generates high-quality synthetic data, Laura Weidinger, Martin Chadwick, Phoebe Thacker,
9
Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico-
Huang, Ramona Comanescu, Fan Yang, Abigail See, las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou,
Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bo-
Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nicholas janowski. Dinov2: Learning robust visual features without
Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason supervision, 2023. 4, 7, 12, 13
Gabriel, William Isaac, John Mellor, Demis Hassabis, Ko- [23] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car-
ray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irv- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
ing. Improving alignment of dialogue agents via targeted Agarwal, Katarina Slama, Alex Ray, et al. Training language
human judgements, 2022. 2 models to follow instructions with human feedback. Ad-
[10] Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng vances in neural information processing systems, 35:27730–
Zhang, and Qian He. Pulid: Pure and lightning id customiza- 27744, 2022. 2
tion via contrastive alignment. In NIPS, 2024. 3 [24] William Peebles and Saining Xie. Scalable diffusion models
[11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- with transformers. In ICCV, pages 4195–4205, 2023. 3, 4
sion probabilistic models. NIPS, 33:6840–6851, 2020. 3 [25] Dustin Podell, Zion English, Kyle Lacey, Andreas
[12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Robin Rombach. SDXL: Improving latent diffusion models
Lora: Low-rank adaptation of large language models. arXiv for high-resolution image synthesis. In ICLR, 2024. 3
preprint arXiv:2106.09685, 2021. 3, 7 [26] Senthil Purushwalkam, Akash Gokul, Shafiq Joty, and Nikhil
[13] Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Naik. Bootpig: Bootstrapping zero-shot personalized image
Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jin- generation capabilities in pretrained diffusion models. arXiv
gren Zhou. In-context lora for diffusion transformers. arXiv preprint arXiv:2401.13974, 2024. 8
preprint arXiv:2410.23775, 2024. 3 [27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[14] Mengqi Huang, Zhendong Mao, Mingcong Liu, Qian He, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
and Yongdong Zhang. Realcustom: narrowing real text word Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
for real-time open-domain text-to-image customization. In ing transferable visual models from natural language super-
CVPR, pages 7476–7485, 2024. 2, 3, 8 vision. In ICML, pages 8748–8763. PmLR, 2021. 7
[28] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
[15] Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Yu, and Jie Song. Resolving multi-condition confusion
Zero-shot text-to-image generation. In ICML, pages 8821–
for finetuning-free personalized image generation. arXiv
8831. Pmlr, 2021. 3
preprint arXiv:2409.17920, 2024. 7, 8
[29] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
[16] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli
and Mark Chen. Hierarchical text-conditional image gener-
Shechtman, and Jun-Yan Zhu. Multi-concept customiza-
ation with clip latents. arXiv preprint arXiv:2204.06125, 1
tion of text-to-image diffusion. In CVPR, pages 1931–1941,
(2):3, 2022. 3
2023. 2
[30] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
[17] Black Forest Labs. Flux: Official inference repository for geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
flux.1 models, 2024. Accessed: 2025-02-07. 3, 4, 7, 12, 13 versarial text to image synthesis. In ICML, pages 1060–1069.
[18] Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- PMLR, 2016. 3
trained subject representation for controllable text-to-image [31] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
generation and editing. Advances in Neural Information Pro- Patrick Esser, and Björn Ommer. High-resolution image syn-
cessing Systems, 36:30146–30166, 2023. 3, 8 thesis with latent diffusion models. In CVPR, pages 10684–
[19] Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject- 10695, 2022. 3
diffusion: Open domain personalized text-to-image genera- [32] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
tion without test-time fine-tuning. In ACM SIGGRAPH 2024 net: Convolutional networks for biomedical image segmen-
Conference Papers, pages 1–12, 2024. 7, 8 tation. In Medical image computing and computer-assisted
[20] Zhendong Mao, Mengqi Huang, Fei Ding, Mingcong Liu, intervention–MICCAI 2015: 18th international conference,
Qian He, and Yongdong Zhang. Realcustom++: Represent- Munich, Germany, October 5-9, 2015, proceedings, part III
ing images as real-word for real-time customization. arXiv 18, pages 234–241. Springer, 2015. 3
preprint arXiv:2408.09744, 2024. 3, 7, 8 [33] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
[21] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and tuning text-to-image diffusion models for subject-driven
Mark Chen. Glide: Towards photorealistic image generation generation. In CVPR, pages 22500–22510, 2023. 2, 3, 5,
and editing with text-guided diffusion models. arXiv preprint 7, 8, 9
arXiv:2112.10741, 2021. 3 [34] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
[22] Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- et al. Photorealistic text-to-image diffusion models with deep
sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- language understanding. NIPS, 35:36479–36494, 2022. 3
10
[35] William Saunders, Catherine Yeh, Jeff Wu, Steven Bills,
Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing
models for assisting human evaluators. arXiv preprint
arXiv:2206.05802, 2022. 2
[36] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang
Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365:
A large-scale, high-quality dataset for object detection. In
CVPR, pages 8430–8439, 2019. 4, 12
[37] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, pages 2256–
2265. pmlr, 2015. 3
[38] Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue,
and Xinchao Wang. Ominicontrol: Minimal and uni-
versal control for diffusion transformer. arXiv preprint
arXiv:2411.15098, 3, 2024. 3, 4, 7, 8, 12
[39] XLabs AI team. x-flux, 2025. Accessed: 2025-02-07. 7
[40] Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and An-
thony Chen. Instantid: Zero-shot identity-preserving gener-
ation in seconds. arXiv preprint arXiv:2401.07519, 2024.
3
[41] Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and
Hao Jiang. MS-diffusion: Multi-subject zero-shot image per-
sonalization with layout guidance. In ICLR, 2025. 3, 7, 8
[42] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
Chain-of-thought prompting elicits reasoning in large lan-
guage models. Advances in neural information processing
systems, 35:24824–24837, 2022. 12, 13
[43] Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei
Zhang, and Wangmeng Zuo. Elite: Encoding visual con-
cepts into textual embeddings for customized text-to-image
generation. In CVPR, pages 15943–15953, 2023. 8
[44] Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin-
grun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, and
Zheng Liu. Omnigen: Unified image generation. arXiv
preprint arXiv:2409.11340, 2024. 3, 7, 8
[45] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
adapter: Text compatible image prompt adapter for text-to-
image diffusion models. arXiv preprint arXiv:2308.06721,
2023. 2, 3
[46] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres-
sive models for content-rich text-to-image generation. arXiv
preprint arXiv:2206.10789, 2(3):5, 2022. 3
[47] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu,
Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu,
Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-
scale dataset of multi-view images. In CVPR, pages 9150–
9161, 2023. 2
[48] Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jin-
peng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan,
et al. Ssr-encoder: Encoding selective subject representation
for subject-driven generation. In CVPR, pages 8069–8078,
2024. 7, 8
11
Less-to-More Generalization:
Unlocking More Controllability by In-Context Generation
Supplementary Material
F. In-Context Data Generation Pipeline
In this section, we give a detailed description of our in-context data generation pipeline. We first build a taxonomy tree
in Sec. F.1 to obtain various subject instances and scenes. Then we generate subject-consistent image-pair data with the in-
context ability of pretrained Text-to-Image (T2I) model and utilize Chain-of-Thought (CoT) [42] to filter the synthesized data
in Sec. F.2. Finally, for multi-subject data, we train a Subject-to-Image (S2I) model to generate subject-consistent reference
image instead of the cropped one to avoid the copy-paste issue in Sec. F.3.
12
Template_1
A diptych with two side-by-side images of same <\subject1>.
Left: <\subject1> in <\scene1>.
Right: <\subject1> together with <\subject2> in <\scene2>.
Template_2
A diptych with two side-by-side images of same <\subject1>.
Top: <\subject1> in <\scene1>.
Bottom: <\subject1> together with <\subject2> in <\scene2>.
Figure 11. Diptych text template for generating subject-consistent image-pair with FLUX.1[17].
1
contain the same subject1 while Itgt has another subject2. To ensure Iref and Itgt have consistent subject1, we then calculate
cosine similarity with DINOv2 [22] and set a threshold to filter out image-pairs with significantly low consistency.
1
However, since the reference image Iref and the target image Itgt have different scene settings, the subject1 in the image-
pairs may not be spatially aligned, resulting in incorrect cosine similarity with feature from DINOv2 [22]. We further employ
VLM to provide a fine-grained score list evaluating various aspects adaptively, i.e. appearance, details, and attributes. We
only keep the data with highest VLM score, which indicating the highest quality and subject consistency in the synthesized
1
data. Specifically, inspired by [4], we utilize CoT [42] for better discrimination of the subject1 in Iref and Itgt , as shown in
Fig. 14. To demonstrate the effectiveness of the CoT filter, we sample data from different VLM score intervals in Fig. 15.
Image-pairs with low score suffer severe subject inconsistency while those with highest score (i.e. score is 4) show highly
consistent subject in the reference image and the target image. We also count the amount of data in each score interval as
shown in Fig. 17, indicating that around 35.43% data would be remained with the VLM CoT filter. Also, there are seldom of
data with extremely low VLM score after DINOv2 filter, showing its effectiveness.
Role:
Please be very creative and generate 50 breif subject prompts for text-to-image generation.
Role: Role:
Follow these rules: Please be very careful and generate 50 breif subject prompts for text-to-image generation.
1. You will be given an <asset category>, you need to create an asset(breif subject prompt) based on Please be very careful and generate 50 breif subject prompts for text-to-image generation.
Follow these rules: Follow these rules:
the <asset category>.
1. You will be given an <asset category>, you need to create an asset(breif subject prompt) based on
2. These descriptions can refer only to appearance descriptions/or to certain brands. e.g. "Elon Musk 1. You will be given an <asset category>, you need to create an asset(breif subject prompt) based on
the <asset category>.
in pajamas", "a tiger in a black hat", "A Mercedes sports car", "A blonde", "A door red on the left the <asset category>.
2. These descriptions can refer only to appearance descriptions/or to certain brands. But it has to be
and green on the right" 2. Please add a text description in the breif subject. The text can appear anywhere.
something that can exist in the real world. e.g. "Elon Musk in pajamas", "a white tiger", "A Mercedes
3. Do not repeat each asset, you need to use your imagination and common sense of life to create. 2. These descriptions can refer only to appearance descriptions/or to certain brands. e.g. "Elon Musk
sports car", "A blonde", "A rotten wooden door"
3. No more than 12 words. in his pajamas with the words' beat it.'", "a white tiger holding a sign that says' go '", "A Mercedes
3. Do not repeat each asset, you need to use common sense of life to create.
Example1 3. No more than 12 words.
sports car with '101' written on it", "A blonde"
[asset category]: Book 3. Do not repeat each asset, you need to use common sense of life to create.
Example1 3. No more than 12 words.
[asset1]: A book with a green cover
[asset category]: Book
[asset2]: commic book Example1
[asset1]: A book with a green cover
[asset3]: math book [asset category]: Person
[asset2]: commic book
[asset4]: An open book [asset1]: A surfer with "Catch the Waves" on a surfboard.
[asset3]: math book
[asset5]: Rotten books [asset2]: A guitarist with "Rock On" on a t-shirt.
[asset4]: An open book
[asset6]: A book made of candy [asset3]: A dancing ballerina with "Grace" written on her tutu.
[asset5]: Rotten books
[asset7]: The book with "love and power" on the cover [asset4]: A chef wearing a hat that says "Cook Master."
[asset6]: The book with "Harry Potter and the Sorcerer's Stone" on the cover
[asset8]: A book with a blue key on it ...
...
[asset9]: Triangular book (Up to [asset50])
(Up to [asset50])
...
(Up to [asset50]) [asset category]:
[asset category]:
[asset category]:
(a) System prompt of LLM used to generate (b) System prompt of LLM used to generate (c) System prompt of LLM used to generate
subject instances in creative type. subject instances in realistic type. subject instances in text-decorated type.
13
Role:
Please be very creative and generate 50 breif subject prompts for text-to-image generation.
Follow these rules:
1. Given a brief subject prompt of an asset, you need to generate 8 detailed Scene Description for the asset.
2. Each Scene Description should be a detailed description, which describes the background area you imagine for an identical extracted asset, under different
environments/camera views/lighting conditions, etc (please be very very creative here).
3. Each Scene Description should be one line and be as short and precise as possible, do not exceed 77 tokens, Be very creative!
Example1
[asset]: Scientist with exploding beakers
[SceneDescription1]: The scientist with exploding beakers stands in a futuristic laboratory with holographic equations swirling around them.
[SceneDescription2]: Amidst the chaos of a stormy outdoor field lab, the scientist with exploding beakers conducts dramatic experiments as lightning crashes overhead.
[SceneDescription3]: In an ancient alchemist's den filled with dusty tomes, the scientist with exploding beakers looks surprised as colorful liquid bursts forth.
[SceneDescription4]: The scientist with exploding beakers is immersed in a vibrant neon-lit urban laboratory, surrounded by robotic assistants.
[SceneDescription5]: A desert makeshift tent serves as the lab where the scientist with exploding beakers creates a plume of shimmering dust.
[SceneDescription6]: On an alien planet bathed in ethereal light, the scientist with exploding beakers observes bioluminescent reactions in awe.
[SceneDescription7]: In a steampunk inspired workshop, the scientist with exploding beakers wears goggles and smiles amidst gears and steam as an experiment erupts.
[SceneDescription8]: The scientist with exploding beakers stands on a floating platform in the clouds, conducting experiments as colorful bursts light up the sky.
[asset]:
Step 1:
Briefly describe these two images, as well as the most prominent subject that exist. Think carefully about which parts of the subject you need to
break down in order to make an objective and thoroughly evaluation. Don't make evaluations at this step.
(a) System prompt of the filter VLM. (b) Prompt for the first round CoT of the filter VLM.
Step 3:
Based on the differences analyzed in Step 2, assign a specific integer score to each part. More and larger differences result in a lower score. The
Step 2: score ranges from 0 to 4:
For each part you have identified, compare this aspect of the subject in the two images and describe the differences in extreme extreme extreme
extreme extreme extreme extreme detail. You need to be meticulous and precise, noting every tiny detail. - Very Poor (0): No resemblance. This subject part in the second image has no relation to the part in the first image.
- Poor (1): Minimal resemblance. This subject part in the second image has significant differences from the part in the first image.
Important Notes - Fair (2): Moderate resemblance. This subject part in the second image has modest differences from the part in the first image.
- Provide quantitative differences whenever possible. For example, "The subject's chest in the first image has 3 blue circular lights, while the - Good (3): Strong resemblance. This subject part in the second image has minor but noticeable differences from the part in the first image.
subject's chest in the second image has only one blue light and it is not circular." - Excellent (4): Near-identical. This subject part in the second image is virtually indistinguishable from the part in the first image.
- Ignore differences in the subject's background, environment, position, size, etc.
- Ignore differences in the subject's actions, poses, expressions, viewpoints, additional accessories, etc. Output Format
- Ignore the extra accessory of the subject in the second image, such as hat, glasses, etc. [Part 1]: [Part 1 Score]
- Consider that when the subject has a large perspective change, the part may not appear in the new perspective, and no judgment is needed at this [Part 2]: [Part 2 Score]
time. For example, if the subject in the first image is the back of the sofa, and the subject in the second image is the front of the sofa, determine the [Part 3]: [Part 3 Score]
similarity of the two sofas based on your association ability. ...
[Part N]: [Part N Score]
You must adhere to the output format strictly. Each part name and its score must be separated by a colon and a space.
(c) Prompt for the second round CoT of the filter VLM. (d) Prompt for the third round CoT of the filter VLM.
14
score 0~1 score 2~3 score 4
15
reference images target image reference images target image
Figure 16. Sampled data from our final multi-subject in-context data.
[2.5, 3.0)
13.02%
[2.0, 2.5)
8.67%
[3.0, 3.5)
23.51%
4.0
2.0
0.0
[3.5, 4.0)
12.66%
[4.0, 4.0]
35.43%
0.34
0.82 rank=4
rank=16
0.70
rank=64
0.80
0.33 rank=128
rank=512
0.65
0.78
0.32
0.60 0.76
CLIP-T
CLIP-I
DINO
0.72
0.50 0.30
0.70
0.45
0.29
1000 2000 3000 4000 5000 6000 7000 8000 1000 2000 3000 4000 5000 6000 7000 8000 1000 2000 3000 4000 5000 6000 7000 8000
step step step
with sufficient generalization capabilities, the types of synthetic data may somewhat restrict its abilities. In the future, we
plan to expand our data types to further unlock UNO’s potential and cover a broader range of tasks.
16
Reference Images Prompt UNO (Ours) OmniGen MS-Diffusion MIP-Adapter SSR-Encoder
a toy and a
toy on top of a
white rug
a purple
sneaker and a
purple toy
a boot and a
stuffed animal
with a city in
the
background
a cartoon and
a toy with a
wheat field in
the
background
a dog in a
purple wizard
outfit, next to
it is a can
a dog in a
police outfit,
next to it is a
glasses
a dog in a
firefighter
outfit, next to
it is a teapot
a cat wearing
pink glasses,
next to it is a
cat
Figure 19. More comparison with different methods on multi-subject driven generation. We italicize the subject-related editing part of the
prompts.
17
Input Output Input Output
“A vase is next to the case, in the desert” “A colorful block on top of the toy car”
“A black cat wearing a black hat is riding a “A little bear carries a blue bag”
little yellow duck, in the forest”
“The girl wears a clothes with the red logo” “The logo is printed on the clothes”
“” “”
Figure 20. More multi-subject generation results from our UNO model.
18
Input Output
Figure 21. More virtual try-on results from our UNO model.
19
Input Output
1.The girl stands before a large painting in an art gallery. Her dark eyes reflect curiosity and appreciation as she studies the brushstrokes up close.
She wears a tailored black blazer over a simple white blouse, exuding an air of sophistication. Around her, other patrons whisper in admiration,
but she remains lost in her own artistic contemplation. 2. The girl poses in a botanical garden. She wears a floral dress, holding a bouquet,
blending natural and artistic beauty. 3. The girl navigates busy city streets. She wears a trench coat and trousers, exploring the urban landscape.
1. The woman sits on a park bench, a brown leather handbag by her side. She wears a floral dress, enjoying the sunshine. The bag rests on her lap,
a small umbrella tucked inside. She occasionally rummages through it for snacks. 2. The woman walks at dusk, a brown velvet evening bag in
hand. 3. The woman is sitting in a cafe, with a brown bag on the chair.
1. The man was lecturing on the podium, and the blackboard was full of mathematical formulas. 2. The man stands in a futuristic laboratory,
wearing a sleek silver jumpsuit. He adjusts the settings on a bizarre machine, its glowing panels and wires humming with energy. 3. The man plays
chess.
1. The man is sitting in a cafe, with a violin on the chair. 2. The man wears a suit and played the violin on the stage. 3. This man has fairy ears and
plays the violin in the forest.
Figure 22. More identity preservation results from our UNO model.
20
Input Output
a woman
a man
Figure 23. More stylized generation results from our UNO model.
Scenarios Prompt
One2One “A clock on the beach is under a red sun umbrella"
“A doll holds a ‘UNO’ sign under the rainbow on the grass"
Two2One “The figurine is in the crystal ball"
“The boy and girl are walking in the street"
Many2One “A penguin doll, a car and a pillow are scattered on the bed"
“A boy in a red hat wear a sunglasses"
Stylized Generation “Ghibli style, a woman"
“Ghibli style, a man"
Virtual Try-on “A man wears the black hoodie and pants"
“A girl wears the blue dress in the snow"
Product Design “The logo and words ‘Let us unlock!’ are printed on the clothes"
“The logo is printed on the cup"
Identity-preservation “The figurine is in the crystal ball"
“A penguin doll, a car and a pillow are scattered on the bed"
Story Generation “A boy in green is in the arcade"
“A man strolls down a bustling city street under moonlight"
“The man and a boy in green clothes are standing among the flowers by the lake"
“The man met a boy dressed in green at the foot of the tower"
21