0% found this document useful (0 votes)

37 views21 pages

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

This document presents a novel approach to customized image generation, addressing challenges in data scalability and subject expandability through a model-data co-evolution paradigm. The proposed method, named UNO, utilizes a systematic synthetic data curation framework and diffusion transformers to generate high-consistency multi-subject paired data. Extensive experiments demonstrate that UNO achieves superior performance in both single and multi-subject driven generation tasks, enhancing controllability without compromising text alignment.

Uploaded by

cosmos2022weirdo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views21 pages

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

Uploaded by

cosmos2022weirdo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Less-to-More Generalization:

Unlocking More Controllability by In-Context Generation

Shaojin Wu Mengqi Huang∗ Wenxu Wu Yufeng Cheng Fei Ding† Qian He

Intelligent Creation Team, ByteDance
{wushaojin, huangmengqi.98, wuwenxu.01, chengyufeng.cb1, dingfei.212, heqian}@bytedance.com
Project Page: https://bytedance.github.io/UNO
arXiv:2504.02160v1 [cs.CV] 2 Apr 2025

One2One Two2One Many2One

👦 👨

👦👨 👦👨
🏛

Stylized Generation Story Generation

Virtual Try-on Product Design Identity-preservation

Figure 1. Our UNO evolves as an universal customization from single to multi-subject.

1
Abstract
Although subject-driven generation has been extensively ex-
plored in image generation due to its wide applications,
it still has challenges in data scalability and subject ex-
pansibility. For the first challenge, moving from curating
single-subject datasets to multiple-subject ones and scal-
ing them is particularly difficult. For the second, most re-
cent methods center on single-subject generation, making
it hard to apply when dealing with multi-subject scenarios.
In this study, we propose a highly-consistent data synthesis
pipeline to tackle this challenge. This pipeline harnesses
the intrinsic in-context generation capabilities of diffusion
transformers and generates high-consistency multi-subject
paired data. Additionally, we introduce UNO, which con-
sists of progressive cross-modal alignment and universal ro-
tary position embedding. It is a multi-image conditioned Figure 2. The illustration of our motivation. We propose a novel
subject-to-image model iteratively trained from a text-to- model-data co-evolution paradigm, where less-controllable pre-
image model. Extensive experiments show that our method ceding models systematically synthesize better customization data
for successive more-controllable variants, enabling persistent co-
can achieve high consistency while ensuring controllability
evolution between enhanced model and enriched data.
in both single-subject and multi-subject driven generation.
Code and model: https://github.com/bytedance/UNO.
Existing customized generation methods can be cate-
gorized into two streams based on how they utilize the
1. Introduction
data to design the model, i.e., few-data fine-tuning and
As the material medium for abstract linguistic semantics large-data training stream. Early few-data fine-tuning ap-
and spatial embodiments of concrete visual subjects, im- proaches [8, 33] primarily employ per-subject optimiza-
ages constitute the foundational modality in intelligent con- tion through model fine-tuning [16, 33] or textual inversion
tent generation. In recent years, customized image genera- [8], incurring substantial computational overhead and time-
tion, which aims to create images that align with both the consuming during inference, which hinders real-world de-
text semantics and the subjects in the reference images, has ployment. Therefore, more recent researches focus on the
garnered significant interest across academic and industrial latter stream, which train adapters or image encoders on a
communities. This task unifies the flexibility of text con- large set of visual subjects to achieve real-time customiza-
trol and the accuracy of visual controls, providing foun- tion. Their corresponding data used for training are either
dational infrastructure for diverse real-world applications real images with diversity limitations (eg, restricted subject
ranging from film production to industrial design. As the variations [45, 47]), or synthetic data with limited image
field advances, the research challenge in customized image quality (typically, ≤ 512 × 512) and narrow domain cover-
generation now centers on developing a stable and scalable age. Therefore, these methods often exhibit a trade-off be-
paradigm for unlocking more controllability, i.e., continu- tween subject similarity and text controllability [45], or un-
ously increasing the amount of visual subject control with- stable generation [14]. Essentially, the existing customized
out compromising the original text controllability. models are designed based on their corresponding available
Data, while serving as the foundation for training gen- customized data, resulting in limited scalability due to their
erative models, has long been the bottleneck in customized data bottleneck.
generation. An ideal model should be capable of generat- Diverging from the existing data-driven model de-
ing visual subjects in diverse poses, locations, sizes, and sign, research on large language models (LLMs) demon-
other attributes based on text prompts. This necessitates strates their capacity for strategic synthetic data genera-
data that encompasses multi-perspective subject variations, tion toward model self-enhancement. Their bidirectional
a requirement hindered by the impracticality of acquiring data-model knowledge transfer manifests through either
such comprehensive real paired datasets. high-performance models providing supervisory signals for
∗Corresponding author
weaker counterparts[1, 9], or conversely, less capable mod-
† Project lead els could provide supervision to elicit higher capabili-
ties leading to a stronger variant[3, 23, 35]. Inspired by

2
this LLMs’ synthetic data-driven self-improvement, this (SOTA) results.
study proposes that achieving stable and scalable cus-
tomized generation necessitates an analogous model-data 2. Related Work
co-evolution paradigm, where less-controllable preceding
customized models systematically synthesize better cus-
2.1. Text-to-image generation
tomization data for successive more-controllable variants, Recent years have witnessed explosive growth in text-to-
enabling persistent co-evolution between enhanced cus- image (T2I) models [6, 7, 17, 25, 28, 30, 31, 44, 46].
tomized models and enriched customization data, as illus- Apart from some work that chooses the GAN or autoregres-
trated in Figure. 2. sive paradigm, most of current text-to-image work chose
Technically, to achieve the model-data co-evolution, this the denoising diffuison [11, 37] as their image generation
study addresses two fundamental challenges, i.e., (1) how framework. Early exploratory work [21, 29, 34] have val-
to establish a systematic synthetic data curation framework idated the feasibility of using diffusion models for text-to-
that reliably harnesses knowledge distillation from less- image generation and demonstrated their superior perfor-
controllable models; and (2) how to develop a general- mance compared to other methods. The efficiency, qual-
ized customization model framework capable of hierarchi- ity, and capacity for T2I diffusion models are keeping im-
cal controllability adaptation, ensuring seamless scalability proved in the following work. LDM [31] suggests training
across varying degrees of controllability. To be specific, as the diffusion model in latent space significantly improves
for the synthetic data curation framework, we introduce a the efficiency and output resolution, which become the de-
progressive synthesis pipeline that transitions from single- fault choice for many subsequent works such as Stable Dif-
subject to multi-subject in-context generation, combined fusion series [7, 25], Imagen3 [2] and Flux [17]. Recent
with a multi-stage filtration mechanism to curate unprece- work [7, 17, 24] replaces the unet [32] to transformer and
dented high-resolution, high-quality paired customization shows the impressive quality and scalability of the trans-
data through fine-grained ensembled filtering. As for the former backbone.
customization model framework, we develop UNO to fully
2.2. Subject-driven generation
unlock the multi-condition contextual capabilities of Dif-
fusion Transformers (DiT) through iterative simple-to-hard Subject-driven generation has been widely studied in the
training, preserving the base architecture’s scalability with context of diffusion models. Dreambooth [33], textual in-
minimal modifications. Moreover, we propose universal version [8] and LoRA [12] introduce the subject-driven gen-
Rotary Position Embedding (UnoPE) to effectively equip eration capability by introducing lightweight new parame-
UNO with the capability of mitigating the attribute confu- ters and perform parameter efficiency tuning for each sub-
sion issue when scaling visual subject controls. ject. The major drawback of those methods is the cum-
Our contributions are summarized as: bersome fine-tuning process for each new-coming subject.
Conceptual Contribution. We identify that current IP-adapter [45], BLIP Diffusion [18] use an extra image
data-driven approaches in customized model design inher- encoder and new layers to encode the reference image of
ently suffer from scalability constraints rooted in fundamen- the subject and inject it into the diffusion model, achiev-
tal data bottlenecks. To address this limitation, we pio- ing a subject-driven generation without further finetuning
neer a model-data co-evolution paradigm that achieves en- for a new concept. For DiT, IC LoRA [13] and Ominicon-
hanced controllability while enabling stable and scalable trol [38] have explored the inherent image reference capa-
customized generation. bility in the transformer, pointing that the DiT itself can
be used as the image encoder for the subject reference.
Technical Contribution. (1) We develop a system-
Many further works follow this reference image injection
atic framework for synthetic data curation that produces
approach and have made improvements in various aspects
high-fidelity, high-resolution paired customization datasets
like facial identity [10, 40], image-text joint controlabil-
through progressive in-context synthesis and multi-stage fil-
ity [14], multiple reference subject support [20, 41]. De-
tering. (2) We propose UNO, a universal customization
spite these advances, the aforementioned work heavily re-
architecture that enables seamless scalability across multi-
lies on paired images, which are hard to collect, especially
condition control through minimal yet effective modifica-
for multi-subject scenes.
tion of DiT.
Experimental Contribution.We conduct extensive ex-
3. Methodology
periments on DreamBench [33] and multi-subject driven
generation benchmarks. Our UNO achieve the highest This section introduces our proposed model-data co-
DINO and CLIP-I scores among these two tasks. This evolution paradigm, encompassing the systematic synthetic
demonstrates its strong subject similarity and text control- data curation framework detailed in Sec. 3.2 and the gen-
lability, showcasing its capability to deliver state-of-the-art eralized customization model framework (i.e., UNO) ex-

3
(a) Single-Subject In-Context Generation
Taxonomy Tree Training Pairs
ing. As depicted in Fig. 3, we introduce a high-resolution,
Generated
highly-consistent data synthesis pipeline to tackle this chal-
Subject and Scene Set
T2I
VLM-based
Filter
lenge, capitalizing on the intrinsic in-context generation ca-

…
Model
'A diptych with two side-by-
side images of same <\subject1>.
Left: <\subject1> in <\scene1>.
$
𝐼!"# '
𝐼!"# 𝐼%&%
pabilities of DiT-based models. Through the utilization of
Right: <\subject1> together
with <\subject2> in <\scene2>’ An elderly woman with gray hair wearing
a brown coat and a scarf, holding a globe.
meticulously crafted text inputs, DiT models exhibit the ca-
Text Template $
𝐼!"# + 𝐼%&%
pacity to generate subject-consistent grid outcomes. In con-

…
(b) Multi-Subject In-Context Generation
Training
trast to prior methodologies like OminiControl [38], which
generate single-subject consistent data at a resolution of
… VLM-based

pitcher
OVD
S2I
Model
Filter

$ '
512 × 512, our approach establishes a more comprehen-
𝐼!"# 𝐼!"# 𝐼%&%
Cropped Subject

Subject Prompts
A European Robin perched on the
sive pipeline that progresses from single-subject to multi-
edge of a white ceramic pitcher.

globe
subject data generation. This advancement enables the di-
'
𝐼!"#
rect production of three distinct high-resolution image pairs
(i.e., 1024 × 1024, 1024 × 768, and 768 × 1024), signifi-
Figure 3. Illustration of our proposed synthetic data curation
cantly broadening the application spectrum and diversity of
framework based on in-context data generation.
the synthesized data.
We emphasize that the superior quality of the synthe-
pounded upon in Sec. 3.3. Specifically, the foundational sized data can significantly enhance model performance. To
work of DiT [24] is elucidated in Sec. 3.1. Section 3.2 substantiate this assertion, we developed a filtering mecha-
provides an in-depth exploration of the construction of our nism based on the Vision-Language Model (VLM) to eval-
innovative subject-consistent dataset, comprising meticu- uate the quality of the generated image pairs. Subsequently,
lously curated single-subject and multi-subject image pairs. we conducted experiments utilizing synthesized data across
Furthermore, Sec. 3.3 outlines our methodology for trans- various quality score levels. As depicted in Fig. 5, high-
forming a Text-toImage (T2I) DiT model into a Subject- quality scores image pairs can significantly enhance the
to-Image (S2I) model, showcasing its contextual generation subject similarity of the results with higher DINO [22] and
capabilities. This adaptation involves an iterative training CLIP-I score, which verifies that our automated data cura-
framework designed to facilitate multi-image perception, tion framework can continuously supplement high-quality
textual comprehension, and condition generation conducive data and improve model performance.
to subject-driven synthesis. Single-Subject In-Context Generation. In order to in-
crease dataset diversity, we initially formulated a taxonomy
3.1. Preliminary tree comprising 365 overarching classes sourced from Ob-
The original DiT architecture focuses solely on class- ject365 [36], alongside finer-grained categories encompass-
conditional image generation. It departs from the coming distinctions in age, profession, and attire styles. Within
monly used U-Net backbone, instead employing full trans- each category, we leverage the capabilities of a Large Lan-
former layers that operate on latent patches. More re- guage Model (LLM) to generate an extensive array of sub-
cently, image generators such as Stable Diffusion 3 [7] and jects and varied settings. Through the amalgamation of
FLUX.1 [17] are built upon MM-DiT, which incorporates these outputs with predefined text templates, we are able
a multi-modal attention mechanism and takes as input the to derive millions of text prompts for the T2I model, facili-
concatenation of the embeddings of text and image inputs. tating the generation of subject-consistent image pairs.
The multi-modal attention operation projects position- Initially generated image pairs often encounter several
encoded tokens into query Q, key K, and value V represen- issues, such as subject inconsistency and missing subjects.
tations, enabling attention computation across all tokens: To efficiently filter the data, we first split the image pair into
1
the reference image Iref and the target image Itgt , then calcu-
QK⊤

Attention ([zt , c]) = softmax √ V, (1) late the DINOv2 [22] between the two images. This method
d is effective in filtering out images with significantly lower
consistency. Subsequently, VLM will be further employed
where Z = [zt , c] denotes the concatenation of image and to provide a score list evaluating different aspects (i.e., ap-
text tokens. This allows both representations to function pearance, details, and attributes), which can be represented
within their own respective spaces while still taking the by the following equation:
other into account.
S = VLM(Iref , Itgt , cy ) ∈ RN ×1 , (2)
3.2. Synthetic Data Curation Framework
The paucity of high-quality, subject-consistent datasets has score = Average(S), (3)
long presented a formidable obstacle for subject-driven gen- where cy represents the input text to VLM, N denotes the
eration, severely constraining the scalability of model train- number of evaluated dimensions that are automatically gen-

4
Text Prompt Target Image 𝐼$%$ Reference Images 𝐼!"#
$
𝐼!"# %
𝐼!"#
Stage Ⅰ Input Stage Ⅱ Input Learnable Frozen
A soccer ball next to a
retro-style microphone
on a wooden floor. N× … Universal Rotary Position
Embedding (UnoPE)
Width
FFN
T5 Text Encoder VAE Encoder

Height
+ Noise
MM-Attention

UnoPE
UNO Sa
m
pl
… … es
Position index
UNO-DiT Blocks

Figure 4. Illustration of the training framework of UNO. It introduces two pivotal enhancements to the model: progressive cross-modal
alignment and universal rotary position embedding(UnoPE). The progressive cross-modal alignment is divided into two stages. In the Stage
I, we use single-subject in-context generated data to finetune the pretrained T2I model into an S2I model. In the Stage II, we continue
training on generated multiple-subject data pairs. The UnoPE can effectively equip UNO with the capability of mitigating the attribute
confusion issue when scaling visual subject controls.

0.34
0.82 Data quality
image onto the resulting image. Our proposed pipeline ef-
0.70 0.33 2.5-3
0.80
0.65 0.32
3.5-4
fectively mitigates this concern.
CLIP-T
CLIP-I

0.78
DINO

4
0.60
0.76 0.31
0.55
0.50
0.74
0.72
0.30 3.3. Customization Model Framework (UNO)
0.29
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
step(×1000) step(×1000) step(×1000) In this section, we will provide a detailed explanation
of how to iteratively train a multi-image conditioned S2I
Figure 5. Model performance on Dreambench [33]. We conduct model from a DiT-based T2I model. It should be noted that
experiments under different quality score levels. all the training data we used originate from images gener-
ated by our in-context data generation method proposed in
the previous chapter.
erated by VLM, S signifies the output which is parsed into
Progressive cross-modal alignment. Original T2I models
a score list, and score indicates the final consistency score
gradually transform pure Gaussian noise into text-adherent
of generated image pairs.
images through an iterative denoising process. During this
Multi-Subject In-Context Generation. The comprehen-
process, the VAE encoder E(·) first encodes the target image
sive dataset from the preceding phase will be utilized to
Itgt into a noisy latent zt = E(Itgt ). zt is then concatenated
train a S2I model, which is conditioned on both single-
with the encoded text token c, forming the input z for the
image and text inputs. Subsequently, this trained S2I model,
DiT model. This process can be formulated as:
along with the dataset, will be employed to generate multi-
subject consistent data in the current stage. Illustrated in z = Concatenate(c, zt ), (4)
Fig. 3(b), we will initially employ an open-vocabulary de-
To incorporate multi-image conditions Iref =
tector (OVD) to identify subjects beyond those present in 1 2 N
1 [Iref , Iref , . . . , Iref ], we introduce a progressive training
Iref . The extracted cropped images and their correspond-
paradigm that progresses from simpler to more complex
ing subject prompts will then be input into our trained S2I
2 scenarios. We view the training phase with single-image
model to derive new results for Iref . Traditional approaches
conditions as the initial phase for cross-modal alignment.
often encountered significant failure rates as models strug-
Given that the original input comprises solely text tokens
gled to preserve the original subject’s identity. How-
and noisy latents, the introduction of noise-free reference
ever, models trained using our proposed in-context training
image tokens could potentially disrupt the original conver-
methodology can effectively surmount this challenge, yield-
gence distribution. Such disruption may result in training
ing highly consistent outcomes with ease. Further elabora-
instability or suboptimal outcomes. Hence, we opt for a
tion on this topic will be provided in the subsequent section.
gradual complexity approach rather than directly exposing
Some may question the necessity of generating new data,
the model to multiple reference image inputs. In Stage
suggesting that the cropped part could be treated simply as
2 I, depicted in Fig. 4, only a single image serves as the
Iref . However, we contend that relying solely on cropped
reference image. We utilize z1 as the input multi-modal
images as the training dataset may introduce copy-paste is-
tokens for the DiT model:
sues. This scenario arises when the model fails to adhere to
1
the textual prompt and merely “pastes" the input reference z1 = Concatenate(c, zt , E(Iref )), (5)

5
Reference Image Prompt UNO(Ours) OminiControl FLUX IP-Adapter OmniGen RealCustom++ SSR-Encoder

A clock on
top of green
grass with
sunflowers
around it.

A sneaker
with the
Eiffel Tower
in the
background.

A purple toy.

A red toy.

Figure 6. Qualitative comparison with different methods on single-subject driven generation.

Reference Images Prompt UNO(Ours) OmniGen MS-Diffusion MIP-Adapter SSR-Encoder

A candle and
a clock on top
of a purple
rug in a forest.

A bowl and a
can on top of
a white rug

A sneaker
and a teapot
on top of
pink fabric

A vase and
a stuffed
animal with
a mountain
in the
background

Figure 7. Qualitative comparison with different methods on multi-subject driven generation.

6
After Stage I training, the model is capable of processing 4. Experiments
single-subject driven generation tasks. We then train the
model with multi-image conditions to tackle more complex 4.1. Experiments Setting
multi-subject driven generation scenarios. z2 can be de- Implementation Details. To self-evolution our base DiT-
scribed as follows: based T2I model, we firstly take the FLUX.1 dev [17] as
the pretrained model. We train the model with a learning
i i
zref = E(Iref ), i = 1, . . . , N, (6) rate of 10−5 and a total batch size of 16. For the progres-
sive cross-modal alignment, we first train the model using
1 2 N
z2 = Concatenate(c, zt , zref , zref , . . . , zref ), (7) single-subject pair-data for 5, 000 steps. Then, we continue
training on multi-subject pair-data for another 5, 000 steps.
where N is set to 2 in our paper. During Stage I, the
Specifically, we generated 230k and 15k data pairs for these
T2I model is trained to refer to the input reference im-
two stages respectively using the in-context data generation
age and prompt, with the goal of generating single subject-
method mentioned above. We conduct the entire experi-
consistent results. Stage II is designed to enable the S2I
ment on 8 NVIDIA A100 GPUs and trained the model using
model to refer to multiple input images and inject infor-
a LoRA [12] rank of 512 throughout the training process.
mation into corresponding latent spaces. Through iterative
training, the inherent in-context generation capability of T2I Comparative Methods. As a tuning-free method, our
model is unlocked, eliciting more controllability from sin- model is capable of handling both single-subject and multi-
gle text-to-image models. subject driven generation. We compare it with some leading
methods in these two tasks respectively, including Omni-
Universal rotary position embedding(UnoPE). An im-
gen [44], Ominicontrol [38], FLUX IPAdapter v2 [39], Ms-
portant consideration for incorporating multi-image condi-
diffusion [41], MIP-Adapter [15], RealCustom++ [20], and
tions into a DiT-based T2I model pertains to the aspect of
SSR-Encoder [48].
position encoding. In the context of FLUX.1 [17], the uti-
lization of Rotary Position Embedding (RoPE) necessitates Evaluation Metrics. Following previous works, we use
the assignment of position indices (i, j) to both text and standard automatic metrics to evaluate both subject simi-
image tokens, thereby influencing the interaction among larity and text fidelity. Specifically, we employ cosine sim-
multimodal tokens. Within the original model architec- ilarity measures between generated images and reference
ture, text tokens are assigned a consistent position index of images within CLIP [27] and DINO [22] spaces, referred
(0, 0), while noisy image tokens are allocated position in- to as CLIP-I and DINO scores, respectively, to assess sub-
dices (i, j) where i ∈ [0, w − 1] and j ∈ [0, h − 1]. Here, ject similarity. Additionally, we calculate the cosine simi-
h and w denote the height and width of the noisy latent, larity between the prompt and the image CLIP embeddings
respectively. (CLIP-T) to evaluate text fidelity. For single-subject driven
Our newly introduced image conditions reuse the same generation, we measure all methods on DreamBench [33]
format to inherit the implicit position correspondence of for fairness. For multi-subject driven generation, we follow
the original model. However, we start from the maximum previous studies [15, 19] that involve 30 different combi-
height and width of the noisy image tokens, as shown in nations of two subjects from DreamBench, including com-
Fig. 4, which begins with the diagonal position. The posi- binations of non-live and live objects. For each combina-
N
tion index for the latent zref is defined as: tion, we generate 6 images per prompt using 25 text prompts
from DreamBench, resulting in 4, 500 image groups for all
subjects.
(i′ , j ′ ) = (i + w(N −1) , j + h(N −1) ), (8)
4.2. Qualitative Analyses
where i ∈ [0, wN ), j ∈ [0, hN ), with wN and hN repre-
N
senting the width and height of the latent zref , respectively. We compare with various state-of-the-art methods to ver-
′ ′
Here, i and j are the adjusted position indices. To prevent ify the effectiveness of our proposed UNO. We show the
the generated image from over-referencing the spatial struc- comparison of single-image condition generation results in
ture of the reference image, we adjust the position indices Fig. 6. In the first two rows, our UNO nearly perfectly
within a certain range. In the scenario of multi-image condi- keeps the subject detail (e.g., the numbers on the dial of the
tions, different reference images inherently have a semantic clock) in the reference image, while other methods strug-
gap. Our proposed UnoPE can further prevent the model gle to maintain the details. In the following two rows, we
from learning the original spatial distribution of reference demonstrate the editability. UNO can maintain subject sim-
images, thereby focusing on obtaining layout information ilarity while editing attributes, specifically colors, whereas
from text features. This enables the model to improve its other methods either fail to maintain subject similarity or do
performance in subject similarity while maintaining good not follow the text edit instructions. As a contrast, Omini-
text controllability. Control [38] has good retention ability but may encounter

7
Method DINO ↑ CLIP-I ↑ CLIP-T ↑ User Evaluations of Single-subject Driven Generation User Evaluations of Multi-subject Driven Generation
text fidelity text fidelity
(subject) (subject)
Oracle(reference images) 0.774 0.885 - 5 5

Textual Inversion [8] 0.569 0.780 0.255

visual 3 text fidelity visual 3 text fidelity
DreamBooth [33] 0.668 0.803 0.305 appeal (background) appeal (background)

BLIP-Diffusion [18] 0.670 0.805 0.302 1 1

ELITE [43] 0.647 0.772 0.296

Re-Imagen [5] 0.600 0.740 0.270
BootPIG[26] 0.674 0.797 0.311
SSR-Encoder[48] 0.612 0.821 0.308 composition subject
similarity composition subject
similarity

RealCustom++ [14, 20] 0.702 0.794 0.318 UNO

OmniControl
Flux
OmniGen RealCustom++
MS- MIP- SSR-
(ours) IP-Adapter Diffusion Adapter Encoder

OmniGen [44] 0.693 0.801 0.315

OminiControl [38] 0.684 0.799 0.312 Figure 8. Radar charts of user evaluation of methods for single-
FLUX.1 IP-Adapter 0.582 0.820 0.288 subject driven and multi-subject driven generation on different di-
UNO (Ours) 0.760 0.835 0.304 mensions

Table 1. Quantitative results for single-subject driven gener-

Method DINO ↑ CLIP-I ↑ CLIP-T ↑
ation on Dreambench. We present the oracle results in the first
2
w/o generated Iref 0.529 0.730 0.308
row and compare both tuning methods and tuning-free methods.
w/o cross-modal alignment 0.511 0.721 0.322
We highlight the best and second-best values for each metric. w/o UnoPE 0.386 0.674 0.323
UNO (Ours) 0.542 0.733 0.322
Method DINO ↑ CLIP-I ↑ CLIP-T ↑
DreamBooth [33] 0.430 0.695 0.308 Table 3. Ablation Study of our proposed in-context data gener-
BLIP-Diffusion [18] 0.464 0.698 0.300 ation and in-context training method. We report the results on
Subject Diffusion [19] 0.506 0.696 0.310 the multi-subject driven generation benchmark.
MIP-Adapter [15] 0.482 0.726 0.311
MS-Diffusion [41] 0.525 0.726 0.319 Method DINO ↑ CLIP-I ↑ CLIP-T ↑
OmniGen [44] 0.511 0.722 0.331 w/ single subject pair-data 0.730 0.821 0.309
UNO (Ours) 0.542 0.733 0.322 w/ cross-modal alignment 0.760 0.835 0.304

Table 2. Quantitative results for multi-subject driven gener- Table 4. Effect of progressive cross-modal alignment. The
ation. Our method achieves state-of-the-art performance among model exhibits superior performance on DreamBench [33] after
both tuning methods and tuning-free methods. undergoing progressive cross-modal alignment, in contrast to be-
ing trained exclusively on single-subject pair-data, despite both
models undergoing an identical number of training steps.
copy-paste risks, e.g., a red robot in the last row of Fig. 6.
We show the comparison results of multi-image condition
generation in Fig. 7. Our method can keep all reference im- questionnaires to showcase the superiority of UNO. For
ages while adhering to text responses, whereas other meth- subjective assessment, 30 evaluators including both domain
ods either fail to maintain subject consistency or miss the experts and non-experts, assessed 300 image combinations
input text editing instructions. covering both single-subject and multi-subject driven gen-
eration tasks. For each case, evaluators rank the best results
4.3. Quantitative Evaluations across five dimensions, including text fidelity at the subject
Automatic scores. Tab. 1 compares our UNO on Dream- level, text fidelity at the background level, subject similar-
Bench [33] against both tuning-based and tuning-free meth- ity, composition quality, and visual appeal. As shown in
ods. UNO has a significant lead over previous methods with Fig. 8, the results reveal that our UNO not only excels in
the highest DINO and CLIP-I scores of 0.760 and 0.835 subject similarity and text fidelity but also achieves strong
respectively in zero-shot scenarios, and a leading CLIP-I performance in other dimensions.
score of 0.304. We also compare our method in the multi-
4.4. Ablation Study
image condition scenario in Tab. 2. UNO achieves the high-
est DINO and CLIP-I scores and has competitive CLIP-T Effect of synthetic data curation framework. Tab. 3
scores compared to existing leading methods. This shows shows the effects of different modules of UNO. When us-
that UNO can greatly improve subject similarity while ad- ing augmented cropped part images from the target image
2
hering to text descriptions. instead of generated Iref , we observe a significant decline in
User study. We further conduct a user study via online all metrics in Tab. 3. In Fig. 9, the results tend to merely

8
w/o cross-modal
Method DINO ↑ CLIP-I ↑ CLIP-T ↑ Reference Images&Prompt UNO(Ours) $
w/o generated 𝐼!"#
alignment
w/o UnoPE

w/o offset 0.470 0.722 0.308

w/ width-offset 0.717 0.813 0.304
w/ height-offset 0.678 0.797 0.308 A backpack and a stuffed
UNO (Ours) 0.730 0.821 0.309 animal floating on top of water

Table 5. Comparison with different forms of position index offsets.

We report the results on DreamBench[33]. A dog and
a toy on the beach

Method DINO ↑ CLIP-I ↑ CLIP-T ↑ Figure 9. Ablation study of UNO. Zoom in for details.
w/o offset 0.386 0.674 0.323
w/ width-offset 0.508 0.724 0.321
w/ height-offset 0.501 0.719 0.306 effectively reducing the copy-paste phenomenon. Extensive
UNO (Ours) 0.542 0.733 0.322 experiments show that UNO achieves high-quality similar-
ity and controllability in both single-subject and multiple-
Table 6. Comparison with different forms of position index off- subject customization.
sets. We report the results on the multi-subject driven generation
benchmark.
References
[1] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell,
copy-paste the subjects and almost do not respond to the Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort,
text prompt description. Deep Ganguli, Tom Henighan, et al. Training a helpful and
Effect of progressive cross-modal alignment. As shown harmless assistant with reinforcement learning from human
in Tab. 3 and Fig. 9, there is a significant drop in both DINO feedback. arXiv preprint arXiv:2204.05862, 2022. 2
and CLIP-I scores, as well as in subject similarity, when the [2] Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole
model is directly exposed to multiple reference image in- Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan,
Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen
puts without progressive cross-modal alignment. Further-
3. arXiv preprint arXiv:2408.07009, 2024. 3
more, as shown in Tab. 4, progressive cross-modal align-
[3] Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen
ment can increase the upper limit of the model in single- Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen,
image condition scenarios. Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-
Effect of UnoPE. As shown in Tab. 3, there is a signifi- strong generalization: Eliciting strong capabilities with weak
cant drop in both DINO and CLIP-I scores when cloning supervision. arXiv preprint arXiv:2312.09390, 2023. 2
the position index from the target image without using Un- [4] Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas,
oPE. In Fig. 9, the generated images can follow the text de- Jiajun Wu, and Gordon Wetzstein. Diffusion self-distillation
scriptions but hardly reference the input images. We further for zero-shot customized image generation. arXiv preprint
compared with different forms of position index offsets, as arXiv:2411.18616, 2024. 13
shown in Tabs. 5 and 6, and our method achieves the best [5] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W
results, which demonstrates the superiority of our proposed Cohen. Re-imagen: Retrieval-augmented text-to-image gen-
UnoPE. erator. arXiv preprint arXiv:2209.14491, 2022. 8
[6] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao,
5. Conclusion
Hongxia Yang, et al. Cogview: Mastering text-to-image gen-
In this paper, we present UNO, a universal customization eration via transformers. NIPS, 34:19822–19835, 2021. 3
architecture that unlock the multi-condition contextual ca- [7] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim
pabilities of diffusion transformer. This is achieved through Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik
progressive cross-modal alignment and universal rotary po- Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti-
fied flow transformers for high-resolution image synthesis.
sition embedding. The training of UNO consists of two
In ICML, 2024. 3, 4
steps. The first step uses single-image input to elicit the
[8] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
subject-to-image capabilities in diffusion transformers. The
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
next step involves further training on multiple-subject data Or. An image is worth one word: Personalizing text-to-
pairs. Our proposed universal rotary position embedding image generation using textual inversion. arXiv preprint
can also significantly improves subject similarity. Addi- arXiv:2208.01618, 2022. 2, 3, 8
tionally, we present a progressive synthesis pipeline that [9] Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John
evolves from single-subject to multi-subject in-context gen- Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh,
eration. This pipeline generates high-quality synthetic data, Laura Weidinger, Martin Chadwick, Phoebe Thacker,

9
Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico-
Huang, Ramona Comanescu, Fan Yang, Abigail See, las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou,
Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bo-
Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nicholas janowski. Dinov2: Learning robust visual features without
Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason supervision, 2023. 4, 7, 12, 13
Gabriel, William Isaac, John Mellor, Demis Hassabis, Ko- [23] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car-
ray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irv- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
ing. Improving alignment of dialogue agents via targeted Agarwal, Katarina Slama, Alex Ray, et al. Training language
human judgements, 2022. 2 models to follow instructions with human feedback. Ad-
[10] Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng vances in neural information processing systems, 35:27730–
Zhang, and Qian He. Pulid: Pure and lightning id customiza- 27744, 2022. 2
tion via contrastive alignment. In NIPS, 2024. 3 [24] William Peebles and Saining Xie. Scalable diffusion models
[11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- with transformers. In ICCV, pages 4195–4205, 2023. 3, 4
sion probabilistic models. NIPS, 33:6840–6851, 2020. 3 [25] Dustin Podell, Zion English, Kyle Lacey, Andreas
[12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Robin Rombach. SDXL: Improving latent diffusion models
Lora: Low-rank adaptation of large language models. arXiv for high-resolution image synthesis. In ICLR, 2024. 3
preprint arXiv:2106.09685, 2021. 3, 7 [26] Senthil Purushwalkam, Akash Gokul, Shafiq Joty, and Nikhil
[13] Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Naik. Bootpig: Bootstrapping zero-shot personalized image
Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jin- generation capabilities in pretrained diffusion models. arXiv
gren Zhou. In-context lora for diffusion transformers. arXiv preprint arXiv:2401.13974, 2024. 8
preprint arXiv:2410.23775, 2024. 3 [27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[14] Mengqi Huang, Zhendong Mao, Mingcong Liu, Qian He, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
and Yongdong Zhang. Realcustom: narrowing real text word Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
for real-time open-domain text-to-image customization. In ing transferable visual models from natural language super-
CVPR, pages 7476–7485, 2024. 2, 3, 8 vision. In ICML, pages 8748–8763. PmLR, 2021. 7
[28] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
[15] Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Yu, and Jie Song. Resolving multi-condition confusion
Zero-shot text-to-image generation. In ICML, pages 8821–
for finetuning-free personalized image generation. arXiv
8831. Pmlr, 2021. 3
preprint arXiv:2409.17920, 2024. 7, 8
[29] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
[16] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli
and Mark Chen. Hierarchical text-conditional image gener-
Shechtman, and Jun-Yan Zhu. Multi-concept customiza-
ation with clip latents. arXiv preprint arXiv:2204.06125, 1
tion of text-to-image diffusion. In CVPR, pages 1931–1941,
(2):3, 2022. 3
2023. 2
[30] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
[17] Black Forest Labs. Flux: Official inference repository for geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
flux.1 models, 2024. Accessed: 2025-02-07. 3, 4, 7, 12, 13 versarial text to image synthesis. In ICML, pages 1060–1069.
[18] Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- PMLR, 2016. 3
trained subject representation for controllable text-to-image [31] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
generation and editing. Advances in Neural Information Pro- Patrick Esser, and Björn Ommer. High-resolution image syn-
cessing Systems, 36:30146–30166, 2023. 3, 8 thesis with latent diffusion models. In CVPR, pages 10684–
[19] Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject- 10695, 2022. 3
diffusion: Open domain personalized text-to-image genera- [32] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
tion without test-time fine-tuning. In ACM SIGGRAPH 2024 net: Convolutional networks for biomedical image segmen-
Conference Papers, pages 1–12, 2024. 7, 8 tation. In Medical image computing and computer-assisted
[20] Zhendong Mao, Mengqi Huang, Fei Ding, Mingcong Liu, intervention–MICCAI 2015: 18th international conference,
Qian He, and Yongdong Zhang. Realcustom++: Represent- Munich, Germany, October 5-9, 2015, proceedings, part III
ing images as real-word for real-time customization. arXiv 18, pages 234–241. Springer, 2015. 3
preprint arXiv:2408.09744, 2024. 3, 7, 8 [33] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
[21] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and tuning text-to-image diffusion models for subject-driven
Mark Chen. Glide: Towards photorealistic image generation generation. In CVPR, pages 22500–22510, 2023. 2, 3, 5,
and editing with text-guided diffusion models. arXiv preprint 7, 8, 9
arXiv:2112.10741, 2021. 3 [34] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
[22] Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- et al. Photorealistic text-to-image diffusion models with deep
sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- language understanding. NIPS, 35:36479–36494, 2022. 3

10
[35] William Saunders, Catherine Yeh, Jeff Wu, Steven Bills,
Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing
models for assisting human evaluators. arXiv preprint
arXiv:2206.05802, 2022. 2
[36] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang
Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365:
A large-scale, high-quality dataset for object detection. In
CVPR, pages 8430–8439, 2019. 4, 12
[37] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, pages 2256–
2265. pmlr, 2015. 3
[38] Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue,
and Xinchao Wang. Ominicontrol: Minimal and uni-
versal control for diffusion transformer. arXiv preprint
arXiv:2411.15098, 3, 2024. 3, 4, 7, 8, 12
[39] XLabs AI team. x-flux, 2025. Accessed: 2025-02-07. 7
[40] Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and An-
thony Chen. Instantid: Zero-shot identity-preserving gener-
ation in seconds. arXiv preprint arXiv:2401.07519, 2024.
3
[41] Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and
Hao Jiang. MS-diffusion: Multi-subject zero-shot image per-
sonalization with layout guidance. In ICLR, 2025. 3, 7, 8
[42] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
Chain-of-thought prompting elicits reasoning in large lan-
guage models. Advances in neural information processing
systems, 35:24824–24837, 2022. 12, 13
[43] Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei
Zhang, and Wangmeng Zuo. Elite: Encoding visual con-
cepts into textual embeddings for customized text-to-image
generation. In CVPR, pages 15943–15953, 2023. 8
[44] Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin-
grun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, and
Zheng Liu. Omnigen: Unified image generation. arXiv
preprint arXiv:2409.11340, 2024. 3, 7, 8
[45] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
adapter: Text compatible image prompt adapter for text-to-
image diffusion models. arXiv preprint arXiv:2308.06721,
2023. 2, 3
[46] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres-
sive models for content-rich text-to-image generation. arXiv
preprint arXiv:2206.10789, 2(3):5, 2022. 3
[47] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu,
Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu,
Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-
scale dataset of multi-view images. In CVPR, pages 9150–
9161, 2023. 2
[48] Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jin-
peng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan,
et al. Ssr-encoder: Encoding selective subject representation
for subject-driven generation. In CVPR, pages 8069–8078,
2024. 7, 8

11
Less-to-More Generalization:
Unlocking More Controllability by In-Context Generation
Supplementary Material
F. In-Context Data Generation Pipeline
In this section, we give a detailed description of our in-context data generation pipeline. We first build a taxonomy tree
in Sec. F.1 to obtain various subject instances and scenes. Then we generate subject-consistent image-pair data with the in-
context ability of pretrained Text-to-Image (T2I) model and utilize Chain-of-Thought (CoT) [42] to filter the synthesized data
in Sec. F.2. Finally, for multi-subject data, we train a Subject-to-Image (S2I) model to generate subject-consistent reference
image instead of the cropped one to avoid the copy-paste issue in Sec. F.3.

F.1. Taxonomy Tree Generation

To ensure the diversity of the generated dataset, we first construct a taxonomy tree that includes common categories of people
and objects, as shown in Fig. 10. Specifically, we use the 365 general classes from Object365 [36] as the basis for our
taxonomy tree. To obtain more diverse categories, we employ Large Language Model (LLM) to generate various subject
instances and diverse scenes. The instructions in Fig. 12 make LLM generate subject instances according to the given asset
category in creative, realistic, and text-decorated ways. In addition, we instruct LLM to generate scene descriptions according
to the given subject with system prompt in Fig. 13. Following the steps above, we build a taxonomy tree and get plenty of
diverse subjects and scene descriptions.

F.2. Single-Subject In-Context Data Generation

In-context ability of T2I model has been proved in OminiControl [38] and we also utilize it to generate subject-consistent
image-pair data as the basic part of our final in-context synthesized data in Sec. F.2.1. Moreover, we filter out low quality data
with bad subject consistency according to the similarity from DINOv2 [22] and Vision-Language Model (VLM) in Sec. F.2.2.

Class Instance Scene

Generation Generation Generation

A hat with rainbow stripes

Creative The **white panama hat**
Speaker A mushroom-shaped hat Rich rests on a sunlit beach chair,
Person … background
swaying gently in the ocean
… Realistic Military officer's cap
breeze with palm trees
Tricycle White panama hat
Object365 casting delicate shadows.
Hat …
…
… A fedora with "Jazz Time"
Text- Empty with a white background, a
Lion stitched into the band.
Decorated background pure background, shot in a
Butterfly A bucket hat with "Stay Cool"
studio, very detailed.
on the rim.

Figure 10. Illustration of the taxonomy tree.

F.2.1. Subject-Consistent Image-Pair Generation

Combining the constructed taxonomy tree with the predifined diptych text template in Fig. 11, we utilize the inherent in-
context ability of FLUX.1 [17], one of the state-of-the-art T2I model, to generate subject-consistent image-pair. Since
FLUX.1 [17] has multi-resolution generation ability, we directly produce three different high-resolution (i.e., 1024×1024,
1024×768, 768×1024) image-pairs, with great balance of quality and efficiency.
F.2.2. Subject-Consistent Image-Pair Filter
Though FLUX.1 [17] shows great in-context generation ability, synthesized image-pairs suffer several issues, especially
subject inconsistency and missing subjects. We highlight that the high quality of synthesized data can notably accelerate the
convergence and improve the subject consistency. To efficiently filter synthesized data, we first split the diptych image-pair
1 1
into reference image Iref and target image Itgt with Hough Transform. According to the template in Fig. 11, both Iref and Itgt

12
Template_1
A diptych with two side-by-side images of same <\subject1>.
Left: <\subject1> in <\scene1>.
Right: <\subject1> together with <\subject2> in <\scene2>.

Template_2
A diptych with two side-by-side images of same <\subject1>.
Top: <\subject1> in <\scene1>.
Bottom: <\subject1> together with <\subject2> in <\scene2>.

Figure 11. Diptych text template for generating subject-consistent image-pair with FLUX.1[17].

1
contain the same subject1 while Itgt has another subject2. To ensure Iref and Itgt have consistent subject1, we then calculate
cosine similarity with DINOv2 [22] and set a threshold to filter out image-pairs with significantly low consistency.
1
However, since the reference image Iref and the target image Itgt have different scene settings, the subject1 in the image-
pairs may not be spatially aligned, resulting in incorrect cosine similarity with feature from DINOv2 [22]. We further employ
VLM to provide a fine-grained score list evaluating various aspects adaptively, i.e. appearance, details, and attributes. We
only keep the data with highest VLM score, which indicating the highest quality and subject consistency in the synthesized
1
data. Specifically, inspired by [4], we utilize CoT [42] for better discrimination of the subject1 in Iref and Itgt , as shown in
Fig. 14. To demonstrate the effectiveness of the CoT filter, we sample data from different VLM score intervals in Fig. 15.
Image-pairs with low score suffer severe subject inconsistency while those with highest score (i.e. score is 4) show highly
consistent subject in the reference image and the target image. We also count the amount of data in each score interval as
shown in Fig. 17, indicating that around 35.43% data would be remained with the VLM CoT filter. Also, there are seldom of
data with extremely low VLM score after DINOv2 filter, showing its effectiveness.

Role:
Please be very creative and generate 50 breif subject prompts for text-to-image generation.
Role: Role:
Follow these rules: Please be very careful and generate 50 breif subject prompts for text-to-image generation.
1. You will be given an <asset category>, you need to create an asset(breif subject prompt) based on Please be very careful and generate 50 breif subject prompts for text-to-image generation.
Follow these rules: Follow these rules:
the <asset category>.
1. You will be given an <asset category>, you need to create an asset(breif subject prompt) based on
2. These descriptions can refer only to appearance descriptions/or to certain brands. e.g. "Elon Musk 1. You will be given an <asset category>, you need to create an asset(breif subject prompt) based on
the <asset category>.
in pajamas", "a tiger in a black hat", "A Mercedes sports car", "A blonde", "A door red on the left the <asset category>.
2. These descriptions can refer only to appearance descriptions/or to certain brands. But it has to be
and green on the right" 2. Please add a text description in the breif subject. The text can appear anywhere.
something that can exist in the real world. e.g. "Elon Musk in pajamas", "a white tiger", "A Mercedes
3. Do not repeat each asset, you need to use your imagination and common sense of life to create. 2. These descriptions can refer only to appearance descriptions/or to certain brands. e.g. "Elon Musk
sports car", "A blonde", "A rotten wooden door"
3. No more than 12 words. in his pajamas with the words' beat it.'", "a white tiger holding a sign that says' go '", "A Mercedes
3. Do not repeat each asset, you need to use common sense of life to create.
Example1 3. No more than 12 words.
sports car with '101' written on it", "A blonde"
[asset category]: Book 3. Do not repeat each asset, you need to use common sense of life to create.
Example1 3. No more than 12 words.
[asset1]: A book with a green cover
[asset category]: Book
[asset2]: commic book Example1
[asset1]: A book with a green cover
[asset3]: math book [asset category]: Person
[asset2]: commic book
[asset4]: An open book [asset1]: A surfer with "Catch the Waves" on a surfboard.
[asset3]: math book
[asset5]: Rotten books [asset2]: A guitarist with "Rock On" on a t-shirt.
[asset4]: An open book
[asset6]: A book made of candy [asset3]: A dancing ballerina with "Grace" written on her tutu.
[asset5]: Rotten books
[asset7]: The book with "love and power" on the cover [asset4]: A chef wearing a hat that says "Cook Master."
[asset6]: The book with "Harry Potter and the Sorcerer's Stone" on the cover
[asset8]: A book with a blue key on it ...
...
[asset9]: Triangular book (Up to [asset50])
(Up to [asset50])
...
(Up to [asset50]) [asset category]:
[asset category]:
[asset category]:

(a) System prompt of LLM used to generate (b) System prompt of LLM used to generate (c) System prompt of LLM used to generate
subject instances in creative type. subject instances in realistic type. subject instances in text-decorated type.

Figure 12. System prompt of LLM used to generate subject instances.

F.3. Multi-Subject In-Context Data Generation

1
Following the above pipeline, we construct single-subject in-context data containing subject-consistent image-pairs (Iref , Itgt ).
Both the reference image and the target image have subject1 while only the target image contain subject2. Since Itgt has multi-
subject, the simplest way to build multi-subject in-context data is utilizing open-vocabulary detector (OVD) to identify and
crop the subject2 in Itgt as the second reference image I¯ref
2
. However, we find that the cropped I¯ref 2
would make severe copy-
paste issue. To alleviate the issue, a S2I model is trained with the single-subject in-context data and then used to generate
new reference image Iref 2
, which has the consistent subject with the cropped I¯ref
2
but different scenes. Thus we have high
1 2
quality multi-subject in-context data with subject-consistent image-pairs (Iref , Iref , Itgt ) after very similar filter pipeline for
2
the synthesized Iref . There are some case randomly sample from our final data in Fig. 16. Interestingly, we find that a small
part of image-pairs have more than 2 reference images, due to the randomness of T2I generation and OVD, empowering
generalization for more-subject generation.

13
Role:
Please be very creative and generate 50 breif subject prompts for text-to-image generation.
Follow these rules:
1. Given a brief subject prompt of an asset, you need to generate 8 detailed Scene Description for the asset.
2. Each Scene Description should be a detailed description, which describes the background area you imagine for an identical extracted asset, under different
environments/camera views/lighting conditions, etc (please be very very creative here).
3. Each Scene Description should be one line and be as short and precise as possible, do not exceed 77 tokens, Be very creative!
Example1
[asset]: Scientist with exploding beakers
[SceneDescription1]: The scientist with exploding beakers stands in a futuristic laboratory with holographic equations swirling around them.
[SceneDescription2]: Amidst the chaos of a stormy outdoor field lab, the scientist with exploding beakers conducts dramatic experiments as lightning crashes overhead.
[SceneDescription3]: In an ancient alchemist's den filled with dusty tomes, the scientist with exploding beakers looks surprised as colorful liquid bursts forth.
[SceneDescription4]: The scientist with exploding beakers is immersed in a vibrant neon-lit urban laboratory, surrounded by robotic assistants.
[SceneDescription5]: A desert makeshift tent serves as the lab where the scientist with exploding beakers creates a plume of shimmering dust.
[SceneDescription6]: On an alien planet bathed in ethereal light, the scientist with exploding beakers observes bioluminescent reactions in awe.
[SceneDescription7]: In a steampunk inspired workshop, the scientist with exploding beakers wears goggles and smiles amidst gears and steam as an experiment erupts.
[SceneDescription8]: The scientist with exploding beakers stands on a floating platform in the clouds, conducting experiments as colorful bursts light up the sky.

[asset]:

Figure 13. System prompt of LLM used to generate scene descriptions.

Step 1:
Briefly describe these two images, as well as the most prominent subject that exist. Think carefully about which parts of the subject you need to
break down in order to make an objective and thoroughly evaluation. Don't make evaluations at this step.

Role Important Notes

You are an expert AI assistant specializing in the objective evaluation of the consistency of subjects in two images. - Focus solely on the subject itself.
- If there is text on the subject, each text itself should be considered as an important separate part.
- Ignore the difference of subject's background, environment, position, size, etc.
- Ignore the difference of subject's actions, poses, expressions, viewpoints, lightning, etc.
Input Format
You will receive two images. You need to describe two pictures and determine whether the subject in the first picture is in the second picture.
Output Format
[subject]: [subject in IMG1, e.g., a man in a white shirt and black pants]
[caption1]: [IMG1 caption, e.g., a man in a white shirt and black pants]
[caption2]: [IMG2 caption, e.g., a man in a white shirt and black pants holds a blue cup, butterflies and flowers swirled around him]
[Break down]: [Break down the evaluation parts of the subject in IMG1]

(a) System prompt of the filter VLM. (b) Prompt for the first round CoT of the filter VLM.

Step 3:
Based on the differences analyzed in Step 2, assign a specific integer score to each part. More and larger differences result in a lower score. The
Step 2: score ranges from 0 to 4:
For each part you have identified, compare this aspect of the subject in the two images and describe the differences in extreme extreme extreme
extreme extreme extreme extreme detail. You need to be meticulous and precise, noting every tiny detail. - Very Poor (0): No resemblance. This subject part in the second image has no relation to the part in the first image.
- Poor (1): Minimal resemblance. This subject part in the second image has significant differences from the part in the first image.
Important Notes - Fair (2): Moderate resemblance. This subject part in the second image has modest differences from the part in the first image.
- Provide quantitative differences whenever possible. For example, "The subject's chest in the first image has 3 blue circular lights, while the - Good (3): Strong resemblance. This subject part in the second image has minor but noticeable differences from the part in the first image.
subject's chest in the second image has only one blue light and it is not circular." - Excellent (4): Near-identical. This subject part in the second image is virtually indistinguishable from the part in the first image.
- Ignore differences in the subject's background, environment, position, size, etc.
- Ignore differences in the subject's actions, poses, expressions, viewpoints, additional accessories, etc. Output Format
- Ignore the extra accessory of the subject in the second image, such as hat, glasses, etc. [Part 1]: [Part 1 Score]
- Consider that when the subject has a large perspective change, the part may not appear in the new perspective, and no judgment is needed at this [Part 2]: [Part 2 Score]
time. For example, if the subject in the first image is the back of the sofa, and the subject in the second image is the front of the sofa, determine the [Part 3]: [Part 3 Score]
similarity of the two sofas based on your association ability. ...
[Part N]: [Part N Score]
You must adhere to the output format strictly. Each part name and its score must be separated by a colon and a space.

Figure 14. CoT prompt of the filter VLM.

G. Analysis on LoRA Rank

In this section, we further conduct an ablation study on the LoRA rank. Since training parameters are strongly related to the
final performance, we scale the rank from 4 to 512. As shown in the Fig. 18, increasing the rank gradually brings sustained
gains, but when the rank reaches 128, the performance improvement slows down. Finally, considering both performance and
resource consumption, we set UNO to a rank of 512.

H. More Qualitative Results

H.1. Qualitative Results on Multi-Subject Driven Generation
We show a more qualitative comparison of multi-subject driven generation in Fig. 19. It is clear that UNO generate images
with best multi-subject consistency, following edit instructions to the subject and background.

14
score 0~1 score 2~3 score 4

Figure 15. Sampled data from different VLM score intervals.

H.2. Application Scenarios

We evaluated our UNO model across diverse multi-image conditional scenarios, such as identity preservation, virtual try-
on, and stylized generation. We found that UNO demonstrated exceptional generalization capabilities, even with minimal
exposure to such data during training.
Multi-subject Driven Generation: we have showcased additional results from our UNO model in Fig. 20. Beyond effec-
tively handling multi-subject scenarios, UNO excels in complex applications like logo design and the integration of virtual
and real elements, demonstrating its strong generalization capabilities. Virtual Try-on: as shown in Fig. 21, UNO performs
exceptionally well in virtual try-on scenarios, despite the absence of specialized training on such datasets. This demonstrates
that UNO has learned to understand relationships between objects rather than simply performing copy-paste operations. It
also suggests that UNO could provide novel optimization strategies for virtual try-on applications, a promising direction we
leave to further exploration. Identity Preservation: another notable observation is that UNO performs well in both pure ID
scenarios and ID-subject combinations in Fig. 22. This flexibility reduces reliance on additional ID plugins, fostering open-
source community development. We attribute this capability to our systematic training data construction. As mentioned
in Sec. F.1, our taxonomy tree covers extensive human-object combinations, enabling this versatile performance. Stylized
Generation: as depicted in Fig. 23, UNO has inherited stylization ability inherited from the original DiT model, despite the
lack of specific paired data in our training set. This stems from our training approach, which smoothly transitions from T2I
to S2I, allowing the model to evolve multi-condition control while maintaining strong semantic alignment.

I. Limitation and Discussion

Although we have established an automated data curation framework, this paper primarily focuses on subject-driven genera-
tion. Our dataset currently contains limited editing and stylization data. While UNO is a unified and customizable framework

15
reference images target image reference images target image

Figure 16. Sampled data from our final multi-subject in-context data.

[2.5, 3.0)
13.02%

[2.0, 2.5)
8.67%
[3.0, 3.5)
23.51%
4.0

2.0

0.0

[3.5, 4.0)
12.66%

[4.0, 4.0]
35.43%

Figure 17. Amount of data in each VLM score interval.

0.34
0.82 rank=4
rank=16
0.70
rank=64
0.80
0.33 rank=128
rank=512
0.65
0.78

0.32
0.60 0.76
CLIP-T
CLIP-I
DINO

0.55 0.74 0.31

0.72
0.50 0.30

0.70
0.45
0.29
1000 2000 3000 4000 5000 6000 7000 8000 1000 2000 3000 4000 5000 6000 7000 8000 1000 2000 3000 4000 5000 6000 7000 8000
step step step

Figure 18. Analysis of model performance under different LoRA ranks.

with sufficient generalization capabilities, the types of synthetic data may somewhat restrict its abilities. In the future, we
plan to expand our data types to further unlock UNO’s potential and cover a broader range of tasks.

16
Reference Images Prompt UNO (Ours) OmniGen MS-Diffusion MIP-Adapter SSR-Encoder

a toy and a
toy on top of a
white rug

a purple
sneaker and a
purple toy

a boot and a
stuffed animal
with a city in
the
background

a cartoon and
a toy with a
wheat field in
the
background

a dog in a
purple wizard
outfit, next to
it is a can

a dog in a
police outfit,
next to it is a
glasses

a dog in a
firefighter
outfit, next to
it is a teapot

a cat wearing
pink glasses,
next to it is a
cat

Figure 19. More comparison with different methods on multi-subject driven generation. We italicize the subject-related editing part of the
prompts.

17
Input Output Input Output

“A vase is next to the case, in the desert” “A colorful block on top of the toy car”

“A black cat wearing a black hat is riding a “A little bear carries a blue bag”
little yellow duck, in the forest”

“The girl wears a clothes with the red logo” “The logo is printed on the clothes”

“” “”

Figure 20. More multi-subject generation results from our UNO model.

18
Input Output

in the flowers near the sea in the street

A pretty woman wears a flower petal dress,

A girl is wearing a white blouse and a blue skirt,

A man wears a blue tank top and black shorts,

A woman wears the dress and holds a bag,

Figure 21. More virtual try-on results from our UNO model.

19
Input Output

1.The girl stands before a large painting in an art gallery. Her dark eyes reflect curiosity and appreciation as she studies the brushstrokes up close.
She wears a tailored black blazer over a simple white blouse, exuding an air of sophistication. Around her, other patrons whisper in admiration,
but she remains lost in her own artistic contemplation. 2. The girl poses in a botanical garden. She wears a floral dress, holding a bouquet,
blending natural and artistic beauty. 3. The girl navigates busy city streets. She wears a trench coat and trousers, exploring the urban landscape.

1. The woman sits on a park bench, a brown leather handbag by her side. She wears a floral dress, enjoying the sunshine. The bag rests on her lap,
a small umbrella tucked inside. She occasionally rummages through it for snacks. 2. The woman walks at dusk, a brown velvet evening bag in
hand. 3. The woman is sitting in a cafe, with a brown bag on the chair.

1. The man was lecturing on the podium, and the blackboard was full of mathematical formulas. 2. The man stands in a futuristic laboratory,
wearing a sleek silver jumpsuit. He adjusts the settings on a bizarre machine, its glowing panels and wires humming with energy. 3. The man plays
chess.

1. The man is sitting in a cafe, with a violin on the chair. 2. The man wears a suit and played the violin on the stage. 3. This man has fairy ears and
plays the violin in the forest.

Figure 22. More identity preservation results from our UNO model.

20
Input Output

Ghibli style, Comic style, 3D cartoon style,

a woman

a man

Figure 23. More stylized generation results from our UNO model.

Scenarios Prompt
One2One “A clock on the beach is under a red sun umbrella"
“A doll holds a ‘UNO’ sign under the rainbow on the grass"
Two2One “The figurine is in the crystal ball"
“The boy and girl are walking in the street"
Many2One “A penguin doll, a car and a pillow are scattered on the bed"
“A boy in a red hat wear a sunglasses"
Stylized Generation “Ghibli style, a woman"
“Ghibli style, a man"
Virtual Try-on “A man wears the black hoodie and pants"
“A girl wears the blue dress in the snow"
Product Design “The logo and words ‘Let us unlock!’ are printed on the clothes"
“The logo is printed on the cup"
Identity-preservation “The figurine is in the crystal ball"
“A penguin doll, a car and a pillow are scattered on the bed"
Story Generation “A boy in green is in the arcade"
“A man strolls down a bustling city street under moonlight"
“The man and a boy in green clothes are standing among the flowers by the lake"
“The man met a boy dressed in green at the foot of the tower"