0% found this document useful (0 votes)
30 views13 pages

Personalize Anything For Free With Diffusion Transformer: Haoran Feng Zehuan Huang Lin Li Hairong LV Lu Sheng

The document presents 'Personalize Anything', a training-free framework for personalized image generation using Diffusion Transformers (DiT). It introduces techniques such as timestep-adaptive token replacement and patch perturbation to enhance identity preservation and flexibility in image editing. The framework demonstrates superior performance in various personalization tasks without the need for extensive training or fine-tuning.

Uploaded by

吴京城
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views13 pages

Personalize Anything For Free With Diffusion Transformer: Haoran Feng Zehuan Huang Lin Li Hairong LV Lu Sheng

The document presents 'Personalize Anything', a training-free framework for personalized image generation using Diffusion Transformers (DiT). It introduces techniques such as timestep-adaptive token replacement and patch perturbation to enhance identity preservation and flexibility in image editing. The framework demonstrates superior performance in various personalization tasks without the need for extensive training or fine-tuning.

Uploaded by

吴京城
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Personalize Anything for Free with Diffusion Transformer

Haoran Feng1 * Zehuan Huang2 †* Lin Li3 Hairong Lv1 Lu Sheng2 ✉


1 2 3
Tsinghua University Beihang University Renmin University of China

Project page: https://fenghora.github.io/Personalize-Anything-Page/


arXiv:2503.12590v1 [cs.CV] 16 Mar 2025

Single-subject Personalization Layout-guided Subject Personalization


a can stuck on the beach a dragon flying in the sky

a dog playing with a ball a robot walking on the road


Multi-Subject Personalization Subject-Scene Composition Inpainting and Outpainting

Visual Storytelling

Figure 1. Personalize Anything is a training-free framework based on Diffusion Transformers (DiT) for personalized image generation. The
framework demonstrates advanced versatility, excelling in single-subject personalization (top), multi-subject or subject-scene composi-
tion, inpainting and outpainting (middle), as well as applications like visual storytelling (bottom), all without any training or fine-tuning.

Abstract vation, we propose Personalize Anything, a training-free


framework that achieves personalized image generation in
Personalized image generation aims to produce images of DiT through: 1) timestep-adaptive token replacement that
user-specified concepts while enabling flexible editing. Re- enforces subject consistency via early-stage injection and
cent training-free approaches, while exhibit higher com- enhances flexibility through late-stage regularization, and
putational efficiency than training-based methods, struggle 2) patch perturbation strategies to boost structural diver-
with identity preservation, applicability, and compatibility sity. Our method seamlessly supports layout-guided gen-
with diffusion transformers (DiTs). In this paper, we un- eration, multi-subject personalization, and mask-controlled
cover the untapped potential of DiT, where simply replacing editing. Evaluations demonstrate state-of-the-art perfor-
denoising tokens with those of a reference subject achieves mance in identity preservation and versatility. Our work
zero-shot subject reconstruction. This simple yet effective establishes new insights into DiTs while delivering a prac-
feature injection technique unlocks diverse scenarios, from tical paradigm for efficient personalization.
personalization to image editing. Building upon this obser-
∗ Equal contribution † Project lead ✉ Corresponding author

1
1. Introduction
Personalized image generation aims to synthesize images
of user-specified concepts while enabling flexible editing.
The advent of text-to-image diffusion models [4, 27, 34, 39,
40, 42, 44, 48, 55] has revolutionized this field, enabling
applications in areas like advertising production. Token Replacing
Subject
Previous research on subject image personalization re-
lies on test-time optimization or large-scale fine-tuning.
Optimization-based approach [12, 21, 23, 24, 26, 32, 47,
49, 71] enables the pre-trained models to learn the spe-
cific concept through fine-tuning on a few images of a sub- U-Net DiT
ject. While achieving identity preservation, these meth-
ods demand substantial computational resources and time Figure 2. Simple token replacement in DiT (right) achieves high-
fidelity subject reconstruction through its position-disentangled
due to per-subject optimization requiring hundreds of it-
representation, while U-Net’s convolutional entanglement (left)
erative steps. Large-scale fine-tuning alternatives [2, 3, 7, induces blurred edges and artifacts.
11, 13, 15, 18–20, 22, 28–30, 33, 36–38, 43, 52, 54, 61–
65, 67, 68, 72] seek to circumvent this limitation by train-
ing auxiliary networks on large-scale datasets to encode ref-
tion in DiT, unlocking various scenarios ranging from per-
erence images. However, these approaches both demand
sonalization to inpainting and outpainting, without necessi-
heavy training requirements and risk overfitting to narrow
tating complicated attention engineering.
data distributions, degrading their generalizability.
Recent training-free solutions [1, 5, 25, 50, 57, 69, 70] Building on this foundation, we propose “Personalize
exhibit higher computational efficiency than the training- Anything”, a training-free framework enabling personal-
based approach. These methods typically leverage an at- ized image generation in DiT through timestep-adaptive to-
tention sharing mechanism to inject reference features, pro- ken replacement and patch perturbation strategies. Specif-
cessing denoising and reference subject tokens jointly in ically, we inject reference subject tokens (excluding posi-
pre-trained self-attention layers. However, these attention- tional information) in the earlier steps of the denoising pro-
based methods lack constraints on subject consistency and cess to enforce subject consistency, while enhancing flexi-
often fail to preserve identity. Moreover, their application bility in the later steps through multi-modal attention. Fur-
to advanced text-to-image diffusion transformers (DiTs) [8, thermore, we introduce patch perturbation to the reference
27, 39] proves challenging stemming from DiT’s positional tokens before token replacement, locally shuffling them and
encoding mechanism. As analyzed in Sec. 3.2, we at- applying morphological operations to the subject mask. It
tribute this limitation to the strong influence of the explicitly encourages the model to introduce more global appearance
encoded positional information on DiT’s attention mecha- information and enhances structural and textural diversity.
nism. This makes it difficult for generated images to cor- Additionally, our framework seamlessly supports 1) layout-
rectly attend to the reference subject’s tokens within tradi- guided generation through translations on replacing regions,
tional attention sharing. 2) multi-subject personalization and subject-scene compo-
In this paper, we delve into the diffusion transform- sition via sequential injection of reference subjects or scene,
ers (DiTs) [8, 27, 39], and observe that simply replacing and 3) extended applications (e.g. inpainting and outpaint-
the denoising tokens with those of a reference subject al- ing) via incorporating user-specified mask conditions.
lows for high-fidelity subject reconstruction. As illustrated Comprehensive evaluations on multiple personalization
in Fig. 2, DiT exhibits exceptional reconstruction fidelity tasks demonstrate that our training-free method exhibits su-
under this manipulation, while U-Net [45] often induces perior identity preservation, fidelity and versatility, outper-
blurred edges and artifacts. We attribute this to the sepa- forming existing approaches including those fine-tuned on
rate embedding of positional information in DiT, achieved DiTs. Our contributions are summarized as follows:
via its explicit positional encoding mechanism. This decou- • We uncover DiT’s potential for high-fidelity subject re-
pling of semantic features and position enables the substitu- construction via simple token replacement, and charac-
tion of purely semantic tokens, avoiding positional interfer- terize its position-disentangled properties.
ence. Conversely, U-Net’s convolutional mechanism binds • We introduce a simple yet effective framework, denoted
texture and spatial position together, causing positional con- as “Personalize Anything”, which starts with subject re-
flicts when replacing tokens and leading to low-quality im- construction and enhances the flexibility via timestep-
age generation. This discovery establishes token replace- adaptive replacement and patch perturbation.
ment as a viable pathway for zero-shot subject personaliza- • Experiments demonstrates that the proposed framework

2
exhibits high consistency, fidelity, and versatility across cepts while preserving textual controllability. In the follow-
multiple personalization tasks and applications. ing sections, we start with an overview of standard architec-
tures in text-to-image diffusion models. Sec. 3.2 systemat-
2. Related Work ically reveals architectural distinctions that impede the ap-
plication of existing attention sharing mechanisms to DiTs.
2.1. Text-to-Image Diffusion Models Sec. 3.3 uncovers DiT’s potential for subject reconstruction
Text-to-image generation has been revolutionized by dif- via simple token replacement, culminating the presentation
fusion models [17, 51] that progressively denoise Gaus- of our Personalize Anything framework in Sec. 3.4.
sian distributions into images. Among a series of effec-
3.1. Preliminaries
tive works [4, 27, 34, 39, 40, 42, 44, 48, 55], Latent Dif-
fusion Model [44] employs a U-Net backbone [45] for ef- Diffusion models progressively denoise a latent variable zT
ficient denoising within compressed latent space, becom- through a network ϵθ , with architectural choices being of
ing the foundation for subsequent improvements in reso- paramount importance. We analyze two main paradigms:
lution [40]. A recent architectural shift replaces convolu- U-Net Architectures. The convolutional U-Net [45] in
tional U-Nets with vision transformers [6, 39], exploiting Stable Diffusion [44] comprises pairs of down-sampling
their global attention mechanisms and geometrically-aware and up-sampling blocks connected by a middle block. Each
positional encodings. These diffusion transformers (DiTs) block interleaves residual blocks for feature extraction with
demonstrate superior scalability [8, 27]—performance im- spatial attention layers for capturing spatial relationships.
provements consistently correlate with increased model ca- Diffusion Transformer (DiT). Modern DiTs [39] in ad-
pacity and training compute, establishing them as the new vanced models [8, 27] leverage transformers [6] to process
state-of-the-art paradigm. discretized latent representations, including image tokens
2.2. Personalized Image Generation X ∈ RN ×d and text tokens C ∈ RM ×d , where d is the em-
bedding dimension, N and M are the length of sequences.
Training-Based Approaches. Previous subject personal- These models typically encode positional information of X
ization methods primarily adopt two strategies: i) Test-time through RoPE [53], which applies rotation matrices based
optimization techniques [10, 12, 14, 21, 23, 24, 26, 32, 47, on the token’s coordinate (i, j) in the 2D grid:
49, 56, 58, 71] that fine-tune foundation models on target
concepts at inference time, often requiring 30 GPU-minute Ẋ i,j = X i,j · R(i, j) (1)
optimization per subject; and ii) large-scale training-based
methods [2, 3, 7, 11, 13, 15, 18–20, 22, 28–30, 33, 36– where R(i, j) denotes the rotation matrix at position (i, j)
38, 43, 52, 54, 61–63, 63–65, 67, 68, 72] that learn concept with 0 ≤ i < w and 0 ≤ j < h. Text tokens C re-
embeddings through auxiliary networks pre-trained on large ceive fixed positional anchors (i = 0, j = 0) to main-
datasets. While achieving notable fidelity, both paradigms tain modality distinction. The multi-modal attention mech-
suffer from computational overheads and distribution shifts anism (MMA) is then applied to all position-encoded tokens
[Ẋ; ĊT ] ∈ R(N +M )×d , enabling full bidirectional attention
that limit real-world application.
across both modalities.
Training-Free Alternatives. Emerging training-free meth-
ods [1, 5, 25, 50, 57, 69, 70] exhibit higher computational 3.2. Attention Sharing Fails in DiT
efficiency than the training-based approach. These meth- We systematically investigate why established U-Net-based
ods typically leverage an attention sharing mechanism to in- personalization techniques [5, 57] fail when naively applied
ject reference features, processing denoising and reference to DiT architectures [27], identifying positional encoding
subject tokens jointly in pre-trained self-attention layers. conflicts as the core challenge.
However, these attention-based methods lack constraints on
subject consistency and fail to preserve identity. More- Positional Encoding Collision. Implementing attention
over, their application to advanced diffusion transformers sharing of existing methods [5, 57] in DiT, we concatenate
(DiTs) [8, 27, 39] proves challenging due to DiT’s ex- position-encoded denoising tokens Ẋ and reference tokens
plicit positional encoding, thereby limiting their scalability Ẋref (obtained via flow inversion [60]) into a unified se-
to larger-scale text-to-image generation models [8, 27]. quence [Ẋ; Ẋref ]. Both tokens keep the original positions
(i, j) ∈ [0, w) × [0, h), causing destructive interference to
3. Methodology attention computation. As visualized in Fig. 3a, this forces
denoising tokens to over-attend to reference tokens with the
This paper introduces a training-free paradigm for person- same positions, resulting in ghosting artifacts of the refer-
alized generation using diffusion transformers (DiTs) [39], ence subject in the generated image. Quantitative analysis
synthesizing high-fidelity depictions of user-specified con- in supplementary materials reveals that the attention score

3
Output Image Attention map of red blocks
Denoising Tokens Reference Tokens
the reference image, obtaining the reference tokens Xref
without encoded positional information, as well as the ref-
Attention Sharing
erence subject’s mask Mref . We then inject Xref into spe-
0 w cific region M of the denoising tokens X via token replace-
0
ment:
(a) Keep original positions: over-attend to the same positions X̂ = X ⊙ (1 − M) + Xref ⊙ M (2)

h
where M can be obtained by translating Mref . As shown
Denoising Tokens in Fig. 2, token replacement in DiT reconstructs high-
C
fidelity images with consistent subjects in specified posi-
(𝑖, 𝑗) tions, while U-Net’s convolutional entanglement manifests
(b) Remove positions: attention miss in reference tokens as blurred edges and artifacts.
We attribute this to the separate embedding of positional
Reference Tokens
information in DiT, achieved via its explicit positional en-
coding mechanism (Sec. 3.1). This decoupling of seman-
C Concatenation tic features and position enables the substitution of purely
Positional Encoding
(c) Shift to (𝑖 + 𝑤, 𝑗): attention miss in reference tokens
semantic tokens, avoiding positional interference. Con-
versely, U-Net’s convolutional mechanism binds texture
Figure 3. Attention sharing [5, 57] fails in DiT due to the explicit and spatial position together, causing positional conflicts
positional encoding mechanism. When keeping the original posi- when replacing tokens and leading to low-quality image
tions (i, j) ∈ [0, w) × [0, h) in reference tokens, denoising tokens generation. This discovery establishes token replacement
over-attend to reference ones with the same positions (shown in at- as a viable pathway for zero-shot subject personalization
tention maps of (a)), resulting in ghosting artifacts in the generated in DiT. It unlocks various scenarios ranging from personal-
image. Modified strategies, (b) removing positions and (c) shift-
ization to inpainting and outpainting, without necessitating
ing to non-overlapping regions, avoid collisions but loses identity
alignment, as attention is almost absent on reference tokens.
complicated attention engineering, and establishes the foun-
dation for our personalization framework in Sec. 3.4.
3.4. Personalize Anything
between denoising and reference tokens at the same posi-
tion in DiT is 723% higher than in U-Net, confirming DiT’s Building upon these discoveries, we propose Personalize
position sensitivity. Anything, a novel training-free personalization for diffu-
sion transformers (DiTs). This framework draws inspira-
Modified Encoding Strategies. Motivated by DiT’s tion from zero-shot subject reconstruction in DiTs, and ef-
position-disentangled encoding (Sec. 3.1), we engineer two fectively enhances flexibility by timestep-adaptive token re-
positional adjustments on Xref to obtain non-conflicting placement and patch perturbation strategies (Fig. 4).
Ẋref : i) remove positions and fix all reference positions
Timestep-adaptive Token Replacement. Our method be-
to (0, 0) akin to text tokens, and ii) shift reference tokens
gins by inverting a reference image containing the desired
to (i′ , j ′ ) = (i + w, j), creating non-overlapping regions.
subject [60]. This process yields reference tokens Xref (ex-
As shown in Fig. 3 (b) and (c), while eliminating collisions,
cluding positional encodings) and a corresponding subject
both methods struggle to preserve identity, as attention is
mask Mref [59]. Instead of continuous injection through-
almost absent on reference tokens.
out the denoising process as employed in subject recon-
In summary, the explicitly encoded positional informa-
struction, we introduce a timestep-dependent strategy:
tion exhibits strong influence on the attention mechanism in
❶ Early-stage subject anchoring via token replacement
DiT—a fundamental divergence from U-Net’s implicit po-
(t > τ ). During the initial denoising steps (t > τ , where
sition handling. This makes it difficult for generated images
τ is an empirically determined threshold set at 80% of the
to correctly attend to the reference subject’s tokens within
total denoising steps T ), we anchor the subject’s identity by
traditional attention sharing.
replacing the denoising tokens X within the subject region
M with the reference tokens Xref (Eq. (2)). The region M
3.3. Token Replacement in DiT
can be obtained by translating Mref to the user-specified
Building on the foundational observation on DiT’s architec- location. We preserve the positional encodings associated
tural distinctions, we extend our investigation to the latent with the denoising tokens X to maintain spatial coherence.
representation in DiT. We uncover that simply replacing the ❷ Later-stage semantic fusion via multi-modal atten-
denoising tokens with those of a reference subject allows tion (t ≤ τ ). In later denoising steps t ≤ τ , we transition
for high-fidelity subject reconstruction. to semantic fusion. Here we concatenate zero-positioned
Specifically, we apply inversion techniques [46, 60] on reference tokens Ẋref with denoising tokens Ẋ and text

4
Token Replacement
Replace
DiT
… DiT … … Perturbation
Block Block

Reference Image
Token Replacement Multi-Modal Attention
𝑋!"# 𝑋"!"# 𝑋
Multi-Modal Attention

DiT
… DiT … … … “a dog
Block Block on the
beach”
“a dog on the beach”
𝑀𝑀𝐴( 𝑋̇; 𝑋̇!"# ; 𝐶̇$ )
Timestep=T 𝜏 Timestep=0

Figure 4. Method overview. Our framework anchors subject identity in early denoising through mask-guided token replacement with pre-
served positional encoding, and transitions to multi-modal attention for semantic fusion with text in later steps. During token replacement,
we inject variations via patch perturbations. This timestep-adaptive strategy balances identity preservation and generative flexibility.

Token Replacing ①

(a) Layout-guided generation (b) Multi-subject or subject-scene composition (c) Inpainting and outpainting

Figure 5. Seamless extensions. Our framework enables: (a) layout-guided generation by translating token-injected regions, (b) multi-
subject composition through sequential token injection, and (c) inpainting and outpainting via specifying masks and increased replacement.

k
embeddings Ċ. The unified sequence [Ẋ; Ẋref ; ĊT ] under- eration, while sequential injection of multiple {Xref } into
goes Multi-Modal Attention (MMA) to harmonize subject disjoint {Mk } regions and unified Multi-Modal Attention
k
guidance with textual conditions. This adaptive threshold τ MMA([Ẋ; {Ẋref }; ĊT ]) facilitate multi-subject or subject-
balances the preservation of subject identity with the flexi- scene composition. For image editing tasks, we incorporate
bility afforded by the text prompt. user-specified masks in the inversion process to obtain refer-
ence Xref and Mref that should be preserved. Meanwhile,
Patch Perturbation for Variation. To prevent identity we disable perturbations and set τ to 10% total steps, pre-
overfitting while preserving identity consistency, we intro- serving the original image content as much as possible and
duce two complementary perturbations: 1) Random Local achieving coherent inpainting or outpainting.
token shuffling within 3x3 windows disrupts rigid texture
alignment, and 2) Mask augmentation of Mref , including 4. Experiments
simulating natural shape variations using morphological di-
lation/erosion with a 5px kernel, or manual selection of re- 4.1. Experimental Setup
gions emphasizing identity. The idea behind this local in-
Implementation Details. Our framework builds upon
terference technique is to encourage the model to introduce
the open-source HunyuanDiT [31] and FLUX.1-dev [27].
more global textural features while enhancing structural and
We adopt 50-step sampling with classifier-free guidance
local diversity.
(w = 3.5), generating 1024 × 1024 resolution images. To-
Seamless Extensions. As illustrated in Fig. 5, our frame- ken replacement threshold τ is set to 80% total steps.
work naturally extends to complex scenarios through geo- Benchmark Protocols. We establish three evaluation tiers:
metric programming: Translating M enables the spatial ar- 1) Single-subject personalization, compared against 10 ap-
rangement of subjects thereby achieving layout-guided gen- proaches spanning training-based [7, 28, 29, 38, 47, 54, 63,

5
Input Image Prompt Training-Based Training-Free

a dog on the
beach

a pair of
sunglasses on
the wooden
floor

a clock in the
forest

DreamBooth IP-Adapter MS-Diffusion OminiControl ConsiStory Ours (Hunyuan) Ours (FLUX)

Figure 6. Qualitative comparisons on single-subject personalization. More results can be found in the supplementary materials.

65, 68] and training-free [57] paradigms, 2) Multi-subject Table 1. Quantitative results on single-subject personalization.
personalization, evaluated against 6 representative meth-
ods [5, 19, 24, 32, 38, 63], and 3) Subject-scene composi- Method CLIP-T↑ CLIP-I↑ DINO↑ DreamSim↓
tion, benchmarked using AnyDoor [2] as reference for con- DreamBooth [47] 0.271 0.819 0.550 0.290
textual adaptation. BLIP-Diffusion [29] 0.251 0.835 0.641 0.283
IP-Adapter [65] 0.249 0.861 0.652 0.256
Evaluation Metrics. We evaluate our Personalize Any- λ-ECLIPSE [38] 0.235 0.866 0.682 0.224
thing on DreamBench [47] which comprises 30 base objects SSR-Encoder [68] 0.244 0.860 0.701 0.220
EZIGen [7] 0.263 0.825 0.662 0.247
each accompanied by 25 textual prompts. We extend this
MS-Diffusion [63] 0.283 0.824 0.539 0.261
dataset to 750, 1000, and 100 test cases for single-subject, OneDiffusion [28] 0.255 0.817 0.603 0.298
multi-subject, and subject-scene personalization using com- OminiControl [54] 0.275 0.820 0.516 0.301
binatorial rules. Quantitative assessment leverages multi- ConsiStory [57] 0.284 0.753 0.472 0.434
dimensional metrics: FID [16] for quality analysis, CLIP- Ours (HunyuanDiT [31]) 0.291 0.869 0.679 0.206
T [41] for image-text alignment, and DINO [35], CLIP- Ours (FLUX [27]) 0.307 0.876 0.683 0.179
I [41], DreamSim [9] for identity preservation in single-
subject evaluation while SegCLIP-I [71] in multi-subject
evaluation. DreamSim [9] is a new metric for perceptual ified subjects, without necessitating training or fine-tuning.
image similarity that bridges the gap between “low-level” Quantitative results in Tab. 1 confirms our excellent perfor-
metrics (e.g., PSNR, SSIM, LPIPS [66]) and “high-level” mance on identity preservation and image-text alignment.
measures (e.g. CLIP [41]). SegCLIP-I is similar to CLIP-I, Multi-Subject Personalization. From a qualitative per-
but all the subjects in source images are segmented. spective in Fig. 7, existing approaches [19, 24, 32, 63, 68]
may suffer from conceptual fusion when generating multi-
4.2. Comparison to State-of-the-Arts ple subjects, struggling to maintain their individual identi-
Single-Subject Personalization. Fig. 6 shows qualita- ties, or produce fragmented results due to incorrect model-
tive comparison with representative baseline methods. Ex- ing of inter-subject relationships. In contrast, our approach
isting test-time fine-tuning methods [47] require 30 GPU- manages to maintain natural interactions among subjects
minute optimization for each concept and sometimes ex- via layout-guided generation, while ensuring each sub-
hibit concept confusion for single-image inputs, manifest- ject retains its identical characteristics and distinctiveness.
ing as treating the background color as a characteristic of Quantitatively, results in Tab. 2 demonstrate the strength
the subject. Training-based but test-time tuning-free meth- of Personalize Anything in SegCLIP-I and CLIP-T, demon-
ods [28, 54, 63, 65], despite trained on large datasets, strug- strating that our approach not only effectively captures iden-
gle to preserve identity in detail for real image inputs. tity of multiple subjects but also excellently preserves the
Training-free methods [57] generate inconsistent subjects text control capabilities.
with single-image input. In contrast, our method produces Subject-Scene Composition. We further evaluate our Per-
high-fidelity images that are highly consistent with the spec- sonalize Anything on subject-scene composition, conduct-

6
Subject 1 Subject 2 Training-Based Training-Free

Cones2 SSR-Encoder Custom-Diffusion MS-Diffusion MIP-Adapter FreeCustom Ours (FLUX)

Figure 7. Qualitative comparisons on multi-subject personalization.

Subject Scene AnyDoor Ours (FLUX) Table 2. Quantitative results on multi-subject personalization.

Method CLIP-T↑ CLIP-I↑ SegCLIP-I↑


λ-ECLIPSE [38] 0.258 0.738 0.757
SSR-Encoder [68] 0.234 0.720 0.761
Cones2 [32] 0.255 0.747 0.702
Custom Diffusion [24] 0.228 0.727 0.781
MIP-Adapter [19] 0.276 0.765 0.751
MS-Diffusion [63] 0.278 0.780 0.748
FreeCustom [5] 0.248 0.749 0.734
Ours (HunyuanDiT) [31] 0.284 0.817 0.832
Ours (FLUX) 0.302 0.843 0.891

Table 3. Ablation studies. We evaluate the effects of token re-


placement threshold τ and patch perturbation.

Figure 8. Qualitative results on subject-scene composition. τ Pertur. CLIP-T↑ CLIP-I↑ DINO↑ DreamSim↓
T % 0.317 0.764 0.625 0.305
0.95 T % 0.313 0.773 0.632 0.294
ing comparison with Anydoor [2]. We show the visualiza- 0.90 T % 0.306 0.849 0.680 0.199
tion results in Fig. 8, where AnyDoor produces incoherent 0.80 T % 0.302 0.882 0.741 0.163
results, manifested as inconsistencies between the subject 0.70 T % 0.282 0.920 0.769 0.140
and environmental factors such as lighting in the generated
images. In contrast, our method successfully generates nat- 0.80 T ! 0.307 0.876 0.683 0.179
ural images while effectively preserving the details of the
subjects. It demonstrates the huge potentials and general-
ization capabilities of Personalize Anything in generating Effects of Threshold τ . Our systematic investigation of
high-fidelity personalized images. the timestep threshold τ reveals its critical role in balanc-
ing reference subject consistency and flexibility. As visual-
4.3. Ablation Study ized in Fig. 9, early-to-mid replacement phases (τ > 0.8 T )
progressively incorporate geometric and appearance priors
We conduct ablation studies on single-subject personaliza- from reference tokens, initially capturing coarse layouts
tion, examining the effects of token replacement timestep (0.9 T ) then refining color patterns and textures (0.8 T ).
threshold τ and the patch perturbation strategy. Beyond τ = 0.8 T , late-stage color fusion dominates, pro-

7
Input Image 𝜏=𝑇 𝜏 = 0.95 𝑇 𝜏 = 0.90 𝑇 𝜏 = 0.80 𝑇 𝜏 = 0.70 𝑇 𝝉 = 𝟎. 𝟖𝟎 𝑻 + Pertur.

Figure 9. Qualitative ablation studies on token replacement threshold τ and patch perturbation.

a boy eating bread a European-style castle a yacht sails eating with Luffy

a western knight a mask on the ground a girl in a maple forest a girl in a futuristic city

Figure 10. Visualization results of Personalize Anything in layout-guided generation (top), inpainting (middle), and outpainting (bottom).

ducing subjects almost completely identical to the reference demonstrate capabilities of our framework on layout-guided
subject (0.7 T ). Quantitative results are shown in Tab. 3, personalization, and precise editing with mask conditions,
where the balanced τ = 0.8 T achieves 0.882 reference all without architectural modification or fine-tuning.
similarity preservation (CLIP-I) while maintaining 0.302
image-text alignment (CLIP-T). 5. Conclusion
Effects of Patch Perturbation. By combining local to-
This paper reveals that simple token replacement achieves
ken shuffling and mask morphing, our perturbation strategy
high-fidelity subject reconstruction in diffusion transform-
reduces both texture and structure overfitting. With τ =
ers (DiTs), due to the position-disentangled representation
0.8 T , the generated subject and the reference one are struc-
in DiTs. The decoupling of semantic features and po-
turally similar without perturbation (Fig. 9). Conversely,
sition enables the substitution of purely semantic tokens,
applying perturbation makes the structure and texture more
avoiding positional interference. Based on this discovery,
flexible while maintaining identically consistency.
we propose Personalize Anything, a training-free frame-
work that achieves high-fidelity personalized image gener-
4.4. Applications
ation through timestep-adaptive token injection and strate-
As illustrated in Fig. 5, our Personalize Anything natu- gic patch perturbation. Our method eliminates per-subject
rally extends to diverse real-world applications, including optimization or large-scale training while delivering supe-
subject-driven image generation with layout guidance, in- rior identity preservation and unprecedented scalability to
painting and outpainting. Visualization results in Fig. 10 layout-guided generation, multi-subject personalization and

8
mask-controlled editing. DiTs’ geometric programming es- In European Conference on Computer Vision, pages 322–
tablishes new paradigms for controllable synthesis, with 340. Springer, 2024. 2, 3
spatial manipulation principles extensible to video/3D gen- [12] Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun-
eration, redefining scalable customization in generative AI. peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning
Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-
rank adaptation for multi-concept customization of diffusion
References models. Advances in Neural Information Processing Sys-
[1] Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, tems, 36, 2024. 2, 3
Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani [13] Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng
Lischinski. The chosen one: Consistent characters in text- Zhang, and Qian He. Pulid: Pure and lightning id
to-image diffusion models. In Special Interest Group on customization via contrastive alignment. arXiv preprint
Computer Graphics and Interactive Techniques Conference arXiv:2404.16022, 2024. 2, 3
Conference Papers ’24, page 1–12. ACM, 2024. 2, 3 [14] Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar,
[2] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, Dimitris Metaxas, and Feng Yang. Svdiff: Compact param-
and Hengshuang Zhao. Anydoor: Zero-shot object-level im- eter space for diffusion fine-tuning, 2023. 3
age customization. In Proceedings of the IEEE/CVF Con- [15] Junjie He, Yuxiang Tuo, Binghui Chen, Chongyang Zhong,
ference on Computer Vision and Pattern Recognition, pages Yifeng Geng, and Liefeng Bo. Anystory: Towards unified
6593–6602, 2024. 2, 3, 6, 7 single and multiple subject personalization in text-to-image
[3] Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye generation. arXiv preprint arXiv:2501.09503, 2025. 2, 3
Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Yilin Wang, et al. Unireal: Universal image generation and Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
editing via learning real-world dynamics. arXiv preprint two time-scale update rule converge to a local nash equilib-
arXiv:2412.07774, 2024. 2, 3 rium, 2018. 6
[4] Prafulla Dhariwal and Alexander Nichol. Diffusion models [17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
beat gans on image synthesis. Advances in neural informa- sion probabilistic models, 2020. 3
tion processing systems, 34:8780–8794, 2021. 2, 3 [18] Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun
Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan,
[5] Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide
Shengcai Liao, et al. Consistentid: Portrait generation with
Liu, Hao Chen, and Chunhua Shen. Freecustom: Tuning-
multimodal fine-grained identity preserving. arXiv preprint
free customized image generation for multi-concept compo-
arXiv:2404.16771, 2024. 2, 3
sition. In Proceedings of the IEEE/CVF Conference on Com-
[19] Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng
puter Vision and Pattern Recognition, pages 9089–9098,
Yu, and Jie Song. Resolving multi-condition confusion
2024. 2, 3, 4, 6, 7, 1
for finetuning-free personalized image generation. arXiv
[6] Alexey Dosovitskiy. An image is worth 16x16 words:
preprint arXiv:2409.17920, 2024. 6, 7
Transformers for image recognition at scale. arXiv preprint
[20] Zehuan Huang, Hongxing Fan, Lipeng Wang, and Lu Sheng.
arXiv:2010.11929, 2020. 3
From parts to whole: A unified reference framework for con-
[7] Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, trollable human image generation, 2024. 2, 3
Ethan Smith, and Lingqiao Liu. Ezigen: Enhancing [21] Junha Hyung, Jaeyo Shin, and Jaegul Choo. Magicap-
zero-shot subject-driven image generation with precise sub- ture: High-resolution multi-concept portrait customization.
ject encoding and decoupled guidance. arXiv preprint In Proceedings of the AAAI Conference on Artificial Intelli-
arXiv:2409.08091, 2024. 2, 3, 5, 6 gence, pages 2445–2453, 2024. 2, 3
[8] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim [22] Jiaxiu Jiang, Yabo Zhang, Kailai Feng, Xiaohe Wu,
Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik and Wangmeng Zuo. Mc2 : Multi-concept guidance
Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim for customized multi-concept generation. arXiv preprint
Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- arXiv:2404.05268, 2024. 2, 3
nik Marek, and Robin Rombach. Scaling rectified flow trans- [23] Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao
formers for high-resolution image synthesis, 2024. 2, 3 Zhang, Bizhu Wu, Guanying Chen, Wei Liu, and Wenhan
[9] Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Luo. Omg: Occlusion-friendly personalized multi-concept
Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- generation in diffusion models. In European Conference on
sim: Learning new dimensions of human visual similarity Computer Vision, pages 253–270. Springer, 2024. 2, 3
using synthetic data, 2023. 6 [24] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli
[10] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Shechtman, and Jun-Yan Zhu. Multi-concept customization
Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An of text-to-image diffusion. In Proceedings of the IEEE/CVF
image is worth one word: Personalizing text-to-image gen- Conference on Computer Vision and Pattern Recognition,
eration using textual inversion, 2022. 3 pages 1931–1941, 2023. 2, 3, 6, 7
[11] Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, [25] Gihyun Kwon and Jong Chul Ye. Tweediemix: Improving
Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Lcm- multi-concept fusion for diffusion-based image/video gener-
lookahead for encoder-based text-to-image personalization. ation, 2025. 2, 3

9
[26] Gihyun Kwon, Simon Jenni, Dingzeyu Li, Joon-Young Lee, [36] Gaurav Parmar, Or Patashnik, Kuan-Chieh Wang, Daniil Os-
Jong Chul Ye, and Fabian Caba Heilbron. Concept weaver: tashev, Srinivasa Narasimhan, Jun-Yan Zhu, Daniel Cohen-
Enabling multi-concept fusion in text-to-image models. In Or, and Kfir Aberman. Object-level visual prompts
Proceedings of the IEEE/CVF Conference on Computer Vi- for compositional image generation. arXiv preprint
sion and Pattern Recognition, pages 8880–8889, 2024. 2, arXiv:2501.01424, 2025. 2, 3
3 [37] Or Patashnik, Rinon Gal, Daniil Ostashev, Sergey Tulyakov,
[27] Black Forest Labs. Flux. [Online], 2024. https:// Kfir Aberman, and Daniel Cohen-Or. Nested attention:
github.com/black-forest-labs/flux. 2, 3, 5, Semantic-aware attention values for concept personalization.
6 arXiv preprint arXiv:2501.01407, 2025.
[28] Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, [38] Maitreya Patel, Sangmin Jung, Chitta Baral, and Yezhou
Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, and Yang. λ-eclipse: Multi-concept personalized text-to-image
Jiasen Lu. One diffusion to generate them all, 2024. 2, 3, 5, diffusion models by leveraging clip latent space, 2024. 2, 3,
6 5, 6, 7
[29] Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- [39] William Peebles and Saining Xie. Scalable diffusion models
trained subject representation for controllable text-to-image with transformers. In Proceedings of the IEEE/CVF Inter-
generation and editing. Advances in Neural Information Pro- national Conference on Computer Vision, pages 4195–4205,
cessing Systems, 36, 2024. 5, 6 2023. 2, 3, 1
[30] Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- [40] Dustin Podell, Zion English, Kyle Lacey, Andreas
Ming Cheng, and Ying Shan. Photomaker: Customizing re- Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
alistic human photos via stacked id embedding. In Proceed- Robin Rombach. Sdxl: Improving latent diffusion mod-
ings of the IEEE/CVF Conference on Computer Vision and els for high-resolution image synthesis. arXiv preprint
Pattern Recognition, pages 8640–8650, 2024. 2, 3 arXiv:2307.01952, 2023. 2, 3
[31] Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, [41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Krueger, and Ilya Sutskever. Learning transferable visual
Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Ji- models from natural language supervision, 2021. 6
hong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, [42] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu, Kai and Mark Chen. Hierarchical text-conditional image gener-
Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Ming- ation with clip latents. arXiv preprint arXiv:2204.06125, 1
tao Chen, Zhichao Hu, Xiao Xiao, Yan Chen, Yuhong Liu, (2):3, 2022. 2, 3
Wei Liu, Di Wang, Yong Yang, Jie Jiang, and Qinglin Lu. [43] Elad Richardson, Yuval Alaluf, Ali Mahdavi-Amiri, and
Hunyuan-dit: A powerful multi-resolution diffusion trans- Daniel Cohen-Or. pops: Photo-inspired diffusion operators.
former with fine-grained chinese understanding, 2024. 5, 6, arXiv preprint arXiv:2406.01300, 2024. 2, 3
7 [44] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[32] Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Patrick Esser, and Björn Ommer. High-resolution image syn-
Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang thesis with latent diffusion models, 2022. 2, 3
Cao. Cones 2: Customizable image synthesis with multi- [45] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
ple subjects. In Proceedings of the 37th International Con- Convolutional networks for biomedical image segmentation,
ference on Neural Information Processing Systems, pages 2015. 2, 3
57500–57519, 2023. 2, 3, 6, 7 [46] Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama-
[33] Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject- nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im-
diffusion: Open domain personalized text-to-image genera- age inversion and editing using rectified stochastic differen-
tion without test-time fine-tuning. In ACM SIGGRAPH 2024 tial equations. arXiv preprint arXiv:2410.10792, 2024. 4
Conference Papers, pages 1–12, 2024. 2, 3 [47] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
[34] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and tuning text-to-image diffusion models for subject-driven
Mark Chen. Glide: Towards photorealistic image generation generation. In Proceedings of the IEEE/CVF conference
and editing with text-guided diffusion models, 2022. 2, 3 on computer vision and pattern recognition, pages 22500–
[35] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy 22510, 2023. 2, 3, 5, 6
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, [48] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael et al. Photorealistic text-to-image diffusion models with deep
Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Je- language understanding. Advances in neural information
gou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr processing systems, 35:36479–36494, 2022. 2, 3
Bojanowski. Dinov2: Learning robust visual features with- [49] Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svet-
out supervision, 2024. 6 lana Lazebnik, Yuanzhen Li, and Varun Jampani. Ziplora:

10
Any subject in any style by effectively merging loras. In [65] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
European Conference on Computer Vision, pages 422–438. adapter: Text compatible image prompt adapter for text-to-
Springer, 2024. 2, 3 image diffusion models. arXiv preprint arXiv:2308.06721,
[50] Chaehun Shin, Jooyoung Choi, Heeseung Kim, and Sungroh 2023. 2, 3, 6
Yoon. Large-scale text-to-image model with inpainting is [66] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
a zero-shot subject-driven image generator. arXiv preprint man, and Oliver Wang. The unreasonable effectiveness of
arXiv:2411.15466, 2024. 2, 3 deep features as a perceptual metric, 2018. 6
[51] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- [67] Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang,
ing diffusion implicit models, 2022. 3 Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu
[52] Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Liu, and Ping Luo. Flashface: Human image personaliza-
Elgammal, and Xiao Yang. Moma: Multimodal llm adapter tion with high-fidelity identity preservation. arXiv preprint
for fast personalized image generation. In European Confer- arXiv:2403.17008, 2024. 2, 3
ence on Computer Vision, pages 117–132. Springer, 2024. 2, [68] Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng
3 Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al.
[53] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Ssr-encoder: Encoding selective subject representation for
Bo, and Yunfeng Liu. Roformer: Enhanced transformer with subject-driven generation. In Proceedings of the IEEE/CVF
rotary position embedding. Neurocomputing, 568:127063, Conference on Computer Vision and Pattern Recognition,
2024. 3 pages 8069–8078, 2024. 2, 3, 6, 7
[54] Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, [69] Yuxin Zhang, Minyan Luo, Weiming Dong, Xiao Yang,
and Xinchao Wang. Ominicontrol: Minimal and uni- Haibin Huang, Chongyang Ma, Oliver Deussen, Tong-Yee
versal control for diffusion transformer. arXiv preprint Lee, and Changsheng Xu. Bringing characters to new sto-
arXiv:2411.15098, 3, 2024. 2, 3, 5, 6 ries: Training-free theme-specific image generation via dy-
[55] Kolors Team. Kolors: Effective training of diffusion model namic visual prompting. arXiv preprint arXiv:2501.15641,
for photorealistic text-to-image synthesis. arXiv preprint, 2025. 2, 3
2024. 2, 3 [70] Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi
[56] Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Feng, and Qibin Hou. Storydiffusion: Consistent self-
Key-locked rank one editing for text-to-image personaliza- attention for long-range image and video generation. arXiv
tion, 2024. 3 preprint arXiv:2405.01434, 2024. 2, 3
[57] Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior [71] Chenyang Zhu, Kai Li, Yue Ma, Chunming He, and Li Xiu.
Wolf, Gal Chechik, and Yuval Atzmon. Training-free consis- Multibooth: Towards generating all your concepts in an im-
tent text-to-image generation. ACM Transactions on Graph- age from text. arXiv preprint arXiv:2404.14239, 2024. 2, 3,
ics (TOG), 43(4):1–18, 2024. 2, 3, 4, 6, 1 6
[58] Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, and Kfir [72] Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song,
Aberman. P+: Extended textual conditioning in text-to- Hao Shao, Dazhong Shen, Yu Liu, and Hongsheng Li.
image generation, 2023. 3 Easyref: Omni-generalized group image reference for
[59] Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin diffusion models via multimodal llm. arXiv preprint
Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is arXiv:2412.09618, 2024. 2, 3
secretly a training-free open vocabulary semantic segmenter.
arXiv preprint arXiv:2309.02773, 2023. 4
[60] Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma,
Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Tam-
ing rectified flow for inversion and editing. arXiv preprint
arXiv:2411.04746, 2024. 3, 4
[61] Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony
Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot
identity-preserving generation in seconds. arXiv preprint
arXiv:2401.07519, 2024. 2, 3
[62] Qinghe Wang, Baolu Li, Xiaomin Li, Bing Cao, Liqian Ma,
Huchuan Lu, and Xu Jia. Characterfactory: Sampling consis-
tent characters with gans for diffusion models. arXiv preprint
arXiv:2404.15677, 2024.
[63] X. Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao
Jiang. Ms-diffusion: Multi-subject zero-shot image person-
alization with layout guidance, 2025. 3, 5, 6, 7
[64] Peng Xing, Haofan Wang, Yanpeng Sun, Qixun Wang,
Xu Bai, Hao Ai, Renyuan Huang, and Zechao Li.
Csgo: Content-style composition in text-to-image genera-
tion, 2024.

11
Personalize Anything for Free with Diffusion Transformer
Supplementary Material
6. Analysis of Position Sensitivity in DiT
We employ the attention sharing mechanism in existing 100
UNet-based subject personalization methods [5, 57] to dif- OneDiffusion OminiControl Ours
fusion transformers (DiTs) [39]. Specifically, we concate- 70
63

Preference Rate (%)


nate position-encoded denoising tokens Ẋ and reference to-
kens Ẋref into a single sequence [Ẋ; Ẋref ] and apply pre- 50 44
trained multi-modal attention on it. In this process, both de- 35
noising and reference tokens retain their original positions 21 19 18 18
(i, j) ∈ [0, w) × [0, h), causing destructive interference to 12
attention computation and producing ghosting artifacts of 0
the reference subject on the generated image. For quan- Textual Alignment Identity Preservation Image Quality
titative analysis, we calculate the denoising tokens’ atten-
tion scores on the reference tokens at the same positions. Figure 11. User study results on single-subject personalization.
By averaging these scores across 100 samples, we obtain
a mean value of 0.4294. Subsequently, we apply the same
procedure to U-Net, which encodes positional information
by convolution layers instead of explicit positional encod-
ing, yielding an average attention score of 0.0522. Compar-
100
atively, the average attention score in DiT is 723% higher MIP-Adapter MS-Diffusion Ours
than that in U-Net. The above quantitative analysis demon- 75
strates the position sensitivity in DiT, where explicit posi-
Preference Rate (%)

tional encoding significantly influences the attention mech-


47
39 42
50
anism, contrasting with U-Net’s implicit position handling.
33
19 20
7. User Study 11 14

To further evaluate model performance, we conducted a 0


user study involving 48 participants, with ages evenly dis- Textual Alignment Identity Preservation Image Quality
tributed between 15 and 60 years. Each participant was
asked to answer 15 questions, resulting in a total of 720 Figure 12. User study results on multi-subject personalization.
valid responses. For single-subject and multi-subject per-
sonalization tasks, users are required to select the optimal
model based on three dimensions: textual alignment, iden-
tity preservation, and image quality. In subject-scene com-
position tasks, we substituted textual alignment with scene 100
consistency to assess subject-scene coordination. The re- AnyDoor Ours
sults are presented in Figs. 11 to 13, which further corrob- 73
Preference Rate (%)

orates our qualitative findings, as our method outperforms 66


other state-of-the-art methods across all metrics. 56
50 44
34
8. More Results 27

We present more qualitative comparison results in Fig. 14.


As demonstrated, our method consistently achieves excep-
0
tional textual alignment and subject consistency. Specifi- Scene Consistency Identity Preservation Image Quality
cally, our approach excels in preserving fine-grained details
of subjects (e.g. the number “3” on the clock) while main- Figure 13. User study results on subject-scene composition.
taining high fidelity to the textual description.

1
Input Image Prompt EZIGen MS-Diffusion OneDiffusion OminiControl ConsiStory Ours (HunyuanDiT) Ours (FLUX)

a dog on the
beach

a pair of
sunglasses
on the wooden
floor

a clock in the
forest

a boot on
the grass

a monster toy
on the floor

Input Image Prompt DreamBooth IP-Adapter SSR-Encoder BLIP-Diffusion λ-ECLIPSE Ours (HunyuanDiT) Ours (FLUX)

a dog on the
beach

a pair of
sunglasses
on the wooden
floor

a clock in the
forest

a boot on
the grass

a monster toy
on the floor

Figure 14. Full qualitative comparisons on single-subject personalization.

You might also like