0% found this document useful (0 votes)
5 views

(ICCV-2023)Expressive Text-to-Image Generation with Rich Text

This document presents a method for expressive text-to-image generation using rich text, which allows users to specify attributes like font style, size, and color for more precise control over image synthesis. The proposed approach utilizes a region-based diffusion process to extract and apply these attributes, enhancing the generation of complex scenes compared to traditional plain text methods. The authors demonstrate that their rich text-based approach outperforms existing models in terms of color accuracy, style differentiation, and detail precision.

Uploaded by

auto8207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

(ICCV-2023)Expressive Text-to-Image Generation with Rich Text

This document presents a method for expressive text-to-image generation using rich text, which allows users to specify attributes like font style, size, and color for more precise control over image synthesis. The proposed approach utilizes a region-based diffusion process to extract and apply these attributes, enhancing the generation of complex scenes compared to traditional plain text methods. The authors demonstrate that their rich text-based approach outperforms existing models in terms of color accuracy, style differentiation, and detail precision.

Uploaded by

auto8207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Expressive Text-to-Image Generation with Rich Text

Songwei Ge1 Taesung Park2 Jun-Yan Zhu3 Jia-Bin Huang1


1 2 3
University of Maryland, College Park Adobe Research Carnegie Mellon University
https://rich-text-to-image.github.io/
arXiv:2304.06720v2 [cs.CV] 30 Aug 2023

A night sky filled with stars above a turbulent


A pizza with pineapples, pepperonis, and sea with giant waves.
A marble statue of a wolf's head and shoulder,
mushrooms on the top, 4k, photorealism. Styles: Van Gogh, Hokusai. surrounded by colorful flowers

A close-up of a cat1 riding a scooter. Tropical A young woman1 sits at a table in a beautiful, lush A nightstand1 next to a bed with pillows on it. Gray
trees in the background. garden, reading a book on the table. wall2 bedroom.
1A cat wearing sunglasses and has a bandana around its neck. Style: Claude Monet 1Girl with a pearl earring by Johannes Vermeer. 1A nightstand with some books. 2Accent shelf with plants on the gray wall.

Figure 1. Plain text (left image) vs. Rich text (right image). Our method allows a user to describe an image using a rich text editor that
supports various text attributes such as font family, size, color, and footnote. Given these text attributes extracted from rich text prompts,
our method enables precise control of text-to-image synthesis regarding colors, styles, and object details compared to plain text.

Abstract outperforms strong baselines with quantitative evaluations.


Plain text has become a prevalent interface for text-to- 1. Introduction
image synthesis. However, its limited customization options
hinder users from accurately describing desired outputs. The development of large-scale text-to-image generative
For example, plain text makes it hard to specify continu- models [52, 56, 54, 28] has propelled image generation to
ous quantities, such as the precise RGB color value or im- an unprecedented era. The great flexibility of these large-
portance of each word. Furthermore, creating detailed text scale models further offers users powerful control of the
prompts for complex scenes is tedious for humans to write generation through visual cues [4, 17, 77] and textual in-
and challenging for text encoders to interpret. To address puts [7, 19]. Without exception, existing studies use plain
these challenges, we propose using a rich-text editor sup- text encoded by a pretrained language model to guide the
porting formats such as font style, size, color, and footnote. generation. However, in our daily life, it is rare to use only
We extract each word’s attributes from rich text to enable plain text when working on text-based tasks such as writing
local style control, explicit token reweighting, precise color blogs or editing essays. Instead, a rich text editor [68, 71]
rendering, and detailed region synthesis. We achieve these is the more popular choice providing versatile formatting
capabilities through a region-based diffusion process. We options for writing and editing text. In this paper, we seek
first obtain each word’s region based on attention maps of to introduce accessible and precise textual control from rich
a diffusion process using plain text. For each region, we en- text editors to text-to-image synthesis.
force its text attributes by creating region-specific detailed Rich text editors offer unique solutions for incorporat-
prompts and applying region-specific guidance, and main- ing conditional information separate from the text. For ex-
tain its fidelity against plain-text generation through region- ample, using the font color, one can indicate an arbitrary
based injections. We present various examples of image color. In contrast, describing the precise color with plain
generation from rich text and demonstrate that our method text proves more challenging as general text encoders do
not understand RGB or Hex triplets, and many color names, 2. Related Work
such as ‘olive’ and ‘orange’, have ambiguous meanings.
This font color information can be used to define the color Text-to-image models. Text-to-image systems aim to syn-
of generated objects. For example, in Figure 1, a specific thesize realistic images according to descriptions [82, 42].
yellow can be selected to instruct the generation of a mar- Fueled by the large-scale text-image datasets [60, 8], vari-
ble statue with that exact color. ous training and inference techniques [20, 62, 21, 22], and
Beyond providing precise color information, various font scalibility [51], significant progress has been made in text-
formats make it simple to augment the word-level informa- to-image generation using diffusion models [4, 51, 45, 56,
tion. For example, reweighting token influence [19] can 17], autoregressive models [52, 76, 11, 15], GANs [59, 28],
be implemented using the font size, a task that is difficult and their hybrids [54]. Our work focuses on making these
to achieve with existing visual or textual interfaces. Rich models more accessible and providing precise controls. In
text editors offer more options than font size – similar to contrast to existing work that uses plain text, we use a rich
how font style distinguishes the styles of individual text el- text editor with various formatting options.
ements, we propose using it to capture the artistic style of Controllable image synthesis with diffusion models. A
specific regions. Another option is using footnotes to pro- wide range of image generation and editing applications
vide supplementary descriptions for selected words, simpli- are achieved through either fine-tuning pre-trained diffusion
fying the process of creating complex scenes. models [55, 32, 77, 3, 72, 30, 41, 35] or modifying the de-
But how can we use rich text? A straightforward im- noising process [43, 13, 19, 46, 5, 12, 2, 4, 26, 6, 58, 78, 9,
plementation is to convert a rich-text prompt with detailed 48, 48, 73, 16]. For example, Prompt-to-prompt [19] uses
attributes into lengthy plain text and feed it directly into ex- attention maps from the original prompt to guide the spa-
isting methods [54, 19, 7]. Unfortunately, these methods tial structure of the target prompt. Although these methods
struggle to synthesize images corresponding to lengthy text can be applied to some rich-text-to-image applications, the
prompts involving multiple objects with distinct visual at- results often fall short, as shown in Section 4. Concurrent
tributes, as noted in a recent study [12]. They often mix with our work, Mixture-of-diffusion [26] and MultiDiffu-
styles and colors, applying a uniform style to the entire im- sion [6] propose merging multiple diffusion-denoising pro-
age. Furthermore, the lengthy prompt introduces extra dif- cesses in different image regions through linear blending.
ficulty for text encoders to interpret accurate information, Instead of relying on user-provided regions, we automat-
making generating intricate details more demanding. ically compute regions of selected tokens using attention
To address these challenges, our insight is to decompose maps. Gradient [24] and Universal [5] guidance control the
a rich-text prompt into two components (1) a short plain- generation by optimizing the denoised generation at each
text prompt (without formatting) and (2) multiple region- time step. We apply them to precise color generation by
specific prompts that include text attributes, as shown in designing an objective on the target region to be optimized.
Figure 2. First, we obtain the self- and cross-attention maps Attention in diffusion models. The attention mech-
using a vanilla denoising process with the short plain-text anism has been used in various diffusion-based applica-
prompt to associate each word with a specific region. Sec- tions such as view synthesis [37, 66, 70], image edit-
ond, we create a prompt for each region using the attributes ing [19, 12, 47, 46, 32], and video editing [38, 49, 10, 40].
derived from rich-text prompt. For example, we use “moun- We also leverage the spatial structure in self-attention maps
tain in the style of Ukiyo-e” as the prompt for the region cor- and alignment information between texts and regions in
responding to the word “mountain” with the attribute “font cross-attention maps for rich-text-to-image generation.
style: Ukiyo-e”. For RGB font colors that cannot be con-
verted to the prompts, we iteratively update the region with Rich text modeling and application. Exploiting in-
region-based guidance to match the target color. We apply a formation beyond the intrinsic meanings of the texts has
separate denoising process for each region and fuse the pre- been previously studied [44, 63, 75, 34]. For example, vi-
dicted noises to get the final update. During this process, sual information, such as underlining and bold type, have
regions associated with the tokens that do not have any for- also been extracted for various document understanding
mats are supposed to look the same as the plain-text results. tasks [75, 34]. To our knowledge, we are the first to leverage
Also, the overall shape of the objects should stay unchanged rich text information for text-to-image synthesis.
in cases such as only the color is changed. To this end, we Image stylization and colorization. Style transfer [18,
propose to use region-based injection approaches. 81, 39] and Colorization [53, 64, 74, 33, 79, 80] for editing
We demonstrate qualitatively and quantitatively that our real images have also been extensively studied. In contrast,
method generates more precise color, distinct styles, and our work focuses on local style and precise color control for
accurate details compared to plain text-based methods. generating images from text-to-image models.
Rich text editor
Cross-Attention Maps (a) Plain text to image
(b) Rich text to image

Token maps :
Plain text
Plain-Text Input
a church surrounded by a beautiful garden,
a snowy mountain range in the distance Vanilla Diffusion
"church" "garden" "a snowy Other tokens
mountain …"

Noised Sample Feature Maps Self-Attention Maps


Rich text
"church ":{
"color": "#FF9900"
},
"snowy mountain range in the distance":{
"font": “Ukiyo-e"
},
"a ", " surrounded by a beautiful ", " a ": {
},
"garden":{ Rich-Text Input Region-based Diffusion Output
“footnote": “a garden filled with colorful wildflowers",
}, Diffusion
Figure 2. Rich-text-to-image framework. First, the plain-text prompt is processed by a diffusion model to collect self- and cross-attention
maps, noised generation, and residual feature maps at certain steps. The token maps of the input prompt are constructed by first creating a
segmentation using the self-attention maps and then labeling each segment using the cross-attention maps. Then the rich texts are processed
as JSON to provide attributes for each token span. The resulting token maps and attributes are used to guide our region-based control. We
inject the self-attention maps, noised generation, and feature maps to improve fidelity to the plain-text generation.

3. Rich Text to Image Generation image models, as there are limited training images featuring
multiple artistic styles. Consequently, existing models tend
From writing messages on communication apps, de- to generate a uniform mixed style across the entire image
signing websites [57], to collaboratively editing a docu- rather than distinct local styles.
ment [36, 25], a rich text editor is often the primary inter-
face to edit texts on digital devices. Nonetheless, only plain Font color indicates a specific color of the modified text
text has been used in text-to-image generation. To use for- span. Given the prompt “a red toy”, the existing text-to-
matting options in rich-text editors for more precise control image models generate toys in various shades of red, such
over the black-box generation process [1], we first introduce as light red, crimson, or maroon. The color attribute pro-
a problem setting called rich-text-to-image generation. We vides a way for specifying a precise color in the RGB
then discuss our approach to this task. color space, denoted as aci . For example, to generate a
toy in fire brick red, one can change the font color to “a
3.1. Problem Setting toy”, where the word “toy” is associated with the attribute
aci = [178, 34, 34]. However, as shown in the experi-
As shown in Figure 2, a rich text editor supports various
ment section, the pretrained text encoder cannot interpret
formatting options, such as font styles, font size, color, and
the RGB values and have difficulty understanding obscure
more. We leverage these text attributes as extra information
color names, such as lime and orange.
to increase control of text-to-image generation. We interpret
the rich-text prompt as JSON, where each text element con- Footnote provides supplementary explanations of the tar-
sists of a span of tokens ei (e.g., ‘church’) and attributes ai get span without hindering readability with lengthy sen-
describing the span (e.g., ‘color:#FF9900’). Note that some tences. Writing detailed descriptions of complex scenes is
tokens eU may not have any attributes. Using these anno- tedious work, and it inevitably creates lengthy prompts [29,
tated prompts, we explore four applications: 1) local style 27]. Additionally, existing text-to-image models are
control using font style, 2) precise color control using font prone to ignoring some objects when multiple objects are
color, 3) detailed region description using footnotes, and 4) present [12], especially with long prompts. Moreover, ex-
explicit token reweighting with font sizes. cess tokens are discarded when the prompt’s length sur-
Font style is used to apply a specific artistic style asi , passes the text encoder’s maximum length, e.g., 77 tokens
e.g., asi = ‘Ukiyo-e’, to the synthesis of the span of tokens for CLIP models [50]. We aim to mitigate these issues using
ei . For instance, in Figure 1, we apply the Ukiyo-e paint- a footnote string afi .
ing style to the ocean waves and the style of Van Gogh to Font size can be employed to indicate the importance,
the sky, enabling the application of localized artistic styles. quantity, or size of an object. We use a scalar aw
i to denote
This task presents a unique challenge for existing text-to- the weight of each token.
Guidance Loss
Noised Sample Token map
Feature injection

RGB: (255, 153, 0) Predicted noise Token map

" church ":{


"color": "#FF9900" an orange church
Diffusion
Gradient Token map
}, UNet
Feature injection

"garden":{
!𝑥
"footnote": “a garden
filled with
a garden filled
with colorful
Diffusion
colorful
wildflowers",
wildflowers UNet
},

Predicted noise Token map

Figure 3. Region-based diffusion. For each element of the rich-text input, we apply a separate diffusion process to its region. The attributes
are either decoded as a region-based guidance target (e.g. re-coloring the church), or as a textual input to the diffusion UNet (e.g. handling
the footnote to the garden). The self-attention maps and feature maps extracted from the plain-text generation process are injected to help
∂L
preserve the structure. The predicted noise ϵt,ei , weighted by the token map Mei , and the guidance gradient ∂x t
are used to denoise and
update the previous generation xt to xt−1 . The noised plain text generation xplain
t is blended with the current generation to preserve the
exact content in those regions of the unformatted tokens.

3.2. Method ment with a texture span ei following Patashnik et al. [47]:
To utilize rich text annotations, our method consists of b j − min(m
m b j)
Me i = {M ck ·
ck | M > ϵ, ∀j s.t. wj ∈ ei },
two steps, as shown in Figure 2. First, we compute the spa- max(mb j ) − min(m
b j) 1
tial layouts of individual token spans. Second, we use a new (2)
region-based diffusion to render each region’s attributes into where ϵ is a hyperparameter that controls the labeling
a globally coherent image. threshold, that is, the segment M
c k is assigned to the span ei
Step 1. Token maps for spatial layout. Several works if the normalized attention score of any tokens in this span
[65, 40, 4, 19, 12, 47, 67] have discovered that the atten- is higher than ϵ. We associate the segments that are not as-
tion maps in the self- and cross-attention layers of the dif- signed to any formatted spans with the unformatted tokens
fusion UNet characterize the spatial layout of the genera- eU . Finally, we obtain the token map in Figure 2 as below:
tion. Therefore, we first use the plain text as the input to P
the diffusion model and collect self-attention maps of size c j ∈Me
M
M
cj
i
32 × 32 × 32 × 32 across different heads, layers, and time Mei = P P (3)
i c j ∈Me
M
M
cj
steps. We take the average across all the extracted maps and i

reshape the result into 1024 × 1024. Note that the value at
Step 2. Region-based denoising and guidance. As shown
ith row and j th column of the map indicates the probability
in Figure 2, given the text attributes and token maps, we di-
of pixel i attending to pixel j. We average the map with its
vide the overall image synthesis into several region-based
transpose to convert it to a symmetric matrix. It is used as
denoising and guidance processes to incorporate each at-
a similarity map to perform spectral clustering [61, 69] and
c of size K ×32×32, tribute, similar to an ensemble of diffusion models [32, 6].
obtain the binary segmentation maps M More specificially, given the span ei , the region defined by
where K is the number of segments. its token map Mei , and the attribute ai , the predicted noise
To associate each segment with a textual span, we also ϵt for noised generation xt at time step t is
extract cross-attention maps for each token wj :
X X
ϵt = Mei · ϵt,ei = Mei · D(xt , f (ei , ai ), t), (4)
exp(sj ) i i
mj = P , (1)
k exp(sk )
where D is the pretrained diffusion model, and f (ei , ai )
where sj is the attention score. We first interpolate each is a plain text representation derived from text span ei and
cross-attention map mj to the same resolution as M c of attributes ai using the following process:
32 × 32. Similar to the processing steps of the self-attention
1. Initially, we set f (ei , ai ) = ei .
maps, we compute the mean across heads, layers, and time
steps to get the averaged map m b j . We associate each seg- 2. If footnote afi is available, we set f (ei , ai ) = afi .
Ours InstructPix2Pix Prompt-to-Prompt Ours InstructPix2Pix Prompt-to-Prompt Ours InstructPix2Pix Prompt-to-Prompt

Green

Black
Pink

Dodger Blue
(48, 131, 172) Plum Purple
Olive Yellow

(105, 28, 226)


(211, 22, 52)

a church with beautiful landscape a car in the street a woman wearing pants
Figure 4. Qualitative comparison on precise color generation. We show images generated by Prompt-to-Prompt [19], Instruct-
Pix2Pix [7], and our method using prompts with font colors. Our method generates precise colors according to either color names or
RGB values. Both baselines generate plausible but inaccurate colors given color names, while neither understands the color defined by
RGB values. InstructPix2Pix tends to apply the color globally, even outside the target object.

3. The style asi is appended if it exists. f (ei , ai ) = Token reweighting with font size. Last, to re-weight the
f (ei , ai ) + ‘in the style of’ + asi . impact of the token wj according to the font size aw j , we
4. The closest color name (string) of font color âci from modify its cross-attention maps mj . However, instead of
a predefined set C is prepended. f (ei , ai ) = âci + applying
Pdirect multiplication as in Prompt-to-Prompt [19]
f (ei , ai ). For example, âci = ‘brown’ for RGB color where j aw j mj ̸= 1, we find that it is critical to preserve
aci = [136,68,20]. the probability property of mj . We thus propose the follow-
ing reweighting approach:
We use f (ei , ai ) as the original plain text prompt of Step
1 for the unformatted tokens eU . This helps us generate a aw
j exp(sj )
m
bj = P w . (8)
coherent image, especially around region boundaries. k ak exp(sk )

Guidance. By default, we use classifier-free guidance [23] We can compute the token map (Equation 3) and predict
for each region to better match the prompt f (ei , ai ). In the noise (Equation 4) with the reweighted attention map.
addition, if the font color is specified, to exploit the RGB
values information further, we apply gradient guidance [24, Preserve the fidelity against plain-text generation. Al-
14, 5] on the current clean image prediction: though our region-based method naturally maintains the
layout, there is no guarantee that the details and shape of

xt − 1 − ᾱt ϵt the objects are retained when no rich-text attributes or only
x
b0 = √ , (5) the color is specified, as shown in Figure 12. To this end, we
ᾱt
follow Plug-and-Play [67] to inject the self-attention maps
where xt is the noisy image at time step t, and ᾱt is the and the residual features extracted from the plain-text gen-
coefficient defined by noise scheduling strategy [20]. Here, eration process when t > Tpnp to improve the structure fi-
we compute an MSE loss L between the average color of delity. In addition, for the regions associated with the unfor-
b weighted by the token map Mei and the RGB triplet aci .
x matted tokens eU , stronger content preservation is desired.
The gradient is calculated below, Therefore, at certain t = Tblend , we blend the noised sample
xplain
t based on the plain text into those regions:
b0 )/ p Mei − aci ∥22
P P
dL d∥ p (Mei · x
= √ , (6)
dxt ᾱt db
x0 xt ← MeU · xplain
t + (1 − MeU ) · xt (9)
where the summation is over all pixels p. We then update 4. Experimental Results
xt with the following equation:
dL Implementation details. We use Stable Diffusion V1-
xt ← xt − λ · Mei · , (7) 5 [54] for our experiments. To create the token maps, we
dxt
use the cross-attention layers in all blocks, excluding the
where λ is a hyperparameter to control the strength of the first encoder and last decoder blocks, as the attention maps
guidance. We use λ = 1 unless denoted otherwise. in these high-resolution layers are often noisy. We discard
Ours Prompt-to-Prompt InstructPix2Pix-para InstructPix2Pix-seq

A night sky filled with stars (1st Region: Van Gogh) above a turbulent sea with giant waves (2nd Region: Ukiyo-e)

The awe-inspiring sky and sea (1st Region: J.M.W. Turner) by a coast with flowers and grasses in spring (2nd Region: Monet).

Figure 5. Qualitative comparison on style control. We show images generated by Prompt-to-Prompt, InstructPix2Pix, and our method
using prompts with multiple styles. Only our method can generate distinct styles for both regions.

Ours Prompt-to-Prompt InstructPix2Pix-seq InstructPix2Pix-para Ours Prompt-to-Prompt InstructPix2Pix


Distance to Target Color (↓)

0.28 0.8
0.73
0.28 0.28 0.14 0.710.7 0.71
CLIP Similarity (↑)

0.27 0.13 0.68


0.27 0.12 0.65
0.260.26 0.11
0.26
0.26
0.1 0.6
0.25
0.25 0.25 0.07 0.07
0.06 0.46
0.24 0.04 0.42
0.24 0.24 0.38
0.02 0.4
0
Common HTML RGB Common HTML RGB
1st Region 2nd Region Both Minimal Distance Mean Distance
Figure 6. Quantitative evaluation of local style control. We re- Figure 7. Quantitaive evaluation on precise color generation.
port the CLIP similarity between each stylized region and its re- Distance against target color is reported (lower is better). Our
gion prompt. Our method achieves the best stylization. method consistently outperforms baselines.
the maps at the initial denoising steps with T > 750. We lighthouse region is measured by comparing its similarity
use K = 15, ϵ = 0.3, Tpnp = 0.3, Tblend = 0.3, and report with the prompt “lighthouse in the style of cyberpunk.” We
the results averaged from three random seeds for all quanti- refer to “lighthouse” as the first region and “waves” as the
tative experiments. More details, such as the running time, second region in this example.
can be found in Appendix B.
Font color evaluation. To evaluate a method’s capac-
Font style evaluation. We compute CLIP scores [50] for ity to understand and generate a specific color, we divide
each local region to evaluate the stylization quality. Specif- colors into three categories. The Common color category
ically, we create prompts of two objects and styles. We contains 17 standard names, such as “red”, “yellow”, and
create combinations using 7 popular styles and 10 objects, “pink”. The HTML color names are selected from the web
resulting in 420 prompts. For each generated image, we color names 1 used for website design, such as “sky blue”,
mask it by the token maps of each object and attach the “lime green”, and “violet purple”. The RGB color category
masked output to a black background. Then, we compute contains 50 randomly sampled RGB triplets to be used as
the CLIP score using the region-specific prompt. For ex- “color of RGB values [128, 128, 128]”. To create a complete
ample, for the prompt “a lighthouse (Cyberpunk) among
the turbulent waves (Ukiyo-e)”, the local CLIP score of the 1 https://simple.wikipedia.org/wiki/Web_color
A coffee table1 sits in front of a sofa2 on a cozy carpet. A painting3 on the wall. cinematic lighting, trending on artstation, 4k, hyperrealistic, focused, extreme details.
1A rustic wooden coffee table adorned with scented candles and many books. 2A plush sofa with a soft blanket and colorful pillows on it.

3A painting of wheat field with a cottage in the distance, close up shot, trending on artstation, HD, calm, complimentary color, realistic lighting, by Albert Bierstadt, Frederic Church.

Stable Diffusion (Plain-Text) Stable Diffusion (Full-Text) Ours

Attend-and-Excite Prompt-to-Prompt InstructPix2Pix

Figure 8. Qualitative comparison on detailed description generation. We show images generated by Attend-and-Excite, Prompt-to-
Prompt, InstructPix2Pix, and our method using complex prompts. Our method is the only one that can generate all the details faithfully.

prompt, we use 12 objects exhibiting different colors, such pare with Attend-and-Excite [12].
as “flower”, “gem”, and “house”. This gives us a total of
1, 200 prompts. We evaluate color accuracy by computing 4.1. Quantitative Comparison
the mean L2 distance between the region and target RGB
values. We also compute the minimal L2 distance as some- We report the local CLIP scores computed by a ViT-B/32
times the object should contain other colors for fidelity, e.g., model in Figure 6. Our method achieves the best overall
the “black tires” of a “yellow car”. CLIP score compared to the two baselines. This demon-
strates the advantage of our region-based diffusion method
Baselines. For font color and style, we quantitatively for localized stylization. To further understand the capacity
compare our method with two strong baselines, Prompt-to- of each model to generate multiple styles, we report the met-
Prompt [19] and InstructPix2Pix [7]. When two instructions ric on each region. Prompt-to-Prompt and InstructPix2Pix-
exist for each image in our font style experiments, we apply para achieve a decent score on the 1st Region, i.e., the region
them in parallel (InstructPix2Pix-para) and sequential man- first occurs in the sentence. However, they often fail to ful-
ners (InstructPix2Pix-seq). More details are in Appendix B. fill the style in the 2nd Region. We conjecture that the Stable
We also perform a human evaluation with these two meth- Diffusion model tends to generate a uniform style for the
ods in Appendix Table 1. For re-weighting token impor- entire image, which can be attributed to single-style train-
tance, we visually compare with Prompt-to-Prompt [19] and ing images. Furthermore, InstructPix2Pix-seq performs the
two heuristic methods, repeating and adding parentheses. worst in 2nd Region. This is because the first instruction
For complex scene generation with footnotes, we also com- contains no information about the second region, and the
Ours: A pizza with pineapples, A pizza with pineapples, pepperonis, A pizza with pineapples, pepperonis,
and mushrooms, mushrooms,
pepperonis, and mushrooms Prompt-to-Prompt mushrooms, mushrooms, mushrooms and (((((mushrooms)))))

Figure 9. Qualitative comparison on token reweighting. We show images generated by our method and Prompt-to-Prompt using token
weight of 13 for ‘mushrooms’. Prompt-to-Prompt suffers from artifacts due to the large weight. Heuristic methods like repeating and
parenthesis do not work well.

categories and prompts.

Local style generation. Figure 5 shows a visual


comparison of local style generation. When applying
InstructPix2Pix-seq, the style in the first instruction domi-
nates the entire image and undermines the second region.
Figure 13 in Appendix shows that this cannot be fully
resolved using different hyperparameters of classifier-free
guidance. Similar to our observation in the quantitative
evaluation, our baselines tend to generate the image in a
globally uniform style instead of distinct local styles for
each region. In contrast, our method synthesizes the correct
Figure 10. Ablation of token maps. Using solely cross-attention styles for both regions. One may suggest applying baselines
maps to create token maps leads to inaccurate segmentations, caus- with two stylization processes independently and compos-
ing the background to be colored in an undesired way. ing the results using token maps. However, as shown in
Figure 12 (Appendix), such methods generate artifacts on
second region’s content could be compromised when we ap- the region boundaries.
ply the first instruction.
We show quantitative results of precise color genera- Complex scene generation. Figure 8 shows comparisons
tion in Figure 7. The distance of HTML color is generally on complex scene generation. Attend-and-Excite [12] uses
the lowest for baseline methods, as they provide the most the tokens missing in the full-text generation result as input
interpretable textual information for text encoders. This to fix the missing objects, like the coffee table and carpet in
aligns with our expectation that the diffusion model can the living room example. However, it still fails to generate
handle simple color names, whereas they struggle to handle all the details correctly, e.g., the books, the painting, and
the RGB triplet. Our rich-text-to-image generation method the blanket. Prompt-to-Prompt [19] and InstructPix2Pix [7]
consistently improves on the three categories and two met- can edit the painting accordingly, but many objects, like the
rics over the baselines. colorful pillows and stuff on the table, are still missing. In
contrast, our method faithfully synthesizes all these details
4.2. Visual Comparison described in the target region.

Precise color generation. We show qualitative com- Token importance control. Figure 9 shows the qualita-
parison on precise color generation in Figure 4. Instruct- tive comparison on token reweighting. When using a large
Pix2Pix [7] is prone to create global color effects rather weight for ‘mushroom,’ Prompt-to-Prompt generates clear
than accurate local control. For example, in the flower re- artifacts as it modifies the attention probabilities to be un-
sults, both the vase and background are changed to the tar- bounded and creates out-of-distribution intermediate fea-
get colors. Prompt-to-Prompt [19] provides more precise tures. Heuristic methods fail with adding more mushrooms,
control over the target region. However, both Prompt-to- while our method generates more mushrooms and preserves
Prompt and InstructPix2Pix fail to generate precise colors. the quality. More results of different font sizes and target to-
In contrast, our method can generate precise colors for all kens are shown in Figures 23 - 25 in Appnedix.
Stable Diffusion Stable Diffusion (full text) InstructPix2Pix
A rustic orange cabin sits on Make the cabin orange.
A rustic cabin sits on the the edge of a giant, crystal- Turn the wildflowers into style
edge of a giant lake. clear, blueish lake. The lake is Claude Monet, Impressionism.
Wildflowers dot the glistening in the sunlight. Make the lake crystal-clear,
meadow around the cabin Wildflowers in the style of blueish, glistening in the
and lake. Claude Monet, Impressionism sunlight.
dot the meadow around the ……
cabin and lake.

Ours Ours Ours


A rustic cabin sits on A rustic cabin sits on A rustic cabin sits on
the edge of a giant lake. the edge of a giant lake1. the edge of a giant lake1.
Wildflowers dot the Wildflowers dot the meadow Wildflowers dot the meadow
meadow around the cabin around the cabin and lake. around the cabin and lake.
and lake.
Style: Claude Monet, impressionism Style: Claude Monet, impressionism
1A crystal-clear, blueish lake, glistening

in the sunlight.

Figure 11. Our workflow. (top left) A user begins with an initial plain-text prompt and wishes to refine the scene by specifying the color,
details, and styles. (top center) Naively inputting the whole description in plain text does not work. (top right) InstructPix2Pix [7] fails
to make accurate editing. (bottom) Our method supports precise refinement with region-constrained diffusion processes. Moreover, our
framework can naturally be integrated into a rich text editor, enabling a tight, streamlined UI.

effectiveness of our injection method, we compare image


generation with and without it in Figure 12. In the font color
example, we show that applying the injection effectively
preserves the shape and details of the target church and the
structure of the sunset in the background. In the footnote
example, we show that the injection keeps the looking of
the black door and the color of the floor.
Figure 12. Ablation of injection method. We show images gen- 5. Discussion and Limitations
erated based on plain text and rich text with or without injection
methods. Injecting features and noised samples help preserve the In this paper, we have expanded the controllability of
structure of the church and unformatted token regions. text-to-image models by incorporating rich-text attributes
as the input. We have demonstrated the potential for gen-
Interactive editing. In Figure 11, we showcase a sample erating images with local styles, precise colors, different
workflow to illustrate our method’s interactive strength and token importance, and complex descriptions. Neverthe-
editing capacity over InstructPix2Pix [7]. less, numerous formatting options remain unexplored, such
as bold/italic, hyperlinks, spacing, and bullets/numbering.
4.3. Ablation Study Also, there are multiple ways to use the same formatting
options. For example, one can use font style to character-
Generating token maps solely from cross-attention. The ize the shape of the objects. We hope this paper encourages
other straightforward way to create token maps is to use further exploration of integrating accessible daily interfaces
cross-attention maps directly. To ablate this, we first take into text-based generation tasks, even beyond images.
the average of cross-attention maps across heads, layers, Limitations. As we use multiple diffusion processes
and time steps and then take the maximum across tokens. and two-stage methods, our method can be multiple times
Finally, we apply softmax across all the spans to normal- slower than the original process. Also, our way to produce
ize the token maps. However, as shown by the example token maps relies on a thresholding parameter. More ad-
in Figure 10, since the prompt has no correspondence with vanced segmentation methods like SAM [31] could be ex-
the background, the token map of “shirt” also covers partial ploited to further improve the accuracy and robustness.
background regions. Note that simple thresholding is inef-
fective as some regions still have high values, e.g., the right Acknowledgment. We thank Mia Tang, Aaron Hertz-
shoulder. As a result, the target color bleeds into the back- mann, Nupur Kumari, Gaurav Parmar, Ruihan Gao, and
ground. Our methods obtain more accurate token maps and, Aniruddha Mahapatra for their helpful discussion and pa-
consequently, more precise colorization. per reading. This work is partly supported by NSF grant
No. IIS-239076, as well as NSF grants No. IIS-1910132
Ablation of the injection methods. To demonstrate the and IIS-2213335.
References (NeurIPS), volume 34, pages 8780–8794. Curran Associates,
Inc., 2021. 5
[1] Maneesh Agrawala. Unpredictable black
[15] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang.
boxes are terrible interfaces, March 2023.
Cogview2: Faster and better text-to-image generation via hi-
https://magrawala.substack.com/p/unpredictable-black-
erarchical transformers. arXiv preprint arXiv:2204.14217,
boxes-are-terrible. 3
2022. 2
[2] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended [16] Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Ar-
latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 2 jun Reddy Akula, Pradyumna Narayana, Sugato Basu,
[3] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Xin Eric Wang, and William Yang Wang. Training-
Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, free structured diffusion guidance for compositional text-to-
and Xi Yin. Spatext: Spatio-textual representation for con- image synthesis. In International Conference on Learning
trollable image generation. IEEE Conference on Computer Representations (ICLR), 2023. 2
Vision and Pattern Recognition (CVPR), 2023. 2 [17] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin,
[4] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, based text-to-image generation with human priors. In Eu-
Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image ropean Conference on Computer Vision (ECCV), pages 89–
diffusion models with an ensemble of expert denoisers. arXiv 106. Springer, 2022. 1, 2
preprint arXiv:2211.01324, 2022. 1, 2, 4 [18] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im-
[5] Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, age style transfer using convolutional neural networks. In
Soumyadip Sengupta, Micah Goldblum, Jonas Geip- IEEE Conference on Computer Vision and Pattern Recogni-
ing, and Tom Goldstein. Universal guidance for diffusion tion (CVPR), 2016. 2
models. arXiv preprint arXiv:2302.07121, 2023. 2, 5 [19] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
[6] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
Multidiffusion: Fusing diffusion paths for controlled image age editing with cross attention control. arXiv preprint
generation. arXiv preprint arXiv:2302.08113, 2023. 2, 4 arXiv:2208.01626, 2022. 1, 2, 4, 5, 7, 8, 13, 23, 25, 29
[7] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- [20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
structpix2pix: Learning to follow image editing instructions. fusion probabilistic models. Neural Information Processing
IEEE Conference on Computer Vision and Pattern Recogni- Systems (NeurIPS), 2020. 2, 5
tion (CVPR), 2023. 1, 2, 5, 7, 8, 9, 13, 29 [21] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet,
[8] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Mohammad Norouzi, and Tim Salimans. Cascaded diffu-
Lee, Woonhyuk Baek, and Saehoon Kim. Coyo- sion models for high fidelity image generation. Journal of
700m: Image-text pair dataset. https://github.com/ Machine Learning Research, 23(47):1–33, 2022. 2
kakaobrain/coyo-dataset, 2022. 2 [22] Jonathan Ho and Tim Salimans. Classifier-free diffusion
[9] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- guidance. In NeurIPS 2021 Workshop on Deep Generative
aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- Models and Downstream Applications, 2021. 2
tual self-attention control for consistent image synthesis and [23] Jonathan Ho and Tim Salimans. Classifier-free diffusion
editing. arXiv preprint arXiv:2304.08465, 2023. 2 guidance. arXiv preprint arXiv:2207.12598, 2022. 5
[10] Duygu Ceylan, Chun-Hao Huang, and Niloy J. Mi- [24] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William
tra. Pix2video: Video editing using image diffusion. Chan, Mohammad Norouzi, and David J Fleet. Video diffu-
arXiv:2303.12688, 2023. 2 sion models. Conference on Neural Information Processing
[11] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Systems (NeurIPS), 2022. 2, 5
Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Mur- [25] Claudia-Lavinia Ignat, Luc André, and Gérald Oster. En-
phy, William T Freeman, Michael Rubinstein, et al. Muse: hancing rich content wikis with real-time collaboration.
Text-to-image generation via masked generative transform- Concurrency and Computation: Practice and Experience,
ers. arXiv preprint arXiv:2301.00704, 2023. 2 33(8):e4110, 2021. 3
[12] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and [26] Álvaro Barbero Jiménez. Mixture of diffusers for scene
Daniel Cohen-Or. Attend-and-excite: Attention-based se- composition and high resolution image generation. arXiv
mantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2302.02412, 2023. 2
preprint arXiv:2301.13826, 2023. 2, 3, 4, 7, 8, 13 [27] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap:
[13] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Fully convolutional localization networks for dense caption-
Gwon, and Sungroh Yoon. Ilvr: Conditioning method for de- ing. In IEEE Conference on Computer Vision and Pattern
noising diffusion probabilistic models. In IEEE International Recognition (CVPR), pages 4565–4574, 2016. 3
Conference on Computer Vision (ICCV), 2021. 2 [28] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park,
[14] Prafulla Dhariwal and Alexander Nichol. Diffusion models Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling
beat gans on image synthesis. In M. Ranzato, A. Beygelz- up gans for text-to-image synthesis. In IEEE Conference on
imer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, edi- Computer Vision and Pattern Recognition (CVPR), 2023. 1,
tors, Conference on Neural Information Processing Systems 2
[29] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align- editing with stochastic differential equations. International
ments for generating image descriptions. In Proceedings of Conference on Learning Representations (ICLR), 2022. 2
the IEEE conference on computer vision and pattern recog- [44] Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie, Fan
nition, pages 3128–3137, 2015. 3 Yin, Muyu Li, Qinghong Han, Xiaofei Sun, and Jiwei Li.
[30] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Hui- Glyce: Glyph-vectors for chinese character representations.
wen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Neural Information Processing Systems (NeurIPS), 32, 2019.
Imagic: Text-based real image editing with diffusion mod- 2
els. In IEEE Conference on Computer Vision and Pattern [45] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh,
Recognition (CVPR), 2023. 2 Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya
[31] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Sutskever, and Mark Chen. Glide: Towards photorealis-
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- tic image generation and editing with text-guided diffusion
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- models. International Conference on Machine Learning
thing. arXiv preprint arXiv:2304.02643, 2023. 9 (ICML), pages 16784–16804, 2022. 2
[32] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli [46] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun
Shechtman, and Jun-Yan Zhu. Multi-concept customization Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image
of text-to-image diffusion. arXiv preprint arXiv:2212.04488, translation. arXiv preprint arXiv:2302.03027, 2023. 2
2022. 2, 4 [47] Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-
[33] Anat Levin, Dani Lischinski, and Yair Weiss. Colorization Elor, and Daniel Cohen-Or. Localizing object-level shape
using optimization. ACM SIGGRAPH, 2004. 2 variations with text-to-image diffusion models. arXiv
[34] Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, preprint arXiv:2303.11306, 2023. 2, 4
and Furu Wei. Dit: Self-supervised pre-training for docu- [48] Quynh Phung, Songwei Ge, and Jia-Bin Huang. Grounded
ment image transformer. In Proceedings of the 30th ACM text-to-image synthesis with attention refocusing. arXiv
International Conference on Multimedia, pages 3530–3539, preprint arXiv:2306.05427, 2023. 2
2022. 2 [49] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei,
[35] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus-
wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. ing attentions for zero-shot text-based video editing, 2023.
Gligen: Open-set grounded text-to-image generation. In 2
IEEE Conference on Computer Vision and Pattern Recog- [50] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
nition (CVPR), 2023. 2 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[36] Geoffrey Litt, Sarah Lim, Martin Kleppmann, and Peter van Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
Hardenberg. Peritext: A crdt for collaborative rich text edit- ing transferable visual models from natural language super-
ing. Proceedings of the ACM on Human-Computer Interac- vision. In International Conference on Machine Learning
tion (PACMHCI), 2022. 3 (ICML), pages 8748–8763. PMLR, 2021. 3, 6
[37] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- [51] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to- and Mark Chen. Hierarchical text-conditional image gen-
3: Zero-shot one image to 3d object. arXiv preprint eration with clip latents. arXiv preprint arXiv:2204.06125,
arXiv:2303.11328, 2023. 2 2022. 2
[38] Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya [52] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
Jia. Video-p2p: Video editing with cross-attention control. Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
arXiv:2303.04761, 2023. 2 Zero-shot text-to-image generation. In International Confer-
[39] Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. ence on Machine Learning (ICML), pages 8821–8831, 2021.
Deep photo style transfer. In IEEE Conference on Computer 1, 2
Vision and Pattern Recognition (CVPR), 2017. 2 [53] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley. Color
[40] Wan-Duo Kurt Ma, JP Lewis, W Bastiaan Kleijn, and transfer between images. IEEE Computer Graphics and Ap-
Thomas Leung. Directed diffusion: Direct control of ob- plications, 21(5):34–41, 2001. 2
ject placement through attention guidance. arXiv preprint [54] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
arXiv:2302.13153, 2023. 2, 4 Patrick Esser, and Björn Ommer. High-resolution image syn-
[41] Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Ying thesis with latent diffusion models. In IEEE Conference on
Shan, Xiu Li, and Qifeng Chen. Follow your pose: Computer Vision and Pattern Recognition (CVPR), 2022. 1,
Pose-guided text-to-video generation using pose-free videos, 2, 5, 26
2023. 2 [55] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
[42] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Rus- Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
lan Salakhutdinov. Generating images from captions with tuning text-to-image diffusion models for subject-driven
attention. In International Conference on Learning Repre- generation. IEEE Conference on Computer Vision and Pat-
sentations (ICLR), 2016. 2 tern Recognition (CVPR), 2023. 2
[43] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- [56] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, [71] Ian H Witten, David Bainbridge, and David M Nichols. How
Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J to build a digital library. Morgan Kaufmann, 2009. 1
Fleet, and Mohammad Norouzi. Photorealistic text-to-image [72] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei,
diffusion models with deep language understanding, 2022. Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and
1, 2, 26 Mike Zheng Shou. Tune-a-video: One-shot tuning of image
[57] Arnaud Sahuguet and Fabien Azavant. Wysiwyg web wrap- diffusion models for text-to-video generation. arXiv preprint
per factory (w4f), 1999. 3 arXiv:2212.11565, 2022. 2
[58] Vishnu Sarukkai, Linden Li, Arden Ma, Christopher R’e, [73] Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo
and Kayvon Fatahalian. Collage diffusion. ArXiv, Durand, and Song Han. Fastcomposer: Tuning-free multi-
abs/2303.00262, 2023. 2 subject image generation with localized attention. arXiv
[59] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, preprint arXiv:2305.10431, 2023. 2
and Timo Aila. Stylegan-t: Unlocking the power of gans [74] Li Xu, Qiong Yan, and Jiaya Jia. A sparse control model
for fast large-scale text-to-image synthesis. arXiv preprint for image and video editing. ACM Transactions on Graphics
arXiv:2301.09515, 2023. 2 (TOG), 32:1 – 10, 2013. 2
[60] Christoph Schuhmann, Romain Beaumont, Richard Vencu, [75] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo and Ming Zhou. Layoutlm: Pre-training of text and layout
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- for document image understanding. In Proceedings of the
man, et al. Laion-5b: An open large-scale dataset for training 26th ACM SIGKDD International Conference on Knowledge
next generation image-text models. Conference on Neural Discovery & Data Mining, pages 1192–1200, 2020. 2
Information Processing Systems (NeurIPS), 2022. 2 [76] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
[61] Jianbo Shi and J. Malik. Normalized cuts and image segmen- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
tation. IEEE Transactions on Pattern Analysis and Machine fei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive
Intelligence, 22(8):888–905, 2000. 4 models for content-rich text-to-image generation. Transac-
[62] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- tions on Machine Learning Research, 2022. 2
ing diffusion implicit models. In International Conference [77] Lvmin Zhang and Maneesh Agrawala. Adding conditional
on Learning Representations (ICLR), 2021. 2 control to text-to-image diffusion models. arXiv preprint
[63] Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang arXiv:2302.05543, 2023. 1, 2
Ao, Qing He, Fei Wu, and Jiwei Li. Chinesebert: Chinese [78] Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin Chen,
pretraining enhanced by glyph and pinyin information. An- and Ming-Yu Liu. Diffcollage: Parallel generation of large
nual Meeting of the Association for Computational Linguis- content with diffusion models. IEEE Conference on Com-
tics (ACL), 2021. 2 puter Vision and Pattern Recognition (CVPR), 2023. 2
[64] Yu-Wing Tai, Jiaya Jia, and Chi-Keung Tang. Local [79] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
color transfer via probabilistic segmentation by expectation- image colorization. In European Conference on Computer
maximization. In IEEE Conference on Computer Vision and Vision (ECCV), 2016. 2
Pattern Recognition (CVPR), volume 1, pages 747–754 vol. [80] Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng,
1, 2005. 2 Angela S Lin, Tianhe Yu, and Alexei A Efros. Real-time
[65] Raphael Tang, Akshat Pandey, Zhiying Jiang, Gefei Yang, user-guided image colorization with learned deep priors.
K. V. S. Manoj Kumar, Jimmy Lin, and Ferhan Ture. What ACM Transactions on Graphics (TOG), 9(4), 2017. 2
the daam: Interpreting stable diffusion using cross attention. [81] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.
ArXiv, abs/2210.04885, 2022. 4 Efros. Unpaired image-to-image translation using cycle-
[66] Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia- consistent adversarial networks. In IEEE International Con-
Bin Huang, and Johannes Kopf. Consistent view synthesis ference on Computer Vision (ICCV), Oct 2017. 2
with pose-guided diffusion models. In IEEE Conference on [82] Xiaojin Zhu, Andrew B Goldberg, Mohamed Eldawy,
Computer Vision and Pattern Recognition (CVPR), 2023. 2 Charles R Dyer, and Bradley Strock. A text-to-picture syn-
[67] Narek Tumanyan, Michal Geyer, Shai Bagon, and thesis system for augmenting communication. In AAAI Con-
Tali Dekel. Plug-and-play diffusion features for text- ference on Artificial Intelligence, 2007. 2
driven image-to-image translation. arXiv preprint
arXiv:2211.12572, 2022. 4, 5
[68] Colorado State University. tutorial: Rich text format (rtf)
from microsoft word - the access project - colorado state uni-
versity, 2012-07-08. 1
[69] Ulrike Von Luxburg. A tutorial on spectral clustering. Statis-
tics and computing, 17:395–416, 2007. 4
[70] Daniel Watson, William Chan, Ricardo Martin-Brualla,
Jonathan Ho, Andrea Tagliasacchi, and Mohammad
Norouzi. Novel view synthesis with diffusion models, 2022.
2
Expressive Text-to-Image Generation with Rich Text Appendix
In this appendix, we provide additional experimental results and details. In section A, we show the images generated by
our model, Attend-and-Excite [12], Prompt-to-Prompt [19], and InstructPix2Pix [7] with various RGB colors, local styles,
and detailed descriptions via footnotes. In section B, we provide additional details on the implementation and evaluation.

A. Additional Results
In this section, we first show additional results of rich-text-to-image generation on complex scene synthesis (Figures 15,
16, and 17), precise color rendering (Figures 18, 19, and 20), local style control (Figures 21 and 22), and explicit token re-
weighting (Figure 23, 24, and 25). We also show an ablation study of the averaging and maximizing operations across tokens
to obtain token maps in Figure 26. We present additional results compared with a composition-based baseline in Figure 27.
Last, we show an ablation of the hyperparameters of our baseline method InstructPix2Pix [7] on the local style generation
application in Figure 28.

A car1 driving on the road. A bicycle2 nearby a tree3. A cityscape4 in the background.
1A sleek sports car gleams on the road in the sunlight, with its aerodynamic curves and polished finish catching the light. 2A bicycle with rusted frame and worn tires.
3A dead tree with a few red apples on it. 4A bustling Hongkong cityscape with towering skyscrapers.

Stable Diffusion (Plain-Text) Stable Diffusion (Full-Text) Ours

Attend-and-Excite Prompt-to-Prompt InstructPix2Pix


Figure 13. Additional results of the footnote. We show the generation from a complex description of a garden. Note that all the methods
except for ours fail to generate accurate details of the mansion and fountain as described.
A lush garden1 with a fountain2. A grand mansion3 in the background.
1A garden is full of vibrant colors with a variety of flowers.
2A fountain made of white marble with multiple tiers. The tiers are intricately carved with various designs.
3An impressive two-story mansion with a royal exterior, white columns, and tile-made roof. The mansion has numerous windows, each adorned with white curtains.

Stable Diffusion (Plain-Text) Stable Diffusion (Full-Text) Ours

Attend-and-Excite Prompt-to-Prompt InstructPix2Pix

Figure 14. Additional results of the footnote. We show the generation from a complex description of a garden. Note that all the methods
except for ours fail to generate accurate details of the mansion and fountain as described.
A small chair1 sits in front of a table2 on the wooden floor. There is a bookshelf3 nearby the window4.
1A blackleather office chair with a high backrest and adjustable arms.
2A large
wooden desk with a stack of books on top of it.
3A bookshelf filled with colorful books and binders.
4A window overlooks a stunning natural landscape of snow mountains.

Stable Diffusion (Plain-Text) Stable Diffusion (Full-Text) Ours

Attend-and-Excite Prompt-to-Prompt InstructPix2Pix

Figure 15. Additional results of the footnote. We show the generation from a complex description of an office. Note that all the methods
except ours fail to generate accurate window overlooks and colorful binders as described.
Ours
Red Yellow Green Blue Pink Cyan Purple Orange Black

(a)
Vegetable

(b)
Flower

(c)
Shirts

(d)
Toy

(e)
Beverage

Prompt-to-Promp InstructPix2Pix
Red Yellow Green Blue Pink Cyan Purple Orange Black Red Yellow Green Blue Pink Cyan Purple Orange Black

(a)

(b)

(c)

(d)

(e)

Figure 16. Additional results of the font color. We show the generation of different objects with colors from the Common category.
Prompt-to-Prompt has a large failure rate of respecting the given color name, while InstructPix2Pix tends to color the background and
irrelevant objects.
Ours
Chocolate Salmon Red Spring Green Gold Yellow Orchid Purple Floral White Indigo Purple Tomato Orange Navy Blue

(a)
Vegetable

(b)
Flower

(c)
Shirts

(d)
Toy

(e)
Beverage

Prompt-to-Prompt InstructPix2Pix
Chocolate Salmon Red Spring Green Gold Yellow Orchid Purple Floral White Indigo Purple Tomato Orange Navy Blue Chocolate Salmon Red Spring Green Gold Yellow Orchid Purple Floral White Indigo Purple Tomato Orange Navy Blue

(a)

(b)

(c)

(d)

(e)

Figure 17. Additional results of the font color. We show the generation of different objects with colors from the HTML category. Both
methods fail to generate the precise color, and InstructPix2Pix tends to color the background and irrelevant objects.
Ours
(25, 75, 226) (39, 126, 109) (99, 219, 32) (105, 28, 226) (116, 6, 93) (222, 80, 195) (211, 22, 52) (208, 211, 9) (219, 100, 27)

(a)
Vegetable

(b)
Flower

(c)
Shirts

(d)
Toy

(e)
Beverage

Prompt-to-Prompt InstructPix2Pix
(25, 75, 226) (39, 126, 109) (99, 219, 32) (105, 28, 226) (116, 6, 93) (222, 80, 195) (211, 22, 52) (208, 211, 9) (219, 100, 27) (25, 75, 226) (39, 126, 109) (99, 219, 32) (105, 28, 226) (116, 6, 93) (222, 80, 195) (211, 22, 52) (208, 211, 9) (219, 100, 27)

(a)

(b)

(c)

(d)

(e)

Figure 18. Additional results of the font color. We show the generation of different objects with colors from the RGB category. Both
baseline methods cannot interpret the RGB values correctly.
garden
Ours

(a) Claude
Monet

(b) Ukiyo-e

(c) Cyber
Punk

(d) Andy
Warhol

(e) Vincent
Van Gogh

(f) Pixel
Art

(g) Cubism snow


mountain
(a) Claude (b) Ukiyo-e (c) Cyber (d) Andy (e) Vincent (f) Pixel (g) Cubism
garden

Monet Punk Warhol Van Gogh Art


Prompt-to-Prompt InstructPix2Pix
(a)

(b)

(c)

(d)

(e)

(f)

snow
(g) mountain

(a) (b) (c) (d) (e) (f) (g) (a) (b) (c) (d) (e) (f) (g)
Figure 19. Additional results of the font style. We show images generated with different style combinations and prompt “a beautiful
garden in front of a snow mountain”. Each row contains “snow mountain” in 7 styles, and each column contains “garden” in 7 styles. Only
our method can generate distinct styles for both objects.
pond
Ours

(a) Claude
Monet

(b) Ukiyo-e

(c) Cyber
Punk

(d) Andy
Warhol

(e) Vincent
Van Gogh

(f) Pixel
Art

(g) Cubism
skyscraper
(a) Claude (b) Ukiyo-e (c) Cyber (d) Andy (e) Vincent (f) Pixel (g) Cubism
pond

Monet Punk Warhol Van Gogh Art


Prompt-to-Prompt InstructPix2Pix
(a)

(b)

(c)

(d)

(e)

(f)

sky-
(g) scraper

(a) (b) (c) (d) (e) (f) (g) (a) (b) (c) (d) (e) (f) (g)
Figure 20. Additional results of the font style. We show images generated with different style combinations and prompt “a small pond
surrounded by skyscraper”. Each row contains “skyscraper” in 7 styles, and each column contains “pond” in 7 styles. Only our method
can generate distinct styles for both objects.
1× 3× 5× 7× 9× 11 × 13 × 15 × 17 × 19 ×

Ours: A pizza with pineapples, pepperonis, and mushrooms.

Prompt-to-Prompt: A pizza with pineapples, pepperonis, and mushrooms.

Parenthesis: A pizza with pineapples, pepperonis, and ((mushrooms)).

Repeating: A pizza with pineapples, pepperonis, and mushrooms, mushrooms, mushrooms.


Figure 21. Additional results of font sizes. We use a token weight evenly sampled from 1 to 20 for the word ‘mushrooms’ with our
method and Prompt-to-Prompt. For parenthesis and repeating, we show results by repeating the word ‘mushrooms’ and adding parentheses
to the word ‘mushrooms’ for 1 to 10 times. Prompt-to-Prompt suffers from generating artifacts. Heuristic methods are not effective.

1× 3× 5× 7× 9× 11 × 13 × 15 × 17 × 19 ×

Ours: A pizza with pineapples, pepperonis, and mushrooms.

Prompt-to-Prompt: A pizza with pineapples, pepperonis, and mushrooms.

Parenthesis: A pizza with ((pineapples)), pepperonis, and mushrooms.

Repeating: A pizza with pineapples, pineapples, pineapples, pepperonis, and mushrooms.


Figure 22. Additional results of font sizes. We use a token weight evenly sampled from 1 to 20 for the word ‘pineapples’ with our method
and Prompt-to-Prompt. For parenthesis and repeating, we show results by repeating the word ‘pineapples’ and adding parentheses to the
word ‘pineapples’ for 1 to 10 times. Prompt-to-Prompt suffers from generating artifacts. Heuristic methods are not effective.
1× 3× 5× 7× 9× 11 × 13 × 15 × 17 × 19 ×

Ours: A pizza with pineapples, pepperonis, and mushrooms.

Prompt-to-Prompt: A pizza with pineapples, pepperonis, and mushrooms.

Parenthesis: A pizza with pineapples, ((pepperonis)), and mushrooms.

Repeating: A pizza with pineapples, pepperonis, pepperonis, pepperonis, and mushrooms.


Figure 23. Additional results of font sizes. We use a token weight evenly sampled from 1 to 20 for the word ‘pepperonis’ with our method
and Prompt-to-Prompt. For parenthesis and repeating, we show results by repeating the word ‘pepperonis’ and adding parentheses to the
word ‘pepperonis’ for 1 to 10 times. Prompt-to-Prompt suffers from generating artifacts. Heuristic methods are not effective.
Ours

Prompt-to-
Prompt

Instruct
Pix2Pix

a cat (Pixel Art) sitting on a meadow (Van Gogh).

Ours

Prompt-to-
Prompt

Instruct
Pix2Pix

A stream train (Ukiyo-e) on the mountain side (Claude Monet).


Figure 24. Comparison with a simple composed-based method using different random seeds. Since the methods like Prompt-to-
Prompt [19] cannot generate multiple styles on a single image, one simple idea to fix this is to apply the methods on two regions separately
and compose them using the token maps. However, we show that this leads to sharp changes and artifacts at the boundary areas.
Text
InstructPix2Pix Step 1

9.5
Prompt: “A camel
(Cyber Punk, futuristic)
in the dessert (Vincent
Van Gogh).”
7.5

No Style
5.5

3.5
skyscraper

1.5
Prompt-to-Prompt
Image
0.5 1.5 3.5 5.5 7.5
Text

InstructPix2Pix Step 2

9.5

Ours 7.5

5.5

3.5
skyscraper

1.5
Image
0.5 1.5 3.5 5.5 7.5
Figure 25. Ablation of the classifier free guidance of InstructPix2Pix. We show that InstruxtPix2Pix fails to generate both styles with
different image and text classifier-free guidance (cfg) weights. When image-cfg is low, the desert is lost after the first editing. We use
image-cfg= 1.5 and text-cfg= 7.5 in our experiment.
Minimal Distance Mean Distance
Distance to Target Color (↓)

0.6
0.1

0.5

0.4

0
0.26 0.27 0.26 0.27
CLIP Similarity CLIP Similarity

Figure 26. Ablation on the hyperparameter λ in Equation (7). We report the trade-off of CLIP similarity and color distance achieved
by sweeping the strength of color optimization λ.

No Style Prompt-to-Prompt Improved Prompt-to-Prompt Ours

a garden (Claude Monet) in front of a snow mountain (Ukiyo-e)


Figure 27. Improved Prompt-to-Prompt. Further constraining the attention maps for styles does not resolve the mixed style issue.

Ablation of the color guidance weight. Changing the guidance strength λ allows us to control the trade-off between fidelity
and color precision. To evaluate the fidelity of the image, we compute the CLIP score between the generation and the plain
text prompt. We plot the CLIP similarity vs. color distance in Figure 26 by sweeping λ from 0 to 20. Increasing the strength
always reduces the CLIP similarity as details are removed to satisfy the color objective. We find that larger λ first reduces
and then increases the distances due to the optimization divergence.
Constrained Prompt-to-Prompt. The original Attention Refinement proposed in Prompt-to-Prompt [19] does not apply
any constraint to newly added tokens’ attention maps, which may be the reason that it fails with generating distinct styles.
Therefore, we attempt to improve Prompt-to-Prompt by injecting the cross-attention maps for the newly added style tokens.
For example, in Figure 27, we use the cross attention map of “garden” for the style “Claude Monet”. However, the method
still produces a uniform style.
Human Evaluation We conduct a user study on crowdsourcing platforms. We show human annotators a pair of generated
images and ask them which image more accurately expresses the reference color, artistic styles, or supplementary descrip-
tions. To compare ours with each baseline, we show 135 font color pairs, 167 font style pairs, and 21 footnote pairs to three
individuals and receive 1938 responses. As shown in the table below, our method is chosen more than 80% of the time
over both baselines for producing more precise color and content given the long prompt and more than 65% of the time for
rendering more accurate artistic styles. We will include a similar study at a larger scale in our revision.

Table 1. Human evaluation results.


Color Style Footnote
Ours vs. Prompt-to-Prompt 88.2% 65.2% 84.1%
Ours vs. InstructPix2Pix 80.7% 69.8% 87.3%
B. Additional Details
This section details our quantitative evaluation of the font style and font color experiments.
Font style evaluation. To compute the local CLIP scores at each local region to evaluate the stylization quality, we need
to create test prompts with multiple objects and styles. We use seven popular styles that people use to describe the artistic
styles of the generation, as listed below. Note that for each style, to achieve the best quality, we also include complementary
information like the name of famous artists in addition to the style.
styles = [
’Claud Monet, impressionism, oil on canvas’,
’Ukiyoe’,
’Cyber Punk, futuristic’,
’Pop Art, masterpiece, Andy Warhol’,
’Vincent Van Gogh’,
’Pixel Art, 8 bits, 16 bits’,
’Abstract Cubism, Pablo Picasso’
]
We also manually create a set of prompts, where each contains a combination of two objects, for stylization, resulting in 420
prompts in total. We generally confirm that Stable Diffusion [54] can generate the correct combination, as our goal is not
to evaluate the compositionality of the generation as in DrawBench [56]. The prompts and the object tokens used for our
method are listed below.
candidate_prompts = [
’A garden with a mountain in the distance.’: [’garden’, ’mountain’],
’A fountain in front of an castle.’: [’fountain’, ’castle’],
’A cat sitting on a meadow.’: [’cat’, ’meadow’],
’A lighthouse among the turbulent waves in the night.’: [’lighthouse’, ’turbulent waves’],
’A stream train on the mountain side.’: [’stream train’, ’mountain side’],
’A cactus standing in the desert.’: [’cactus’, ’desert’],
’A dog sitting on a beach.’: [’dog’, ’beach’],
’A solitary rowboat tethered on a serene pond.’: [’rowboat’, ’pond’],
’A house on a rocky mountain.’: [’house’, ’mountain’],
’A rustic windmill on a grassy hill.’: [’rustic’, ’hill’],
]

Font color evaluation. To evaluate precise color generation capacity, we create a set of prompts with colored objects. We
divide the potential colors into three levels according to the difficulty of text-to-image generation models to depend on. The
easy-level color set contains 17 basic color names that these models generally understand. The complete set is as below.
COLORS_easy = {
’brown’: [165, 42, 42],
’red’: [255, 0, 0],
’pink’: [253, 108, 158],
’orange’: [255, 165, 0],
’yellow’: [255, 255, 0],
’purple’: [128, 0, 128],
’green’: [0, 128, 0],
’blue’: [0, 0, 255],
’white’: [255, 255, 255],
’gray’: [128, 128, 128],
’black’: [0, 0, 0],
’crimson’: [220, 20, 60],
’maroon’: [128, 0, 0],
’cyan’: [0, 255, 255],
’azure’: [240, 255, 255],
’turquoise’: [64, 224, 208],
’magenta’: [255, 0, 255],
}

The medium-level set contain color names that are selected from the HTML color names 2 . These colors are also standard
to use for website design. However, their names are less often occurring in the image captions, making interpretation by
a text-to-image model challenging. To address this issue, we also append the coarse color category when possible, e.g.,
“Chocolate” to “Chocolate brown”. The complete list is below.

COLORS_medium = {
’Fire Brick red’: [178, 34, 34],
’Salmon red’: [250, 128, 114],
’Coral orange’: [255, 127, 80],
’Tomato orange’: [255, 99, 71],
’Peach Puff orange’: [255, 218, 185],
’Moccasin orange’: [255, 228, 181],
’Goldenrod yellow’: [218, 165, 32],
’Olive yellow’: [128, 128, 0],
’Gold yellow’: [255, 215, 0],
’Lavender purple’: [230, 230, 250],
’Indigo purple’: [75, 0, 130],
’Thistle purple’: [216, 191, 216],
’Plum purple’: [221, 160, 221],
’Violet purple’: [238, 130, 238],
’Orchid purple’: [218, 112, 214],
’Chartreuse green’: [127, 255, 0],
’Lawn green’: [124, 252, 0],
’Lime green’: [50, 205, 50],
’Forest green’: [34, 139, 34],
’Spring green’: [0, 255, 127],
’Sea green’: [46, 139, 87],
’Sky blue’: [135, 206, 235],
’Dodger blue’: [30, 144, 255],
’Steel blue’: [70, 130, 180],
’Navy blue’: [0, 0, 128],
’Slate blue’: [106, 90, 205],
’Wheat brown’: [245, 222, 179],
’Tan brown’: [210, 180, 140],
’Peru brown’: [205, 133, 63],
’Chocolate brown’: [210, 105, 30],
’Sienna brown’: [160, 82, 4],
’Floral White’: [255, 250, 240],
’Honeydew White’: [240, 255, 240],
}

The hard-level set contains 50 randomly sampled RGB triplets as we aim to generate objects with arbitrary colors indicated
in rich texts. For example, the color can be selected by an RGB slider.

COLORS_hard = {
’color of RGB values [68, 17, 237]’: [68, 17, 237],
’color of RGB values [173, 99, 227]’: [173, 99, 227],
’color of RGB values [48, 131, 172]’: [48, 131, 172],
2 https://simple.wikipedia.org/wiki/Web_color
’color of RGB values [198, 234, 45]’: [198, 234, 45],
’color of RGB values [182, 53, 74]’: [182, 53, 74],
’color of RGB values [29, 139, 118]’: [29, 139, 118],
’color of RGB values [105, 96, 172]’: [105, 96, 172],
’color of RGB values [216, 118, 105]’: [216, 118, 105],
’color of RGB values [88, 119, 37]’: [88, 119, 37],
’color of RGB values [189, 132, 98]’: [189, 132, 98],
’color of RGB values [78, 174, 11]’: [78, 174, 11],
’color of RGB values [39, 126, 109]’: [39, 126, 109],
’color of RGB values [236, 81, 34]’: [236, 81, 34],
’color of RGB values [157, 69, 64]’: [157, 69, 64],
’color of RGB values [67, 192, 60]’: [67, 192, 60],
’color of RGB values [181, 57, 181]’: [181, 57, 181],
’color of RGB values [71, 240, 139]’: [71, 240, 139],
’color of RGB values [34, 153, 226]’: [34, 153, 226],
’color of RGB values [47, 221, 120]’: [47, 221, 120],
’color of RGB values [219, 100, 27]’: [219, 100, 27],
’color of RGB values [228, 168, 120]’: [228, 168, 120],
’color of RGB values [195, 31, 8]’: [195, 31, 8],
’color of RGB values [84, 142, 64]’: [84, 142, 64],
’color of RGB values [104, 120, 31]’: [104, 120, 31],
’color of RGB values [240, 209, 78]’: [240, 209, 78],
’color of RGB values [38, 175, 96]’: [38, 175, 96],
’color of RGB values [116, 233, 180]’: [116, 233, 180],
’color of RGB values [205, 196, 126]’: [205, 196, 126],
’color of RGB values [56, 107, 26]’: [56, 107, 26],
’color of RGB values [200, 55, 100]’: [200, 55, 100],
’color of RGB values [35, 21, 185]’: [35, 21, 185],
’color of RGB values [77, 26, 73]’: [77, 26, 73],
’color of RGB values [216, 185, 14]’: [216, 185, 14],
’color of RGB values [53, 21, 50]’: [53, 21, 50],
’color of RGB values [222, 80, 195]’: [222, 80, 195],
’color of RGB values [103, 168, 84]’: [103, 168, 84],
’color of RGB values [57, 51, 218]’: [57, 51, 218],
’color of RGB values [143, 77, 162]’: [143, 77, 162],
’color of RGB values [25, 75, 226]’: [25, 75, 226],
’color of RGB values [99, 219, 32]’: [99, 219, 32],
’color of RGB values [211, 22, 52]’: [211, 22, 52],
’color of RGB values [162, 239, 198]’: [162, 239, 198],
’color of RGB values [40, 226, 144]’: [40, 226, 144],
’color of RGB values [208, 211, 9]’: [208, 211, 9],
’color of RGB values [231, 121, 82]’: [231, 121, 82],
’color of RGB values [108, 105, 52]’: [108, 105, 52],
’color of RGB values [105, 28, 226]’: [105, 28, 226],
’color of RGB values [31, 94, 190]’: [31, 94, 190],
’color of RGB values [116, 6, 93]’: [116, 6, 93],
’color of RGB values [61, 82, 239]’: [61, 82, 239],
}

To write a complete prompt, we create a list of 12 objects and simple prompts containing them as below. The objects
would naturally exhibit different colors in practice, such as “flower”, “gem”, and “house”.

candidate_prompts = [
’a man wearing a shirt’: ’shirt’,
’a woman wearing pants’: ’pants’,
’a car in the street’: ’car’,
’a basket of fruit’: ’fruit’,
’a bowl of vegetable’: ’vegetable’,
’a flower in a vase’: ’flower’,
’a bottle of beverage on the table’: ’bottle beverage’,
’a plant in the garden’: ’plant’,
’a candy on the table’: ’candy’,
’a toy on the floor’: ’toy’,
’a gem on the ground’: ’gem’,
’a church with beautiful landscape in the background’: ’church’,
]

Baseline. We compare our method quantitatively with two strong baselines, Prompt-to-Prompt [19] and InstructPix2Pix [7].
The prompt refinement application of Prompt-to-Prompt allows adding new tokens to the prompt. We use plain text as the
base prompt and add color or style to create the modified prompt. InstructPix2Pix [7] allows using instructions to edit the
image. We use the image generated by the plain text as the input image and create the instructions using templates “turn the
[object] into the style of [style],” or “make the color of [object] to be [color]”. For the stylization experiment, we apply two
instructions in both parallel (InstructPix2Pix-para) and sequence (InstructPix2Pix-seq). We tune both methods on a separate
set of manually created prompts to find the best hyperparameters. In contrast, it is worth noting that our method does not
require hyperparameter tuning.
Running time. The inference time of our models depends on the number of attributes added to the rich text since we
implement each attribute with an independent diffusion process. In practice, we always use a batch size of 1 to make the
code compatible with low-resource devices. In our experiments on an NVIDIA RTX A6000 GPU, each sampling based on
the plain text takes around 5.06 seconds, while sampling an image with two styles takes around 8.07 seconds, and sampling
an image with our color optimization takes around 13.14 seconds.

You might also like