(ICCV-2023)Expressive Text-to-Image Generation with Rich Text
(ICCV-2023)Expressive Text-to-Image Generation with Rich Text
A close-up of a cat1 riding a scooter. Tropical A young woman1 sits at a table in a beautiful, lush A nightstand1 next to a bed with pillows on it. Gray
trees in the background. garden, reading a book on the table. wall2 bedroom.
1A cat wearing sunglasses and has a bandana around its neck. Style: Claude Monet 1Girl with a pearl earring by Johannes Vermeer. 1A nightstand with some books. 2Accent shelf with plants on the gray wall.
Figure 1. Plain text (left image) vs. Rich text (right image). Our method allows a user to describe an image using a rich text editor that
supports various text attributes such as font family, size, color, and footnote. Given these text attributes extracted from rich text prompts,
our method enables precise control of text-to-image synthesis regarding colors, styles, and object details compared to plain text.
Token maps :
Plain text
Plain-Text Input
a church surrounded by a beautiful garden,
a snowy mountain range in the distance Vanilla Diffusion
"church" "garden" "a snowy Other tokens
mountain …"
3. Rich Text to Image Generation image models, as there are limited training images featuring
multiple artistic styles. Consequently, existing models tend
From writing messages on communication apps, de- to generate a uniform mixed style across the entire image
signing websites [57], to collaboratively editing a docu- rather than distinct local styles.
ment [36, 25], a rich text editor is often the primary inter-
face to edit texts on digital devices. Nonetheless, only plain Font color indicates a specific color of the modified text
text has been used in text-to-image generation. To use for- span. Given the prompt “a red toy”, the existing text-to-
matting options in rich-text editors for more precise control image models generate toys in various shades of red, such
over the black-box generation process [1], we first introduce as light red, crimson, or maroon. The color attribute pro-
a problem setting called rich-text-to-image generation. We vides a way for specifying a precise color in the RGB
then discuss our approach to this task. color space, denoted as aci . For example, to generate a
toy in fire brick red, one can change the font color to “a
3.1. Problem Setting toy”, where the word “toy” is associated with the attribute
aci = [178, 34, 34]. However, as shown in the experi-
As shown in Figure 2, a rich text editor supports various
ment section, the pretrained text encoder cannot interpret
formatting options, such as font styles, font size, color, and
the RGB values and have difficulty understanding obscure
more. We leverage these text attributes as extra information
color names, such as lime and orange.
to increase control of text-to-image generation. We interpret
the rich-text prompt as JSON, where each text element con- Footnote provides supplementary explanations of the tar-
sists of a span of tokens ei (e.g., ‘church’) and attributes ai get span without hindering readability with lengthy sen-
describing the span (e.g., ‘color:#FF9900’). Note that some tences. Writing detailed descriptions of complex scenes is
tokens eU may not have any attributes. Using these anno- tedious work, and it inevitably creates lengthy prompts [29,
tated prompts, we explore four applications: 1) local style 27]. Additionally, existing text-to-image models are
control using font style, 2) precise color control using font prone to ignoring some objects when multiple objects are
color, 3) detailed region description using footnotes, and 4) present [12], especially with long prompts. Moreover, ex-
explicit token reweighting with font sizes. cess tokens are discarded when the prompt’s length sur-
Font style is used to apply a specific artistic style asi , passes the text encoder’s maximum length, e.g., 77 tokens
e.g., asi = ‘Ukiyo-e’, to the synthesis of the span of tokens for CLIP models [50]. We aim to mitigate these issues using
ei . For instance, in Figure 1, we apply the Ukiyo-e paint- a footnote string afi .
ing style to the ocean waves and the style of Van Gogh to Font size can be employed to indicate the importance,
the sky, enabling the application of localized artistic styles. quantity, or size of an object. We use a scalar aw
i to denote
This task presents a unique challenge for existing text-to- the weight of each token.
Guidance Loss
Noised Sample Token map
Feature injection
"garden":{
!𝑥
"footnote": “a garden
filled with
a garden filled
with colorful
Diffusion
colorful
wildflowers",
wildflowers UNet
},
Figure 3. Region-based diffusion. For each element of the rich-text input, we apply a separate diffusion process to its region. The attributes
are either decoded as a region-based guidance target (e.g. re-coloring the church), or as a textual input to the diffusion UNet (e.g. handling
the footnote to the garden). The self-attention maps and feature maps extracted from the plain-text generation process are injected to help
∂L
preserve the structure. The predicted noise ϵt,ei , weighted by the token map Mei , and the guidance gradient ∂x t
are used to denoise and
update the previous generation xt to xt−1 . The noised plain text generation xplain
t is blended with the current generation to preserve the
exact content in those regions of the unformatted tokens.
3.2. Method ment with a texture span ei following Patashnik et al. [47]:
To utilize rich text annotations, our method consists of b j − min(m
m b j)
Me i = {M ck ·
ck | M > ϵ, ∀j s.t. wj ∈ ei },
two steps, as shown in Figure 2. First, we compute the spa- max(mb j ) − min(m
b j) 1
tial layouts of individual token spans. Second, we use a new (2)
region-based diffusion to render each region’s attributes into where ϵ is a hyperparameter that controls the labeling
a globally coherent image. threshold, that is, the segment M
c k is assigned to the span ei
Step 1. Token maps for spatial layout. Several works if the normalized attention score of any tokens in this span
[65, 40, 4, 19, 12, 47, 67] have discovered that the atten- is higher than ϵ. We associate the segments that are not as-
tion maps in the self- and cross-attention layers of the dif- signed to any formatted spans with the unformatted tokens
fusion UNet characterize the spatial layout of the genera- eU . Finally, we obtain the token map in Figure 2 as below:
tion. Therefore, we first use the plain text as the input to P
the diffusion model and collect self-attention maps of size c j ∈Me
M
M
cj
i
32 × 32 × 32 × 32 across different heads, layers, and time Mei = P P (3)
i c j ∈Me
M
M
cj
steps. We take the average across all the extracted maps and i
reshape the result into 1024 × 1024. Note that the value at
Step 2. Region-based denoising and guidance. As shown
ith row and j th column of the map indicates the probability
in Figure 2, given the text attributes and token maps, we di-
of pixel i attending to pixel j. We average the map with its
vide the overall image synthesis into several region-based
transpose to convert it to a symmetric matrix. It is used as
denoising and guidance processes to incorporate each at-
a similarity map to perform spectral clustering [61, 69] and
c of size K ×32×32, tribute, similar to an ensemble of diffusion models [32, 6].
obtain the binary segmentation maps M More specificially, given the span ei , the region defined by
where K is the number of segments. its token map Mei , and the attribute ai , the predicted noise
To associate each segment with a textual span, we also ϵt for noised generation xt at time step t is
extract cross-attention maps for each token wj :
X X
ϵt = Mei · ϵt,ei = Mei · D(xt , f (ei , ai ), t), (4)
exp(sj ) i i
mj = P , (1)
k exp(sk )
where D is the pretrained diffusion model, and f (ei , ai )
where sj is the attention score. We first interpolate each is a plain text representation derived from text span ei and
cross-attention map mj to the same resolution as M c of attributes ai using the following process:
32 × 32. Similar to the processing steps of the self-attention
1. Initially, we set f (ei , ai ) = ei .
maps, we compute the mean across heads, layers, and time
steps to get the averaged map m b j . We associate each seg- 2. If footnote afi is available, we set f (ei , ai ) = afi .
Ours InstructPix2Pix Prompt-to-Prompt Ours InstructPix2Pix Prompt-to-Prompt Ours InstructPix2Pix Prompt-to-Prompt
Green
Black
Pink
Dodger Blue
(48, 131, 172) Plum Purple
Olive Yellow
a church with beautiful landscape a car in the street a woman wearing pants
Figure 4. Qualitative comparison on precise color generation. We show images generated by Prompt-to-Prompt [19], Instruct-
Pix2Pix [7], and our method using prompts with font colors. Our method generates precise colors according to either color names or
RGB values. Both baselines generate plausible but inaccurate colors given color names, while neither understands the color defined by
RGB values. InstructPix2Pix tends to apply the color globally, even outside the target object.
3. The style asi is appended if it exists. f (ei , ai ) = Token reweighting with font size. Last, to re-weight the
f (ei , ai ) + ‘in the style of’ + asi . impact of the token wj according to the font size aw j , we
4. The closest color name (string) of font color âci from modify its cross-attention maps mj . However, instead of
a predefined set C is prepended. f (ei , ai ) = âci + applying
Pdirect multiplication as in Prompt-to-Prompt [19]
f (ei , ai ). For example, âci = ‘brown’ for RGB color where j aw j mj ̸= 1, we find that it is critical to preserve
aci = [136,68,20]. the probability property of mj . We thus propose the follow-
ing reweighting approach:
We use f (ei , ai ) as the original plain text prompt of Step
1 for the unformatted tokens eU . This helps us generate a aw
j exp(sj )
m
bj = P w . (8)
coherent image, especially around region boundaries. k ak exp(sk )
Guidance. By default, we use classifier-free guidance [23] We can compute the token map (Equation 3) and predict
for each region to better match the prompt f (ei , ai ). In the noise (Equation 4) with the reweighted attention map.
addition, if the font color is specified, to exploit the RGB
values information further, we apply gradient guidance [24, Preserve the fidelity against plain-text generation. Al-
14, 5] on the current clean image prediction: though our region-based method naturally maintains the
layout, there is no guarantee that the details and shape of
√
xt − 1 − ᾱt ϵt the objects are retained when no rich-text attributes or only
x
b0 = √ , (5) the color is specified, as shown in Figure 12. To this end, we
ᾱt
follow Plug-and-Play [67] to inject the self-attention maps
where xt is the noisy image at time step t, and ᾱt is the and the residual features extracted from the plain-text gen-
coefficient defined by noise scheduling strategy [20]. Here, eration process when t > Tpnp to improve the structure fi-
we compute an MSE loss L between the average color of delity. In addition, for the regions associated with the unfor-
b weighted by the token map Mei and the RGB triplet aci .
x matted tokens eU , stronger content preservation is desired.
The gradient is calculated below, Therefore, at certain t = Tblend , we blend the noised sample
xplain
t based on the plain text into those regions:
b0 )/ p Mei − aci ∥22
P P
dL d∥ p (Mei · x
= √ , (6)
dxt ᾱt db
x0 xt ← MeU · xplain
t + (1 − MeU ) · xt (9)
where the summation is over all pixels p. We then update 4. Experimental Results
xt with the following equation:
dL Implementation details. We use Stable Diffusion V1-
xt ← xt − λ · Mei · , (7) 5 [54] for our experiments. To create the token maps, we
dxt
use the cross-attention layers in all blocks, excluding the
where λ is a hyperparameter to control the strength of the first encoder and last decoder blocks, as the attention maps
guidance. We use λ = 1 unless denoted otherwise. in these high-resolution layers are often noisy. We discard
Ours Prompt-to-Prompt InstructPix2Pix-para InstructPix2Pix-seq
A night sky filled with stars (1st Region: Van Gogh) above a turbulent sea with giant waves (2nd Region: Ukiyo-e)
The awe-inspiring sky and sea (1st Region: J.M.W. Turner) by a coast with flowers and grasses in spring (2nd Region: Monet).
Figure 5. Qualitative comparison on style control. We show images generated by Prompt-to-Prompt, InstructPix2Pix, and our method
using prompts with multiple styles. Only our method can generate distinct styles for both regions.
0.28 0.8
0.73
0.28 0.28 0.14 0.710.7 0.71
CLIP Similarity (↑)
3A painting of wheat field with a cottage in the distance, close up shot, trending on artstation, HD, calm, complimentary color, realistic lighting, by Albert Bierstadt, Frederic Church.
Figure 8. Qualitative comparison on detailed description generation. We show images generated by Attend-and-Excite, Prompt-to-
Prompt, InstructPix2Pix, and our method using complex prompts. Our method is the only one that can generate all the details faithfully.
prompt, we use 12 objects exhibiting different colors, such pare with Attend-and-Excite [12].
as “flower”, “gem”, and “house”. This gives us a total of
1, 200 prompts. We evaluate color accuracy by computing 4.1. Quantitative Comparison
the mean L2 distance between the region and target RGB
values. We also compute the minimal L2 distance as some- We report the local CLIP scores computed by a ViT-B/32
times the object should contain other colors for fidelity, e.g., model in Figure 6. Our method achieves the best overall
the “black tires” of a “yellow car”. CLIP score compared to the two baselines. This demon-
strates the advantage of our region-based diffusion method
Baselines. For font color and style, we quantitatively for localized stylization. To further understand the capacity
compare our method with two strong baselines, Prompt-to- of each model to generate multiple styles, we report the met-
Prompt [19] and InstructPix2Pix [7]. When two instructions ric on each region. Prompt-to-Prompt and InstructPix2Pix-
exist for each image in our font style experiments, we apply para achieve a decent score on the 1st Region, i.e., the region
them in parallel (InstructPix2Pix-para) and sequential man- first occurs in the sentence. However, they often fail to ful-
ners (InstructPix2Pix-seq). More details are in Appendix B. fill the style in the 2nd Region. We conjecture that the Stable
We also perform a human evaluation with these two meth- Diffusion model tends to generate a uniform style for the
ods in Appendix Table 1. For re-weighting token impor- entire image, which can be attributed to single-style train-
tance, we visually compare with Prompt-to-Prompt [19] and ing images. Furthermore, InstructPix2Pix-seq performs the
two heuristic methods, repeating and adding parentheses. worst in 2nd Region. This is because the first instruction
For complex scene generation with footnotes, we also com- contains no information about the second region, and the
Ours: A pizza with pineapples, A pizza with pineapples, pepperonis, A pizza with pineapples, pepperonis,
and mushrooms, mushrooms,
pepperonis, and mushrooms Prompt-to-Prompt mushrooms, mushrooms, mushrooms and (((((mushrooms)))))
Figure 9. Qualitative comparison on token reweighting. We show images generated by our method and Prompt-to-Prompt using token
weight of 13 for ‘mushrooms’. Prompt-to-Prompt suffers from artifacts due to the large weight. Heuristic methods like repeating and
parenthesis do not work well.
Precise color generation. We show qualitative com- Token importance control. Figure 9 shows the qualita-
parison on precise color generation in Figure 4. Instruct- tive comparison on token reweighting. When using a large
Pix2Pix [7] is prone to create global color effects rather weight for ‘mushroom,’ Prompt-to-Prompt generates clear
than accurate local control. For example, in the flower re- artifacts as it modifies the attention probabilities to be un-
sults, both the vase and background are changed to the tar- bounded and creates out-of-distribution intermediate fea-
get colors. Prompt-to-Prompt [19] provides more precise tures. Heuristic methods fail with adding more mushrooms,
control over the target region. However, both Prompt-to- while our method generates more mushrooms and preserves
Prompt and InstructPix2Pix fail to generate precise colors. the quality. More results of different font sizes and target to-
In contrast, our method can generate precise colors for all kens are shown in Figures 23 - 25 in Appnedix.
Stable Diffusion Stable Diffusion (full text) InstructPix2Pix
A rustic orange cabin sits on Make the cabin orange.
A rustic cabin sits on the the edge of a giant, crystal- Turn the wildflowers into style
edge of a giant lake. clear, blueish lake. The lake is Claude Monet, Impressionism.
Wildflowers dot the glistening in the sunlight. Make the lake crystal-clear,
meadow around the cabin Wildflowers in the style of blueish, glistening in the
and lake. Claude Monet, Impressionism sunlight.
dot the meadow around the ……
cabin and lake.
in the sunlight.
Figure 11. Our workflow. (top left) A user begins with an initial plain-text prompt and wishes to refine the scene by specifying the color,
details, and styles. (top center) Naively inputting the whole description in plain text does not work. (top right) InstructPix2Pix [7] fails
to make accurate editing. (bottom) Our method supports precise refinement with region-constrained diffusion processes. Moreover, our
framework can naturally be integrated into a rich text editor, enabling a tight, streamlined UI.
A. Additional Results
In this section, we first show additional results of rich-text-to-image generation on complex scene synthesis (Figures 15,
16, and 17), precise color rendering (Figures 18, 19, and 20), local style control (Figures 21 and 22), and explicit token re-
weighting (Figure 23, 24, and 25). We also show an ablation study of the averaging and maximizing operations across tokens
to obtain token maps in Figure 26. We present additional results compared with a composition-based baseline in Figure 27.
Last, we show an ablation of the hyperparameters of our baseline method InstructPix2Pix [7] on the local style generation
application in Figure 28.
A car1 driving on the road. A bicycle2 nearby a tree3. A cityscape4 in the background.
1A sleek sports car gleams on the road in the sunlight, with its aerodynamic curves and polished finish catching the light. 2A bicycle with rusted frame and worn tires.
3A dead tree with a few red apples on it. 4A bustling Hongkong cityscape with towering skyscrapers.
Figure 14. Additional results of the footnote. We show the generation from a complex description of a garden. Note that all the methods
except for ours fail to generate accurate details of the mansion and fountain as described.
A small chair1 sits in front of a table2 on the wooden floor. There is a bookshelf3 nearby the window4.
1A blackleather office chair with a high backrest and adjustable arms.
2A large
wooden desk with a stack of books on top of it.
3A bookshelf filled with colorful books and binders.
4A window overlooks a stunning natural landscape of snow mountains.
Figure 15. Additional results of the footnote. We show the generation from a complex description of an office. Note that all the methods
except ours fail to generate accurate window overlooks and colorful binders as described.
Ours
Red Yellow Green Blue Pink Cyan Purple Orange Black
(a)
Vegetable
(b)
Flower
(c)
Shirts
(d)
Toy
(e)
Beverage
Prompt-to-Promp InstructPix2Pix
Red Yellow Green Blue Pink Cyan Purple Orange Black Red Yellow Green Blue Pink Cyan Purple Orange Black
(a)
(b)
(c)
(d)
(e)
Figure 16. Additional results of the font color. We show the generation of different objects with colors from the Common category.
Prompt-to-Prompt has a large failure rate of respecting the given color name, while InstructPix2Pix tends to color the background and
irrelevant objects.
Ours
Chocolate Salmon Red Spring Green Gold Yellow Orchid Purple Floral White Indigo Purple Tomato Orange Navy Blue
(a)
Vegetable
(b)
Flower
(c)
Shirts
(d)
Toy
(e)
Beverage
Prompt-to-Prompt InstructPix2Pix
Chocolate Salmon Red Spring Green Gold Yellow Orchid Purple Floral White Indigo Purple Tomato Orange Navy Blue Chocolate Salmon Red Spring Green Gold Yellow Orchid Purple Floral White Indigo Purple Tomato Orange Navy Blue
(a)
(b)
(c)
(d)
(e)
Figure 17. Additional results of the font color. We show the generation of different objects with colors from the HTML category. Both
methods fail to generate the precise color, and InstructPix2Pix tends to color the background and irrelevant objects.
Ours
(25, 75, 226) (39, 126, 109) (99, 219, 32) (105, 28, 226) (116, 6, 93) (222, 80, 195) (211, 22, 52) (208, 211, 9) (219, 100, 27)
(a)
Vegetable
(b)
Flower
(c)
Shirts
(d)
Toy
(e)
Beverage
Prompt-to-Prompt InstructPix2Pix
(25, 75, 226) (39, 126, 109) (99, 219, 32) (105, 28, 226) (116, 6, 93) (222, 80, 195) (211, 22, 52) (208, 211, 9) (219, 100, 27) (25, 75, 226) (39, 126, 109) (99, 219, 32) (105, 28, 226) (116, 6, 93) (222, 80, 195) (211, 22, 52) (208, 211, 9) (219, 100, 27)
(a)
(b)
(c)
(d)
(e)
Figure 18. Additional results of the font color. We show the generation of different objects with colors from the RGB category. Both
baseline methods cannot interpret the RGB values correctly.
garden
Ours
(a) Claude
Monet
(b) Ukiyo-e
(c) Cyber
Punk
(d) Andy
Warhol
(e) Vincent
Van Gogh
(f) Pixel
Art
(b)
(c)
(d)
(e)
(f)
snow
(g) mountain
(a) (b) (c) (d) (e) (f) (g) (a) (b) (c) (d) (e) (f) (g)
Figure 19. Additional results of the font style. We show images generated with different style combinations and prompt “a beautiful
garden in front of a snow mountain”. Each row contains “snow mountain” in 7 styles, and each column contains “garden” in 7 styles. Only
our method can generate distinct styles for both objects.
pond
Ours
(a) Claude
Monet
(b) Ukiyo-e
(c) Cyber
Punk
(d) Andy
Warhol
(e) Vincent
Van Gogh
(f) Pixel
Art
(g) Cubism
skyscraper
(a) Claude (b) Ukiyo-e (c) Cyber (d) Andy (e) Vincent (f) Pixel (g) Cubism
pond
(b)
(c)
(d)
(e)
(f)
sky-
(g) scraper
(a) (b) (c) (d) (e) (f) (g) (a) (b) (c) (d) (e) (f) (g)
Figure 20. Additional results of the font style. We show images generated with different style combinations and prompt “a small pond
surrounded by skyscraper”. Each row contains “skyscraper” in 7 styles, and each column contains “pond” in 7 styles. Only our method
can generate distinct styles for both objects.
1× 3× 5× 7× 9× 11 × 13 × 15 × 17 × 19 ×
1× 3× 5× 7× 9× 11 × 13 × 15 × 17 × 19 ×
Prompt-to-
Prompt
Instruct
Pix2Pix
Ours
Prompt-to-
Prompt
Instruct
Pix2Pix
9.5
Prompt: “A camel
(Cyber Punk, futuristic)
in the dessert (Vincent
Van Gogh).”
7.5
No Style
5.5
3.5
skyscraper
1.5
Prompt-to-Prompt
Image
0.5 1.5 3.5 5.5 7.5
Text
InstructPix2Pix Step 2
9.5
Ours 7.5
5.5
3.5
skyscraper
1.5
Image
0.5 1.5 3.5 5.5 7.5
Figure 25. Ablation of the classifier free guidance of InstructPix2Pix. We show that InstruxtPix2Pix fails to generate both styles with
different image and text classifier-free guidance (cfg) weights. When image-cfg is low, the desert is lost after the first editing. We use
image-cfg= 1.5 and text-cfg= 7.5 in our experiment.
Minimal Distance Mean Distance
Distance to Target Color (↓)
0.6
0.1
0.5
0.4
0
0.26 0.27 0.26 0.27
CLIP Similarity CLIP Similarity
Figure 26. Ablation on the hyperparameter λ in Equation (7). We report the trade-off of CLIP similarity and color distance achieved
by sweeping the strength of color optimization λ.
Ablation of the color guidance weight. Changing the guidance strength λ allows us to control the trade-off between fidelity
and color precision. To evaluate the fidelity of the image, we compute the CLIP score between the generation and the plain
text prompt. We plot the CLIP similarity vs. color distance in Figure 26 by sweeping λ from 0 to 20. Increasing the strength
always reduces the CLIP similarity as details are removed to satisfy the color objective. We find that larger λ first reduces
and then increases the distances due to the optimization divergence.
Constrained Prompt-to-Prompt. The original Attention Refinement proposed in Prompt-to-Prompt [19] does not apply
any constraint to newly added tokens’ attention maps, which may be the reason that it fails with generating distinct styles.
Therefore, we attempt to improve Prompt-to-Prompt by injecting the cross-attention maps for the newly added style tokens.
For example, in Figure 27, we use the cross attention map of “garden” for the style “Claude Monet”. However, the method
still produces a uniform style.
Human Evaluation We conduct a user study on crowdsourcing platforms. We show human annotators a pair of generated
images and ask them which image more accurately expresses the reference color, artistic styles, or supplementary descrip-
tions. To compare ours with each baseline, we show 135 font color pairs, 167 font style pairs, and 21 footnote pairs to three
individuals and receive 1938 responses. As shown in the table below, our method is chosen more than 80% of the time
over both baselines for producing more precise color and content given the long prompt and more than 65% of the time for
rendering more accurate artistic styles. We will include a similar study at a larger scale in our revision.
Font color evaluation. To evaluate precise color generation capacity, we create a set of prompts with colored objects. We
divide the potential colors into three levels according to the difficulty of text-to-image generation models to depend on. The
easy-level color set contains 17 basic color names that these models generally understand. The complete set is as below.
COLORS_easy = {
’brown’: [165, 42, 42],
’red’: [255, 0, 0],
’pink’: [253, 108, 158],
’orange’: [255, 165, 0],
’yellow’: [255, 255, 0],
’purple’: [128, 0, 128],
’green’: [0, 128, 0],
’blue’: [0, 0, 255],
’white’: [255, 255, 255],
’gray’: [128, 128, 128],
’black’: [0, 0, 0],
’crimson’: [220, 20, 60],
’maroon’: [128, 0, 0],
’cyan’: [0, 255, 255],
’azure’: [240, 255, 255],
’turquoise’: [64, 224, 208],
’magenta’: [255, 0, 255],
}
The medium-level set contain color names that are selected from the HTML color names 2 . These colors are also standard
to use for website design. However, their names are less often occurring in the image captions, making interpretation by
a text-to-image model challenging. To address this issue, we also append the coarse color category when possible, e.g.,
“Chocolate” to “Chocolate brown”. The complete list is below.
COLORS_medium = {
’Fire Brick red’: [178, 34, 34],
’Salmon red’: [250, 128, 114],
’Coral orange’: [255, 127, 80],
’Tomato orange’: [255, 99, 71],
’Peach Puff orange’: [255, 218, 185],
’Moccasin orange’: [255, 228, 181],
’Goldenrod yellow’: [218, 165, 32],
’Olive yellow’: [128, 128, 0],
’Gold yellow’: [255, 215, 0],
’Lavender purple’: [230, 230, 250],
’Indigo purple’: [75, 0, 130],
’Thistle purple’: [216, 191, 216],
’Plum purple’: [221, 160, 221],
’Violet purple’: [238, 130, 238],
’Orchid purple’: [218, 112, 214],
’Chartreuse green’: [127, 255, 0],
’Lawn green’: [124, 252, 0],
’Lime green’: [50, 205, 50],
’Forest green’: [34, 139, 34],
’Spring green’: [0, 255, 127],
’Sea green’: [46, 139, 87],
’Sky blue’: [135, 206, 235],
’Dodger blue’: [30, 144, 255],
’Steel blue’: [70, 130, 180],
’Navy blue’: [0, 0, 128],
’Slate blue’: [106, 90, 205],
’Wheat brown’: [245, 222, 179],
’Tan brown’: [210, 180, 140],
’Peru brown’: [205, 133, 63],
’Chocolate brown’: [210, 105, 30],
’Sienna brown’: [160, 82, 4],
’Floral White’: [255, 250, 240],
’Honeydew White’: [240, 255, 240],
}
The hard-level set contains 50 randomly sampled RGB triplets as we aim to generate objects with arbitrary colors indicated
in rich texts. For example, the color can be selected by an RGB slider.
COLORS_hard = {
’color of RGB values [68, 17, 237]’: [68, 17, 237],
’color of RGB values [173, 99, 227]’: [173, 99, 227],
’color of RGB values [48, 131, 172]’: [48, 131, 172],
2 https://simple.wikipedia.org/wiki/Web_color
’color of RGB values [198, 234, 45]’: [198, 234, 45],
’color of RGB values [182, 53, 74]’: [182, 53, 74],
’color of RGB values [29, 139, 118]’: [29, 139, 118],
’color of RGB values [105, 96, 172]’: [105, 96, 172],
’color of RGB values [216, 118, 105]’: [216, 118, 105],
’color of RGB values [88, 119, 37]’: [88, 119, 37],
’color of RGB values [189, 132, 98]’: [189, 132, 98],
’color of RGB values [78, 174, 11]’: [78, 174, 11],
’color of RGB values [39, 126, 109]’: [39, 126, 109],
’color of RGB values [236, 81, 34]’: [236, 81, 34],
’color of RGB values [157, 69, 64]’: [157, 69, 64],
’color of RGB values [67, 192, 60]’: [67, 192, 60],
’color of RGB values [181, 57, 181]’: [181, 57, 181],
’color of RGB values [71, 240, 139]’: [71, 240, 139],
’color of RGB values [34, 153, 226]’: [34, 153, 226],
’color of RGB values [47, 221, 120]’: [47, 221, 120],
’color of RGB values [219, 100, 27]’: [219, 100, 27],
’color of RGB values [228, 168, 120]’: [228, 168, 120],
’color of RGB values [195, 31, 8]’: [195, 31, 8],
’color of RGB values [84, 142, 64]’: [84, 142, 64],
’color of RGB values [104, 120, 31]’: [104, 120, 31],
’color of RGB values [240, 209, 78]’: [240, 209, 78],
’color of RGB values [38, 175, 96]’: [38, 175, 96],
’color of RGB values [116, 233, 180]’: [116, 233, 180],
’color of RGB values [205, 196, 126]’: [205, 196, 126],
’color of RGB values [56, 107, 26]’: [56, 107, 26],
’color of RGB values [200, 55, 100]’: [200, 55, 100],
’color of RGB values [35, 21, 185]’: [35, 21, 185],
’color of RGB values [77, 26, 73]’: [77, 26, 73],
’color of RGB values [216, 185, 14]’: [216, 185, 14],
’color of RGB values [53, 21, 50]’: [53, 21, 50],
’color of RGB values [222, 80, 195]’: [222, 80, 195],
’color of RGB values [103, 168, 84]’: [103, 168, 84],
’color of RGB values [57, 51, 218]’: [57, 51, 218],
’color of RGB values [143, 77, 162]’: [143, 77, 162],
’color of RGB values [25, 75, 226]’: [25, 75, 226],
’color of RGB values [99, 219, 32]’: [99, 219, 32],
’color of RGB values [211, 22, 52]’: [211, 22, 52],
’color of RGB values [162, 239, 198]’: [162, 239, 198],
’color of RGB values [40, 226, 144]’: [40, 226, 144],
’color of RGB values [208, 211, 9]’: [208, 211, 9],
’color of RGB values [231, 121, 82]’: [231, 121, 82],
’color of RGB values [108, 105, 52]’: [108, 105, 52],
’color of RGB values [105, 28, 226]’: [105, 28, 226],
’color of RGB values [31, 94, 190]’: [31, 94, 190],
’color of RGB values [116, 6, 93]’: [116, 6, 93],
’color of RGB values [61, 82, 239]’: [61, 82, 239],
}
To write a complete prompt, we create a list of 12 objects and simple prompts containing them as below. The objects
would naturally exhibit different colors in practice, such as “flower”, “gem”, and “house”.
candidate_prompts = [
’a man wearing a shirt’: ’shirt’,
’a woman wearing pants’: ’pants’,
’a car in the street’: ’car’,
’a basket of fruit’: ’fruit’,
’a bowl of vegetable’: ’vegetable’,
’a flower in a vase’: ’flower’,
’a bottle of beverage on the table’: ’bottle beverage’,
’a plant in the garden’: ’plant’,
’a candy on the table’: ’candy’,
’a toy on the floor’: ’toy’,
’a gem on the ground’: ’gem’,
’a church with beautiful landscape in the background’: ’church’,
]
Baseline. We compare our method quantitatively with two strong baselines, Prompt-to-Prompt [19] and InstructPix2Pix [7].
The prompt refinement application of Prompt-to-Prompt allows adding new tokens to the prompt. We use plain text as the
base prompt and add color or style to create the modified prompt. InstructPix2Pix [7] allows using instructions to edit the
image. We use the image generated by the plain text as the input image and create the instructions using templates “turn the
[object] into the style of [style],” or “make the color of [object] to be [color]”. For the stylization experiment, we apply two
instructions in both parallel (InstructPix2Pix-para) and sequence (InstructPix2Pix-seq). We tune both methods on a separate
set of manually created prompts to find the best hyperparameters. In contrast, it is worth noting that our method does not
require hyperparameter tuning.
Running time. The inference time of our models depends on the number of attributes added to the rich text since we
implement each attribute with an independent diffusion process. In practice, we always use a batch size of 1 to make the
code compatible with low-resource devices. In our experiments on an NVIDIA RTX A6000 GPU, each sampling based on
the plain text takes around 5.06 seconds, while sampling an image with two styles takes around 8.07 seconds, and sampling
an image with our color optimization takes around 13.14 seconds.