Lets Go Shopping: Webscale Image-text Dataset For Visual Concept Understanding
Lets Go Shopping: Webscale Image-text Dataset For Visual Concept Understanding
Yatong Bai1∗ Utsav Garg2 Apaar Shanker2 Haoming Zhang2 Samyak Parajuli2
Erhan Bas2 Isidora Filipovic3 Amelia N. Chu3 Eugenia D Fomitcheva3 Elliot Branson2
Aerin Kim2 Somayeh Sojoudi1 Kyunghyun Cho3
arXiv:2401.04575v1 [cs.CV] 9 Jan 2024
1 2 3
University of California, Berkeley Scale AI New York University
∗
Work done during internship at Scale. Correspondences to [email protected], [email protected].
– Abstract – 1. Introduction
Computer vision (CV) and natural language process-
ing (NLP) tasks increasingly rely on pre-trained repre-
sentations. While NLP representations can be trained
on unannotated raw text, vision applications often con-
sider pre-training using large-scale datasets with discrete
Vision and vision-language applications of neural
networks, such as image classification and caption-
class labels annotated by humans, such as ImageNet
ing, rely on large-scale annotated datasets that re- [9, 51] or OpenImages [30]. Vision-language bimodal
quire non-trivial data-collecting processes. This time- applications, such as image captioning or visual question
consuming endeavor hinders the emergence of large- answering, similarly rely on large amounts of annotated
scale datasets, limiting researchers and practitioners data. Unfortunately, many of the large-scale bi-modal
to a small number of choices. Therefore, we seek more datasets now in existence, such as CLIP [48], ALIGN
efficient ways to collect and annotate images. Previ- [25], and JFT300M [8, 19], are not publicly accessible.
ous initiatives have gathered captions from HTML As a result, research has been constrained to a few
alt-texts and crawled social media postings, but these selected large datasets, such as Conceptual Captions
data sources suffer from noise, sparsity, or subjectivity. [5] and COCO [6]. This shortage of available public
For this reason, we turn to commercial shopping web-
datasets can be attributed in part to the time and effort
sites whose data meet three criteria: cleanliness, infor-
mativeness, and fluency. We introduce the Let’s Go
required to gather, clean, and annotate large datasets.
Shopping (LGS) dataset, a large-scale public dataset Therefore, we adopt a more efficient and scalable high-
with 15 million image-caption pairs from publicly quality data collection pipeline to acquire image-text
available e-commerce websites. When compared with pairs easily available on e-commerce websites. While
existing general-domain datasets, the LGS images some existing datasets use public websites as annota-
focus on the foreground object and have less complex tion sources, most of them use social media websites
backgrounds. Our experiments on LGS show that (RedCaps [11]) or alt-texts1 (Conceptual Captions [53])
the classifiers trained on existing benchmark datasets
for this purpose. Nevertheless, social media data suffer
do not readily generalize to e-commerce data, while
from subjectivity. On the other hand, alt-texts can be
specific self-supervised visual feature extractors can
better generalize. Furthermore, LGS’s high-quality unacceptably noisy, sometimes merely including unin-
e-commerce-focused images and bimodal nature make formative texts such as “alt img”, as shown in Figure 1.
it advantageous for vision-language bi-modal tasks: As a result, we gravitate to e-commerce websites,
LGS enables image-captioning models to generate
richer captions and helps text-to-image generation 1 Alt-texts are short descriptions of HTML website images.
models achieve e-commerce style transfer. When an image cannot be rendered, the website displays its
alt-text as a surrogate.
1
Ju-Ni San Francisco | A San
Shaun the Sheep Coloring Pages Francisco Food Restaurant Review
(Unclear description) (Does not mention tuna maki)
where clean images with objective, accurate, succinct, LGS classes are well-separable, we conclude that this
and informative descriptions are abundant, as illustrated performance degradation can be mostly attributed to
in Figure 2. Let’s Go Shopping (LGS) dataset collects the distributional disparity. To separate the effects of
15 million image-description pairs from approximately labels and captions and isolate the distribution shift of
10,000 e-commerce sites selling a wide range of products. the images, we consider Masked AutoEncoder (MAE)
Due to the nature of e-commerce data, the majority [17], a self-supervised pre-training method that does not
of LGS images have a clear background and a static rely on labels. We show that an MAE model trained
focus on the stated object. On the captions front, LGS on ImageNet-1k can reconstruct LGS images well, but
provides precise and elaborative captions. We show adding LGS to the training data improves the perfor-
how highly precise information can be extracted from mance on LGS and generalizes better to COCO.
captions for vision-language fine-tuning. The above results demonstrate that while the e-
On the other hand, ImageNet-1k has served as the commerce images are from a distribution that is distinct
ubiquitous go-to pre-training and evaluation dataset for from current benchmark datasets, the feature extrac-
vision-only applications. While ImageNet covers a wide tors can be shared. Moreover, we illustrate additional
range of domains, the diversity of angles and arrange- merits of LGS that qualify it as a pre-training dataset.
ments is restricted. As a result, the literature has shown Specifically, the models learned on both LGS and Im-
that ImageNet models do not generalize well to deliber- ageNet have improved linear probing performance on
ately constructed out-of-distribution (OOD) scenarios common downstream tasks such as CIFAR-100 [29] and
[3]. This work uses image classification experiments Fashion MNIST [63], compared with the ImageNet-only
to demonstrate that such OOD data is ubiquitous in counterparts.
e-commerce applications. We then show that models The distinctive distribution of LGS also benefits
can benefit from the unique e-commerce distribution vision-language bimodal tasks. For caption genera-
in image classification, reconstruction, captioning, and tion tasks, we train an OFA model [61] on LGS to
generation tasks. demonstrate that the more prominent image foreground,
Specifically, we convert the LGS captions into tax- cleaner image background, and the highly descriptive
onomies and labels and demonstrate a large disparity captions of LGS enable the model to produce “attribute-
between the label distributions of LGS and ImageNet: rich” image captions, which models trained on tradi-
even with best efforts, only 17.6% of the concepts are tional datasets fail to produce.
shared between popular ImageNet-1k synsets and the Finally, for text-to-image generation tasks, diffusion
e-commerce corpus (more details in Section 3.4). Even models [2, 20, 54] is currently the most popular family of
for those shared classes, the performance of ImageNet methods. To illustrate the efficacy of LGS in this setting,
models degrades significantly. By verifying that the we use Stable Diffusion (SD) [49] and fine-tune it in both
2
general and fine-grained settings on subsets of the LGS Table 1. The instance count of LGS compared with existing
dataset. We demonstrate promising qualitative and bi-modal datasets.
quantitative results on adapting existing text-to-image
models using LGS for e-commerce-related generations. Datasets Instances
Furthermore, with the help of its distinct image style Let’s Go Shopping (this paper) 14,847,764
and descriptive captions, LGS can help the SD model YFCC100M (Yahoo) 100 million
generate e-commerce-styled images. RedCaps (University of Michigan) 12,011,111
To make LGS available to the public, we will share Conceptual Captions 12M (Google) 12,423,374
WIT-English (Google) 5,500,746
the filtered links to the image-caption pairs under the
Localized Narratives (Google) 849,000
“BSD 3-Clause” license (also used in common datasets COCO (Microsoft) 328,000
such as ImageNet), as was the case for ImageNet. We Visual Genome (Stanford) 108,077
will also share the downloader so that the exact same CLIP (OpenAI) 400M
dataset can be reproduced. ALIGN (Google) 1.8B
2. Related Work
offers rich semantic data for pre-training applications.
2.1. Unimodal Pre-Training Datasets However, our choice of e-commerce data source is unique,
leading toward distinctive data distribution.
Prior to the popularization of bi-modal training, uni-
Image-text datasets are also used for learning visual
modal data (vision-only or language-only) have been the
features. The work [33] has proposed to train visual
workhorses for pre-training tasks. On the vision side,
n-gram models on YFCC100M, whereas other methods
ImageNet-1k and ImageNet-22k are still some of the
[4, 10] aim to learn features from the captions from the
most prevalent examples, alongside the larger JFT-300M
COCO dataset [6]. The quality of the resulting features
dataset. For the e-commerce domain, Fashion MNIST,
is competitive with supervised ImageNet training [18] on
Clothing1M [64], Fashion200k [16], and FashionIQ [62]
many downstream tasks [13, 15, 38, 51, 60]. Moreover,
have been proposed to analyze the effects of noisy labels.
the image-text pre-training schemes scale up to very
Some of the most common datasets used as general wide-
larger non-public datasets that are even larger than
domain downstream tasks include CIFAR-10, CIFAR-
LGS [25, 48].
100, MNIST [32], SVHN [44], and Tiny ImageNet [31].
A core motivation for collecting image-text pairs from
2.2. Vision-and-Language Pre-Training the internet is the possibility of scaling up the data
size without bearing the prohibitively expensive anno-
Datasets tation costs. In light of this motivation, there have
The literature has shown that image-text data from been multiple efforts to collect large quantities of noisy
COCO can be used to learn visual features that are com- labels associated with online images, leading to datasets
petitive with supervised pre-training [18] on ImageNet such as WebVision [36], YFCC100M, JFT-300M, and
when transferred to downstream tasks [4, 10, 13, 15, 38, Instagram-3.5B [41].
60, 67]. More recently, CLIP and ALIGN scaled up to Existing multi-modal e-commerce-inspired datasets
400M and 1B+ web-curated image-text pairs, enabling include M5Product [12] and DeepFashion [39]. With
zero-shot visual recognition on downstream tasks. 6 million instances, M5Product’s size is around half
Originally intended for image-text retrieval and image of LGS’s. While M5Product focuses on demonstrating
captioning, bi-modal datasets are now widely used for the effectiveness of multi-modal training, this paper
training cross-modal representations [7, 22, 27, 34, 35, emphasizes analyzing the e-commerce data distribution
37, 40, 45, 53, 56, 58, 68] that transfer to downstream and how it generalizes to general wide-domain datasets
tasks, such as visual question answering [1, 23, 69], in a pre-training setting.
referring expressions [26], and visual reasoning [57, 66].
In light of these novel training paradigms, more recent
works build larger datasets specifically for vision-and- 3. The Let’s Go Shopping (LGS)
language pre-training. Examples include LAIT [47], Dataset
Conceptual Captions-12M, and Wikipedia-ImageText
(WIT) [55], Localized Narratives [46], Visual Genome With 14,847,764 image-text pairs, the LGS dataset
[28], YFCC100M [59]. Similar to these datasets, LGS has a size advantage over many publicly available bi-
3
model datasets, as presented in Table 1. In this section, Table 2. Comparing the word count statistics of the LGS
we offer additional analysis of the LGS data. For all and COCO captions.
analysis and experiments in the paper, we use a subset of
the instances with 13 million instances, as the rest of the Dataset Min Max Mean Median Skew
dataset was constructed in parallel with the experiments. LGS 2 3642 89.58 67 3.44
COCO 5 50 10.56 10 2.76
3.1. Data Collection
To create training data that is truly representative
Table 3. The POS’s that occur at least ten times.
of e-commerce data as a whole, we include a wide range
of commerce websites with various product kinds, such
Dataset C. Nouns P. Nouns Adjectives Verbs
as infant products, sporting goods, bridal jewelry, etc.
The collection pipeline starts with a set of heuristic LGS 158,479 139,174 48,907 57,481
COCO 10,403 1,655 3,053 4,961
rules to isolate the product pages from the non-product
pages of an e-commerce website. Then, our automated
extractor obtains relevant information on each product
page, including the product title, the description, and
the first listed image. Some products may include nu- some statistics of the word distribution of the captions,
merous variants (e. g., different colors for a type of showing that both LGS and COCO have highly posi-
T-shirt), and we collect all variants. We avoid crawl- tively skewed distributions, with LGS having a longer
ing information that the sellers are unwilling to share. tail. Since LGS incorporates data from a large variety
Specifically, the extractor is forbidden from crawling of e-commerce websites, the descriptions can include
pages with a ‘Disallow’ extension. Finally, we use strict rich information. In the subsequent sections, we show
automated tests to filter out the instances with potential that while the raw captions of LGS are diverse, clear
quality issues. Examples of the tests include confirming structural information can be extracted from the LGS
that the price is a number, certifying that the images captions for fine-tuning purposes.
are valid, and ensuring that the product title exists and Additionally, we use the part-of-speech (POS) tag-
contains no unexpected characters. ging method from the Spacy library [21] to analyze
the linguistic statistics of the LGS captions, comparing
3.2. Characteristics of LGS Images common nouns, proper nouns, adjectives, and verbs.
Section 3.3 illustrates that LGS has at least 10x more
In general-domain image-caption datasets, the im- words per POS compared with COCO, whereas Figures
ages usually consist of one or more subjects juxtaposed Supp-5 and Supp-6 in the supplementary materials pro-
against a rich background, and their captions often men- vide further insights into the composition of each word
tion the background. In contrast, e-commerce product type. Due to the e-commerce nature of LGS, a large
thumbnails in LGS often depict only one in-animate portion of the instances is clothing and other wearable
item that occupies the foreground without any asso- items. Thus, within LGS, the proper nouns often present
ciation with the background. The background is also the brand names and sizes, the common nouns often
often a single color, with some examples shown in Fig. 3. describe the materials, and the adjectives and verbs
These clear backgrounds make it easier for models to often characterize the product-specific descriptions and
locate the patterns that correspond to their tasks. actions, making the LGS captions highly descriptive.
4
socks bracelets Outfit Sets Shoulder Bags Roller Skates
Sweatshirts and Hoodies
leaf of each taxonomy tree is then used as the label, adjusting the similarity threshold to 0.90 and including
with some examples displayed in Figure 3. The tax- additional pre-processing steps such as singularization
onomy tree can also be used to generate summarized and keyword merging. Note that polysemous words in
image captions that include product title, product brand the labels can refer to different objects in LGS and Ima-
name, and a number of “bullet strings” describing spe- geNet. For example, “cricket” in LGS refers to sports
cific product attributes. The bullet strings include ex- equipment but refers to the insect species in ImageNet.
amples such as Nylon fabric, Classic collar, and Thus, a manual inspection of the merged classes is per-
Front zipper fastening. The LGS leaves form a formed. After discarding classes with less than 20 in-
long-tailed distribution that emphasizes common daily stances, we gather the remaining 176 ImageNet synsets
commodities, with the five most common leaves being that align with the LGS end leaves and use them as the
Tops and T-shirts, Dresses, Rings, T-shirts, and LGS-Overlap dataset. The fact that only 17.6% of the
Sweatshirts and Hoodies. For each of the three clas- ImageNet synsets are matched shows a significant label
sification variants, we further clean the end leaves, with distribution difference between e-commerce applications
details provided in the two following paragraphs. In Fig- and common pre-training datasets. Since a higher level
ure Supp-1 in the supplementary materials, we provide of label-space alignment is essential for more effective
a histogram of the end leaf distribution. pre-training [41], LGS forms a representative benchmark
LGS-117 and LGS-710 are designed as pre-training and a pre-training dataset for downstream tasks that
datasets. Within all raw labels generated by the tax- see distributions close to e-commerce.
onomy model, there are synonyms and overlaps that
should be unified. After manually merging the synonyms 4. Experiments
among the most popular classes, we observe 117 classes
that contain at least 10k images. We select 10k images 4.1. Image Classification and Recon-
from each class, forming the balanced LGS-117 dataset.
struction
LGS-710 is an unbalanced dataset that includes more
scarce classes. To accelerate label engineering, we use a In this subsection, we use image classification and
semi-automated pipeline. First, we remove uninforma- reconstruction tasks to characterize the distributional
tive words like “other” and parse juxtaposed nouns by difference between LGS and ImageNet. We consider the
commas and “and”. Next, we use a pre-trained language distributions of images as well as the labels.
model to extract the embedding of each parsed noun. As
an example, for the leaf Tops and T-shirts, we embed
both tops and t-shirts. We then consider the “simi- 4.1.1 ImageNet Classifiers Do Not Readily Gen-
larity” between two classes to be the maximum cosine eralize to E-commerce
similarity between all pairs of corresponding nouns. Very The existing literature has shown that carefully con-
close classes are merged based on a similarity threshold structed images collected in a bias-controlled manner can
of 0.92, which is determined by manually inspecting the elicit a significant performance degradation on classifiers
merged classes. trained on ImageNet [3]. By applying pre-trained Ima-
LGS-Overlap is proposed as an out-of-distribution geNet classification models to the LGS-Overlap dataset
test set for models trained on ImageNet-1k, one of the without further training, we show that such out-of-
most widely used benchmarking datasets. We use a sim- distribution examples naturally exist in the e-commerce
ilar semi-automated pipeline to merge LGS classes with domain. Specifically, we use publicly available weights of
ImageNet synsets [9, 43]. We optimize the pipeline by a ResNet-50 model and a ConvNeXT-Base model. The
5
Table 4. The classification accuracy of models trained on LGS Table 5. The reconstruction quality of the MAE
shows that the LGS end leaves are well-separable. models trained on LGS and ImageNet, evaluated on
COCO. The symbol ↑ denotes “higher is better” while
LGS-117 LGS-117 LGS-710 ↓ means “lower is better”.
LGS Accuracy from scratch IN-pretrained IN-pretrained
(Top-1) (Top-5) Training Dataset Inception (↑) FID (↓)
After linear probing – 69.58 % 60.72 % 81.16 % ImageNet-1k 9.2930 114.60
After fine-tuning 97.89 % 98.16 % 77.27 % 89.09 % IN pretrain→IN+LGS 9.1906 115.48
LGS 10.187 91.387
ResNet-50 achieves a 74% average recall across the 176 Table 6. Linear probing accuracy of the self-supervised
overlapping synsets over the ImageNet images, but the MAE models with three different initializations. A: base-
number noticeably reduces to 46.43% on LGS-Overlap. line ImageNet MAE model [17], B: LGS MAE model, C:
LGS+ImageNet MAE model. Specifically, C is initialized
The ConvNeXT-Base obtains 79.00% and 50.14% on Im-
with A followed by 150 epochs on mixed Imagenet and LGS-
ageNet and LGS-Overlap, respectively. This difference 710 data (1:1 ratio). Fine-tuning on LGS-117 and Imagenet
highlights that existing ImageNet models do not readily datasets used 40 and 60 epochs, respectively.
transfer to LGS instances. In addition to having a dif-
ferent label distribution, the e-commerce domain forms MAE Training Setting
a natural distribution shift even for the classes that also Linear probing dataset A B C
exist in ImageNet. While taxonomy standardization
LGS-117 (40 epochs) 72.98 % 76.37 % 76.87 %
techniques exist, aligning and merging the label space is
ImageNet-1k (60 epochs) 67.78 % 46.37 % 65.29 %
still hard in general. Thus, a pre-training dataset that
is more aligned with e-commerce is necessary, and LGS
fulfills this role.
We further show that LGS end leaves are well-
separable, verifying that the performance degradation of MAE trained on ImageNet can reconstruct a reasonable
ImageNet models is caused by the distribution mismatch LGS image, but the reconstruction quality of the Ima-
and not the ambiguity of the LGS classes. Note that geNet+LGS model is better, demonstrating that LGS
Section 3.4 illustrates that the models learned on LGS- can be used to learn e-commerce visual features.
117 / LGS-710 can achieve high accuracy on LGS-117 / To quantitatively demonstrate the generalizability
LGS-710. Specifically, we consider the “linear probing of the vision feature extractors, we evaluate the recon-
followed by fine-tuning” training schedule, a transfer struction performance of the MAE models trained on
learning scheme that has been shown to improve the ro- LGS and ImageNet on COCO. The qualities of the raw
bustness against distribution shift by avoiding significant reconstructions obtained by the models are presented in
distortions of the pre-trained weights. Table 5. While LGS is more domain-specific compared
with ImageNet and COCO (both of which cover a wide
4.1.2 Non-classification Visual Feature Extrac- range of domains), the MAE trained on LGS is able
tors Can Generalize to generate COCO images with higher qualities com-
pared with the ImageNet model. Furthermore, we use
Since the image-label correspondence is different be- Section 4.1.2 to show that upon the visual embeddings
tween LGS and ImageNet, we use self-supervised train- learned jointly on ImageNet and LGS, a linear classi-
ing to isolate this mismatch and focus on the distribution fier with a satisfactory performance can be learned on
of images. In the context of transfer learning, since self- both ImageNet and LGS. The above results verify that
supervised training does not use labels, it circumvents the feature extractors can generalize between LGS and
the issue of label space mismatch between target and general-domain datasets, despite the separation of the
source domains, which has been shown to undermine intermediate visual embeddings (which are visualized in
the quality of transfer learning. Masked AutoEncoder Appendix A.2).
(MAE) [17] is a self-supervised method designed for Based on the above observations, we infer that
pre-training. Thus, we compare the performance of an the e-commerce data distribution, represented by the
MAE trained on ImageNet only with an MAE trained LGS dataset, significantly differs from existing general
on ImageNet and LGS-710. Figure 4 shows that the datasets in the label space, while visual features can
6
reconstruct reconstruct + visible reconstruct reconstruct + visible
original masked ImageNet MAE ImageNet MAE ImageNet+LGS MAE ImageNet+LGS MAE
Figure 4. While an MAE trained on ImageNet can reasonably reconstruct an LGS image, adding LGS instances to the
training improves the reconstruction quality.
Figure 5. Adding LGS instances to the training also improves the reconstruction on some ImageNet instances.
Table 7. ImageNet→LGS-710 two-phase pre-training improves downstream linear probing accuracy for downstream tasks
including CIFAR-100, Fashion MNIST, and Clothing1M. On Clothing1M, whose data also comes from the e-commerce
domain, the LGS-pre-trained features also improve end-to-end fine-tuning performance. For Clothing1M, we only use its
clean training set, whereas Clothing1M (10%) is a few-shot setup that trains on a 10% subset of the clean training set.
generalize. Thus, LGS is an ideal pre-training dataset corporating in-domain pre-training (both LGS-117 and
for downstream tasks whose class distributions align LGS-710) results in better performance (2% absolute)
with the e-commerce domain. compared to ImageNet pre-training. Moreover, in
limited-data settings, we observe less model regression
4.1.3 LGS Supplements ImageNet as a Pre- compared to the full-data setups. For example, for fine-
training Dataset tuning a linear classifier on 10% of the Clothing1M-clean
dataset, the ImageNet pre-trained model regresses more
LGS can also widen the span of the pre-training distri- (11.5% relative) compared to LGS-117 and LGS-710
bution when used in conjunction with ImageNet, acting pre-trained models (7.4 and 8.4% relative respectively).
as a bridge between general visual features and domain- When models are trained end-to-end, we observe that the
specific applications. Specifically, Table 7 shows that a pre-training setup is less critical in fine-tuning the full
two-phase ImageNet→LGS-710 weakly-supervised pre- Clothing1M-clean training dataset. However, for limited-
training scheme produces features more suitable for fine- data experiments, filtering out under-represented classes
tuning on common downstream tasks. On e-commerce- (LGS-117) in pre-training helps with the downstream
related downstream datasets such as Clothing1M, the fine-tuning results (2% absolute) compared to both Im-
models pre-trained on LGS also excel in both linear ageNet and LGS-710 datasets.
probing and end-to-end settings. In Appendix A.3 in the supplementary materials, we
In linear probing experiments, we observe that in- use GradCam [14, 52] to visualize the representations
7
Table 8. IC model performance of image-captioning task LGS Examples
evaluated on different combinations of training and evalua-
tion datasets. a photo of terez
leggings navy camo
stripe hi-shine bump
Training Set Test Set METEOR (↑) squad leggings,
e-commerce
LGS-title LGS-title 0.184
LGS-description LGS-title 0.161
LGS-taxonomy LGS-taxonomy 0.584 a photo of ana silver
COCO LGS-title 0.069 co. rings apatite
ring size 8.25 (925
sterling silver)
ring81016,
e-commerce
learned by the classification models, demonstrating that
the LGS models look for much more localized patterns DeepFashion InShop Examples
that are relevant to e-commerce classification. a photo of Dresses
Ivory-navy Forever
21 Contemporary -
4.2. Caption Generation Show the perfect
amount of skin in
this sleek,
In this section, we illustrate that the distinct distri- sophisticated
bution of LGS benefits vision-language bi-modal tasks. surplice dress,
e-commerce
Specifically, we study the efficacy of image-captioning
(IC) models trained on traditional datasets in predicting a photo of Pants
LGS-type descriptions. We also evaluate the perfor- Black-grey This
jogger’s easy,
mance of LGS-trained models in generating attribute- slouchy silhouette
rich image captions that would otherwise not be possible gets a little grit
courtesy of its
for models trained on more traditional datasets. eye-popping print of
In this experiment, we utilize a bi-modal modeling photorealistic roses,
e-commerce
framework based on OFA [61], a recently proposed
encoder-decoder architecture that has achieved state- (a) Input Prompt (b) Vanilla SD (c) LGS-117
of-the-art performances in many language-vision tasks.
For each LGS image, the corresponding caption can be Figure 6. Qualitative comparisons of the generations of
constructed by concatenating the “product description” the Vanilla and the LGS-117-fine-tuned SD models in the
strings in various orders. Specifically, we create three general setting. The fine-tuned model generates more visu-
types of captions: ally appealing images.
1. LGS-title : title and brand name; Table 9. Comparing the Vanilla SD and the LGS-117 fine-
tuned model on LGS and DeepFashion datasets.
2. LGS-taxonomy : product taxonomy;
3. LGS-description: concatenated bullet strings.
Model Test Set FID (↓)
The OFA IC model was trained on the three types of Vanilla LGS Val 25.3498
LGS inputs as well as on the traditional COCO dataset. Vanilla + LGS-117 LGS Val 24.1952
The IC model performance in terms of its ability to Vanilla DeepFashion 62.9269
predict the appropriate target string is tabulated in Vanilla + LGS-117 DeepFashion 74.0185
Table 8.
8
end leaf as an example and compare the generations
before and after LGS-fine-tuning under the fine-grained
a photo of new
balance men white setting in Figure 7. As did the general setting results,
running shoes the fine-grained examples also indicate that LGS helps
adapt text-to-image models to e-commerce scenarios
and improves image quality and aesthetics.
a photo of on
running athletic
shoes on running 5. Conclusion
women’s cloudswift
road shoe 41.99578
in lake sky
The Let’s Go Shopping (LGS) dataset consists of 15
million pairs of publically-accessible diverse images and
descriptive captions from e-commerce websites. Our effi-
Figure 7. The LGS-117-fine-tuned SD model also generates cient semi-automated gathering and annotation pipeline
more visually appealing images in the fine-grained setting. ensure scalable data collection. We then use LGS
The prompts are from LGS. to show that while the categories associated with e-
commerce data may not align with the general-domain
pre-training datasets, visual feature extractors can be
the domain identifier is crucial, as the paper [50] shows shared. Finally, we show that the distinct distribution
that a domain identifier with a strong prior should be offered by LGS and LGS’s bi-modal nature can be ben-
avoided. For example, the word retail has a strong eficial for applications including image classification,
prior, and the pre-trained “Vanilla” SD model confi- image reconstruction, bi-modal representation learning,
dently associates it with (physical) retail stores. This and text-to-image generation.
behavior is undesirable for the goal of e-commerce style
transfer. By analyzing the effects of various domain iden-
tifiers on the generations of the pre-trained SD model,
6. Acknowledgment
we determine that the word “e-commerce” gives a weak We would like to appreciate Matteo Bruno, James
prior and is a suitable identifier. We then construct the Caulkins, Xiaowen Dong, Peter Grindrod, Jack Hessel,
ground-truth training prompts for the LGS images in the Sean Holloway, Dorothy Nicholas, Janet Pierrehumbert,
format of a photo of <brand> <end_leaf> <title>, and Julian Winkler for their valuable help and fruitful
e-commerce, where the <end_leaf> refers to the end discussions.
leaf of the taxonomy tree introduced in Section 3.4. The
We also appreciate Baillie Gifford, the Institute for
“Vanilla” SD checkpoint is fine-tuned on one million LGS
New Economic Thinking at the Oxford Martin School,
image-prompt pairs for 100k steps with a batch size of
and the UK Engineering and Physical Science Research
24. Table 9 displays the quantitative results on an un-
Council for funding our work at the University of Oxford.
seen validation set (5K image-prompt pairs) from LGS
and a subset of the DeepFashion InShop dataset. The This work was supported in part through the NYU IT
fine-tuning process enhances the performance of SD on High-Performance Computing resources, services, and
LGS as expected. While the FID scores on DeepFashion staff expertise.
are lower, the generations of the LGS-117 fine-tuned
model are aesthetically more appealing. At this instant,
there are no quantitative metrics that directly measure
aesthetic superiority. Thus, we present Figure 6 and
the additional examples in Appendix A.6 in the sup-
plementary materials (Figures Supp-9 and Supp-8) to
demonstrate the aesthetic improvement qualitatively.
The lower FID scores may indicate a distribution shift
between LGS and DeepFashion images.
For the fine-grained setting, we use data belonging
to only a particular end leaf, using the same prompt
without the additional identifier. The checkpoint is
fine-tuned with 10k image-prompt pairs for 25k steps
with a batch size of 6. We use the “athletic shoes”
9
References [13] Mark Everingham, Luc Van Gool, Christopher K. I.
Williams, John M. Winn, and Andrew Zisserman. The
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- pascal visual object classes (VOC) challenge. IJCV,
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and 2009. 3
Devi Parikh. VQA: Visual question answering. In ICCV,
2015. 3 [14] Jacob Gildenblat and contributors. Pytorch library
for cam methods. https://github.com/jacobgil/
[2] Yatong Bai, Trung Dang, Dung Tran, Kazuhito pytorch-grad-cam, 2021. 7, 14
Koishida, and Somayeh Sojoudi. Accelerating diffusion-
based text-to-audio generation with consistency distil- [15] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS:
lation. arXiv preprint arXiv:2309.10740, 2023. 2 A dataset for large vocabulary instance segmentation.
In CVPR, 2019. 3
[3] Andrei Barbu, David Mayo, Julian Alverio, William
Luo, Christopher Wang, Dan Gutfreund, Josh Tenen- [16] Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao
baum, and Boris Katz. Objectnet: A large-scale bias- Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S
controlled dataset for pushing the limits of object recog- Davis. Automatic spatially-aware fashion concept dis-
nition models. In Advances in Neural Information Pro- covery. In Proceedings of the IEEE international con-
cessing Systems, 2019. 2, 5 ference on computer vision, 2017. 3
[4] Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. [17] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li,
Learning visual representations with caption annota- Piotr Dollár, and Ross Girshick. Masked autoencoders
tions. In ECCV, 2020. 3 are scalable vision learners. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
[5] Soravit Changpinyo, Piyush Sharma, Nan Ding, and tern Recognition (CVPR), 2022. 2, 6
Radu Soricut. Conceptual 12M: Pushing Web-Scale
Image-Text Pre-Training To Recognize Long-Tail Visual [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Concepts. In CVPR, 2021. 1 Sun. Deep residual learning for image recognition. In
CVPR, 2016. 3
[6] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-
[19] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling
ishna Vedantam, Saurabh Gupta, Piotr Dollár, and
the Knowledge in a Neural Network. NeurIPS Deep
C Lawrence Zitnick. Microsoft COCO captions:
Learning and Representation Learning Workshop, 2015.
Data collection and evaluation server. arXiv preprint
1
arXiv:1504.00325, 2015. 1, 3
[20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
[7] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy,
diffusion probabilistic models. In Advances in Neural
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.
Information Processing Systems, 2020. 2
Uniter: Learning universal image-text representations.
arXiv preprint arXiv:1909.11740, 2019. 3 [21] Matthew Honnibal, Ines Montani, Sofie Van Landeghem,
and Adriane Boyd. spaCy: Industrial-strength Natural
[8] Francois Chollet. Xception: Deep Learning with Depth- Language Processing in Python. 2020. 4
wise Separable Convolutions. In CVPR, 2017. 1
[22] Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu,
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Jianlong Fu. Pixel-BERT: Aligning Image Pixels
and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical with Text by Deep Multi-Modal Transformers. arXiv
Image Database. In CVPR, 2009. 1, 5 preprint arXiv:2004.00849, 2020. 3
[10] Karan Desai and Justin Johnson. VirTex: Learning [23] Drew A Hudson and Christopher D Manning. GQA:
Visual Representations from Textual Annotations. In A New Dataset for Real-world Visual Reasoning and
CVPR, 2020. 3 Compositional Question Answering. In CVPR, 2019. 3
[11] Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin [24] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Lo-
Johnson. Redcaps: web-curated image-text data created gan Engstrom, Brandon Tran, and Aleksander Madry.
by the people, for the people. ArXiv, abs/2111.11431, Adversarial examples are not bugs, they are features.
2021. 1 Advances in neural information processing systems, 2019.
14
[12] Xiao Dong, Xunlin Zhan, Yangxin Wu, Yunchao Wei,
Michael C Kampffmeyer, Xiaoyong Wei, Minlong Lu, [25] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
Yaowei Wang, and Xiaodan Liang. M5product: Self- Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen
harmonized contrastive learning for e-commercial multi- Li, and Tom Duerig. Scaling Up Visual and Vision-
modal pretraining. In CVPR, 2022. 3 Language Representation Learning With Noisy Text
10
Supervision. arXiv preprint arXiv:2102.05918, 2021. 1, [39] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and
3 Xiaoou Tang. Deepfashion: Powering robust clothes
recognition and retrieval with rich annotations. In
[26] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, CVPR, 2016. 3
and Tamara Berg. Referitgame: Referring to objects in
photographs of natural scenes. In EMNLP, 2014. 3 [40] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.
ViLBERT: Pretraining task-agnostic visiolinguistic rep-
[27] Wonjae Kim, Bokyung Son, and Ildoo Kim. ViLT:
resentations for vision-and-language tasks. In NeurIPS,
Vision-and-Language Transformer Without Convolution
2019. 3
or Region Supervision. In ICML, 2021. 3
[28] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- [41] Dhruv Kumar Mahajan, Ross B. Girshick, Vignesh
son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yan- Ramanathan, Kaiming He, Manohar Paluri, Yixuan
nis Kalantidis, Li-Jia Li, David A Shamma, Michael S Li, Ashwin Bharambe, and Laurens van der Maaten.
Bernstein, and Li Fei-Fei. Visual Genome: Connecting Exploring the limits of weakly supervised pretraining.
Language and Vision using Crowdsourced Dense Image In ECCV, 2018. 3, 5
Annotations. IJCV, 2017. 3 [42] Leland McInnes, John Healy, Nathaniel Saul, and Lukas
[29] Alex Krizhevsky. Learning multiple layers of features Großberger. Umap: Uniform manifold approxima-
from tiny images, 2012. 2 tion and projection. Journal of Open Source Software,
3(29):861, 2018. 13
[30] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper
Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, [43] George A. Miller. Wordnet: A lexical database for
Stefan Popov, Matteo Malloci, Alexander Kolesnikov, english. Communications of the ACM, 38(11):39–41,
et al. The open images dataset v4. International Journal 1995. 5
of Computer Vision, 128(7):1956–1981, 2020. 1
[44] Yuval Netzer, Tao Wang, Adam Coates, Alessandro
[31] Ya Le and Xuan S. Yang. Tiny imagenet visual recog- Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in
nition challenge. 2015. 3 natural images with unsupervised feature learning. In
NeurIPS Workshop on Deep Learning and Unsupervised
[32] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist Feature Learning 2011, 2011. 3
handwritten digit database. ATT Labs, 2, 2010. 3
[45] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg.
[33] Ang Li, Allan Jabri, Armand Joulin, and Laurens
Im2Text: Describing Images Using 1 Million Captioned
van der Maaten. Learning visual n-grams from web
Photographs. In NeurIPS, 2011. 3
data. In ICCV, 2017. 3
[46] Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo,
[34] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and
Radu Soricut, and Vittorio Ferrari. Connecting vision
Ming Zhou. Unicoder-VL: A universal encoder for vision
and language with localized narratives. In ECCV, 2020.
and language by cross-modal pre-training. AAAI, 2020.
3
3
[35] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui [47] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti,
Hsieh, and Kai-Wei Chang. VisualBERT: A simple and Arun Sacheti. ImageBERT: Cross-modal Pre-
and performant baseline for vision and language. arXiv training with Large-scale Weak-supervised Image-Text
preprint arXiv:1908.03557, 2019. 3 Data. arXiv preprint arXiv:2001.07966, 2020. 3
[36] Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and [48] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Luc Van Gool. WebVision Database: Visual Learning Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
and Understanding from Web Data. arXiv preprint Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
arXiv:1708.02862, 2017. 3 Krueger, and Ilya Sutskever. Learning Transferable Vi-
sual Models From Natural Language Supervision. arXiv
[37] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, preprint arXiv:2103.00020, 2021. 1, 3
Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu,
Li Dong, Furu Wei, et al. Oscar: Object-semantics [49] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
aligned pre-training for vision-language tasks. In ECCV, Patrick Esser, and Björn Ommer. High-resolution image
2020. 3 synthesis with latent diffusion models, 2021. 2
[38] Tsung-Yi Lin, Michael Maire, Serge Belongie, James [50] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and Pritch, Michael Rubinstein, and Kfir Aberman. Dream-
C Lawrence Zitnick. Microsoft COCO: Common objects booth: Fine tuning text-to-image diffusion models for
in context. In ECCV, 2014. 3 subject-driven generation. 2022. 9
11
[51] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, [62] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah,
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Steven Rennie, Kristen Grauman, and Rogerio Feris.
Karpathy, Aditya Khosla, Michael Bernstein, et al. Ima- The fashion iq dataset: Retrieving images by combining
genet Large Scale Visual Recognition Challenge. IJCV, side information and relative natural language feedback.
2015. 1, 3 CVPR, 2021. 3
[52] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek [63] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-
Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv mnist: a novel image dataset for benchmarking machine
Batra. Grad-cam: Visual explanations from deep net- learning algorithms, 2017. 2
works via gradient-based localization. In Proceedings of
the IEEE international conference on computer vision, [64] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and
2017. 7, 14 Xiaogang Wang. Learning from massive noisy labeled
data for image classification. In Proceedings of the IEEE
[53] Piyush Sharma, Nan Ding, Sebastian Goodman, and conference on computer vision and pattern recognition,
Radu Soricut. Conceptual Captions: A Cleaned, Hyper- 2015. 3
nymed, Image Alt-text Dataset for Automatic Image
Captioning. In ACL, 2018. 1, 3 [65] Chenfeng Xu, Shijia Yang, Tomer Galanti, Bichen Wu,
Xiangyu Yue, Bohan Zhai, Wei Zhan, Peter Vajda,
[54] Jascha Sohl-Dickstein, Eric Weiss, Niru Mah- Kurt Keutzer, and Masayoshi Tomizuka. Image2point:
eswaranathan, and Surya Ganguli. Deep unsupervised 3d point-cloud understanding with 2d image pretrained
learning using nonequilibrium thermodynamics. In In- models. In Shai Avidan, Gabriel J. Brostow, Moustapha
ternational Conference on Machine Learning, 2015. 2 Cissé, Giovanni Maria Farinella, and Tal Hassner, edi-
tors, ECCV, 2022. 15
[55] Krishna Srinivasan, Karthik Raman, Jiecao Chen,
Michael Bendersky, and Marc Najork. WIT: Wikipedia- [66] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
based Image Text Dataset for Multimodal Multilingual Choi. From recognition to cognition: Visual common-
Machine Learning. arXiv preprint arXiv:2103.01913, sense reasoning. In CVPR, 2019. 3
2021. 3
[67] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio
[56] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Torralba, and Aude Oliva. Learning deep features for
Furu Wei, and Jifeng Dai. VL-BERT: Pre-training of scene recognition using places database. In NeurIPS,
generic visual-linguistic representations. In ICLR, 2020. 2014. 3
3
[68] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong
[57] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Hu, Jason J Corso, and Jianfeng Gao. Unified vision-
Huajun Bai, and Yoav Artzi. A corpus for reasoning language pre-training for image captioning and VQA.
about natural language grounded in photographs. In AAAI, 2020. 3
ACL, 2019. 3
[69] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-
[58] Hao Tan and Mohit Bansal. LXMERT: Learning cross- Fei. Visual7W: Grounded Question Answering in Im-
modality encoder representations from transformers. In ages. In CVPR, 2016. 3
EMNLP, 2019. 3
[60] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin
Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro
Perona, and Serge Belongie. The inaturalist species
classification and detection dataset. In CVPR, 2018. 3
12
600000
500000
400000
300000
200000
100000
0
Pants
Area Rugs
Panties
Women's Fragrances
Tops and T-Shirts
Frames
Socks
Shoulder Bags
Casual
Sports Bras
Crossbody Bags
Handbags
T-Shirts
Boots
Jeans
Bracelets
Music
Hats
Skirts
Belts
Suits
Slippers
Fabric
Sweatshirts and Hoodies
Earrings
Other Women's Clothing
Heels
Engagement Rings
Casual Shirts
Sunglasses
Jumpsuits and Rompers
Outfits and Sets
Wall Art
Flats
Bodysuits
Charms and Pendants
Leggings
Tumblers
Girls' Shoes
Outerwear
Other Men's Clothing
Bed Sheets and Pillowcases
Boys' Shoes
Men's Watches
Vests
Wallets
Jerseys
Other Women's Shoes
(Other end leaves)
Unisex Shoes
Athletic
Athletic Shoes
ResNet50 block 0 features ResNet50 block 1 features ResNet50 block 2 features ResNet50 block 3 features ResNet50 block 4 features
9 LGS 9 LGS LGS LGS
8 ImageNet 6 ImageNet 10 ImageNet 7 ImageNet
7 8 9 6
5
6 7 8 5
5 4 7 4
6
4 3
3 6
3 5
5 2
2 2 LGS 4 1
1 ImageNet 4
1
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 4 6 8 10 4 6 8 10
Figure Supp-2. UMAP visualization of the ImageNet and LGS features extracted on a ResNet50 model trained on
ImageNet and LGS.
A. Additional Analyses
A.1. LGS End Leaf Histogram
The instance counts of the LGS end leaves are displayed in Figure Supp-1. The top 80 most popular end leaves
encompass 83.28% of the total instances, with the most popular Tops and T-shirts containing 16.23% of the
instances.
13
50 50
75 75
100 100
125 125
150 150
175 175
200 200
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
0 0 0 0
25 25 25 25
50 50 50 50
75 75 75 75
100 100 100 100
125 125 125 125
150 150 150 150
175 175 175 175
200 200 200 200
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
0 0 0 0
25 25 25 25
50 50 50 50
75 75 75 75
100 100 100 100
125 125 125 125
150 150 150 150
175 175 175 175
200 200 200 200
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
0 0 0 0
25 (a) ImageNet25 examples. 25
(b) LGS examples.
25
50 50 50 50
75 Figure Supp-3.
75 The saliency map of the
75 LGS-ImageNet binary
75 classifier.
100 100 100 100
125 125 125 125
150 150 150 150
is more important
175 in the second LGS
175 instance. This observation
175 aligns with the 175findings of [24], which states that
deep neural 200networks are not always
200 200
understandable by humans. 200
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
0 0
14
Input ImageNet RN50 LGS-117 RN50 LGS-710 RN50
Figure Supp-4. GradCam visualizations show that LGS classification models look for much more localized patterns.
(BN) layers are jointly optimized alongside the first and the last layer, this modified “linear” probing scheme can
achieve a performance that is comparable to end-to-end training [65]. Specifically, with learnable BN, a ResNet-50
model pre-trained on ImageNet→LGS-710→ImageNet achieves an accuracy of 71.41% on CIFAR-100, compared
with 69.47% for an ImageNet-only model.
15
Figure Supp-5. Top 20 most common words per POS for LGS.
Figure Supp-6. Top 20 most common words per POS for COCO.
Table Supp-1. Comparing the n-gram statistics of LGS with that of COCO.
frequent tri-grams describe a group of objects and the relative position of the objects, whereas the LGS tri-grams
encompass inherent properties of the commodities, including the size and the nature of each item.
In addition to the part-of-speech (POS) results presented in Section 3.3, we use Figures Supp-5 and Supp-6 to
present the most common words per POS for LGS and COCO, respectively.
16
Table Supp-2. The FID scores across prompts using a subset (n = 5000) of the LGS and DeepFashion InShop datasets.
FID (↓)
Prompt ID Model
LGS DeepFashion
1 Vanilla 40.4437 61.8519
Vanilla + LGS-117 42.7328 74.4327
2 Vanilla 42.1081 63.2344
Vanilla + LGS-117 42.0529 77.7190
3 Vanilla 36.7157 58.2189
Vanilla + LGS-117 36.1946 79.3607
4 Vanilla 38.4101 62.9269
Vanilla + LGS-117 38.4100 74.0185
Table Supp-3. Prompts evaluated for text-to-image generation experiment. Prompt structures varied slightly due to
available metadata across datasets.
Prompt 4.
17
end_leaf : Jackets Vests gender_category: Men color:
Khaki first sentence of description:Made in a cotton-
nylon blend with a modified collar and partial mesh lining,
this baseball jacket is the slickest iteration of the style
yet
18
a photo of Jackets Vests
Khaki Made in a
a photo of Blouses Shirts cotton-nylon blend with a
Tomato Love 21 - A woven modified collar and partial
cami featuring a pleated mesh lining, this baseball
front and crossback strap jacket is the slickest
detail in the back, iteration of the style yet,
e-commerce e-commerce
Figure Supp-8. Additional qualitative examples of the Vanilla SD vs LGS-117 fine-tuned SD model on DeepFashion
InShop dataset.
19
a photo of ana silver co.
earrings rainbow moonstone
earrings 3/4" (925 sterling
silver) earr415021
a photo of myconquering
conquering unisex black
joggers
a photo of
wristwatchstraps.co smart
watch accessories bumper
cover+glass for apple watch
- lilac 21 - 38mm
Figure Supp-9. Additional qualitative examples of the Vanilla SD vs LGS-117 fine-tuned SD model on the LGS dataset.
20