0% found this document useful (0 votes)
42 views

Lets Go Shopping: Webscale Image-text Dataset For Visual Concept Understanding

Uploaded by

rodo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Lets Go Shopping: Webscale Image-text Dataset For Visual Concept Understanding

Uploaded by

rodo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Let’s Go Shopping (LGS) –

Web-Scale Image-Text Dataset


for Visual Concept Understanding

Yatong Bai1∗ Utsav Garg2 Apaar Shanker2 Haoming Zhang2 Samyak Parajuli2
Erhan Bas2 Isidora Filipovic3 Amelia N. Chu3 Eugenia D Fomitcheva3 Elliot Branson2
Aerin Kim2 Somayeh Sojoudi1 Kyunghyun Cho3
arXiv:2401.04575v1 [cs.CV] 9 Jan 2024

1 2 3
University of California, Berkeley Scale AI New York University

Work done during internship at Scale. Correspondences to [email protected], [email protected].

– Abstract – 1. Introduction
Computer vision (CV) and natural language process-
ing (NLP) tasks increasingly rely on pre-trained repre-
sentations. While NLP representations can be trained
on unannotated raw text, vision applications often con-
sider pre-training using large-scale datasets with discrete
Vision and vision-language applications of neural
networks, such as image classification and caption-
class labels annotated by humans, such as ImageNet
ing, rely on large-scale annotated datasets that re- [9, 51] or OpenImages [30]. Vision-language bimodal
quire non-trivial data-collecting processes. This time- applications, such as image captioning or visual question
consuming endeavor hinders the emergence of large- answering, similarly rely on large amounts of annotated
scale datasets, limiting researchers and practitioners data. Unfortunately, many of the large-scale bi-modal
to a small number of choices. Therefore, we seek more datasets now in existence, such as CLIP [48], ALIGN
efficient ways to collect and annotate images. Previ- [25], and JFT300M [8, 19], are not publicly accessible.
ous initiatives have gathered captions from HTML As a result, research has been constrained to a few
alt-texts and crawled social media postings, but these selected large datasets, such as Conceptual Captions
data sources suffer from noise, sparsity, or subjectivity. [5] and COCO [6]. This shortage of available public
For this reason, we turn to commercial shopping web-
datasets can be attributed in part to the time and effort
sites whose data meet three criteria: cleanliness, infor-
mativeness, and fluency. We introduce the Let’s Go
required to gather, clean, and annotate large datasets.
Shopping (LGS) dataset, a large-scale public dataset Therefore, we adopt a more efficient and scalable high-
with 15 million image-caption pairs from publicly quality data collection pipeline to acquire image-text
available e-commerce websites. When compared with pairs easily available on e-commerce websites. While
existing general-domain datasets, the LGS images some existing datasets use public websites as annota-
focus on the foreground object and have less complex tion sources, most of them use social media websites
backgrounds. Our experiments on LGS show that (RedCaps [11]) or alt-texts1 (Conceptual Captions [53])
the classifiers trained on existing benchmark datasets
for this purpose. Nevertheless, social media data suffer
do not readily generalize to e-commerce data, while
from subjectivity. On the other hand, alt-texts can be
specific self-supervised visual feature extractors can
better generalize. Furthermore, LGS’s high-quality unacceptably noisy, sometimes merely including unin-
e-commerce-focused images and bimodal nature make formative texts such as “alt img”, as shown in Figure 1.
it advantageous for vision-language bi-modal tasks: As a result, we gravitate to e-commerce websites,
LGS enables image-captioning models to generate
richer captions and helps text-to-image generation 1 Alt-texts are short descriptions of HTML website images.
models achieve e-commerce style transfer. When an image cannot be rendered, the website displays its
alt-text as a surrogate.

1
Ju-Ni San Francisco | A San
Shaun the Sheep Coloring Pages Francisco Food Restaurant Review
(Unclear description) (Does not mention tuna maki)

Figure 1. In comparison to e-commerce prod-


uct descriptions, alt-text is usually less informa-
Figure 2. An e-commerce-based LGS sample instance with image, title,
tive, sometimes too broad, or even irrelevant.
and description.

where clean images with objective, accurate, succinct, LGS classes are well-separable, we conclude that this
and informative descriptions are abundant, as illustrated performance degradation can be mostly attributed to
in Figure 2. Let’s Go Shopping (LGS) dataset collects the distributional disparity. To separate the effects of
15 million image-description pairs from approximately labels and captions and isolate the distribution shift of
10,000 e-commerce sites selling a wide range of products. the images, we consider Masked AutoEncoder (MAE)
Due to the nature of e-commerce data, the majority [17], a self-supervised pre-training method that does not
of LGS images have a clear background and a static rely on labels. We show that an MAE model trained
focus on the stated object. On the captions front, LGS on ImageNet-1k can reconstruct LGS images well, but
provides precise and elaborative captions. We show adding LGS to the training data improves the perfor-
how highly precise information can be extracted from mance on LGS and generalizes better to COCO.
captions for vision-language fine-tuning. The above results demonstrate that while the e-
On the other hand, ImageNet-1k has served as the commerce images are from a distribution that is distinct
ubiquitous go-to pre-training and evaluation dataset for from current benchmark datasets, the feature extrac-
vision-only applications. While ImageNet covers a wide tors can be shared. Moreover, we illustrate additional
range of domains, the diversity of angles and arrange- merits of LGS that qualify it as a pre-training dataset.
ments is restricted. As a result, the literature has shown Specifically, the models learned on both LGS and Im-
that ImageNet models do not generalize well to deliber- ageNet have improved linear probing performance on
ately constructed out-of-distribution (OOD) scenarios common downstream tasks such as CIFAR-100 [29] and
[3]. This work uses image classification experiments Fashion MNIST [63], compared with the ImageNet-only
to demonstrate that such OOD data is ubiquitous in counterparts.
e-commerce applications. We then show that models The distinctive distribution of LGS also benefits
can benefit from the unique e-commerce distribution vision-language bimodal tasks. For caption genera-
in image classification, reconstruction, captioning, and tion tasks, we train an OFA model [61] on LGS to
generation tasks. demonstrate that the more prominent image foreground,
Specifically, we convert the LGS captions into tax- cleaner image background, and the highly descriptive
onomies and labels and demonstrate a large disparity captions of LGS enable the model to produce “attribute-
between the label distributions of LGS and ImageNet: rich” image captions, which models trained on tradi-
even with best efforts, only 17.6% of the concepts are tional datasets fail to produce.
shared between popular ImageNet-1k synsets and the Finally, for text-to-image generation tasks, diffusion
e-commerce corpus (more details in Section 3.4). Even models [2, 20, 54] is currently the most popular family of
for those shared classes, the performance of ImageNet methods. To illustrate the efficacy of LGS in this setting,
models degrades significantly. By verifying that the we use Stable Diffusion (SD) [49] and fine-tune it in both

2
general and fine-grained settings on subsets of the LGS Table 1. The instance count of LGS compared with existing
dataset. We demonstrate promising qualitative and bi-modal datasets.
quantitative results on adapting existing text-to-image
models using LGS for e-commerce-related generations. Datasets Instances
Furthermore, with the help of its distinct image style Let’s Go Shopping (this paper) 14,847,764
and descriptive captions, LGS can help the SD model YFCC100M (Yahoo) 100 million
generate e-commerce-styled images. RedCaps (University of Michigan) 12,011,111
To make LGS available to the public, we will share Conceptual Captions 12M (Google) 12,423,374
WIT-English (Google) 5,500,746
the filtered links to the image-caption pairs under the
Localized Narratives (Google) 849,000
“BSD 3-Clause” license (also used in common datasets COCO (Microsoft) 328,000
such as ImageNet), as was the case for ImageNet. We Visual Genome (Stanford) 108,077
will also share the downloader so that the exact same CLIP (OpenAI) 400M
dataset can be reproduced. ALIGN (Google) 1.8B

2. Related Work
offers rich semantic data for pre-training applications.
2.1. Unimodal Pre-Training Datasets However, our choice of e-commerce data source is unique,
leading toward distinctive data distribution.
Prior to the popularization of bi-modal training, uni-
Image-text datasets are also used for learning visual
modal data (vision-only or language-only) have been the
features. The work [33] has proposed to train visual
workhorses for pre-training tasks. On the vision side,
n-gram models on YFCC100M, whereas other methods
ImageNet-1k and ImageNet-22k are still some of the
[4, 10] aim to learn features from the captions from the
most prevalent examples, alongside the larger JFT-300M
COCO dataset [6]. The quality of the resulting features
dataset. For the e-commerce domain, Fashion MNIST,
is competitive with supervised ImageNet training [18] on
Clothing1M [64], Fashion200k [16], and FashionIQ [62]
many downstream tasks [13, 15, 38, 51, 60]. Moreover,
have been proposed to analyze the effects of noisy labels.
the image-text pre-training schemes scale up to very
Some of the most common datasets used as general wide-
larger non-public datasets that are even larger than
domain downstream tasks include CIFAR-10, CIFAR-
LGS [25, 48].
100, MNIST [32], SVHN [44], and Tiny ImageNet [31].
A core motivation for collecting image-text pairs from
2.2. Vision-and-Language Pre-Training the internet is the possibility of scaling up the data
size without bearing the prohibitively expensive anno-
Datasets tation costs. In light of this motivation, there have
The literature has shown that image-text data from been multiple efforts to collect large quantities of noisy
COCO can be used to learn visual features that are com- labels associated with online images, leading to datasets
petitive with supervised pre-training [18] on ImageNet such as WebVision [36], YFCC100M, JFT-300M, and
when transferred to downstream tasks [4, 10, 13, 15, 38, Instagram-3.5B [41].
60, 67]. More recently, CLIP and ALIGN scaled up to Existing multi-modal e-commerce-inspired datasets
400M and 1B+ web-curated image-text pairs, enabling include M5Product [12] and DeepFashion [39]. With
zero-shot visual recognition on downstream tasks. 6 million instances, M5Product’s size is around half
Originally intended for image-text retrieval and image of LGS’s. While M5Product focuses on demonstrating
captioning, bi-modal datasets are now widely used for the effectiveness of multi-modal training, this paper
training cross-modal representations [7, 22, 27, 34, 35, emphasizes analyzing the e-commerce data distribution
37, 40, 45, 53, 56, 58, 68] that transfer to downstream and how it generalizes to general wide-domain datasets
tasks, such as visual question answering [1, 23, 69], in a pre-training setting.
referring expressions [26], and visual reasoning [57, 66].
In light of these novel training paradigms, more recent
works build larger datasets specifically for vision-and- 3. The Let’s Go Shopping (LGS)
language pre-training. Examples include LAIT [47], Dataset
Conceptual Captions-12M, and Wikipedia-ImageText
(WIT) [55], Localized Narratives [46], Visual Genome With 14,847,764 image-text pairs, the LGS dataset
[28], YFCC100M [59]. Similar to these datasets, LGS has a size advantage over many publicly available bi-

3
model datasets, as presented in Table 1. In this section, Table 2. Comparing the word count statistics of the LGS
we offer additional analysis of the LGS data. For all and COCO captions.
analysis and experiments in the paper, we use a subset of
the instances with 13 million instances, as the rest of the Dataset Min Max Mean Median Skew
dataset was constructed in parallel with the experiments. LGS 2 3642 89.58 67 3.44
COCO 5 50 10.56 10 2.76
3.1. Data Collection
To create training data that is truly representative
Table 3. The POS’s that occur at least ten times.
of e-commerce data as a whole, we include a wide range
of commerce websites with various product kinds, such
Dataset C. Nouns P. Nouns Adjectives Verbs
as infant products, sporting goods, bridal jewelry, etc.
The collection pipeline starts with a set of heuristic LGS 158,479 139,174 48,907 57,481
COCO 10,403 1,655 3,053 4,961
rules to isolate the product pages from the non-product
pages of an e-commerce website. Then, our automated
extractor obtains relevant information on each product
page, including the product title, the description, and
the first listed image. Some products may include nu- some statistics of the word distribution of the captions,
merous variants (e. g., different colors for a type of showing that both LGS and COCO have highly posi-
T-shirt), and we collect all variants. We avoid crawl- tively skewed distributions, with LGS having a longer
ing information that the sellers are unwilling to share. tail. Since LGS incorporates data from a large variety
Specifically, the extractor is forbidden from crawling of e-commerce websites, the descriptions can include
pages with a ‘Disallow’ extension. Finally, we use strict rich information. In the subsequent sections, we show
automated tests to filter out the instances with potential that while the raw captions of LGS are diverse, clear
quality issues. Examples of the tests include confirming structural information can be extracted from the LGS
that the price is a number, certifying that the images captions for fine-tuning purposes.
are valid, and ensuring that the product title exists and Additionally, we use the part-of-speech (POS) tag-
contains no unexpected characters. ging method from the Spacy library [21] to analyze
the linguistic statistics of the LGS captions, comparing
3.2. Characteristics of LGS Images common nouns, proper nouns, adjectives, and verbs.
Section 3.3 illustrates that LGS has at least 10x more
In general-domain image-caption datasets, the im- words per POS compared with COCO, whereas Figures
ages usually consist of one or more subjects juxtaposed Supp-5 and Supp-6 in the supplementary materials pro-
against a rich background, and their captions often men- vide further insights into the composition of each word
tion the background. In contrast, e-commerce product type. Due to the e-commerce nature of LGS, a large
thumbnails in LGS often depict only one in-animate portion of the instances is clothing and other wearable
item that occupies the foreground without any asso- items. Thus, within LGS, the proper nouns often present
ciation with the background. The background is also the brand names and sizes, the common nouns often
often a single color, with some examples shown in Fig. 3. describe the materials, and the adjectives and verbs
These clear backgrounds make it easier for models to often characterize the product-specific descriptions and
locate the patterns that correspond to their tasks. actions, making the LGS captions highly descriptive.

3.3. Characteristics of LGS Captions 3.4. LGS for Classification


In this subsection, we analyze the traits of the LGS While the raw data format of LGS is image-caption
captions. The LGS dataset has 14,847,764 captions in pairs, we also experimented with image classification
total, and the words and phrases in LGS captions are di- with LGS by labeling the classes. Specifically, we
verse. For example, while LGS has around 3x more cap- build three classification variants: LGS-117, LGS-710,
tions than COCO2 , its captions possess about 20x more and LGS-Overlap. For all three variants, we use a
uni-grams, bi-grams, and tri-grams, with more detailed taxonomy generation language model pre-trained in-
statistics presented in Appendix A.5. Table 2 presents house to convert each product description into a tax-
2 Each COCO instance has five corresponding captions, and onomy tree, whose nodes are designed to be informa-
we consider each of them separately. tive for e-commerce catalog applications. The end

4
socks bracelets Outfit Sets Shoulder Bags Roller Skates
Sweatshirts and Hoodies

Figure 3. Examples of LGS images with taxonomy end leaves

leaf of each taxonomy tree is then used as the label, adjusting the similarity threshold to 0.90 and including
with some examples displayed in Figure 3. The tax- additional pre-processing steps such as singularization
onomy tree can also be used to generate summarized and keyword merging. Note that polysemous words in
image captions that include product title, product brand the labels can refer to different objects in LGS and Ima-
name, and a number of “bullet strings” describing spe- geNet. For example, “cricket” in LGS refers to sports
cific product attributes. The bullet strings include ex- equipment but refers to the insect species in ImageNet.
amples such as Nylon fabric, Classic collar, and Thus, a manual inspection of the merged classes is per-
Front zipper fastening. The LGS leaves form a formed. After discarding classes with less than 20 in-
long-tailed distribution that emphasizes common daily stances, we gather the remaining 176 ImageNet synsets
commodities, with the five most common leaves being that align with the LGS end leaves and use them as the
Tops and T-shirts, Dresses, Rings, T-shirts, and LGS-Overlap dataset. The fact that only 17.6% of the
Sweatshirts and Hoodies. For each of the three clas- ImageNet synsets are matched shows a significant label
sification variants, we further clean the end leaves, with distribution difference between e-commerce applications
details provided in the two following paragraphs. In Fig- and common pre-training datasets. Since a higher level
ure Supp-1 in the supplementary materials, we provide of label-space alignment is essential for more effective
a histogram of the end leaf distribution. pre-training [41], LGS forms a representative benchmark
LGS-117 and LGS-710 are designed as pre-training and a pre-training dataset for downstream tasks that
datasets. Within all raw labels generated by the tax- see distributions close to e-commerce.
onomy model, there are synonyms and overlaps that
should be unified. After manually merging the synonyms 4. Experiments
among the most popular classes, we observe 117 classes
that contain at least 10k images. We select 10k images 4.1. Image Classification and Recon-
from each class, forming the balanced LGS-117 dataset.
struction
LGS-710 is an unbalanced dataset that includes more
scarce classes. To accelerate label engineering, we use a In this subsection, we use image classification and
semi-automated pipeline. First, we remove uninforma- reconstruction tasks to characterize the distributional
tive words like “other” and parse juxtaposed nouns by difference between LGS and ImageNet. We consider the
commas and “and”. Next, we use a pre-trained language distributions of images as well as the labels.
model to extract the embedding of each parsed noun. As
an example, for the leaf Tops and T-shirts, we embed
both tops and t-shirts. We then consider the “simi- 4.1.1 ImageNet Classifiers Do Not Readily Gen-
larity” between two classes to be the maximum cosine eralize to E-commerce
similarity between all pairs of corresponding nouns. Very The existing literature has shown that carefully con-
close classes are merged based on a similarity threshold structed images collected in a bias-controlled manner can
of 0.92, which is determined by manually inspecting the elicit a significant performance degradation on classifiers
merged classes. trained on ImageNet [3]. By applying pre-trained Ima-
LGS-Overlap is proposed as an out-of-distribution geNet classification models to the LGS-Overlap dataset
test set for models trained on ImageNet-1k, one of the without further training, we show that such out-of-
most widely used benchmarking datasets. We use a sim- distribution examples naturally exist in the e-commerce
ilar semi-automated pipeline to merge LGS classes with domain. Specifically, we use publicly available weights of
ImageNet synsets [9, 43]. We optimize the pipeline by a ResNet-50 model and a ConvNeXT-Base model. The

5
Table 4. The classification accuracy of models trained on LGS Table 5. The reconstruction quality of the MAE
shows that the LGS end leaves are well-separable. models trained on LGS and ImageNet, evaluated on
COCO. The symbol ↑ denotes “higher is better” while
LGS-117 LGS-117 LGS-710 ↓ means “lower is better”.
LGS Accuracy from scratch IN-pretrained IN-pretrained
(Top-1) (Top-5) Training Dataset Inception (↑) FID (↓)
After linear probing – 69.58 % 60.72 % 81.16 % ImageNet-1k 9.2930 114.60
After fine-tuning 97.89 % 98.16 % 77.27 % 89.09 % IN pretrain→IN+LGS 9.1906 115.48
LGS 10.187 91.387

ResNet-50 achieves a 74% average recall across the 176 Table 6. Linear probing accuracy of the self-supervised
overlapping synsets over the ImageNet images, but the MAE models with three different initializations. A: base-
number noticeably reduces to 46.43% on LGS-Overlap. line ImageNet MAE model [17], B: LGS MAE model, C:
LGS+ImageNet MAE model. Specifically, C is initialized
The ConvNeXT-Base obtains 79.00% and 50.14% on Im-
with A followed by 150 epochs on mixed Imagenet and LGS-
ageNet and LGS-Overlap, respectively. This difference 710 data (1:1 ratio). Fine-tuning on LGS-117 and Imagenet
highlights that existing ImageNet models do not readily datasets used 40 and 60 epochs, respectively.
transfer to LGS instances. In addition to having a dif-
ferent label distribution, the e-commerce domain forms MAE Training Setting
a natural distribution shift even for the classes that also Linear probing dataset A B C
exist in ImageNet. While taxonomy standardization
LGS-117 (40 epochs) 72.98 % 76.37 % 76.87 %
techniques exist, aligning and merging the label space is
ImageNet-1k (60 epochs) 67.78 % 46.37 % 65.29 %
still hard in general. Thus, a pre-training dataset that
is more aligned with e-commerce is necessary, and LGS
fulfills this role.
We further show that LGS end leaves are well-
separable, verifying that the performance degradation of MAE trained on ImageNet can reconstruct a reasonable
ImageNet models is caused by the distribution mismatch LGS image, but the reconstruction quality of the Ima-
and not the ambiguity of the LGS classes. Note that geNet+LGS model is better, demonstrating that LGS
Section 3.4 illustrates that the models learned on LGS- can be used to learn e-commerce visual features.
117 / LGS-710 can achieve high accuracy on LGS-117 / To quantitatively demonstrate the generalizability
LGS-710. Specifically, we consider the “linear probing of the vision feature extractors, we evaluate the recon-
followed by fine-tuning” training schedule, a transfer struction performance of the MAE models trained on
learning scheme that has been shown to improve the ro- LGS and ImageNet on COCO. The qualities of the raw
bustness against distribution shift by avoiding significant reconstructions obtained by the models are presented in
distortions of the pre-trained weights. Table 5. While LGS is more domain-specific compared
with ImageNet and COCO (both of which cover a wide
4.1.2 Non-classification Visual Feature Extrac- range of domains), the MAE trained on LGS is able
tors Can Generalize to generate COCO images with higher qualities com-
pared with the ImageNet model. Furthermore, we use
Since the image-label correspondence is different be- Section 4.1.2 to show that upon the visual embeddings
tween LGS and ImageNet, we use self-supervised train- learned jointly on ImageNet and LGS, a linear classi-
ing to isolate this mismatch and focus on the distribution fier with a satisfactory performance can be learned on
of images. In the context of transfer learning, since self- both ImageNet and LGS. The above results verify that
supervised training does not use labels, it circumvents the feature extractors can generalize between LGS and
the issue of label space mismatch between target and general-domain datasets, despite the separation of the
source domains, which has been shown to undermine intermediate visual embeddings (which are visualized in
the quality of transfer learning. Masked AutoEncoder Appendix A.2).
(MAE) [17] is a self-supervised method designed for Based on the above observations, we infer that
pre-training. Thus, we compare the performance of an the e-commerce data distribution, represented by the
MAE trained on ImageNet only with an MAE trained LGS dataset, significantly differs from existing general
on ImageNet and LGS-710. Figure 4 shows that the datasets in the label space, while visual features can

6
reconstruct reconstruct + visible reconstruct reconstruct + visible
original masked ImageNet MAE ImageNet MAE ImageNet+LGS MAE ImageNet+LGS MAE

Figure 4. While an MAE trained on ImageNet can reasonably reconstruct an LGS image, adding LGS instances to the
training improves the reconstruction quality.

reconstruct reconstruct + visible reconstruct reconstruct + visible


original masked ImageNet MAE ImageNet MAE ImageNet+LGS MAE ImageNet+LGS MAE

Figure 5. Adding LGS instances to the training also improves the reconstruction on some ImageNet instances.

Table 7. ImageNet→LGS-710 two-phase pre-training improves downstream linear probing accuracy for downstream tasks
including CIFAR-100, Fashion MNIST, and Clothing1M. On Clothing1M, whose data also comes from the e-commerce
domain, the LGS-pre-trained features also improve end-to-end fine-tuning performance. For Clothing1M, we only use its
clean training set, whereas Clothing1M (10%) is a few-shot setup that trains on a 10% subset of the clean training set.

Linear Probing End-to-end training


Fashion Clothing1M Clothing1M Clothing1M Clothing1M
Pre-training Setup CIFAR-10 CIFAR-100
MNIST (10 %) (100 %) (10 %) (100 %)
ImageNet 61.97 40.46 79.68 59.74 67.57 65.69 74.81
ImageNet→LGS-117 59.83 35.57 80.39 64.48 69.67 68.16 75.47
ImageNet→LGS-710 58.81 42.21 82.18 64.16 70.06 65.85 74.51

generalize. Thus, LGS is an ideal pre-training dataset corporating in-domain pre-training (both LGS-117 and
for downstream tasks whose class distributions align LGS-710) results in better performance (2% absolute)
with the e-commerce domain. compared to ImageNet pre-training. Moreover, in
limited-data settings, we observe less model regression
4.1.3 LGS Supplements ImageNet as a Pre- compared to the full-data setups. For example, for fine-
training Dataset tuning a linear classifier on 10% of the Clothing1M-clean
dataset, the ImageNet pre-trained model regresses more
LGS can also widen the span of the pre-training distri- (11.5% relative) compared to LGS-117 and LGS-710
bution when used in conjunction with ImageNet, acting pre-trained models (7.4 and 8.4% relative respectively).
as a bridge between general visual features and domain- When models are trained end-to-end, we observe that the
specific applications. Specifically, Table 7 shows that a pre-training setup is less critical in fine-tuning the full
two-phase ImageNet→LGS-710 weakly-supervised pre- Clothing1M-clean training dataset. However, for limited-
training scheme produces features more suitable for fine- data experiments, filtering out under-represented classes
tuning on common downstream tasks. On e-commerce- (LGS-117) in pre-training helps with the downstream
related downstream datasets such as Clothing1M, the fine-tuning results (2% absolute) compared to both Im-
models pre-trained on LGS also excel in both linear ageNet and LGS-710 datasets.
probing and end-to-end settings. In Appendix A.3 in the supplementary materials, we
In linear probing experiments, we observe that in- use GradCam [14, 52] to visualize the representations

7
Table 8. IC model performance of image-captioning task LGS Examples
evaluated on different combinations of training and evalua-
tion datasets. a photo of terez
leggings navy camo
stripe hi-shine bump
Training Set Test Set METEOR (↑) squad leggings,
e-commerce
LGS-title LGS-title 0.184
LGS-description LGS-title 0.161
LGS-taxonomy LGS-taxonomy 0.584 a photo of ana silver
COCO LGS-title 0.069 co. rings apatite
ring size 8.25 (925
sterling silver)
ring81016,
e-commerce
learned by the classification models, demonstrating that
the LGS models look for much more localized patterns DeepFashion InShop Examples
that are relevant to e-commerce classification. a photo of Dresses
Ivory-navy Forever
21 Contemporary -
4.2. Caption Generation Show the perfect
amount of skin in
this sleek,
In this section, we illustrate that the distinct distri- sophisticated
bution of LGS benefits vision-language bi-modal tasks. surplice dress,
e-commerce
Specifically, we study the efficacy of image-captioning
(IC) models trained on traditional datasets in predicting a photo of Pants
LGS-type descriptions. We also evaluate the perfor- Black-grey This
jogger’s easy,
mance of LGS-trained models in generating attribute- slouchy silhouette
rich image captions that would otherwise not be possible gets a little grit
courtesy of its
for models trained on more traditional datasets. eye-popping print of
In this experiment, we utilize a bi-modal modeling photorealistic roses,
e-commerce
framework based on OFA [61], a recently proposed
encoder-decoder architecture that has achieved state- (a) Input Prompt (b) Vanilla SD (c) LGS-117
of-the-art performances in many language-vision tasks.
For each LGS image, the corresponding caption can be Figure 6. Qualitative comparisons of the generations of
constructed by concatenating the “product description” the Vanilla and the LGS-117-fine-tuned SD models in the
strings in various orders. Specifically, we create three general setting. The fine-tuned model generates more visu-
types of captions: ally appealing images.

1. LGS-title : title and brand name; Table 9. Comparing the Vanilla SD and the LGS-117 fine-
tuned model on LGS and DeepFashion datasets.
2. LGS-taxonomy : product taxonomy;
3. LGS-description: concatenated bullet strings.
Model Test Set FID (↓)
The OFA IC model was trained on the three types of Vanilla LGS Val 25.3498
LGS inputs as well as on the traditional COCO dataset. Vanilla + LGS-117 LGS Val 24.1952
The IC model performance in terms of its ability to Vanilla DeepFashion 62.9269
predict the appropriate target string is tabulated in Vanilla + LGS-117 DeepFashion 74.0185
Table 8.

4.3. Text-to-Image Generation generation method to two e-commerce scenarios: general


Because of its high-quality e-commerce-focused im- and fine-grained. For both scenarios, we fine-tune based
ages and bimodal nature, LGS is an ideal option for on the sd-v1-4 (referred to as Vanilla) SD checkpoint.
training text-to-image models in the e-commerce sector, For the general setting, we add a domain identifier
serving as a bridge between general visual features and to all training prompts associated with LGS images and
domain-specific applications. In this section, we use guide the SD model to adapt to the e-commerce image
LGS to adapt the Stable Diffusion (SD) text-to-image style when this identifier is provided. The choice of

8
end leaf as an example and compare the generations
before and after LGS-fine-tuning under the fine-grained
a photo of new
balance men white setting in Figure 7. As did the general setting results,
running shoes the fine-grained examples also indicate that LGS helps
adapt text-to-image models to e-commerce scenarios
and improves image quality and aesthetics.
a photo of on
running athletic
shoes on running 5. Conclusion
women’s cloudswift
road shoe 41.99578
in lake sky
The Let’s Go Shopping (LGS) dataset consists of 15
million pairs of publically-accessible diverse images and
descriptive captions from e-commerce websites. Our effi-
Figure 7. The LGS-117-fine-tuned SD model also generates cient semi-automated gathering and annotation pipeline
more visually appealing images in the fine-grained setting. ensure scalable data collection. We then use LGS
The prompts are from LGS. to show that while the categories associated with e-
commerce data may not align with the general-domain
pre-training datasets, visual feature extractors can be
the domain identifier is crucial, as the paper [50] shows shared. Finally, we show that the distinct distribution
that a domain identifier with a strong prior should be offered by LGS and LGS’s bi-modal nature can be ben-
avoided. For example, the word retail has a strong eficial for applications including image classification,
prior, and the pre-trained “Vanilla” SD model confi- image reconstruction, bi-modal representation learning,
dently associates it with (physical) retail stores. This and text-to-image generation.
behavior is undesirable for the goal of e-commerce style
transfer. By analyzing the effects of various domain iden-
tifiers on the generations of the pre-trained SD model,
6. Acknowledgment
we determine that the word “e-commerce” gives a weak We would like to appreciate Matteo Bruno, James
prior and is a suitable identifier. We then construct the Caulkins, Xiaowen Dong, Peter Grindrod, Jack Hessel,
ground-truth training prompts for the LGS images in the Sean Holloway, Dorothy Nicholas, Janet Pierrehumbert,
format of a photo of <brand> <end_leaf> <title>, and Julian Winkler for their valuable help and fruitful
e-commerce, where the <end_leaf> refers to the end discussions.
leaf of the taxonomy tree introduced in Section 3.4. The
We also appreciate Baillie Gifford, the Institute for
“Vanilla” SD checkpoint is fine-tuned on one million LGS
New Economic Thinking at the Oxford Martin School,
image-prompt pairs for 100k steps with a batch size of
and the UK Engineering and Physical Science Research
24. Table 9 displays the quantitative results on an un-
Council for funding our work at the University of Oxford.
seen validation set (5K image-prompt pairs) from LGS
and a subset of the DeepFashion InShop dataset. The This work was supported in part through the NYU IT
fine-tuning process enhances the performance of SD on High-Performance Computing resources, services, and
LGS as expected. While the FID scores on DeepFashion staff expertise.
are lower, the generations of the LGS-117 fine-tuned
model are aesthetically more appealing. At this instant,
there are no quantitative metrics that directly measure
aesthetic superiority. Thus, we present Figure 6 and
the additional examples in Appendix A.6 in the sup-
plementary materials (Figures Supp-9 and Supp-8) to
demonstrate the aesthetic improvement qualitatively.
The lower FID scores may indicate a distribution shift
between LGS and DeepFashion images.
For the fine-grained setting, we use data belonging
to only a particular end leaf, using the same prompt
without the additional identifier. The checkpoint is
fine-tuned with 10k image-prompt pairs for 25k steps
with a batch size of 6. We use the “athletic shoes”

9
References [13] Mark Everingham, Luc Van Gool, Christopher K. I.
Williams, John M. Winn, and Andrew Zisserman. The
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- pascal visual object classes (VOC) challenge. IJCV,
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and 2009. 3
Devi Parikh. VQA: Visual question answering. In ICCV,
2015. 3 [14] Jacob Gildenblat and contributors. Pytorch library
for cam methods. https://github.com/jacobgil/
[2] Yatong Bai, Trung Dang, Dung Tran, Kazuhito pytorch-grad-cam, 2021. 7, 14
Koishida, and Somayeh Sojoudi. Accelerating diffusion-
based text-to-audio generation with consistency distil- [15] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS:
lation. arXiv preprint arXiv:2309.10740, 2023. 2 A dataset for large vocabulary instance segmentation.
In CVPR, 2019. 3
[3] Andrei Barbu, David Mayo, Julian Alverio, William
Luo, Christopher Wang, Dan Gutfreund, Josh Tenen- [16] Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao
baum, and Boris Katz. Objectnet: A large-scale bias- Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S
controlled dataset for pushing the limits of object recog- Davis. Automatic spatially-aware fashion concept dis-
nition models. In Advances in Neural Information Pro- covery. In Proceedings of the IEEE international con-
cessing Systems, 2019. 2, 5 ference on computer vision, 2017. 3

[4] Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. [17] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li,
Learning visual representations with caption annota- Piotr Dollár, and Ross Girshick. Masked autoencoders
tions. In ECCV, 2020. 3 are scalable vision learners. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
[5] Soravit Changpinyo, Piyush Sharma, Nan Ding, and tern Recognition (CVPR), 2022. 2, 6
Radu Soricut. Conceptual 12M: Pushing Web-Scale
Image-Text Pre-Training To Recognize Long-Tail Visual [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Concepts. In CVPR, 2021. 1 Sun. Deep residual learning for image recognition. In
CVPR, 2016. 3
[6] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-
[19] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling
ishna Vedantam, Saurabh Gupta, Piotr Dollár, and
the Knowledge in a Neural Network. NeurIPS Deep
C Lawrence Zitnick. Microsoft COCO captions:
Learning and Representation Learning Workshop, 2015.
Data collection and evaluation server. arXiv preprint
1
arXiv:1504.00325, 2015. 1, 3
[20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
[7] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy,
diffusion probabilistic models. In Advances in Neural
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.
Information Processing Systems, 2020. 2
Uniter: Learning universal image-text representations.
arXiv preprint arXiv:1909.11740, 2019. 3 [21] Matthew Honnibal, Ines Montani, Sofie Van Landeghem,
and Adriane Boyd. spaCy: Industrial-strength Natural
[8] Francois Chollet. Xception: Deep Learning with Depth- Language Processing in Python. 2020. 4
wise Separable Convolutions. In CVPR, 2017. 1
[22] Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu,
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Jianlong Fu. Pixel-BERT: Aligning Image Pixels
and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical with Text by Deep Multi-Modal Transformers. arXiv
Image Database. In CVPR, 2009. 1, 5 preprint arXiv:2004.00849, 2020. 3
[10] Karan Desai and Justin Johnson. VirTex: Learning [23] Drew A Hudson and Christopher D Manning. GQA:
Visual Representations from Textual Annotations. In A New Dataset for Real-world Visual Reasoning and
CVPR, 2020. 3 Compositional Question Answering. In CVPR, 2019. 3
[11] Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin [24] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Lo-
Johnson. Redcaps: web-curated image-text data created gan Engstrom, Brandon Tran, and Aleksander Madry.
by the people, for the people. ArXiv, abs/2111.11431, Adversarial examples are not bugs, they are features.
2021. 1 Advances in neural information processing systems, 2019.
14
[12] Xiao Dong, Xunlin Zhan, Yangxin Wu, Yunchao Wei,
Michael C Kampffmeyer, Xiaoyong Wei, Minlong Lu, [25] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
Yaowei Wang, and Xiaodan Liang. M5product: Self- Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen
harmonized contrastive learning for e-commercial multi- Li, and Tom Duerig. Scaling Up Visual and Vision-
modal pretraining. In CVPR, 2022. 3 Language Representation Learning With Noisy Text

10
Supervision. arXiv preprint arXiv:2102.05918, 2021. 1, [39] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and
3 Xiaoou Tang. Deepfashion: Powering robust clothes
recognition and retrieval with rich annotations. In
[26] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, CVPR, 2016. 3
and Tamara Berg. Referitgame: Referring to objects in
photographs of natural scenes. In EMNLP, 2014. 3 [40] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.
ViLBERT: Pretraining task-agnostic visiolinguistic rep-
[27] Wonjae Kim, Bokyung Son, and Ildoo Kim. ViLT:
resentations for vision-and-language tasks. In NeurIPS,
Vision-and-Language Transformer Without Convolution
2019. 3
or Region Supervision. In ICML, 2021. 3
[28] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- [41] Dhruv Kumar Mahajan, Ross B. Girshick, Vignesh
son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yan- Ramanathan, Kaiming He, Manohar Paluri, Yixuan
nis Kalantidis, Li-Jia Li, David A Shamma, Michael S Li, Ashwin Bharambe, and Laurens van der Maaten.
Bernstein, and Li Fei-Fei. Visual Genome: Connecting Exploring the limits of weakly supervised pretraining.
Language and Vision using Crowdsourced Dense Image In ECCV, 2018. 3, 5
Annotations. IJCV, 2017. 3 [42] Leland McInnes, John Healy, Nathaniel Saul, and Lukas
[29] Alex Krizhevsky. Learning multiple layers of features Großberger. Umap: Uniform manifold approxima-
from tiny images, 2012. 2 tion and projection. Journal of Open Source Software,
3(29):861, 2018. 13
[30] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper
Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, [43] George A. Miller. Wordnet: A lexical database for
Stefan Popov, Matteo Malloci, Alexander Kolesnikov, english. Communications of the ACM, 38(11):39–41,
et al. The open images dataset v4. International Journal 1995. 5
of Computer Vision, 128(7):1956–1981, 2020. 1
[44] Yuval Netzer, Tao Wang, Adam Coates, Alessandro
[31] Ya Le and Xuan S. Yang. Tiny imagenet visual recog- Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in
nition challenge. 2015. 3 natural images with unsupervised feature learning. In
NeurIPS Workshop on Deep Learning and Unsupervised
[32] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist Feature Learning 2011, 2011. 3
handwritten digit database. ATT Labs, 2, 2010. 3
[45] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg.
[33] Ang Li, Allan Jabri, Armand Joulin, and Laurens
Im2Text: Describing Images Using 1 Million Captioned
van der Maaten. Learning visual n-grams from web
Photographs. In NeurIPS, 2011. 3
data. In ICCV, 2017. 3
[46] Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo,
[34] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and
Radu Soricut, and Vittorio Ferrari. Connecting vision
Ming Zhou. Unicoder-VL: A universal encoder for vision
and language with localized narratives. In ECCV, 2020.
and language by cross-modal pre-training. AAAI, 2020.
3
3
[35] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui [47] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti,
Hsieh, and Kai-Wei Chang. VisualBERT: A simple and Arun Sacheti. ImageBERT: Cross-modal Pre-
and performant baseline for vision and language. arXiv training with Large-scale Weak-supervised Image-Text
preprint arXiv:1908.03557, 2019. 3 Data. arXiv preprint arXiv:2001.07966, 2020. 3

[36] Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and [48] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Luc Van Gool. WebVision Database: Visual Learning Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
and Understanding from Web Data. arXiv preprint Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
arXiv:1708.02862, 2017. 3 Krueger, and Ilya Sutskever. Learning Transferable Vi-
sual Models From Natural Language Supervision. arXiv
[37] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, preprint arXiv:2103.00020, 2021. 1, 3
Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu,
Li Dong, Furu Wei, et al. Oscar: Object-semantics [49] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
aligned pre-training for vision-language tasks. In ECCV, Patrick Esser, and Björn Ommer. High-resolution image
2020. 3 synthesis with latent diffusion models, 2021. 2

[38] Tsung-Yi Lin, Michael Maire, Serge Belongie, James [50] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and Pritch, Michael Rubinstein, and Kfir Aberman. Dream-
C Lawrence Zitnick. Microsoft COCO: Common objects booth: Fine tuning text-to-image diffusion models for
in context. In ECCV, 2014. 3 subject-driven generation. 2022. 9

11
[51] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, [62] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah,
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Steven Rennie, Kristen Grauman, and Rogerio Feris.
Karpathy, Aditya Khosla, Michael Bernstein, et al. Ima- The fashion iq dataset: Retrieving images by combining
genet Large Scale Visual Recognition Challenge. IJCV, side information and relative natural language feedback.
2015. 1, 3 CVPR, 2021. 3

[52] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek [63] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-
Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv mnist: a novel image dataset for benchmarking machine
Batra. Grad-cam: Visual explanations from deep net- learning algorithms, 2017. 2
works via gradient-based localization. In Proceedings of
the IEEE international conference on computer vision, [64] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and
2017. 7, 14 Xiaogang Wang. Learning from massive noisy labeled
data for image classification. In Proceedings of the IEEE
[53] Piyush Sharma, Nan Ding, Sebastian Goodman, and conference on computer vision and pattern recognition,
Radu Soricut. Conceptual Captions: A Cleaned, Hyper- 2015. 3
nymed, Image Alt-text Dataset for Automatic Image
Captioning. In ACL, 2018. 1, 3 [65] Chenfeng Xu, Shijia Yang, Tomer Galanti, Bichen Wu,
Xiangyu Yue, Bohan Zhai, Wei Zhan, Peter Vajda,
[54] Jascha Sohl-Dickstein, Eric Weiss, Niru Mah- Kurt Keutzer, and Masayoshi Tomizuka. Image2point:
eswaranathan, and Surya Ganguli. Deep unsupervised 3d point-cloud understanding with 2d image pretrained
learning using nonequilibrium thermodynamics. In In- models. In Shai Avidan, Gabriel J. Brostow, Moustapha
ternational Conference on Machine Learning, 2015. 2 Cissé, Giovanni Maria Farinella, and Tal Hassner, edi-
tors, ECCV, 2022. 15
[55] Krishna Srinivasan, Karthik Raman, Jiecao Chen,
Michael Bendersky, and Marc Najork. WIT: Wikipedia- [66] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
based Image Text Dataset for Multimodal Multilingual Choi. From recognition to cognition: Visual common-
Machine Learning. arXiv preprint arXiv:2103.01913, sense reasoning. In CVPR, 2019. 3
2021. 3
[67] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio
[56] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Torralba, and Aude Oliva. Learning deep features for
Furu Wei, and Jifeng Dai. VL-BERT: Pre-training of scene recognition using places database. In NeurIPS,
generic visual-linguistic representations. In ICLR, 2020. 2014. 3
3
[68] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong
[57] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Hu, Jason J Corso, and Jianfeng Gao. Unified vision-
Huajun Bai, and Yoav Artzi. A corpus for reasoning language pre-training for image captioning and VQA.
about natural language grounded in photographs. In AAAI, 2020. 3
ACL, 2019. 3
[69] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-
[58] Hao Tan and Mohit Bansal. LXMERT: Learning cross- Fei. Visual7W: Grounded Question Answering in Im-
modality encoder representations from transformers. In ages. In CVPR, 2016. 3
EMNLP, 2019. 3

[59] Bart Thomee, David A Shamma, Gerald Friedland,


Benjamin Elizalde, Karl Ni, Douglas Poland, Damian
Borth, and Li-Jia Li. YFCC100M: The New Data in
Multimedia Research. Communications of the ACM,
2016. 3

[60] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin
Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro
Perona, and Serge Belongie. The inaturalist species
classification and detection dataset. In CVPR, 2018. 3

[61] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai


Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren
Zhou, and Hongxia Yang. Unifying architectures, tasks,
and modalities through a simple sequence-to-sequence
learning framework. arXiv preprint arXiv:2202.03052,
2022. 2, 8

12
600000
500000
400000
300000
200000
100000
0
Pants

Area Rugs
Panties

Women's Fragrances
Tops and T-Shirts

Wedding and Anniversary Bands


Dresses

Frames

Wigs and Hair Pieces


Totes

Sleepwear and Robes


Rings

Coats and Jackets


Bras

Socks

Shoulder Bags
Casual

Sports Bras

Crossbody Bags

Handbags
T-Shirts

Boots
Jeans

Bracelets

Music
Hats
Skirts

Belts
Suits

Decals, Stickers and Vinyl Art


Underwear
Books
Sleepwear

Slippers
Fabric
Sweatshirts and Hoodies

Posters, Prints and Paintings


Sweatshirts
Shorts
Sweaters
Swimwear
Sandals and Flip Flops
Fashion Sneakers
Necklaces

Earrings
Other Women's Clothing
Heels

Engagement Rings

Casual Shirts
Sunglasses
Jumpsuits and Rompers
Outfits and Sets
Wall Art
Flats
Bodysuits
Charms and Pendants
Leggings

Tumblers

Disc Golf Discs

Girls' Shoes
Outerwear
Other Men's Clothing
Bed Sheets and Pillowcases
Boys' Shoes
Men's Watches

Bedding Sets and Collections

Vests
Wallets
Jerseys
Other Women's Shoes
(Other end leaves)
Unisex Shoes

Cases, Covers and Skins

Athletic

Athletic Shoes

Clothing and Accessories

Newborn and Baby Shoes


Other Children's Clothing and Accessories
Hair Ties and Styling Accessories

Other Unisex Accessories


Figure Supp-1. The instance counts of the 80 most popular LGS end leaves.

ResNet50 block 0 features ResNet50 block 1 features ResNet50 block 2 features ResNet50 block 3 features ResNet50 block 4 features
9 LGS 9 LGS LGS LGS
8 ImageNet 6 ImageNet 10 ImageNet 7 ImageNet
7 8 9 6
5
6 7 8 5
5 4 7 4
6
4 3
3 6
3 5
5 2
2 2 LGS 4 1
1 ImageNet 4
1
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 4 6 8 10 4 6 8 10

Figure Supp-2. UMAP visualization of the ImageNet and LGS features extracted on a ResNet50 model trained on
ImageNet and LGS.

A. Additional Analyses
A.1. LGS End Leaf Histogram
The instance counts of the LGS end leaves are displayed in Figure Supp-1. The top 80 most popular end leaves
encompass 83.28% of the total instances, with the most popular Tops and T-shirts containing 16.23% of the
instances.

A.2. How Features learned on ImageNet and LGS Differ


To understand how vision models interpret the ImageNet and LGS instances, we use a ResNet50 model
sequentially trained on ImageNet and LGS-117 as the feature extractor, and use UMAP [42] to visualize the
high-dimensional ImageNet and LGS features in 2D figures. As shown in Figure Supp-2, the ImageNet features
form a cluster, while the LGS features form a less concentrated cluster. The separation of the two clusters is
especially prominent at the first two layers.
As discussed in the main portion of the paper, many LGS product thumbnails consist of isolated foreground
objects and clear backgrounds, while ImageNet instances are mostly natural images where the foreground blends
into the background. Thus, we question whether the feature clustering is a consequence of this difference. To
this end, we learn a binary classification linear header that predicts between LGS and ImageNet images based
on the features extracted by the ResNet-50 model. We then visualize the saliency map of this binary model in
Figure Supp-3. While the background is the most prominent difference between ImageNet and LGS to human
eyes, the saliency maps demonstrate that the deep models look for more sophisticated patterns, which can vary
across different images. Specifically, the foreground is emphasized in the first LGS example, while the background

13
50 50
75 75
100 100
125 125
150 150
175 175
200 200

0 50 100 150 200 0 50 100 150 200


0 0 0 0
25 25 25 25
50 50 50 50
75 75 75 75
100 100 100 100
125 125 125 125
150 150 150 150
175 175 175 175
200 200 200 200

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
0 0 0 0
25 25 25 25
50 50 50 50
75 75 75 75
100 100 100 100
125 125 125 125
150 150 150 150
175 175 175 175
200 200 200 200

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
0 0 0 0
25 25 25 25
50 50 50 50
75 75 75 75
100 100 100 100
125 125 125 125
150 150 150 150
175 175 175 175
200 200 200 200

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
0 0 0 0
25 (a) ImageNet25 examples. 25
(b) LGS examples.
25
50 50 50 50
75 Figure Supp-3.
75 The saliency map of the
75 LGS-ImageNet binary
75 classifier.
100 100 100 100
125 125 125 125
150 150 150 150
is more important
175 in the second LGS
175 instance. This observation
175 aligns with the 175findings of [24], which states that
deep neural 200networks are not always
200 200
understandable by humans. 200

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
0 0

A.3. LGS Classification Models Look for Localized


25
50
Patterns
25
50
75 75
In Figure Supp-4, we use GradCam [14, 52], a framework 100
that visualizes gradient
100
activations of the input
images, to demonstrate that the models trained on LGS 125 look for much more localized
125 patterns. Here, we draw
150
examples from the “sweatshirt” synset in the LGS-Overlap 175
dataset, and feed150175
them into the three ResNet-50
models learned on ImageNet, LGS-117, and LGS-710, respectively.
200 The gradient
200 activation of the ImageNet
model spreads across the entire image, while the LGS models 0 return
50 100 more
150 concentrated
200 0 50 gradient
100 150 maps.
200 Note that
the gradient spikes produced by the LGS models mostly locate around the sleeves and the waist portion of the
clothes. This makes sense because the LGS models are trained to differentiate various kinds of clothing-dominated
e-commercial products. The portions highlighted by the LGS model gradient maps precisely correspond to the
places where various types of clothes differ. For example, checking the sleeve length may be one of the easiest ways
of distinguishing T-shirts from sweatshirts. Since the LGS-710 model was trained to classify more fine-grained
types of products, it looks for even more localized patterns compared with the LGS-117 model.

A.4. Linear Probing Details


In this section of the appendix, we discuss the implementation details for the linear probing experiments in
Section 4.1.3. In the existing literature, when ResNets (designed for 224 × 224 inputs) are adopted for tasks
that use smaller input sizes, the first 7 × 7 convolution layer is often replaced with a 3 × 3 layer. We adopt
this replacement for CIFAR and Fashion MNIST. During linear probing, we thus allow this modified, randomly
reinitialized first layer to be optimized along with the output layer.
In Section 4.1.3, we presented the improved linear probing results on CIFAR-100 and Fashion MNIST. We
would like to highlight that linear probing is a practical training method, because when the batch normalization

14
Input ImageNet RN50 LGS-117 RN50 LGS-710 RN50

Figure Supp-4. GradCam visualizations show that LGS classification models look for much more localized patterns.

(BN) layers are jointly optimized alongside the first and the last layer, this modified “linear” probing scheme can
achieve a performance that is comparable to end-to-end training [65]. Specifically, with learnable BN, a ResNet-50
model pre-trained on ImageNet→LGS-710→ImageNet achieves an accuracy of 71.41% on CIFAR-100, compared
with 69.47% for an ImageNet-only model.

A.5. n-gram and POS Analysis of LGS Captions


Table Supp-1 presents the comparisons of the uni-grams, bi-grams, and tri-grams of LGS. This comparison
indicates that LGS is more linguistically diverse. The uni-grams and bi-grams of the two datasets are similar.
However, we notice greater conceptual diversity for LGS within its tri-grams. Specifically, COCO’s five most

15
Figure Supp-5. Top 20 most common words per POS for LGS.

Figure Supp-6. Top 20 most common words per POS for COCO.

Table Supp-1. Comparing the n-gram statistics of LGS with that of COCO.

Number of n-grams with occurrence ≥ 10 Five most frequent n-grams (n = 1, 2, 3)


LGS COCO LGS COCO
uni-grams 364,802 17,009 and, the, a, to, with a, of, on, the, i
bi-grams 4,054,418 184,882 with a, in the, of the, is a, for a on a, in a, a man, of a, with a
true to size, made to order, this is a, a group of, group of people
tri-grams 8,900,084 462,653
this item is, machine wash cold in front of, next to a , on top of

frequent tri-grams describe a group of objects and the relative position of the objects, whereas the LGS tri-grams
encompass inherent properties of the commodities, including the size and the nature of each item.
In addition to the part-of-speech (POS) results presented in Section 3.3, we use Figures Supp-5 and Supp-6 to
present the most common words per POS for LGS and COCO, respectively.

A.6. Determining the Prompts for Text-to-Image Generation


Ensuring the quality of the input prompts is paramount for text-to-image models to generate realistic images.
Our goal is to choose a prompt which generates images faithful to the metadata, performs relatively well in terms
of Frechet Inception Distance (FID) score, and generalizes across datasets.
To that end, we randomly selected 5,000 examples each from the LGS and DeepFashion InShop datasets. It is
important to note that, for prompt engineering, the ground-truth images used for FID calculation are upscaled
from 256×256, and the denoising diffusion implicit model steps (ddim_steps) were lowered to 50 for inference.
This resulted in lower scores than the experiment results (Table 9). However, the numbers are still indicative of
relative performance.
Quantitatively, Prompts 3 and 4 perform significantly better on LGS, perform comparably on DeepFashion, and
generalize well (Table Supp-2). Prompt 3 had better FID scores using the Vanilla model and performed slightly
better on LGS. Qualitatively, however, Prompt 4 generations are consistently better and more faithful to the
metadata (Figure Supp-7). Therefore, we choose Prompt 4 for our experiments. This also reaffirms that these
metrics are not strong indicators of aesthetic quality in this particular case, and should only be used as a loose
relative measure. Figures Supp-9 and Supp-8 show additional examples from the two datasets generated with

16
Table Supp-2. The FID scores across prompts using a subset (n = 5000) of the LGS and DeepFashion InShop datasets.

FID (↓)
Prompt ID Model
LGS DeepFashion
1 Vanilla 40.4437 61.8519
Vanilla + LGS-117 42.7328 74.4327
2 Vanilla 42.1081 63.2344
Vanilla + LGS-117 42.0529 77.7190
3 Vanilla 36.7157 58.2189
Vanilla + LGS-117 36.1946 79.3607
4 Vanilla 38.4101 62.9269
Vanilla + LGS-117 38.4100 74.0185

Table Supp-3. Prompts evaluated for text-to-image generation experiment. Prompt structures varied slightly due to
available metadata across datasets.

Prompt ID Dataset Prompt Structure


1 LGS {brand} {title} in the style of e-commerce
DeepFashion {first sentence of description} in the style of e-commerce
2 LGS {end_leaf} advertisement for a {title} from {brand}
DeepFashion {end_leaf} advertisement for a {first sentence of description}
3 LGS {brand} {end_leaf} {title} {description}
DeepFashion {end_leaf} {description} {gender_category} {color}
4 LGS a photo of {brand} {end_leaf} {title}, e-commerce
DeepFashion a photo of {end_leaf} {color} {first sentence of description}, e-commerce

Prompt 4.

17
end_leaf : Jackets Vests gender_category: Men color:
Khaki first sentence of description:Made in a cotton-
nylon blend with a modified collar and partial mesh lining,
this baseball jacket is the slickest iteration of the style
yet

end_leaf : Pants gender_category: Men color: Black-


grey first sentence of description: This jogger’s easy,
slouchy silhouette gets a little grit courtesy of its eye-
popping print of photorealistic roses

end_leaf : Shirts Polos gender_category: Men color:


Coral first sentence of description: Constructed from
cotton for a classic fit, this lightweight shirt features
buttoned chest pockets

end_leaf : Shorts gender_category: Men color: Grey


first sentence of description: Crafted from speckled
French terry, this sharper-than -average pair of sweat-
shorts is outfitted with a mock fly and three shiny zip
pockets (two in front, one in back), ideal for lounging
around or winning triathalons (just kidding)

end_leaf : Blouses Shirts gender_category: Women


color: Rust first sentence of description: Effortlessly
ethereal and romantic, this cutout-shoulder top is what
dream closets are made of

(a) Metadata (b) Prompt 3 (c) Prompt 4

Figure Supp-7. Generated images with Vanilla SD model to determine prompt.

18
a photo of Jackets Vests
Khaki Made in a
a photo of Blouses Shirts cotton-nylon blend with a
Tomato Love 21 - A woven modified collar and partial
cami featuring a pleated mesh lining, this baseball
front and crossback strap jacket is the slickest
detail in the back, iteration of the style yet,
e-commerce e-commerce

a photo of Shirts Polos


Coral Constructed from
cotton for a classic fit, this
lightweight shirt features
buttoned chest pockets,
e-commerce

a photo of Shorts Grey


Crafted from speckled
French terry, this
sharper-than-average pair of
sweatshorts is outfitted with
a mock fly and three shiny
zip pockets (two in front,
one in back), ideal for
lounging around or winning
triathalons (just kidding),
e-commerce

(a) Input Prompt (b) Vanilla SD (c) LGS-117

Figure Supp-8. Additional qualitative examples of the Vanilla SD vs LGS-117 fine-tuned SD model on DeepFashion
InShop dataset.

19
a photo of ana silver co.
earrings rainbow moonstone
earrings 3/4" (925 sterling
silver) earr415021

a photo of vans vault vans


vault old skool lx - croc
skin/flame

a photo of myconquering
conquering unisex black
joggers

a photo of chopard cat eye


unisex sunglasses

a photo of invicta bracelets


elements men’s bracelet

a photo of
wristwatchstraps.co smart
watch accessories bumper
cover+glass for apple watch
- lilac 21 - 38mm

(a) Input Prompt (b) Vanilla SD (c) LGS-117

Figure Supp-9. Additional qualitative examples of the Vanilla SD vs LGS-117 fine-tuned SD model on the LGS dataset.

20

You might also like