Fashion Modeling via Visual Patterns
Fashion Modeling via Visual Patterns
ABSTRACT
We propose a method to try to model fashionable dresses in
this paper. We first discover common visual patterns that ap-
pear in dress images using a human-in-the-loop, active clus-
tering approach. A fashionable dress is expected to contain
certain visual patterns which make it fashionable. An ap-
proach is proposed to jointly identify fashionable visual pat-
terns and learn a discriminative fashion classifier. The results
show that interesting fashionable patterns can be discovered
on a newly collected dress dataset. Furthermore, our model
can also achieve high accuracy on distinguishing fashionable Fig. 1. We aim to discover common visual patterns which
and unfashionable dresses. make dresses fashionable. The first row shows 5 fashionable
Index Terms— Fashion Classification, Visual Patten Dis- dresses with a shared fashionable visual pattern (labeled in
covery, Active Clustering magenta), and the second row shows non-fashionable ones.
(2) There are dependencies between some pair of visu- 3.3.2. Model Inference and Training
al patterns. A dress is more likely to be fashionable or un-
The above section shows how we compute the compatibility
fashionable, when a certain pair of visual patterns co-occurs.
score when h is specified. However, in the inference proce-
Based on this intuition, we add the second term to charac-
dure, h is a latent variable. For an input x, we should find a
terize the compatibility between the class label y and visual
class label y ∗ and a h∗ which produce the largest confidence
pattern pairs. We build an undirected graph G = (V, E) with
score:
a tree structure, where each node v ∈ V represents a visual
< y ∗ , h∗ >= arg max fω (x, y, h) (5)
pattern, and an edge (j, k) ∈ E means the jth and the kth vi- (y,h)
sual pattern have dependencies. For each visual pattern pair,
Experimentally, we use dynamic programming to solve the
we count the co-occurrence frequencies in the training data.
following inference problem.
We then perform the maximum spanning tree algorithm to
Similar to [17, 14], we adopt an EM-like algorithm to
generate edges E in the graph. For each pair (j, k) ∈ E, we
learn model parameters, as there are latent variables in the
represent it as.
model. We first initialize the model parameters, then the
learning algorithm alternates between inferring latent vari-
y if hj = 1 and hk = 1
ϕ(hj , hk , y) = (4) ables and updating model parameters. Assume we have T
0 otherwise
training examples: {(xi , yi )|i = 1, 2, ..., T }. With the in-
ferred latent variable h∗i , it becomes a standard latent struc-
Intuitively, if a visual pattern pair (j, k) co-occurs more often
tural SVM learning problem. The optimization problem is
in the fashionable dress images, then it is more compatible
formulated as following:
with y = 1, and the corresponding model parameter ωj,k is
more likely to be positive. T
1 X
This model learns the discriminative classifier by find- min kωk2 + C ξi
ω 2
ing hidden rationales: one dress is fashionable because it has i=1
certain fashionable visual patterns and visual pattern pairs. s.t. ω · Φ(xi , yi , h∗i ) − max ω · Φ(xi , yˆi , ĥi ) ≥ 1 − ξi ,
Hence, it can help identify which visual patterns and visual (yˆi ,hˆi )
pattern pairs make a dress fashionable. ξi ≥ 0, ∀yˆi 6= yi (6)
Table 1. Comparison on the classification accuracy. “C”, “S” Table 2. Comparison on the precision, recall and F1 measure.
and “T” represents color, SIFT and texton features. “All” is “ SP(All)+χ2 ” means the spatial pyramid SVM trained on all
the combination of “C”, “S” and “T”. “SP” means a 3-layer the features with χ2 kernel.
spatial pyramid [23] is used. “Linear” means a linear SVM. Method Model 1 Model 2 SP(All)+χ2
“χ2 ” and “HIK” represents the χ2 kernel and histogram inter- Precision 0.80 0.88 0.89
section kernel. “Model 1” and “Model 2” are the two models Recall 0.83 0.94 0.90
proposed in this paper. F1 0.81 0.91 0.90
C S T All
Linear 0.73 0.78 0.74 0.77
HIK 0.74 0.86 0.79 0.88
χ2 0.82 0.85 0.83 0.88 4. EXPERIMENTS
SP(C) SP(S) SP(T) SP(All)
We perform experiments on our collected dress dataset. The
Linear 0.64 0.83 0.75 0.71
dataset contains 1011 positive data and 1637 negative data.
HIK 0.77 0.89 0.87 0.90
We discover the common visual patterns and learn the latent
χ2 0.79 0.89 0.88 0.90 structural SVM models on 1785 training images (676 posi-
Model 1 Model 2 tive data and 1109 negative data), the remaining 863 images
0.85 0.93 are used for testing. We combine three features to train a bi-
nary SVM classifier for each visual pattern in common visual
pattern discovery process. The three features are color word-
s, bag of SIFT words, and bag of texton words. For color
Following [21], we adopt the stochastic gradient descen- features, we encode the RGB value of each pixel with an in-
t method to learn the model parameter ω. It picks a xi at each teger between 0 and 511. Then we represent each image as
iteration, infer its (yˆi , ĥi ), and update the model parameters a 512 dimensional histogram. SIFT is extracted densely and
in a gradient descent fashion. Interested readers may refer to quantized into 1000 visual words. To extract texture infor-
[21] for more detailed introduction. mation, each image is convolved with the Leung-Mailk filter
bank [22]. The generated filter responses are quantized in-
to 1000 textons. Images are then represented as a “bag-of-
3.3.3. Model 2: A model with conventional image classifiers words” histogram, for SIFT and texton features, respectively.
For each type of feature, we construct a histogram intersec-
Our current model is built based on the discovered common tion kernel. Then we combine these three kernels with equal
visual patterns. Some discriminative information which is weights to train an SVM classifier. With visual pattern classi-
suitable for classification might be lost when creating this in- fiers available, it usually takes several hours to train our latent
termediate representation. We here propose a new model to structural SVM model, and takes less than one second to do
fully exploit the discrimination power of the original image inference on a test image.
features. We define the potential function as:
X 4.1. Classification Performance
ω · Φ(x, y, h) = ωr φ(x, y) + ωk ψ(x, hk , y) +
k
We first evaluate fashion classification. Except our method,
X we also evaluate a set of binary SVM classifiers trained on
ωj,k ϕ(hj , hk , y) (7) multiple features and their combinations, with histogram in-
(j,k)∈E
tersection kernel (HIK) and χ2 kernels, respectively. The
classification results are presented in Table 1. Among al-
The first term represents a conventional image linear clas- l the baselines, the spatial pyramid SVM trained on all the
sifier without considering the visual pattern information. A- features with χ2 kernel (SP(All)+χ2 ) produces the best s-
gain, we don’t keep φ(x, y) as a high dimensional feature core. It is very interesting that computer vision algorithms can
vector. Instead, we define it as φ(x, y) = y · (f (x) − 0.5), achieve very high accuracy on this classification task, which
where f (x) is the “probabilistic” output of a conventional im- may suggest that a fashion classifier can be applied in real
age classifier, which is trained using the positive and nega- visual shopping applications. Our Model 2 is trained based
tive training data offline. Intuitively, the class label y must on SP(All)+χ2 , and both Model 1 and Model 2 achieve quite
be compatible with the classification score f (x). This model promising accuracy. Table 2 shows the precision, recall and
can effectively combine a conventional image classifier and F1 measure scores for Model 1, Model 2 and one of the base-
our visual pattern based classifier for more effective classifi- line SP(All)+χ2 . The Model 2 achieves the highest recall and
cation. F1 scores among these three methods.
4.2. Qualitative Results of Discovered Fashionable Visual [4] J.J. Tsay, C.H. Lin, C.H. Tseng, and K.C. Chang, “On visual
Patterns clothing search,” in TAAI, 2011.
[5] H. Wang, J. Du, and Q. Guo, “The application of content based
Our models can identify fashionable visual pattern by joint-
image retrieval technology in clothing retrieval system,” CTA,
ly learning the classifier. As formulated in Eq. 3, a learned
2009.
model parameter ωk contains two dimensions. The first di-
mension measures how well the k-th visual pattern is com- [6] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan, “Street-to-
shop: Cross-scenario clothing retrieval via parts alignment and
patible with its appearance feature, and the second dimension
auxiliary set,” in CVPR, 2012.
measures how well it is compatible with its class label. Hence
we could use ωk to identify which patterns are more likely to [7] Kota Yamaguchi, M Hadi Kiapour, Luis E Ortiz, and Tamara L
be fashionable. For the k-th visual pattern, we measure its Berg, “Parsing clothing in fashion photographs,” in CVPR,
2012.
fashionability Fk as:
[8] Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu,
ωk (1) ∗ ωk (2) if ωk (1) > 0 Changsheng Xu, and Shuicheng Yan, “Hi, magic closet, tell
Fk = (8)
NULL if ωk (1) <= 0 me what to wear!,” in ACM Multimedia, 2012.
where “NULL” means we neglect those patterns which are [9] Y.J. Lee and K. Grauman, “Object-graphs for context-aware
not compatible with their appearance features. A large and category discovery,” in CVPR, 2010.
positive Fk indicates this pattern is fashionable and stable. [10] B.C. Russell, W.T. Freeman, A.A. Efros, J. Sivic, and A. Zis-
For our Model 1, there are 86 fashionable patterns discov- serman, “Using multiple segmentations to discover objects and
ered in 5 parts. In Fig. 2, we show the top ranked fashion- their extent in image collections,” in CVPR, 2006.
able and unfashionable visual patterns discovered by Model [11] G. Kim, C. Faloutsos, and M. Hebert, “Unsupervised modeling
1. For each part, the first 4 images share the same fashionable of object categories using link analysis techniques,” in CVPR,
visual pattern, which is labeled in magenta. We find the dis- 2008.
covered visual patterns are reasonable and make sense. Such [12] J. Yuan and Y. Wu, “Spatial random partition for common
discovered visual patterns can help people understand what visual pattern discovery,” in ICCV, 2007.
makes a dress fashionable or unfashionable (at this specific [13] D. Parikh and K. Grauman, “Interactively building a discrimi-
time). They can also help designers capture the trend of fash- native vocabulary of nameable attributes,” in CVPR, 2011.
ion when creating their own designs.
[14] L.L. Zhu, Y. Chen, A. Yuille, and W. Freeman, “Latent hi-
erarchical structural learning for object detection,” in CVPR,
5. CONCLUSIONS 2010.
[15] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ra-
In this paper, we present a method to discover fashionable manan, “Object detection with discriminatively trained part-
visual patterns and learn fashion classifiers. The generated based models,” TPAMI, 2009.
results can be useful for design and visual shopping. In future,
[16] Y. Wang and G. Mori, “A discriminative latent model of object
we are interested in investigating how to perform relevance classes and attributes,” ECCV, 2010.
feedback for fashionable visual pattern based retrieval. We
are also interested in applying this method to other products [17] C.N.J. Yu and T. Joachims, “Learning structural svms with
latent variables,” in ICML, 2009.
such as handbags and shoes.
[18] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing
objects by their attributes,” in CVPR, 2009.
6. ACKNOWLEDGMENT
[19] B.J. Frey and D. Dueck, “Clustering by passing messages be-
This research is supported by the Singapore National Re- tween data points,” Science, 2007.
search Foundation under its International Research Centre @ [20] S. Tong and D. Koller, “Support vector machine active learn-
Singapore Funding Initiative and administered by the IDM ing with applications to text classification,” The Journal of
Programme Office. Machine Learning Research, 2002.
[21] S. Branson, P. Perona, and S. Belongie, “Strong supervision
7. REFERENCES from weak annotation: Interactive training of deformable part
models,” in ICCV, 2011.
[1] N. Snavely, S.M. Seitz, and R. Szeliski, “Photo tourism: ex-
[22] T. Leung and J. Malik, “Representing and recognizing the vi-
ploring photo collections in 3d,” in TOG, 2006.
sual appearance of materials using three-dimensional textons,”
[2] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, IJCV, 2001.
R. Moore, A. Kipman, and A. Blake, “Real-time human pose
[23] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of fea-
recognition in parts from single depth images,” in CVPR, 2011.
tures: Spatial pyramid matching for recognizing natural scene
[3] W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on visual
categories,” in CVPR, 2006.
surveillance of object motion and behaviors,” Systems, Man,
and Cybernetics, Part C: Applications and Reviews, 2004.