Graph-based Semi-supervised Multi-label Learning Method
Graph-based Semi-supervised Multi-label Learning Method
Abstract—The problem of multi-label classification has mainly focus on supervised settings which assume existence of
attracted great interest in the last decade. However, most multi- large amounts of labeled training data. However, with the
label learning methods only focus on supervised settings, and can increasing number of class labels, especially in multi-label
not effectively make use of relatively inexpensive and easily leaning problems, the training data is relatively sparse and may
obtained large number of unlabeled samples. To solve this not enough for obtaining good generalization capacity on
problem, we put forward a novel graph-based semi-supervised unlabeled data. One way to solve such issue is to develop semi-
multi-label learning method, called GSMM. GSMM characterize supervised learning methods. In singe-label learning category,
the inherent correlations among multiple labels by Hilbert- it have been testified that semi-supervised learning methods
Schmidt independence criterion. It’s expected to derive the
can effectively make use of unlabeled data to improve the
optimal assignment of class membership to unlabeled samples by
maximizing the consistency of class label correlations and
generalization performance. But for multi-label learning
simultaneously as smooth as possible on sample feature graph. problems, as far as we know, the related work is still few.
The experiments comparing GSMM to the state-of-the-art multi- Reference [7,8] adapt the graph-based semi-supervised
label learning approaches on several real-world datasets show learning method for multi-label learning by embedding the
GSMM can effectively learn from the labeled and unlabeled labels correlations. Reference [9] proposed a normalized inner
samples. Especially when the labeled is relatively rare, it can product between a kernel matrix and a matrix constructed from
improve the performance greatly. the labels for labeling the unlabeled. Reference [10] proposes a
semi-supervised method, called Tram, which is similar to
Keywords—multi-label learning; graph based semi-superivsed random walk method, on the basis of manifold assumption.
learning; Hilbert-Schimidt independence criterion
In this paper, we put forward a novel graph-based semi-
supervised multi-label learning method, called GSMM.
I. INTRODUCTION GSMM exploits the consistency on both category level and
Traditional learning methods, including k-nearest neighbor instance level simultaneously. For instance level,
method, decision trees, neural network and support vector
a graph is defined based on both labeled and unlabeled
machine, etc. can not be directly used for multi-label learning
instances, where each node represents one instance and each
problems. Motivated by this challenge, many multi-label
edge weight reflects the similarity between corresponding pair
learning methods are proposed in the last decade, and
wise instances. For category level, GSMM quantifies the
according to the learning criterion, most of the methods can be
inherent correlations among multiple labels by Hilbert-Schmidt
divided into two learning framework, namely problem
independence criterion(HSIC)[11].The two aspects of GSMM
transformation methods(PRM) and algorithm adaptation
are combined to a quadratic energy function. And then by
methods(AAM). PRM tackle multi-label learning problem by
minimizing the energy function, GSMM is expected to derive
transforming it into other well-established learning scenarios.
the optimal assignment of class membership to unlabeled
For example, Binary Relevance[1] and Classifier Chains[2] get a
instances, such that labels are smooth on the instances graph
classifier for each class, decomposing multi-label learning task
and consistent with the category correlations.
into a combination of independent binary classifications. Label
Powerset[3] and Random k-Labelsets[4] take the labels Compared to other methods, our work is closely related to
combination of samples as new class labels, transforming that of Zha et al. [7, 8], as all the three methods constructed
multi-label problems into multi-class learning problems. graphs on instance level. For category level, However, we
Unlike PRM, AAM adapts traditional single-label learning measures the category correlations by HSIC, and Zha et al. by
approaches to deal with multi-label data. For example, Multi- constructing another graph, where each node is a category and
Label k-Nearest Neighbor (MLKNN) [5] and Rank-SVM[6] edge weight represents similarity between categories.
adapt k-nearest neighbor method and support vector machine
respectively to be suitable for multi-label learning. Empirical studies, comparing GSMM to the state-of-the-art
multi-label learning approaches on several real world multi-
At present, research on both multi-label learning framework label classification tasks, demonstrate that our GSMM can
This work was supported by the College Scientific Research Program of
Education Department of Hainan Province under Grant No.Hjkj2012-01
1022
Where N k (i ) is the set of the k nearest neighbors of xi , Rewrite the optimization goal of GSMM, we have
1023
TABLE I. THE DATASETS FOR MULTI-LABEL LEARNING TABLE STYLES
1024
[7] Zha Z J, Mei T, Wang J, Wang Z, Hua X S. Graph-based semi-upervised 21rd AAAI Conference on Artificial Intelligence. Boston, USA: AAAI
learning with multiple labels. Journal of Visual Communication and Im- Press, 2006. 666~671.
age Representation, 2009, 20(2): 97~103. [11] Kong X N, Ng M K, Zhou Z H. Transductive multi-labellearning via
[8] Zha Z J, Mei T, Wang J, Wang Z, Hua X S. Graph-based semi- label set propagation. IEEE Transactions on Knowledge and Data
supervised learning with multiple labels. Journal of Visual Engineering, 2011 , 23(5): 788~799.
Communication and Image Representation, 2009, 20(2): 97~103. [12] Gretton A, Bousquet B, Smola A, Schölkopf B. Measuring statistical
[9] Chen G, Song Y, Wang F, Zhang C S, Semi-supervised multi-label dependence with Hilbert-Schmidt norms. In: 16th International
learning by solving a Sylvester equation, In: SIAM International Conference on Algorithmic Learning Theory. Singapore, Berlin:
Conference on Data Mining. Atlanta, USA: Curran Associates , 2008. Springer Verlag, 2005. 63~77.
410~419. [13] Song L, Smola A, Gretton A, Borgwardt K M. A dependence
[10] Liu Y, Jin R, Yang L. Semi-supervised multi-label learning by maximization view of clustering. In: Proceedings of the 24th
constrained nonnegative matrix factorization, In: Proceedings of the international conference on Machine learning. Corvallis, USA: ACM,
2007. 815~822.
scene Emotion yeast
0.45 GSMM
GSMM GSMM
MLKNN 0.35
0.4 MLKNN 0.4 BR MLKNN
0.35 BR ML-LGC BR
ML-LGC 0.3 ML-LGC
0.3 0.35
Hammingloss
Hammingloss
Hammingloss
0.25
0.3 0.25
0.2
0.15 0.25
0.2
0.1
0.05 0.2
0.15
0
0.15
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
Label Rate
Label Rate Label Rate
scene Emotion Yeast
RankingLoss
RankingLoss
0.3
0.3
0.3
0.25
0.2
0.1 0.2 0.2
0 0.15
0.1
-0.1 0.1
0
-0.2 0.05
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
Label Rate
Label Rate Label Rate
scene Emotion Yeast
0.9
GSMM 0.8 GSMM GSMM
MLKNN MLKNN 0.8 MLKNN
1
BR 0.7 BR BR
0.7
0.8 ML-LGC ML-LGC ML-LGC
0.6 0.6
OneError
OneError
OneError
0.4 0.4
0.4
0.3 0.3
0.2
0.2
0.2
0 0.1
0.1
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
Label Rate Label Rate Label Rate
Average Precision
0.8 0.7
0.7 0.7
0.6
0.6
0.5 0.6 0.5
0.4
0.5 0.4
0.3
0.2 0.3
0.4
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
Label Rate Label Rate Label Rate
a) Scene b) Emotion c)Yeast
Fig. 1. Results on a) Scene; b) Emotion; c)Yeast classification task with different label rates. Except Average Precision, the lower the value, the better the
performance. Along with the curves, we also plot the mean ± std on each point for different random data set partitions.
1025