0% found this document useful (0 votes)
7 views5 pages

Graph-based Semi-supervised Multi-label Learning Method

Uploaded by

Frank Puk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

Graph-based Semi-supervised Multi-label Learning Method

Uploaded by

Frank Puk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2013 International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC)

Dec 20-22, 2013, Shenyang, China

Graph-based Semi-supervised Multi-label


Learning Method

Zhang Chen-Guang, Zhang Xia-Huan


College of information and technology
Hainan University
Haikou, China
e-mail: [email protected], [email protected]

Abstract—The problem of multi-label classification has mainly focus on supervised settings which assume existence of
attracted great interest in the last decade. However, most multi- large amounts of labeled training data. However, with the
label learning methods only focus on supervised settings, and can increasing number of class labels, especially in multi-label
not effectively make use of relatively inexpensive and easily leaning problems, the training data is relatively sparse and may
obtained large number of unlabeled samples. To solve this not enough for obtaining good generalization capacity on
problem, we put forward a novel graph-based semi-supervised unlabeled data. One way to solve such issue is to develop semi-
multi-label learning method, called GSMM. GSMM characterize supervised learning methods. In singe-label learning category,
the inherent correlations among multiple labels by Hilbert- it have been testified that semi-supervised learning methods
Schmidt independence criterion. It’s expected to derive the
can effectively make use of unlabeled data to improve the
optimal assignment of class membership to unlabeled samples by
maximizing the consistency of class label correlations and
generalization performance. But for multi-label learning
simultaneously as smooth as possible on sample feature graph. problems, as far as we know, the related work is still few.
The experiments comparing GSMM to the state-of-the-art multi- Reference [7,8] adapt the graph-based semi-supervised
label learning approaches on several real-world datasets show learning method for multi-label learning by embedding the
GSMM can effectively learn from the labeled and unlabeled labels correlations. Reference [9] proposed a normalized inner
samples. Especially when the labeled is relatively rare, it can product between a kernel matrix and a matrix constructed from
improve the performance greatly. the labels for labeling the unlabeled. Reference [10] proposes a
semi-supervised method, called Tram, which is similar to
Keywords—multi-label learning; graph based semi-superivsed random walk method, on the basis of manifold assumption.
learning; Hilbert-Schimidt independence criterion
In this paper, we put forward a novel graph-based semi-
supervised multi-label learning method, called GSMM.
I. INTRODUCTION GSMM exploits the consistency on both category level and
Traditional learning methods, including k-nearest neighbor instance level simultaneously. For instance level,
method, decision trees, neural network and support vector
a graph is defined based on both labeled and unlabeled
machine, etc. can not be directly used for multi-label learning
instances, where each node represents one instance and each
problems. Motivated by this challenge, many multi-label
edge weight reflects the similarity between corresponding pair
learning methods are proposed in the last decade, and
wise instances. For category level, GSMM quantifies the
according to the learning criterion, most of the methods can be
inherent correlations among multiple labels by Hilbert-Schmidt
divided into two learning framework, namely problem
independence criterion(HSIC)[11].The two aspects of GSMM
transformation methods(PRM) and algorithm adaptation
are combined to a quadratic energy function. And then by
methods(AAM). PRM tackle multi-label learning problem by
minimizing the energy function, GSMM is expected to derive
transforming it into other well-established learning scenarios.
the optimal assignment of class membership to unlabeled
For example, Binary Relevance[1] and Classifier Chains[2] get a
instances, such that labels are smooth on the instances graph
classifier for each class, decomposing multi-label learning task
and consistent with the category correlations.
into a combination of independent binary classifications. Label
Powerset[3] and Random k-Labelsets[4] take the labels Compared to other methods, our work is closely related to
combination of samples as new class labels, transforming that of Zha et al. [7, 8], as all the three methods constructed
multi-label problems into multi-class learning problems. graphs on instance level. For category level, However, we
Unlike PRM, AAM adapts traditional single-label learning measures the category correlations by HSIC, and Zha et al. by
approaches to deal with multi-label data. For example, Multi- constructing another graph, where each node is a category and
Label k-Nearest Neighbor (MLKNN) [5] and Rank-SVM[6] edge weight represents similarity between categories.
adapt k-nearest neighbor method and support vector machine
respectively to be suitable for multi-label learning. Empirical studies, comparing GSMM to the state-of-the-art
multi-label learning approaches on several real world multi-
At present, research on both multi-label learning framework label classification tasks, demonstrate that our GSMM can
This work was supported by the College Scientific Research Program of
Education Department of Hainan Province under Grant No.Hjkj2012-01

978-1-4799-2565-0/13/$31.00 ©2013 IEEE 1021


effectively boost the performance of multi-label classification III. GRAPH-BASED SEMI-SUPERVISED MULITI-LABEL
by using both labeled and unlabeled data. Especially when the LEARNING METHOD
labeled data is relatively rare, it can improve the performance
greatly. A. Notions
We will first introduce some notations that will be used
II. HILBERT-SCHMIDT INDEPENDENCE CRITERION throughout the paper.
HSIC, which has been used for cluster analysis, taxonomy Let V {( xi , yi )  0 u 1 | i 1,.., v} be labeled
inference[12] and multi-label dimensionality reduction, is one of
kernel-based independence measure methods. We adopt it to instances set and U {( x j , y j )  0 u 1 | j v  1,.., v  u} be
measure the inherent correlations among multiple labels, unlabeled instances set with y j ( j v  1, ..., u ) unknown,
considering its simplicity and neat theoretical properties such
as exponential convergence. where 0 and 1 are spaces of feature vectors and label
vectors. Also, for convenience, let
Let  be a reproducing kernel Hilbert space of functions
from 0 to \ , where 0 is a separable metric space. For X [ x1 , x2 ,..., xv u ], Y [ y1 , y2 ,..., yv u ] (6)
each point x  0 , it’s associated with an element Each yi ( yi1 ,..., yik )T is a k-dimensional real vector,
I ( x )   (We call I the feature map) , and the and yij (1 d j d k ) represents the confidence that xi is
corresponding kernel is
associated with label j . For labeled instance, its label
k ( x, xc) I ( x ), I ( xc) 
x, xc  0 (1)
yi (1 d i d v ) is predefined as:
Where ˜, ˜ is inner product on space  . Similarly, We
 ­1 xi is associated with label j (7)
can also define a second RKHS  with respect to the yij ®
¯0 otherwise
separable metric space 1 , with feature map \ : 1 o  and
kernel B. Learning Framework
l ( y , y c) \ ( y ),\ ( y c) 
y , y c  1 (2) GSMM is expected to derive the optimal categories
assignment y j ( j v  1, ..., u ) to unlabeled instances, such
Let Pr0 u1 be the joint distribution of that the labels-function Y satisfies the following two properties:
(0 u 1 , * u / ) ˈ where * and / are Borel sets of 0 1) it should be as smooth as possible on the instances graph,
where nodes represent instances and edges represent similarity
and 1 , and associated marginals Pr0 and Pr1 . Then the between instances; 2) it should be consistent with the class
covariance operator C xy :  o  is defined as˖ label correlations. According the two requirements, the
proposed framework of GSMM can be formulated to minimize
C xy E x , y >I ( x )
I ( y )@  P x P y (3) E (Y ) Es (Y )  P Ec (Y ) (8)
Where P x and P y are expectation of I ( x ) ઼\ ( y ) , and The first item ES (Y ) measures the label smoothness, which
for f   and g  ˈthe tensor product operator have been addressed in existing graph-based methods. The
f g :  o  is defined as second item Ec (Y ) reflects the coherence between Y and the
inherent class label correlations, which is novel and the key
(f g )h f g, h 
h   (4) contribution of this paper. Given the category structure (class
label inherent correlations), we characterize Ec (Y ) as the
A measure of dependence of 0 and 1 is then the
underlying dependence of X and Y . The trade-off between
Hilbert-Schmidt norm of this operator (the sum of the squared
2 these two items is positive parameter P .
singular values) , named C xy . Given the observed value
HS C. Optimization
Z ( x1 , y1 ),...,( xn , yn ) , HSIC is defined to be the In order to characterize the smoothness item ES (Y ) , it
2
empirical estimate of C xy : needs to build a weighted neighborhood graph G = (V,E) on
HS both labeled and unlabeled instances, where each vertex
HSIC(  ,  , Z ) ( n  1) 2 Tr[HKHL] (5) corresponds to an instance, and the weigh of edge is defined as:
­ § x x 2·
Where H I  (1 / n )ee , I is identity matrix ˈ e is the
T
°°exp ¨ i j
¸ i z j, i  N k ( j ) or j  N k (i ) (9)
n u 1 vector of ones, K and L is the gram matrix of kernel wij ® ¨ 2V 2 ¸
k ( x, xc) and l ( x, xc) with respect to observed samples Z . ° © ¹
°̄0 otherwise

1022
Where N k (i ) is the set of the k nearest neighbors of xi , ˜ Rewrite the optimization goal of GSMM, we have

refers to the Euclidean distance, and parameter V is min E (Y ) Tr[YLY T ]  PTr[YAY T 3 ]


Y (19)
empirically estimated as the average distance between s. t. YV { [ y1 ,..., yv ]
instances. Then ES (Y ) is defined as the label smoothness on
In order to solve the above optimization problem, We split the
graph G: matrix L and A into four blocks according to whether or not
1 corresponding with labeled samples:
¦¦
2
ES (Y ) wij (Yik  Y jk ) (10)
2 k i, j ªL LVU º ªA AVU º
L= « V A= « V
LU »¼ AU »¼
(20)
For the second item Ec (Y ) , We address it by HSIC. Given ¬ LUV ¬ AUV
Gram matrix L with respect to Y as[12] Differentiating E(Y) with respect to Y and let be zero, we
have
L Y T 3Y (11)
wE (Y )
2LUYUT +2LUV YVT -P 2AUYUT +2AUV YVT 3 0 (21)
Where 3 is the category structure, characterized as below: wYU
3 YV YVT (12) which can be transformed into
Also given the Gram matrix K with respect to X . Then the LU YUT -P AU YUT 3 = P AUV YVT 3  LUV YVT (22)
dependence of X and Y can be estimated by [17]: If AU is invertible, then
HSIC( X , Y ) ( n  1) 2 Tr[HKHL] P 1 AU1LU YUT -YUT 3 =AU1 AUV YVT 3  P 1 LUV YVT (23)
2
( n  1) Tr[HKHY 3Y ] T
(13) Let
( n  1) 2 Tr[YHKHY T 3 ] B AU1LU C AU1 AUV YVT 3  LUV YVT (24)
Where Tr[.] is the trace function. As mentioned in [12], We have the simpler form of above equation
maximizing the above dependence can derive an optimal Y ,
which is consistent with the predefined class label structure 3 . BYUT -YUT 3 =C (25)
So we can define Ec (Y ) as the HSIC dependence of X and Y, This is a Sylvester Equation which often occurs in the control
domain. In order to solve the Sylvester Equation an iterative
with given category structure 3 : Krylov-subspace method is adopted by us.
Ec (Y ) Tr[YHKHY T 3 ] (14)
IV. EXPERIMENTS
Incorporate the two regularization terms into Eq. (8), we get
In this section, we show the performance of GSMM on
the optimization objective of GSMM: several real-world multi-label classification tasks. We compare
1
¦¦
2 GSMM with several general-purpose multi-label classification
min wij (Yik  Y jk )  PTr[YHKHY T 3 ]
Y 2 k i, j (15) algorithms, including MLKNN[5] and Binary Relevance[1],
which are applicable to various multi-label problems, and
s . t . YV { [ y1 ,..., yv ] represent the state-of-the-art techniques in multi-label
classification. We also compare it with the ML-LGC[7], which
Where Y [YV , YU ] , YV and YU are blocks of Y , associated
is a semi-supervised multi-label learning method too.
with the labeled and unlabeled data respectively. The constraint
promises the derived Y has the same values as given labels on A. Parameters Setting
labeled data. The methods and the parameters setting are detailed below:
D. Solution 1. GSMM: The algorithm proposed by us. The number of
First , simplify ES (Y ) into matrix form , then have the nearest neighbor is set to be 15. And the trade-off
parameter P is set to be 0.1; To reduce computing time,
1
¦¦
2
ES (Y ) wij (Yik  Y jk ) the Graph Kernel L is used as the Gram matrix of X,
2 k i, j (16) namely K = L;
T 2. MLKNN: MLKNN adapts k-nearest neighbor method for
Tr[YLY ] multi-label case. It needs building nearest neighbor graph
Where too. The number of nearest neighbor is set to be as the
L D W (17) same with GSMM, and smooth parameter is set to 1.
is combinatorial graph Laplacian. W is the adjacent matrix of 3. Binary Relevance: BR decomposes multi-label learning
graph G, D is a diagonal matrix with its (i , i ) -element equal to task into a combination of independent binary
the sum of the ith row sum of W. classifications. We adopt Naive Bayes method as base
Let classifiers.
4. ML-LGC: ML-LGC is the most close relative method to
A=HKH (18)

1023
TABLE I. THE DATASETS FOR MULTI-LABEL LEARNING TABLE STYLES

Name Domain Instances Attributes Class Cardinality Density


Emotions Music 593 72 6 1.869 0.311
Yeast Biology 2417 103 14 4.237 0.303
Scene Multimedia 2407 294 6 1.074 0.179
our work. The number of nearest neighbor in ML-LGC is and unlabeled data to improve performance. However, on
set to be as the same with above methods, and trade-off hamming loss, GSMM is not good as the others, a possible
parameters P and Q are set to be 0.1 and 1; reason is the threshold of GSMM for labeling is not proper .
We give an instance a label when the confidence of the class is
B. DataSets more than zero.
Datasets used in the experiments are the benchmark or The second task of study is music emotion analysis. The
widely used multi-label datasets. The details about the multi- whole dataset has 593 instances of genes and 6 possible class
label datasets are shown in Table I, where the cardinality labels, and all the results are showed in of fig.1(b). It can be
means the average number of labels assigned to the samples; see from the figure that GSMM outperforms the other three
the density denotes the average number of the ratio of sample method on almost every evaluation metric, except on
labels to the total labels of the dataset; the class and attributes hamming loss when label rate is larger than 0.06, better is
denote the number of different class label and dimensionality MLKNN than GSMM.
of instances. The above dataset could be downloaded from the The last multi-label task studied in this paper is yeast gene
home page of open source project mulan: function analysis, the results of which are showed in fig.1(c).
http://mlkd.csd.auth.gr/multilabel.html. GSMM also gets a better performance than others on almost
every evaluation metric except hamming loss. On hamming
C. Evaluation Metrics loss, MLKNN is better than all others. And the rest methods
Traditional single label performance evaluation metrics, have the close results .
including accuracy, precision, recall and F-measure, are not
suitable for multi-label learning. The evaluation for multi-label V. CONCLUSIONS
learning is much more complicated. The following multi-label
evaluation metrics are used in this paper: We have proposed a novel graph-based semi-supervised
(1) hamming loss: evaluates how many times an instance- learning framework to address the multi-label problems, which
label pair is misclassified, i.e. a label not belonging to simultaneously takes both the consistency on category level
the instance is predicted or a label belonging to the and instance level into accounts.
instance is not predicted. Empirical studies on several real-world tasks demonstrate
(2) one-error: evaluates how many times the top ranked that our GSMM can effectively boost the performance of
label is not in the set of proper labels of the instance. multi-label classification by using unlabeled data in addition
(3) ranking loss: evaluates the average fraction of label to labeled data
pairs that are reversely ordered for instances.
(4) average precision: evaluates the average fraction ACKNOWLEDGMENT
of labels ranked above a proper label which are also
proper. The authors wish to thank the editor and anonymous
reviewers for their helpful comments and suggestions.
D. Experiments Results and Analysis
REFERENCES
In order to test the effect of unlabeled data, we randomly
[1] Zhang M L, Zhou Z H. A review on multi-label learning algorithms.
draw few instances from every dataset according to a certain IEEE Transactions on Knowledge and Data Engineering, to be published.
ratio as labeled dataset and the rest as the unlabeled dataset. [2] Read J, Pfahringer B, Holmes G, Frank E. Classifier chains for multi-
The sample ratio is from 0.01 to 0.09 for emotion and scene label classification. Machine Learning, 2011, 85(3): 333~359.
dataset, and from 0.1 to 0.19 for yeast dataset. All the results [3] Tsoumakas G, Ioannis V. Random k-labelsets for multi-label
under different label ratios are summarized into curve diagrams classification. IEEE Transactions on Knowledge and Data Engineering,
with error bars, where the horizontal axis is the label ratio and 2011, 23(7): 1079~1089.
the vertical axis is evaluation metrics. [4] G. Tsoumakas, I. Vlahavas. Random k-labelsets: An ensemble method
for multilabel classification. in Proceedings Of the 18th European
We first test the natural scene classification task on Scene Conference on Machine Learning (ECML 2007), Warsaw, Poland,
dataset. Performance evaluation results of GSMM, MLKNN, September 2007: 406~417.
BR and ML-LGC are reported in fig.1(a). We can see on every [5] M.L.Zhang, Z.H.Zhou. A k-nearest neighbor based algorithm for multi-
evaluation metric except Hamming Loss, GSMM and ML- label classification. In Proceedings of the 1st IEEE International
LGC gets better and more stable performance than the Conference on Granular Computing (GrC 2005), Beijing, China,
2005:718~721.
MLKNN, BR. Moreover, GSMM is little better than ML-LGC
[6] Elisseeff A, Jason W. A kernel method for multi-labelled classification.
on these metrics, especially when label ratio decreasing. One In: Advances in neural information processing systems. British
reasonable reason is GSMM and ML-LGC are both semi- Columbia, Massachusetts: MIT Press, 2001. 681~687.
supervised learning methods, can make use of both labeled

1024
[7] Zha Z J, Mei T, Wang J, Wang Z, Hua X S. Graph-based semi-upervised 21rd AAAI Conference on Artificial Intelligence. Boston, USA: AAAI
learning with multiple labels. Journal of Visual Communication and Im- Press, 2006. 666~671.
age Representation, 2009, 20(2): 97~103. [11] Kong X N, Ng M K, Zhou Z H. Transductive multi-labellearning via
[8] Zha Z J, Mei T, Wang J, Wang Z, Hua X S. Graph-based semi- label set propagation. IEEE Transactions on Knowledge and Data
supervised learning with multiple labels. Journal of Visual Engineering, 2011 , 23(5): 788~799.
Communication and Image Representation, 2009, 20(2): 97~103. [12] Gretton A, Bousquet B, Smola A, Schölkopf B. Measuring statistical
[9] Chen G, Song Y, Wang F, Zhang C S, Semi-supervised multi-label dependence with Hilbert-Schmidt norms. In: 16th International
learning by solving a Sylvester equation, In: SIAM International Conference on Algorithmic Learning Theory. Singapore, Berlin:
Conference on Data Mining. Atlanta, USA: Curran Associates , 2008. Springer Verlag, 2005. 63~77.
410~419. [13] Song L, Smola A, Gretton A, Borgwardt K M. A dependence
[10] Liu Y, Jin R, Yang L. Semi-supervised multi-label learning by maximization view of clustering. In: Proceedings of the 24th
constrained nonnegative matrix factorization, In: Proceedings of the international conference on Machine learning. Corvallis, USA: ACM,
2007. 815~822.
scene Emotion yeast
0.45 GSMM
GSMM GSMM
MLKNN 0.35
0.4 MLKNN 0.4 BR MLKNN
0.35 BR ML-LGC BR
ML-LGC 0.3 ML-LGC
0.3 0.35
Hammingloss

Hammingloss
Hammingloss
0.25
0.3 0.25
0.2
0.15 0.25
0.2
0.1
0.05 0.2
0.15
0
0.15
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
Label Rate
Label Rate Label Rate
scene Emotion Yeast

0.7 0.6 GSMM


GSMM 0.45 GSMM
MLKNN
0.6 MLKNN BR
MLKNN
BR 0.5 ML-LGC 0.4 BR
0.5 ML-LGC ML-LCG
0.35
0.4 0.4
RankingLoss

RankingLoss
RankingLoss

0.3
0.3
0.3
0.25
0.2
0.1 0.2 0.2

0 0.15
0.1
-0.1 0.1
0
-0.2 0.05
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
Label Rate
Label Rate Label Rate
scene Emotion Yeast
0.9
GSMM 0.8 GSMM GSMM
MLKNN MLKNN 0.8 MLKNN
1
BR 0.7 BR BR
0.7
0.8 ML-LGC ML-LGC ML-LGC
0.6 0.6
OneError
OneError

OneError

0.6 0.5 0.5

0.4 0.4
0.4
0.3 0.3
0.2
0.2
0.2
0 0.1
0.1
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
Label Rate Label Rate Label Rate

scene Emotion Yeast

1.1 GSMM GSMM GSMM


0.9
MLKNN 0.9 MLKNN MLKNN
1
BR BR BR
0.8
0.9 ML-LGC ML-LGC ML-LGC
0.8
Average Precision
Average Precision

Average Precision

0.8 0.7
0.7 0.7
0.6
0.6
0.5 0.6 0.5
0.4
0.5 0.4
0.3
0.2 0.3
0.4
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
Label Rate Label Rate Label Rate
a) Scene b) Emotion c)Yeast
Fig. 1. Results on a) Scene; b) Emotion; c)Yeast classification task with different label rates. Except Average Precision, the lower the value, the better the
performance. Along with the curves, we also plot the mean ± std on each point for different random data set partitions.

1025

You might also like