0% found this document useful (0 votes)
44 views4 pages

A New Semi-Supervised Support Vector Machine Learning Algorithm Based On Active Learning

This document summarizes a research paper that proposes a new semi-supervised support vector machine (SVM) learning algorithm based on active learning. The algorithm aims to improve on existing semi-supervised SVM methods by training an early learner using a small number of labeled samples, then selecting the best training samples through active learning to further improve the classifier while reducing learning costs. The paper provides background on SVM and semi-supervised learning methods, describes the proposed algorithm and its implementation steps, and reports experimental results showing the algorithm can achieve good learning effects with less learning costs compared to other methods.

Uploaded by

Lucas Fiordelisi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views4 pages

A New Semi-Supervised Support Vector Machine Learning Algorithm Based On Active Learning

This document summarizes a research paper that proposes a new semi-supervised support vector machine (SVM) learning algorithm based on active learning. The algorithm aims to improve on existing semi-supervised SVM methods by training an early learner using a small number of labeled samples, then selecting the best training samples through active learning to further improve the classifier while reducing learning costs. The paper provides background on SVM and semi-supervised learning methods, describes the proposed algorithm and its implementation steps, and reports experimental results showing the algorithm can achieve good learning effects with less learning costs compared to other methods.

Uploaded by

Lucas Fiordelisi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

May 12, 2010 17:39 RPS : Trim Size: 8.50in x 11.

00in (IEEE) icfcc2010-lineup˙vol-3: F1649

A New Semi-Supervised Support Vector Machine Learning Algorithm Based on


Active Learning

Li Cunhe Wu Chenggang
College of Computer and Communication Engineering College of Computer and Communication Engineering
China University of Petroleum China University of Petroleum
Dongying, China Dongying, China
[email protected] [email protected]

Abstract—Semi-supervised support vector machine is an of the algorithm are shown in section IV, followed by
extension of standard support vector machine in machine conclusions of this paper in Section V.
learning problem in real life. However, the existing semi-
supervised support vector machine algorithm has some
drawbacks such as slower training speed, lower accuracy, etc. II. RELATED WORK
This paper presents a semi-supervised support vector machine
learning algorithm based on active learning, which trains early A. Description of SVM
learner by a spot of labeled-data, selects the best training
SVM is a kind of method that non-linear problem in low-
samples for training and learning by active learning and
dimensional space is mapped to a high-dimensional space so
reduces learning cost by deleting non- support vector.
Simulative experiments have shown that the algorithm may get
that a simple linear classification techniques can be dealt
good learning effect at less learning cost. with is suitable for small sample learning [1]. SVM can
effectively overcome the traditional method of over-fitting
Keywords-Semi-Supervised Support Vector Machine; Active problems and neural network learning problems commonly
Learning; Chinese Webpage Classification seen in local minimum, so it has strong generalization
ability.
SVM can ensure that the non-linear problem mapped to
I. INTRODUCTION
high-dimensional space does not increase the computational
Currently, support vector machine (SVM) is a very complexity and also effectively overcomes the curse of
popular research direction in machine learning, which is an dimensionality by introducing the kernel function and the
effective learning machine that has been confirmed in associated dot product operation. SVM can be summed up
experiments and has successful application in many areas. for solving the quadratic programming problem. For a given
However, the traditional SVM is mainly used to deal with labeled training data set:
supervised learning problems, namely, which needs to mark D = {(x1, y1), … , (xl, yl)}, x Rn, y {+1, -1} (1)
sample data to train classifiers, while the large amount of
data in real life which is unmarked and it is time-consuming Consider the general nonlinear situation: we define i as
ξ
for marking the data. This contributed to the study of the relaxation variable, and then the mathematical
machine learning into a new phase. Semi-supervised learning expressions for the problem is,
which has been combined with marked and unmarked data is 1 l
becoming the new hot spot. The broad application prospect Φ(ω , ξ i ) = || ω || 2 +C ∑ ξ i
which has been led to by Semi-supervised learning in the Min 2 i =1 i=1,2,…,l
field of pattern recognition and artificial intelligence systems
is the main reason for such research heats up. s.t.
yi (ω * xi + b) ≥ 1 − ξ i ; ξ i ≥ 0; i = 1,2,..., l
(2)
This paper attempts to research the characteristics of The model of SVM has been provided that which is a
support vector machine and explores a way for improving powerful analytical tool for solving a variety of pattern
the performance of classifiers which is a practical approach recognition, function simulation and forecasting models.
for classifying used a small number of labeled samples and a However, SVM can only be used in analysis of issues for
large number of unlabeled samples. The paper is organized marked sample data, while there are large amounts of data
as follows. Section II describes the related work and it are unmarked in real-life, so that the scope of application of
includes the mathematical model of SVM, the concept of SVM has been limited. At present, an effective way to solve
semi-supervised learning and typical algorithms and the this problem is to use semi-supervised learning.
concept and significance of active learning; The semi-
supervised support vector machine (S3VM) for the status B. Description of S3VM
quo, the new algorithm features and implementation steps is Semi-supervised learning is an integrated way of learning
discussed in section III; The experimental results of analysis which uses a small amount of labeled data to obtain an initial
classifier, and then uses a large number of unlabeled data to

c
978-1-4244-5824-0/$26.00 2010 IEEE V3-638
May 12, 2010 17:39 RPS : Trim Size: 8.50in x 11.00in (IEEE) icfcc2010-lineup˙vol-3: F1649

further improve the performance of the initial classifier to labeled samples, thus we need to improve the algorithm.
eventually achieve the precision of learning. Although it is a Zhong Qingliu et al against the problems that TSVM needs
new field of research, but also has obtained some initial the prior accurate estimate of the distribution of positive and
results. The typical semi-supervised learning method will be negative rates of the samples and PTSVM algorithm needs
used in SVM is transductive support vector machine repeated calculations and training, so that it results in
(TSVM) learning method [2]. The principle of TSVM computational cost, they proposed a semi-supervised
algorithm can be expressed as, given a group of independent learning algorithm based on SVM and by gradual approach.
and identically distributed labeled data sets: However, this algorithm does not combine with the active
D = {(x1, y1), …, (xl, yl)}, x Rn, y {+1, -1} (3) learning, so cannot choose the best samples for study [5].
and unlabeled data sets: Chen Yaodong et al combined semi-supervised learning and
active learning to study (in this paper referred to as the
x1' , x 2' ,..., x k' CSSA), but the just labeled samples have been increase into
(4)
In general, the mathematical formula of TSVM can be the training set, so the training set is slowly increasing.
described as follows: However, the more training samples, then the slower training
Min( y1 , y 2 ,..., y k , ω , b, ξ1 , ξ 2 ,..., ξ l , ξ1' , ξ 2' ,..., ξ k' ) speed. Therefore, inspired by the following two papers for its
shortcomings, this paper presents a new semi-supervised
l k
1
|| ω || 2 +C ∑ ξ i + D ∑ ξ 'j
support vector machine learning algorithm based on active
learning (AS3VM).
2 i =1 i =1
yi (ω * xi + b) ≥ 1 − ξ i ; ξ i ≥ 0; i = 1,2,..., l
S.t. Distance
Positive class
y 'j (ω * x 'j + b) ≥ 1 − ξ 'j ; ξ 'j ≥ 0; j = 1,2,..., k
(5)
where, C, D as a parameter, and D for the impact factor.
C. Active Learning
Active learning is a training process of a machine
learning method which dynamically selects samples from the
candidates according to a strategy when training [3]. We
know from the previous section that the unlabeled training Negative class
samples are used in TSVM, but we cannot draw the
conclusions that more samples are used for training and then
Hyperplane
the learning performance will certainly be better. Actually,
the unlabeled samples for the semi-supervised learning
Figure 1. The sketch map of edge sample set
methods used in practical applications may come from
different environments, the samples characteristics of In semi-supervised learning, since the initial labeled
distribution is complex and unknown, would most likely has samples are relatively small, the initial classifier trained by
the noise. Active Learning can make use of existing the small samples may be classified ineffective, and lead to
knowledge and take the initiative to select the most likely to some unlabeled samples markup annotation errors. If we use
solve the problem of the sample, effectively reduce the these wrong annotations samples for training, it is clear that
number of samples required for assessment. the classifier training has been ineffective. In addition,
III. THE IMPROVED ALGORITHM BASED ON ACTIVE because we use SVM classifier for classification, not all of
LEARNING
the samples work when training, and only a small amount
training samples which is called support vector, and these
The classifier has a greater performance improvement support vector located at the geometric position around the
when use algorithms of TSVM than a single classifier, while hyperplane [6](See in the figure 1.). In order to reduce the
TSVM main drawback is that it needs to estimate the number of training samples to improve learning efficiency,
training sample distribution ratio of positive and negative we should try to select those who may be the samples of
and the positive number of samples in advance. And support vector to learn, this is the active learning strategies in
incorrect estimate will result in poor results. On this issue, this paper for choosing samples, and that is we will select the
Chen Yisong et al proposed an improved algorithm PTSVM samples which is located at the geometric position around
[4]. The algorithm only is fitted for small number of the hyperplane for training samples to learn. In S3VM
unlabeled samples, although annotated paired and reset label learning, the final result is to obtain a better classifier and
to improve the performance of the TSVM. When there are will label the unlabeled samples annotation to the correct
more samples, this frequent label and label replacement will category. Thus, when we need to know the S3VM training
result in rapid increase in the complexity of the algorithm, and classification result which is well or bad, we can judge it
and then the PTSVM algorithm is far more than the general by means of final label of the samples and the correctness of
TSVM algorithm. However, the unlabeled samples in most category. However, in the second-class classification, the
of the situation in practical applications are far more than vast majority of samples can only belong to one of two

[Volume 3] 2010 2nd International Conference on Future Computer and Communication V3-639
May 12, 2010 17:39 RPS : Trim Size: 8.50in x 11.00in (IEEE) icfcc2010-lineup˙vol-3: F1649

categories, this is because that the boundaries of categories science and technology and they both have 250 web pages.
of some samples are little different such as news and sports Now, we select 50(both 25) pages to label for the initial
categories, and sometimes a sample may also belong to two labeled samples; we select 350(both 175) pages from the rest
categories, so we can define a threshold T to the end of of pages for the unlabeled samples; and we label the rest of
training by the last two results labeled and all samples pages as the test set. In the second experiment, the number of
needed labeled, subject webpage is 1000. And they also belong to health and
A science and technology and they both have 500 web pages.
T= × 100% Now, we select 50(both 25) pages to label for the initial
B , (6) labeled samples; we select 850(both 425) pages from the rest
Where, A is the last two results number of samples of of pages for the unlabeled samples; and then we label the rest
different labeled, B is the total number of samples should be of pages as the test set. In the third experiment, the number
labeled. of subject webpage is also 1000. And they belong to health
This is because that the value of T reflects the stability of and science and technology and others, and health and
the results of the samples categories labeled recently. When science and technology both have 490 web pages and others
the smaller of the value of T, the greater consistency of, and have 20 web pages. Now, we select 50(both 25 in health and
the more credible of the classification results of the classifier, science and technology) pages to label for the initial labeled
we can end the training. Otherwise, when the greater of the samples; we select 850(both 425 in health and science and
value of T, the more greatly of the results of the last two technology and 20 in others) pages from the rest of pages for
classification results vary, and the more not credible of the the unlabeled samples; and then we label the rest of pages as
classification results of the classifier, we should continue to the test set. Of course, we should pretreat the web pages and
train. Therefore, the value of T can be used for the conditions transfer them to the form of vectors, then train the samples
to determine when we should end the training. Of course, for using the above-mentioned four kinds of algorithms for
the above formula, we can see that when T is 0, the classifier learning respectively. All of the algorithms are carried out in
is the best, but the boundaries of categories of some samples the Libsvm-mat-2.89-3 saddlebag [7]. The platform of the
are little different and a sample may also belong to two experiment is Pentium 4, 1.0GRAM operating system is
categories sometimes, also we should reduce the training the Windows XP. In the experiment, kernel is REF function,
time, we can set a condition of ending training, which is the C=1, T=5%. The results of the experiment are shown in table
value of T is recognized. 1.
In summary, the training algorithm of the main steps in
this paper is: TABLE I. RESULTS OF THE FOUR KINDS OF TRAINING ALGORITHMS
Step 1 Select the appropriate type of kernel functions and Training Training Time Precision of the Test
related parameters, and then train the initial samples which Algorithms (s) (%)
are labeled in initial S3VM learning to obtain the initial I II III I II III
classifier.
Step 2 Label the all unlabeled samples used the obtained TSVM 183 917 1219 67 76 54
classifier, use the active learning strategies mentioned earlier PTSVM 78 1734 1954 70 79 59
to select the samples which may become support vector from CSSA 127 1103 1038 75 82 65
them, and combine the initial labeled training set for a new AS3VM 103 847 983 72 89 78
training set. Where,
Step 3 Train the new set of training samples to obtain a A
new classifier again, and delete the non-support vector from Precision of the Test = × 100%
the new training set and then it is a new training set again. B , (7)
Step 4 Label all the initial unlabeled samples used the Where, A is the number of documents classified
obtained classifier, calculate the value of T, if the value of T correctly, B is the number of the documents of the category.
is scheduled to meet the requirements, then go to step 5.
Otherwise, use the active learning strategies mentioned
earlier to select the samples which may become support
vector from all samples, and combine the last training set for
a new training set, then repeat step 3 to step 4.
Step 5 End the training, output the results.
IV. EXPERIMENT AND ARGUMENTATION
In order to test the performance of AS3VM algorithm, we
apply this algorithm into Chinese webpage classification. We
compare the performance using algorithm of TSVM,
PTSVM and CSSA.
In the experiment, the webpage of Chinese training text
comes from sina web net. In the first experiment, the number
of subject webpage is 500. And they belong to health and Figure 2. Training time of the four kinds of training algorithms

V3-640 2010 2nd International Conference on Future Computer and Communication [Volume 3]
May 12, 2010 17:39 RPS : Trim Size: 8.50in x 11.00in (IEEE) icfcc2010-lineup˙vol-3: F1649

V. CONCLUSION
This paper presents a new semi-supervised support vector
machine learning algorithm based on active learning, which
would effectively overcome slow training speed and low
training accuracy rate with the traditional TSVM algorithm,
could adapt to the training sample in different distribution
better and have the practical significance for promoting.
Next, the research will be focused on how to choose better
active learning strategies, how to make the algorithm to
better adapt to the different distribution of training samples,
as well as how to further improve the TSVM classification
performance, especially, how to train the noise samples to
meet the higher requirements of the classification problem.
Figure 3. Precision of the test of the four kinds of training algorithms
ACKNOWLEDGEMENT
From the figure 2, we can find that the four kinds of This work is supported by the Fundamental Research
training algorithms are almost similar when the training set is Funds for the Central Universities (No.09CX04031A). The
small, otherwise, we can find that PTSVM algorithm is the authors are grateful for the anonymous reviewers who made
most time-consuming and AS3VM algorithm is a little better constructive comments.
than the rest of the training algorithms. From the figure 3, the
performance of all four kinds of training algorithms in the REFERENCES
second experiment is better in the first one and also better in [1] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-
the third one. Also, we can find that the performance of verlag, New York, 2000.
AS3VM algorithm is the best. So, in the overall, the [2] Zhao Yinggang, Chen Qi, He Qinming, A Transductive Learning
performance of AS3VM algorithm is better than the other Algorithm Based on Support Vector Machine, Journal of Southern
three kinds of algorithms from the experiment results of the Yangtze University( Natural Science Edition), 2006,5(4):441-444.
above table and figures, specifically, the more of the training [3] Chen Yaodong, Wang Ting, Chen Huowang, Combining Semi-
samples and the littler of the noise training samples, the Supervised Learning and Active Learning for Shallow Semantic
better of the classification results. This is that because Parsing, Journal of Chinese Information Processing, 2008,22(2):70-
75.
AS3VM algorithm reduces the training samples by the
[4] Chen Yisong, Wang Guoping, Dong Shihang. A progressive
method of preselecting support vectors which are small-scale transductive inference algorithm based on support vector machine.
in all training set, and reduces the training time effectively; Journal of Software, 2003,14(3):451-460.
in addition, choosing the best samples, re-training every time [5] Zhong Qingliu, Cai Zixing, Computer Engineering and Applications,
and ending training with threshold T reduces the 2006,25:19-22.
accumulated error and improves the accuracy rate of [6] Li Cunhe, Liu Kangwei, Wang Hongxia, The incremental learning
classification. algorithm with support vector machine based on hyperplane-distance.
Applied Intelligence. Spring. 2009.
[7] Chang Chihchung Lin Chihjen LIBSVM a library for support
vector machines (2009)
http://www.csie.ntu.edu.tw/~cjlin/libsvm/matlab/oldfiles/

[Volume 3] 2010 2nd International Conference on Future Computer and Communication V3-641

You might also like