0% found this document useful (0 votes)
31 views

A Class-Incremental Learning Method For Multi-Class Support Vector Machines in Text Classification

1) The document discusses a class-incremental learning method for updating multi-class support vector machines when new classes are added. 2) The method consists of two phases: incremental feature selection and incremental training. It reuses old models and learns a new binary sub-classifier for the additional class. 3) The experiment showed the method was substantially faster than batch learning methods for SVM and competitive in effectiveness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

A Class-Incremental Learning Method For Multi-Class Support Vector Machines in Text Classification

1) The document discusses a class-incremental learning method for updating multi-class support vector machines when new classes are added. 2) The method consists of two phases: incremental feature selection and incremental training. It reuses old models and learns a new binary sub-classifier for the additional class. 3) The experiment showed the method was substantially faster than batch learning methods for SVM and competitive in effectiveness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, 13-16 August 2006

A CLASS-INCREMENTAL LEARNING METHOD FOR MULTI-CLASS


SUPPORT VECTOR MACHINES IN TEXT CLASSIFICATION
BO-FENG ZHANG1, JIN-SHU SU1, XIN XU2
1
School of Computer, National University of Defense Technology, Changsha 410073, China
2
School of Mechantronics Engineering and Automation, National Univ. of Defense Tech., Changsha 410073, China
E-MAIL: [email protected], [email protected], [email protected]

Abstract: since no intervention from either knowledge engineers or


To solve multi-class problems of support vector machines domain experts is needed for the construction of the
(SVM) more efficiently, a novel framework, which we call classifier or for its porting to a different set of categories.
class-incremental learning (CIL), is proposed in this paper. Until now, many works has been done on applying
CIL consists of two phases: incremental feature selection and machine-learning methods to automated text classification,
incremental training, for updating the knowledge of old SVM
classifiers in text classification when new classes are added to
which include various supervised learning algorithms such
the system. CIL reuses the old models of the classifier and as kNNs, decision trees, Naïve Bayes, Rocchio, neural
learns only one binary sub-classifier with an additional phase networks and support vector machines (SVM), etc. Among
of feature selection when a new class comes. In the testing these methods, SVM have obtained several state-of-art
phase, current classifier is applied to the vectors’ projections results in the practices of text classification [1].
on the sub-spaces concerned. CIL can serve as a flexible Being a learning approach implementing the principle
approach for all binary classification algorithms in text of Structural Risk Minimization (SRM), SVM has become
classification. Our experiment shows that the CIL-based SVM one of the most powerful tools and main techniques in the
was not only substantially faster in training time than the area of machine learning. Vapnik originally proposed SVM
popular batch SVM learning methods such as 1-against-rest,
1-against-1 and divide-by-2 but also almost competed to the
for binary classification where a hyper-plane that
best performances in effectiveness of them. maximizes the margin between two classes is constructed
[2]. However, many classification problems especially text
Keywords: classification have much more classes, therefore multi-class
Machine learning; class-incremental learning; support classifiers constructed and combined by several binary
vector machines; text classification; feature selection; Internet classifiers are often applied. The most three popular
information filtering approaches of the multi-class SVM are introduced below
[3]:
1. Introduction 1) 1-against-rest. For N different classes, considering one
of them as positive and the rest as negative
With the wide spread of Internet applications, respectively, we get N binary classifiers in all. Each
automated classification and filtering of network class is support by a classifier taking his instances as
information based on machine learning techniques has positive.
become an important research topic in recent years since 2) 1-against-1. We construct N (N-1)/2 classifiers by
the availability of digital text documents increases using all the binary pair-wise choices of the N classes
dramatically. The applications of Internet information and ensemble them by voting method.
classification and filtering technology range from personal 3) Divide-by-2. Starting from the whole label set,
information service agents to spam mail filtering. hierarchically we build a binary SVM classifier to
Automated text classification is defined as the task of divide current label set into two subsets until every
assigning predefined class labels to text documents by subset consists of only one class label. Then these N-1
learning a set of training samples to construct a classifier or classifiers will form a decision tree.
filterer model. The advantages of this approach include Nevertheless, the batch training of the whole SVM
accuracy comparable to that achieved by human experts, model is both memory requiring and time consuming.
and a considerable saving in terms of expert labor power, When new learning samples or classes are added, the whole

1-4244-0060-0/06/$20.00 ©2006 IEEE


2581

Authorized licensed use limited to: SLIIT - Sri Lanka Institute of Information Technology. Downloaded on June 09,2023 at 08:47:25 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, 13-16 August 2006

model of the classifier must be relearned. In such situations, Incremental techniques have been developed to facilitate
it is necessary to update an existing classifier in an batch SVM learning over very large data sets and stream
incremental fashion to accommodate new data without data sets widespread use in the SVM community. A
compromising classification performance on previous data. procedure for exact “adiabatic” incremental learning of
On the other hand, many applications that involve massive SVM classifiers, in a single pass through the data, was
data sets are emerging. In such settings, SVM which suffers introduced in [4] and extended larger-set increments [5].
from the problem of large memory requirement and CPU Incremental learning with support vector machines,
time when trained in batch mode on large data sets proposed in [6], consists in learning new data by discarding
algorithm can not effectively deal with them. Incremental all past examples except support vectors. The proposed
learning techniques are one possible solution in both framework thus relies on the property that support vectors
settings. summarize well the data, so it may give an approximate
In this paper, we first present a new strategy for solution. A way to incrementally solve the global
updating SVM classifiers in an incremental manner when optimization problem in order to find the exact solution was
new classes are added. When a new class comes, the proposed in [4]. Furthermore, it allows to do “decremental”
proposed method will take all the original classes as one unlearning and to efficiently compute leave-one-out
new negative super class against it and using the basic estimations. Local incremental learning of SVM or LISVM,
binary training method on them to get a higher lever proposed in [7], exploits the “locality” of Radial Basis
classifier. The feature selection step will also be adjusted Function kernels to update current machine by only
accordingly, that is, we will use different subset of the considering a subset of support candidates in the
features for different level. The method offers a flexible neighborhood of the input and allowing determining a
choice for originally trained classes and effectively reduces relevant neighborhood size at each learning step.
the training time when new classes are added without However, there was little attention to the updating of
significant decreasing the accuracy of classification. multi-class SVM classifiers when the number of the class
The rest of the paper is organized as follows. In increased. When a new class was added to the classification
section 2, we describe how SVM classifiers are trained system, the above methods could not fully be
incrementally when new samples are added and used in text accommodated to this situation and the old models became
classification. Section 3 presents the principles and useless. As a result, the retraining of the whole model was
algorithms of our class-incremental training framework. not avoidable. In this case, the class-incrementablity as well
Experimental results on real web text data are given in as the reusing of old models turned into important issues.
Section 4. Finally some conclusions are drawn in Section 5.
3. Class-Incremental Learning for Multi-Class SVM
2. Related work
For a N-instance label set denoted by {1, 2,…, N}, a
In text classification, a feature selection phase is set containing N-1 binary SVM classifiers CN = {S2,…, SN-1,
needed before a text d was represented by a vector [1]: SN} can be constructed accordingly, in which Si is the
x = [wd1,wd2 ,..., wdn ], binary classifier telling apart all the instances belong to any
where wdi (i=1,2,…,n) are the weights of document features class of set {i-1,…,1} with those of class i. When a
in text d and n is the number of document features. The newcome class N+1 unknown to current classification
Lagrangian dual of soft-margin support vector learning for system is considered, we need not to train a classifier for all
M examples can be formulated as: the N+1 classes from beginning but only construct a new
max
α
∑ia i − 12 ∑i, j a i a j y i y j K (x i ⋅ x j ) binary classifier takes all the samples from known classes
, as negative and that of the new class as positive. By this
s.t. 0 ≤ a i ≤ C ( for i = 1,...M ), ∑i =1 a i y i = 0
M
way, which we call class-incremental learning (CIL), a new
where K is the kernel function and yi = ±1 are the labels of class or label is added to the old classification system
the positive and negative samples respectively [2]. The incrementally. All through this paper, without posing any
decision function of the binary SVM is: confusion, we take the class label i as either a class label or
f (x) = sgn(∑i =1 a i y i K (x, x i ) + b) .
M all the samples belonging to the class with label i, which
sometimes we also call class i.
Because the SVM learning poses the optimization
problem, which implies to re-learn the whole machine,
there were only a few works tackle the issue of incremental
training of SVM when samples are gradually added.

2582

Authorized licensed use limited to: SLIIT - Sri Lanka Institute of Information Technology. Downloaded on June 09,2023 at 08:47:25 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, 13-16 August 2006

3.1. Incremental Updating Method of Classifiers CN-1 called VN-1 is a sub-space of CN called VN. We train the
binary SVM sub-classifier in the space VN. In the testing
The basic idea of the CIL is to reuse the old models as phase, when a vector x in VN is passed down to the
possible as we can. When a new class and its samples are classifier CN-1, we use its projection on VN-1 denoted by
added to the system, the ability of the old classifier to x|VN-1 in classifier CN-1.
distinguish between the instances of the old classes is not Now we can get the whole learning process and
influenced at all. What should to be done is only to tell summarize it as follows:
apart the instances of the new class from that of the old Step 0: Apply appropriate feature selection phase to
classes. get feature set Fm and train a multi-class SVM classifier Cm
Assume that we now have N-1 classes whose in vector space Vm for m original classes by any batch
classification system is called classifier CN-1. When we take method.
a different N-th class into account, we only need to train a For a new class N (N>m),
binary sub-classifier SN to separate the instances of the class Step 1: Apply incremental feature selection phase to
N from instances of all the other N-1 classes (referred to as get feature set F N;
Super Class SCN-1). Consequently a positive decision of the Step 2: Train a binary classifier for super class SCN-1
instance in the testing phase will indicate its belonging to and class N;
the class N. Rejected instances will be passed down to the Step 3: Combine the binary classifier and the classifier
classifier CN-1. Then the classifier CN-1 and the sub-classifier CN-1 to get classifier CN.
SN are combined and referred to as classifier CN. When class
N+1 is considered, the same process is progressed. Figure 1 3.3. Testing
illustrates the relations between some date sets and
sub-classifiers mentioned above in the training process. The process of testing phase is showed in figure 2.
According to above discussions, classifier CN consists of a
binary sub-classifier SN and a lower-leveled classifier CN-1.
SN Every Instance x is projected to the vector space VN and the
projection, x|Vk, can be taken as a vector in space VN. A
positive decision after applying SN on x|Vk implies the
belonging of x to class N, otherwise x will be passed further
down to the lower-leveled classifier SN-1 iteratively until a
label returned. we give the summarization of the testing
class N-1 class N phase as follows:
SC N-2
For classifier CN, and the instance x, until k = m,
...
SC N-1 Step 1: Apply classifier Ck on x|Vk;
SC N Step 2: If x belong to a class
return label of class;
Figure 1. Relations between data sets and sub-classifiers in Else set k = k-1, pass x|Vk down to Ck;
the process of incremental updating of the classifier go to step 1.

x|V N
3.2. Incremental Feature Selection

Feature sets must be replaced when new classes are SN


added to the classification system. But if a feature selection x|V N -1
phase irrelevant to the old feature set is applied, the above class N
S N -1
updating method cannot work due to the different vector
x|V N -2

spaces. We have a correspondingly adjusted incremental


feature selection process before training. class N -1

We assume that the feature set of the classifier CN-1 is


F N-1. Before the training process, we apply a local feature CN
selection on the class N and the super class SCN-1 [1]. The
union of the selected features and set FN-1 is defined as FN.
So we get FN −1 ⊂ FN and the vector space of classifier Figure 2. Testing phase of the classifier by CIL

2583

Authorized licensed use limited to: SLIIT - Sri Lanka Institute of Information Technology. Downloaded on June 09,2023 at 08:47:25 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, 13-16 August 2006

4. Experimental Result In figure 4, we compare the Micro-F1 value of our CIL


method with other batch learning method in various
4.1. Data Set and Experiment Setup numbers of classes. The effectiveness of all the methods
slightly decreases along with the increment of the class
The Reuters-21578 [8] test collection for text number. Our method only shows a little disadvantage to the
classification from the Reuters newswire in 1987 contains 1-against-1 methods on the whole and performs better than
135 categories. We use its subset of 20 categories and 2,706 1-against-rest and divide-by-2 methods most of the time.
documents in total.1600 documents are used as training
samples and the rest 1106 as testing instances.
We use linear kernel for SVM and respectively apply
Chi-Square feature selections to choose 100 features for
each class considered in the binary classifiers every time.
We ran the experiments on a computer with a single
Intel P4 CPU of 2.4G Hz and a DDR266 RAM of 512M
Bytes. The implementation of SVM algorithm is from
LIBSVM [9], which is widely used in the literature.

4.2. Performance Measure

We use the F1 measure introduced in [1] as the


performance measure. This measure combines recall and
precision in the following way: Figure 3. Comparison of the learning time
number of correct positive predictions
Recall =
number of positive examples
number of correct positive predictions
Precision =
number of positive precictions
2 × Recall × Precision
F1 =
Precision + Recall
There are Micro- and Macro- averages of F1 cores to
evaluate the performance of a classifier on the whole data
set:
Micro-F1 = F1 over categories and documents
Macro-F1 = average of within - category F1 values
In the experimental result, we choose Micro-F1 as our
measure because it emphasizes the performance of the
system on common.
Figure 4. Comparison of the Mico-F1
4.3. Results and Analysis
5. Conclusions and Future Work
In figure 3, by increasing the number of classes from 4
In this paper, we have proposed a novel method called
to 20, we compare the learning time of incremental
CIL as a solution to updating the classifiers for multi-class
updating with that of the batch updating. It is obvious that
support vector machines when new classes participate in
our CIL method is significantly more efficient than those of
the classification system. CIL consists of two phases in the
the batch learning methods. Consider when a new class N
learning process: incremental feature selection and
comes, the batch learning need N-1 times of training of a
incremental training under current vector space. CIL can
binary classifier for 1-against-rest and divide-by-2 as well
serve as an incremental approach not only to the SVM
as N(N-1)/2 for 1-against-1 method. But for a CIL method,
algorithms but also an effective framework for all other
owing to the reusing of the old models, we need only one
binary classification algorithms because of its intention to
time. All the methods need one time of feature selection
reuse of the old models. Our experiments have shown that
phase. The CPU time of the figure 3 is in seconds.
the training time of CIL was substantially faster than that of

2584

Authorized licensed use limited to: SLIIT - Sri Lanka Institute of Information Technology. Downloaded on June 09,2023 at 08:47:25 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, 13-16 August 2006

the three batches learning methods and almost competed to Proceedings of ECML-98, Chemnitz, Germany
the best performances of all the others. pp.137-142, 1998.
Since CIL trains a binary classifier for class N and [3] Volkan V., Jennifer G. Dy, "A hierarchical methods
super class SCN-1, sometimes there will be far more negative for multi-class Support Vector Machines", Proceeding
samples than positive samples especially when N is bigger. of ICML-04 Conference, Banff, pp. 105-112, 2004.
This poses the problem of the class imbalance [10]. In the [4] Cauwenberghs G., Poggio T., “Incremental and
situation of our method, the positives will be prone to being Decremental Support Vector Machine Learning”,
submerged by the negatives. This is one of the possible NIPS 2000, pp. 409–415, MIT Press, 2000.
reasons for the decrease in effectiveness of the CIL and [5] Fine S., Scheinberg K., “Incremental learning and
other method when N increased. Our future work is to take selective sampling via parametric optimization
the class imbalance problem into account and build a framework for SVM”, Advances in Neural
classifier adaptive to the class distributions. Information Processing Systems 14, pp. 705-711, MIT
Another direction is to study the influence of different Press, 2002.
feature selection algorithms and the sizes of feature sets to [6] Syed N. A., Liu H., Sung, K., “Incremental learning
the performances of the CIL method because we have only with support vector machines”, Proceedings of
adopted Chi-Square to choose a given number of features in IJCAI-99, Stockholm, Sweden, 1999.
this paper. Some other feature selection metrics may be [7] Ralaivola L., d'Alch-Buc F., "Incremental Support
more suitable for CIL. Vector Machine learning: a local approach”, Georg
Dorffner Horst Bischof, Kurt Hornik (Eds.): Lecture
Acknowledgements Notes in Computer Science 2130, pp. 322–330,
Springer, Vienna, Austria, 2001.
This paper is supported by the National Research [8] Lewis D. D., “Reuters-21578 text categorization tests
Foundation for the Doctoral Program of Higher Education collection. Distribution 1.0.199”, available at
of China under Grant No. 20049998027, and the Natural http://www.daviddlewis.com/resources/testcollections/
Science Foundation of China under Grant No. 90604006. reuters21578/readme.txt
[9] Chang C., Lin C., “LIBSVM: a library for support
References vector machines”, software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
[1] Sebastiani F., “Machine learning in automated text [10] Castillo M. D., Serrano J. I., “A multi-strategy
categorization”, ACM Computing Surveys. Vol 34 No. approach for digital text categorization from
1, pp. 1-47, 2002. imbalanced documents”, ACM SIGKDD Explorations
[2] Joachims T., "Text categorization with support vector Newsletter. Vol 6, No. 1, pp. 70-79, 2004.
machines: learning with many relevant features",

2585

Authorized licensed use limited to: SLIIT - Sri Lanka Institute of Information Technology. Downloaded on June 09,2023 at 08:47:25 UTC from IEEE Xplore. Restrictions apply.

You might also like