Machine Learning: A Review of Classification and Combining Techniques
Machine Learning: A Review of Classification and Combining Techniques
DOI 10.1007/s10462-007-9052-3
of
and combining techniques
pro
© Springer Science+Business Media B.V. 2007
1 Abstract Supervised classification is one of the tasks most frequently carried out by so-
2 called Intelligent Systems. Thus, a large number of techniques have been developed based
3
cted
on Artificial Intelligence (Logic-based techniques, Perceptron-based techniques) and Statis-
4 tics (Bayesian Networks, Instance-based techniques). The goal of supervised learning is to
5 build a concise model of the distribution of class labels in terms of predictor features. The
6 resulting classifier is then used to assign class labels to the testing instances where the values
7 of the predictor features are known, but the value of the class label is unknown. This paper
8 describes various classification algorithms and the recent attempt for improving classification
9 accuracy—ensembles of classifiers.
11
12 1 Introduction
13 There are several applications for Machine Learning (ML), the most significant of which is
14 predictive data mining. Every instance in any dataset used by machine learning algorithms
15 is represented using the same set of features. The features may be continuous, categorical
unc
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
16 or binary. If instances are given with known labels (the corresponding correct outputs) then
17 the learning is called supervised, in contrast to unsupervised learning, where instances are
18 unlabeled (Jain et al. 1999).
19 Numerous ML applications involve tasks that can be set up as supervised. In the present
Author Proof
paper, we have concentrated on the techniques necessary to do this. In particular, this work is
of
20
21 concerned with classification problems in which the output of instances admits only discrete,
22 unordered values. We have limited our references to recent refereed journals, published books
23 and conferences. A brief review of what ML includes can be found in (Dutton and Conroy
24 1996). De Mantaras and Armengol (1998) also presented a historical survey of logic and
instance based learning classifiers.
pro
25
26 After a better understanding of the strengths and limitations of each method, the possibility
27 of integrating two or more algorithms together to solve a problem should be investigated. The
28 objective is to utilize the strengths of one method to complement the weaknesses of another.
29 If we are only interested in the best possible classification accuracy, it might be difficult or
30 impossible to find a single classifier that performs as well as a good ensemble of classifiers.
31 Our next section covers wide-ranging issues of supervised machine learning such as data
32 pre-processing and feature selection. Logic-based learning techniques are described in Sect.
33 3, whereas perceptron-based techniques are analyzed in Sect. 4. Statistical techniques for ML
34 are covered in Sect. 5. Section 6 deals with the newest supervised ML technique—Support
35
cted
Vector Machines (SVMs). In Sect. 7, a representative algorithm for each learning technique
36 is compared to a number of datasets in order for researchers to have baseline accuracy for new
37 algorithms in these well-known datasets. Section 8 presents the recent attempt for improving
38 classification accuracy—ensembles of classifiers. Finally, the last section concludes this
39 work.
41 The first step is collecting the dataset. If a requisite expert is available, then s/he could suggest
orre
42 which fields (attributes, features) are the most informative. If not, then the simplest method
43 is that of “brute-force,” which means measuring everything available in the hope that the
44 right (informative, relevant) features can be isolated. However, a dataset collected by the
45 “brute-force” method is not directly suitable for induction. It contains in most cases noise
46 and missing feature values, and therefore requires significant pre-processing (Zhang et al.
47 2002).
What can be wrong with data? There is a hierarchy of problems that are often encountered
unc
49
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
of
Max, min Max, min should not be outside of permissible range
Variance, deviation Variance, deviation of statistical values should not be
higher than threshold
Misspellings Feature values Sorting on values often brings misspelled values next to
correct values
pro
59 Variable-by-variable data cleansing is a filter approach for unlikely values (those values
60 that are suspicious due to their relationship to a specific probability distribution, that is to
61 say, a normal distribution with a mean of 5, a standard deviation of 3, but a suspicious value
62 of 10). Table 1 shows examples of how such metadata can help in detecting a number of
63 possible data quality problems.
64 Hodge and Austin (2004) have recently introduced a survey of contemporary techniques
65 for outlier (noise) detection. These researchers have identified the techniques’ advantages
66
cted
and disadvantages. While the focus above is on analytical methods, the use of visualization
67 can also often be a powerful tool. Visualization is particularly good at picking out bad values
68 that occur in a regular pattern. However, care is needed in distinguishing between natural
69 variability and the presence of bad values, since data is often more dispersed that we think.
70 Instance selection is not only used to handle noise but to cope with the infeasibility of
71 learning from very large datasets. Instance selection in these datasets is an optimization prob-
72 lem that attempts to maintain the mining quality while minimizing the sample size (Liu and
73 Metoda 2001). It reduces data and enables a data-mining algorithm to function and work
74 effectively with very large datasets. There are a variety of procedures for sampling instances
75 from a large dataset. The most well-known are (Reinartz 2002):
orre
81 Incomplete data is an unavoidable problem in dealing with most real world data sources.
82 Generally, there are some important factors to be taken into account when processing
83 unknown feature values. One of the most important ones is the source of “unknown-ness”:
unc
84 (i) a value is missing because it was forgotten or lost; (ii) a certain feature is not applicable
85 for a given instance (e.g., it does not exist for a given instance); (iii) for a given observation,
86 the designer of a training set does not care about the value of a certain feature (so-called
87 “don’t-care values).”
88 Depending on the circumstances, researchers have a number of methods to choose from
89 to handle missing data (Batista and Monard 2003):
90 • Method of ignoring instances with unknown feature values: This method is simplest:
91 ignore any instances, which have at least one unknown feature value.
92 • Most common feature value: The value of the feature that occurs most often is selected
93 to be the value for all the unknown values of the feature.
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
94 • Most common feature value in class: This time the value of the feature, which occurs
95 most commonly within the same class is selected to be the value for all the unknown
96 values of the feature.
97 • Mean substitution: Researchers substitute a feature’s mean value (computed from avail-
Author Proof
able cases) to fill in missing data values on the remaining cases. A more sophisticated
of
98
99 solution than using the “general” feature mean is to use the feature mean for all samples
100 belonging to the same class to fill in the missing value.
101 • Regression or classification methods: A regression or classification model based on the
102 complete case data for a given feature is developed. This model treats the feature as the
outcome and uses the other features as predictors.
pro
103
104 • Hot deck inputting: The most similar case to the case with a missing value is identified,
105 and then a similar case’s Y value for the missing case’s Y value is substituted.
106 • Method of treating missing feature values as special values: “Unknown” itself is treated
107 as a new value for the features that contain missing values.
108 Feature subset selection is the process of identifying and removing as many irrelevant and
109 redundant features as possible (Yu and Liu 2004). This reduces the dimensionality of the
110 data and enables data mining algorithms to operate faster and more effectively. Generally,
111 features are characterized as:
112
cted
• Relevant: These are features have an influence on the output. Their role cannot be assumed
113 by the rest.
114 • Irrelevant: Irrelevant features are defined as those features not having any influence on
115 the output. Their values could be generated at random and not influence the output.
116 • Redundant: A redundancy exists whenever a feature can take the role of another (perhaps
117 the simplest way to incur model redundancy).
118 Feature selection algorithms in general have two components: a selection algorithm that gen-
119 erates proposed subsets of features and attempts to find an optimal subset and an evaluation
120 algorithm that determines how “good” a proposed feature subset is. However, without a suit-
able stopping criterion, the feature selection process may run repeatedly through the space
orre
121
122 of subsets, taking up valuable computational time. Stopping criteria might be: (i) whether
123 addition (or deletion) of any feature does not produce a better subset; and (ii) whether an
124 optimal subset according to some evaluation function is obtained.
125 The fact that many features depend on one another often unduly influences the accuracy
126 of supervised ML classification models. This problem can be addressed by constructing new
127 features from the basic feature set (Markovitch and Rosenstein 2002). This technique is called
128 feature construction/transformation. These newly generated features may lead to the creation
129 of more concise and accurate classifiers. In addition, the discovery of meaningful features
130 contributes to better comprehensibility of the produced classifier, and a better understanding
unc
133 The choice of which specific learning algorithm we should use is a critical step. The classifier’s
134 evaluation is most often based on prediction accuracy (the percentage of correct prediction
135 divided by the total number of predictions). There are at least three techniques which are
136 used to calculate a classifier’s accuracy. One technique is to split the training set by using
137 two-thirds for training and the other third for estimating performance. In another technique,
138 known as cross-validation, the training set is divided into mutually exclusive and equal-sized
139 subsets and for each subset the classifier is trained on the union of all the other subsets. The
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
140 average of the error rate of each subset is therefore an estimate of the error rate of the classi-
141 fier. Leave-one-out validation is a special case of cross validation. All test subsets consist of
142 a single instance. This type of validation is, of course, more expensive computationally, but
143 useful when the most accurate estimate of a classifier’s error rate is required.
Author Proof
If the error rate evaluation is unsatisfactory, a variety of factors must be examined: perhaps
of
144
145 relevant features for the problem are not being used, a larger training set is needed, the
146 dimensionality of the problem is too high, the selected algorithm is inappropriate or param-
147 eter tuning is needed. A common method for comparing supervised ML algorithms is to
148 perform statistical comparisons of the accuracies of trained classifiers on specific datasets
(Bouckaert 2003). Several heuristic versions of the t-test have been developed to handle this
pro
149
154 In this section we will concentrate on two groups of logic (symbolic) learning methods:
155 decision trees and rule-based classifiers.
cted
156 3.1 Decision trees
157 Murthy (1998) provided an overview of work in decision trees and a sample of their useful-
158 ness to newcomers as well as practitioners in the field of machine learning. Decision trees are
159 trees that classify instances by sorting them based on feature values. Each node in a decision
160 tree represents a feature in an instance to be classified, and each branch represents a value
161 that the node can assume. Instances are classified starting at the root node and sorted based
162 on their feature values.
163 The problem of constructing optimal binary decision trees is an NP-complete problem
orre
164 and thus theoreticians have searched for efficient heuristics for constructing near-optimal
165 decision trees. The feature that best divides the training data would be the root node of the
166 tree. There are numerous methods for finding the feature that best divides the training data
167 but a majority of studies have concluded that there is no single best method (Murthy 1998).
168 Comparison of individual methods may still be important when deciding which metric should
169 be used in a particular dataset. The same procedure is then repeated on each partition of the
170 divided data, creating sub-trees until the training data is divided into subsets of the same
171 class. Figure 1 presents a general pseudo-code for building decision trees.
172 A decision tree, or any learned hypothesis h, is said to overfit training data if another
hypothesis h ′ exists that has a larger error than h when tested on the training data, but a
unc
173
174 smaller error than h when tested on the entire dataset. If the two trees employ the same
175 kind of tests and have the same prediction accuracy, the one with fewer leaves is usually
176 preferred. Breslow and Aha (1997) survey methods of tree simplification to improve their
177 comprehensibility. The most straightforward way of tackling overfitting is to pre-prune the
178 decision tree by not allowing it to grow to its full size. A comparative study of well-known
179 pruning methods presented in (Elomaa 1999). However, Elomaa (1999) concluded that there
180 is no single best pruning method.
181 Even though the divide-and-conquer algorithm is quick, efficiency can become important
182 in tasks with hundreds of thousands of instances. The most time-consuming aspect is sort-
183 ing the instances on a numeric feature to find the best threshold t. This can be expedited if
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
Author Proof
of
pro
cted
Fig. 1 Pseudo-code for building a decision tree
184 possible thresholds for a numeric feature are determined just once, effectively converting the
185 feature to discrete intervals, or if the threshold is determined from a subset of the instances
orre
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
Author Proof
of
pro
Fig. 2 Pseudocode for rule learners
206 to a specific class. Since a decision tree constitutes a hierarchy of tests, an unknown feature
207 value during classification is usually dealt with by passing the example down all branches
208 of the node where the unknown feature value was detected, and each branch outputs a class
209 distribution. The output is a combination of the different class distributions that sum to 1.
The assumption made in the decision trees is that instances belonging to different classes
210
211
cted
have different values in at least one of their features.
213 Decision trees can be translated into a set of rules by creating a separate rule for each path
214 from the root to a leaf in the tree (Quinlan 1993). However, rules can also be directly induced
215 from training data using a variety of rule-based algorithms. Furnkranz (1999) provided an
216 excellent overview of existing work in rule-based methods. The goal is to construct the small-
217 est rule-set that is consistent with the training data. A large number of learned rules is usually
218 a sign that the learning algorithm is attempting to “remember” the training set, instead of
orre
228 training data, for instance by stopping the specialization process with the use of a quality mea-
229 sure or by generalizing overly specialized rules in a separate pruning phase (Furnkranz 1997).
230 It is therefore important for a rule induction system to generate decision rules that have
231 high predictability or reliability. These properties are commonly measured by a function
232 called rule quality. An and Cercone (2000) surveyed a number of statistical and empirical
233 rule quality measures. When using unordered rule sets, conflicts can arise between the rules,
234 i.e., two or more rules cover the same example but predict different classes. Lindgren (2004)
235 has recently given a survey of methods used to solve this type of conflict.
236 RIPPER is a well-known rule-based algorithm (Cohen 1995). It forms rules through a
237 process of repeated growing and pruning. During the growing phase the rules are made more
238 restrictive in order to fit the training data as closely as possible. During the pruning phase, the
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
239 rules are made less restrictive in order to avoid overfitting, which can cause poor performance
240 on unseen instances. RIPPER handles multiple classes by ordering them from least to most
241 prevalent and then treating each in order as a distinct two-class problem. Bonarini (2000)
242 gave an overview of fuzzy rule-based classifiers. Fuzzy logic tries to improve classification
Author Proof
and decision support systems by allowing the use of overlapping class definitions.
of
243
244 Furnkranz (2001) investigated the use of round robin binarization (or pairwise classifica-
245 tion) as a technique for handling multi-class problems with separate and conquer rule learning
246 algorithms. His experimental results show that, in comparison to conventional, ordered or
247 unordered binarization, the round robin approach may yield significant gains in accuracy
without risking a poor performance.
pro
248
249 There are numerous other rule-based learning algorithms. Furnkranz (1999) referred to
250 most of them. The PART algorithm infers rules by repeatedly generating partial decision
251 trees, thus combining the two major paradigms for rule generation—creating rules from
252 decision trees and the separate-and-conquer rule-learning technique. Once a partial tree has
253 been build, a single rule is extracted from it and for this reason the PART algorithm avoids
254 postprocessing (Frank and Witten 1998).
255 Genetic algorithms (GAs) have also been used for learning sets of rules. Fidelis et al.
256 (2000) used a genetic algorithm to learn binary concepts represented by a disjunctive set of
257 propositional rules and it was found to be comparable in generalization accuracy to other
258
cted
learning algorithms. Assuming two binary features X 1 , X 2 and the binary target value c, rule
259 representation to chromosomes is:
IF X 1 =True ∧ X 2 =False THEN c=True, IF X 1 = False ∧ X 2 =True THEN c=False
260
10 01 1 01 10 0
261 Note that there is a fixed length bit-string representation for each rule. The task of the
262 genetic algorithm is to find good chromosomes. The goodness of a chromosome is repre-
263 sented in the GA by a function which is called the fitness function (Reeves and Rowe 2003).
264 For the classification task, the fitness function typically scores the classification accuracy of
265 the rule over a set of provided training instances. At the heart of the algorithm are opera-
orre
266 tions which take the population of the present generation and produce the population of the
267 next generation in such a way that the overall fitness of the population is increased. These
268 operations repeat until some stopping criterion is met, such as a given number of chromo-
269 somes being processed or a chromosome of given quality produced. Three operations take
270 the population of generation t and produce the new population of generation t + 1: selection,
271 crossover, and mutation (Reeves and Rowe 2003).
272 For the task of learning binary problems, rules are more comprehensible than decision
273 trees because typical rule-based approaches learn a set of rules for only the positive class.
274 On the other hand, if definitions for multiple classes are to be learned, the rule-based learner
unc
275 must be run separately for each class separately. For each individual class a separate rule
276 set is obtained and these sets may be inconsistent (a particular instance might be assigned
277 multiple classes) or incomplete (no class might be assigned to a particular instance). These
278 problems can be solved with decision lists (the rules in a rule set are supposed to be ordered, a
279 rule is only applicable when none of the preceding rules are applicable) but with the decision
280 tree approach, they simply do not occur. Moreover, the divide and conquer approach (used
281 by decision trees) is usually more efficient than the separate and conquer approach (used
282 by rule-based algorithms). Separate-and-conquer algorithms look at one class at a time, and
283 try to produce rules that uniquely identify the class. They do this independent of all the
284 other classes in the training set. For this reason, for small datasets, it may be better to use a
285 divide-and-conquer algorithm that considers the entire set at once.
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
286 To sum up, the most useful characteristic of rule-based classifiers is their comprehensi-
287 bility. In addition, even though some rule-based classifiers can deal with numerical features,
288 some experts propose these features should be discretized before induction, so as to reduce
289 training time and increase classification accuracy (An and Cercone 1999). Classification
Author Proof
of
290
291 decision trees) using the background knowledge of the user (Flach and Lavrac 2000) or
292 automatic feature construction algorithms (Markovitch and Rosenstein 2002).
pro
294 The main limitation of propositional learners is their limited capacity to take into account
295 available background knowledge. Learners that induce hypotheses in the form of logic pro-
296 grams (Horn clauses) are called inductive logic programming systems. De Mantaras and
297 Armengol (1998) presented a survey of inductive logic programming.
298 The previously discussed logic-based methods can learn hypotheses which can be
299 expressed in terms of propositional logic. Inductive Logic Programming (ILP) classifiers
300 use framework of first order predicate logic. As with any learning system, this one can be
301 quite complex and intractably difficult unless it is constrained by biases of some sort. One
302 might restrict the program to Horn clauses, not allow recursion, and so on (De Raedt 1996).
303
cted
First, most algorithms favor those clauses which cover as many instances as possible, then,
304 they turn to those clauses having a smaller number of features and finally they accept those
305 clauses which have the least total number of values in the internal disjunctions.
306 ILP systems typically adopt the covering approach of rule induction systems. In a main
307 loop, they construct a clause explaining some of the positive examples, add this clause to
308 the hypothesis, remove the positive examples explained and repeat this until all positive
309 examples are explained (the hypothesis is complete). In an inner loop, individual clauses are
310 constructed by (heuristically) searching the space of possible clauses which was structured
311 by a specialization or generalization operator. Typically, a search starts with a very general
312 rule (clause with no conditions in the body), then proceeds to add literals (conditions) to this
orre
313 clause until it only covers (explains) positive examples (the clause is consistent). This search
314 can be bound from below by using so-called “bottom clauses,” constructed by least general
315 generalization or inverse resolution/entailment.
316 There are three major patterns in which a logic program might be generalized: (i) replace
317 some terms in a program clause by variables, (ii) remove literals from the body of a clause,
318 (iii) add a clause to the program. Correspondingly, there are three patterns in which a logic
319 program might be specialized: (i) replace some variables in a program clause by terms (a
320 substitution), (ii) add literals to the body of a clause and (iii) remove a clause from the pro-
321 gram. FOIL (Quinlan 1995) is an algorithm designed to learn a set of first order rules to
unc
322 predict a target predicate to be true. It differs from classifiers such as C4.5 in that it learns
323 relations among features that are described with variables. PROGOL (Muggleton 1995) is
324 another well-known first order rules learner.
325 A significant part of ILP research now goes under the heading of Relational Data Mining
326 and is concerned with finding decision trees from multi-table relational databases (Dzeroski
327 and Lavrac 2001). These ILP algorithms can be understood as a kind decision tree induction
328 where each node of the decision tree is itself a sub-decision tree, and each sub-decision
329 tree consists of nodes that make binary splits on several features using the background rela-
330 tions available. However, since ILP methods search a much larger space of possible rules
331 in a more expressive language, they are computationally more demanding. Blockeel and
332 De Raedt (1998) studied scaling up first-order logical decision trees with the TILDE
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
of
True
Fig. 3 First Order Rule versus Zero Order Rule
333 algorithm. A dataset is presented to TILDE in the form of a set of interpretations. Each
pro
334 interpretation consists of a number of Prolog facts, surrounded by a begin and end line. The
335 background knowledge is simply a Prolog program.
336 ILP differs from most other forms of machine learning both by its use of an expressive
337 representation language and its ability to make use of logically encoded background knowl-
338 edge. This has allowed successful applications of ILP in areas such as molecular biology and
339 natural language processing, which both have rich sources of background knowledge and
340 both benefit from the use of conceptually expressive languages (Muggleton 1999).
341 When comparing ILP (first order predicate logic rules) and feature-value learning tech-
342 niques (propositional logic rules), there is a trade-off between expressive power and efficiency
(Dantsin et al. 2001). ILP techniques are typically more expressive (see Fig. 3) but also less
343
344
cted
efficient. If the available data has a standard tabular form, with rows being individual records
345 (training instances) and columns being properties (features) used to describe the data, a data
346 analyst has no reason to become interested in ILP if there is no expert knowledge (background
347 knowledge). ILP can be used for data mining in multi-relational data mining tasks with data
348 stored in relational databases and tasks with abundant expert knowledge of a relational or
349 structural nature. Typically, each predicate will correspond to one relation in the relational
350 database. Each fact in an interpretation is a tuple (instance) in the database, and an interpre-
351 tation corresponds to a part of the database (a set of tuples). Background knowledge can be
352 expressed by means of views as well as extensional tables.
orre
354 Other well-known algorithms are based on the notion of perceptron. Perceptron can be briefly
355 described as:
356 If x1 through xn are input feature values and w1 through wn are connection weights/pre-
357 diction vector (typically
real numbers in the interval [−1, 1]), then perceptron computes the
358 sum of weighted inputs: i xi wi and output goes through an adjustable threshold: if the sum
359 is above threshold, output is 1; else it is 0. The most common way the perceptron algorithm
unc
360 is used for learning from a batch of training instances is to run the algorithm repeatedly
361 through the training set until it finds a prediction vector which is correct on all of the train-
362 ing set. This prediction rule is then used for predicting the labels on the test set. WINNOW
363 (Littlestone and Warmuth 1994) is based on the perceptron idea. It was experimentally proved
364 (Blum 1997) that WINNOW can adapt rapidly to changes in the target function (concept
365 drift). A target function (such as user preferences) is not static in time. In order to enable,
366 for example, a decision tree algorithm to respond to changes, it is necessary to decide which
367 old training instances could be deleted. A number of algorithms similar to WINNOW have
368 been developed, such as those by Auer and Warmuth (1998).
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
369 Freund and Schapire (1999) created a newer algorithm, called voted-perceptron, which
370 stores more information during training and then uses this elaborate information to generate
371 better predictions about the test data. The information it maintains during training is the list
372 of all prediction vectors that were generated after each and every mistake. For each such
Author Proof
vector, it counts the number of iterations it “survives” until the next mistake is made; Freund
of
373
374 and Schapire refer to this count as the “weight” of the prediction vector. To calculate a pre-
375 diction the algorithm computes the binary prediction of each one of the prediction vectors
376 and combines all these predictions by means of a weighted majority vote. The weights used
377 are the survival times described above.
To sum up, we have discussed perceptron-like linear algorithms with emphasis on their
pro
378
379 superior time complexity when dealing irrelevant features. This can be a considerable advan-
380 tage when there are many features, but only a few relevant ones. Generally, all perceptron-like
381 linear algorithms are anytime online algorithms that can produce a useful answer regardless
382 of how long they run (Kivinen 2002). The longer they run, the better the result they produce.
383 Most perceptron-like algorithms cannot deal with numerical features, thus numerical fea-
384 tures should be discretized before induction. Finally, perceptron-like methods are binary, and
385 therefore in the case of multi-class problem one must reduce the problem to a set of multiple
386 binary classification problems.
396 input units, which receive information to be processed; output units, where the results of
397 the processing are found; and units in between known as hidden units. Feed-forward ANNs
398 (Fig. 4) allow signals to travel one way only, from input to output.
399 Generally, properly determining the size of the hidden layer is a problem, because an
400 underestimate of the number of neurons can lead to poor approximation and generalization
401 capabilities, while excessive nodes can result in overfitting and eventually make the search
402 for the global optimum more difficult. An excellent argument regarding this topic can be
403 found in (Camargo and Yoneyama 2001). Kon and Plaskota (2000) also studied the mini-
404 mum amount of neurons and the number of instances necessary to program a given task into
feed-forward neural networks.
unc
405
406 ANN depends upon three fundamental aspects, input and activation functions of the unit,
407 network architecture and the weight of each input connection. Given that the first two aspects
408 are fixed, the behavior of the ANN is defined by the current values of the weights. The
409 weights of the net to be trained are initially set to random values, and then instances of
410 the training set are repeatedly exposed to the net. The values for the input of an instance
411 are placed on the input units and the output of the net is compared with the desired output
412 for this instance. Then, all the weights in the net are adjusted slightly in the direction that
413 would bring the output values of the net closer to the values for the desired output. There
414 are several algorithms with which a network can be trained (Neocleous and Schizas 2002).
415 However, the most well-known and widely used learning algorithm to estimate the values of
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
wi j wj k
Author Proof
of
pro
Fig. 4 Feed-forward ANN
416 the weights is the Back Propagation (BP) algorithm. The general rule for updating weights
417 is: W j i = ηδ j O i where:
cted
418 • η is a positive number (called learning rate), which determines the step size in the gradi-
419 ent descent search. A large value enables back propagation to move faster to the target
420 weight configuration but it also increases the chance of its never reaching this target.
421 • O i is the output computed by neuron i
422 • δ j = O j (1 − O j )(T j − O j ) for the output neurons, where T j the wanted output for the
423 neuron j and
424 • δ j = O j (1 − O j ) k δk Wk j for the internal (hidden) neurons
orre
425 During classification the signal at the input units propagates all the way through the net to
426 determine the activation values at all the output units. Each input unit has an activation value
427 that represents some feature external to the net. Then, every input unit sends its activation
428 value to each of the hidden units to which it is connected. Each of these hidden units calculates
429 its own activation value and this signal are then passed on to output units. The activation
430 value for each receiving unit is calculated according to a simple activation function. The
431 function sums together the contributions of all sending units, where the contribution of a unit
432 is defined as the weight of the connection between the sending and receiving units multiplied
433 by the sending unit’s activation value. This sum is usually then further modified, for example,
unc
434 by adjusting the activation sum to a value between 0 and 1 and/or by setting the activation
435 value to zero unless a threshold level for that sum is reached.
436 Feed-forward neural networks are usually trained by the original back propagation algo-
437 rithm or by some variant. Their greatest problem is that they are too slow for most applications.
438 One of the approaches to speed up the training rate is to estimate optimal initial weights (Yam
439 and Chow 2001). Genetic algorithms have been used to train the weights of neural networks
440 (Siddique and Tokhi 2001) and to find the architecture of neural networks (Yen and Lu
441 2000). There are also Bayesian methods in existence which attempt to train neural networks.
442 Vivarelli and Williams (2001) compare two Bayesian methods for training neural networks.
443 A number of other techniques have emerged recently which attempt to improve ANNs train-
444 ing algorithms by changing the architecture of the networks as training proceeds. These
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
445 techniques include pruning useless nodes or weights (Castellano et al. 1997), and construc-
446 tive algorithms, where extra nodes are added as required (Parekh et al. 2000).
447 ANN learning can be achieved, among others, through (i) synaptic weight modification,
448 (ii) network structure modifications (creating or deleting neurons or synaptic connections),
Author Proof
(iii) use of suitable attractors or other suitable stable state points, (iv) appropriate choice
of
449
pro
454
455 network, in which each hidden unit implements a radial activation function and each output
456 unit implements a weighted sum of hidden units outputs. Its training procedure is usually
457 divided into two stages. First, the centers and widths of the hidden layer are determined by
458 clustering algorithms. Second, the weights connecting the hidden layer with the output layer
459 are determined by Singular Value Decomposition (SVD) or Least Mean Squared (LMS)
460 algorithms. The problem of selecting the appropriate number of basis functions remains a
461 critical issue for RBF networks. The number of basis functions controls the complexity and
462 the generalization ability of RBF networks. RBF networks with too few basis functions can-
463 not fit the training data adequately due to limited flexibility. On the other hand, those with
464
cted
too many basis functions yield poor generalization abilities since they are too flexible and
465 erroneously fit the noise in the training data.
466 To sum up, ANNs have been applied to many real-world problems but still, their most
467 striking disadvantage is their lack of ability to reason about their output in a way that can be
468 effectively communicated. For this reason many researchers have tried to address the issue
469 of improving the comprehensibility of neural networks, where the most attractive solution
470 is to extract symbolic rules from trained neural networks. Setiono and Loew (2000) divided
471 the activation values of relevant hidden units into two subintervals and then found the set of
472 relevant connections of those relevant units to construct rules. More references can be found
473 in (Zhou 2004), an excellent survey. However, it is also worth mentioning that Roy (2000)
orre
474 identified the conflict between the idea of rule extraction and traditional connectionism. In
475 detail, the idea of rule extraction from a neural network involves certain procedures, spe-
476 cifically the reading of parameters from a network, which is not allowed by the traditional
477 connectionist framework that these neural networks are based on. Neural networks are usu-
478 ally more able to easily provide incremental learning than decision trees (Saad 1998), even
479 though there are some algorithms for incremental learning of decision trees such as (Utgoff
480 et al. 1997) and (McSherry 1999).
unc
482 Conversely to ANNs, statistical approaches are characterized by having an explicit underly-
483 ing probability model, which provides a probability that an instance belongs in each class,
484 rather than simply a classification. Under this category of classification algorithms, one can
485 find Bayesian networks and instance-based methods. A comprehensive book on Bayesian
486 networks is Jensen’s (1996).
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
488 A Bayesian Network (BN) is a graphical model for probability relationships among a set of
489 variables (features). The Bayesian network structure S is a directed acyclic graph (DAG) and
Author Proof
the nodes in S are in one-to-one correspondence with the features X . The arcs represent casual
of
490
491 influences among the features while the l a c k of possible arcs in S encodes conditional inde-
492 pendencies. Moreover, a feature (node) is conditionally independent from its non-descendants
493 given its parents.
494 Typically, the task of learning a Bayesian network can be divided into two subtasks: ini-
tially, the learning of the DAG structure of the network, and then the determination of its
pro
495
496 parameters. Probabilistic parameters are encoded into a set of tables, one for each variable,
497 in the form of local conditional distributions of a variable given its parents. Given the inde-
498 pendences encoded into the network, the joint distribution can be reconstructed by simply
499 multiplying these tables. Within the general framework of inducing Bayesian networks, there
500 are two scenarios: known structure and unknown structure.
501 In the first scenario, the structure of the network is given (e.g. by an expert) and assumed
502 to be correct. Once the network structure is fixed, learning the parameters in the Conditional
503 Probability Tables (CPT) is usually solved by estimating a locally exponential number of
504 parameters from the data provided (Jensen 1996). Each node in the network has an associ-
505
cted
ated CPT that describes the conditional probability distribution of that node given the different
506 values of its parents.
507 In spite of the remarkable power of Bayesian Networks, they have an inherent limitation.
508 This is the computational difficulty of exploring a previously unknown network. Given a
509 problem described by n features, the number of possible structure hypotheses is more than
510 exponential in n. If the structure is unknown, one approach is to introduce a scoring function
511 (or a score) that evaluates the “fitness” of networks with respect to the training data, and
512 then to search for the best network according to this score. Several researchers have shown
513 experimentally that the selection of a single good hypothesis using greedy search often yields
514 accurate predictions (Heckerman et al. 1999; Chickering 2002).
orre
515 Within the score & search paradigm, another approach uses local search methods in the
516 space of directed acyclic graphs, where the usual choices for defining the elementary modi-
517 fications (local changes) that can be applied are arc addition, arc deletion, and arc reversal.
518 Acid and de Campos (2003) proposed a new local search method, restricted acyclic partially
519 directed graphs, which uses a different search space and takes account of the concept of
520 equivalence between network structures. In this way, the number of different configurations
521 of the search space is reduced, thus improving efficiency.
522 A BN structure can be also found by learning the conditional independence relationships
523 among the features of a dataset. Using a few statistical tests (such as the Chi-squared and
unc
524 mutual information test), one can find the conditional independence relationships among
525 the features and use these relationships as constraints to construct a BN. These algorithms
526 are called CI-based algorithms or constraint-based algorithms. Cowell (2001) has shown
527 that for any structure search procedure based on CI tests, an equivalent procedure based
528 on maximizing a score can be specified. In Fig. 5 there is a pseudo-code for training
529 BNs.
530 A comparison of scoring-based methods and CI-based methods is presented in (Hecker-
531 man et al. 1999). Both of these approaches have their advantages and disadvantages. Gen-
532 erally speaking, the dependency analysis approach is more efficient than the search & scor-
533 ing approach for sparse networks (networks that are not densely connected). It can also
534 deduce the correct structure when the probability distribution of the data satisfies certain
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
Author Proof
of
Fig. 5 Pseudo-code for training BN
pro
535 assumptions. However, many of these algorithms require an exponential number of CI tests
536 and many high order CI tests (CI tests with large condition-sets). Yet although the search &
537 scoring approach may not find the best structure due to its heuristic nature, it works with a
538 wider range of probabilistic models than the dependency analysis approach. Madden (2003)
539 compared the performance of a number of Bayesian Network Classifiers. His experiments
540 demonstrated that very similar classification performance can be achieved by classifiers con-
541 structed using the different approaches described above.
542 The most generic learning scenario is when the structure of the network is unknown and
543 there is missing data. Friedman and Koller (2003)s proposed a new approach for this task
544
cted
and showed how to efficiently compute a sum over the exponential number of networks that
545 are consistent with a fixed order over networks.
546 Using a suitable version of any of the model types mentioned in this review, one can
547 induce a Bayesian Network from a given training set. A classifier based on the network and
548 on the given set of features X 1 , X 2 , . . . X n , returns the label c, which maximizes the posterior
549 probability p(c|X 1 , X 2 , . . . X n ).
550 Bayesian multi-nets allow different probabilistic dependencies for different values of the
551 class node (Jordan 1998). This suggests that simple BN classifiers should work better when
552 there is a single underlying model of the dataset and multi-net classifier should work better
553 when the underlying relationships among the features are very different for different classes
orre
563 is, declare that a node appears earlier than another node in the ordering.
564 A problem of BN classifiers is that they are not suitable for datasets with many features
565 (Cheng et al. 2002). The reason for this is that trying to construct a very large network is
566 simply not feasible in terms of time and space. A final problem is that before the induction,
567 the numerical features need to be discretized in most cases.
569 Naive Bayesian networks (NB) are very simple Bayesian networks which are composed of
570 DAGs with only one parent (representing the unobserved node) and several children (corre-
571 sponding to observed nodes) with a strong assumption of independence among child nodes
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
572 in the context of their parent. Thus, the independence model (Naive Bayes) is based on
573 estimating:
P (i|X ) P (i) P (X |i) P (i) P (X r |i)
574 R = = =
Author Proof
P ( j|X ) P ( j) P (X | j) P ( j) P (X r | j)
of
575 Comparing these two probabilities, the larger probability indicates that the class label
576 value that is more likely to be the actual label (if R > 1: predict i else predict j). Since the
577 Bayes classification algorithm uses a product operation to compute the probabilities P(X , i),
578 it is especially prone to being unduly impacted by probabilities of 0. This can be avoided by
pro
579 using Laplace estimator, by adding one to all numerators and adding the number of added
580 ones to the denominator.
581 The assumption of independence among child nodes is clearly almost always wrong and
582 for this reason naive Bayes classifiers are usually less accurate that other more sophisti-
583 cated learning algorithms (such ANNs). However, Domingos and Pazzani (1997) performed
584 a large-scale comparison of the naive Bayes classifier with state-of-the-art algorithms for
585 decision tree induction, instance-based learning, and rule induction on standard benchmark
586 datasets, and found it to be sometimes superior to the other learning schemes, even on datasets
587 with substantial feature dependencies.
The basic independent Bayes model has been modified in various ways in attempts to
588
589
cted
improve its performance. Attempts to overcome the independence assumption are mainly
590 based on adding extra edges to include some of the dependencies between the features, for
591 example CR(Friedman et al. 1997). In this case, the network has the limitation that each fea-
592 ture can be related to only one other feature. An alternative generalisation, called “selective
593 Bayesian classifiers”, has been explored by (Ratanamahatana and Gunopulos 2003). They
594 built independence Bayes model forms but included a feature selection stage, so that their
595 final model did not need to include all of the features. The idea of this algorithm was to
596 improve the classification performance by removing the irrelevant features or by removing
597 one of the two totally correlated features.
The major advantage of the naive Bayes classifier is its short computational time for
orre
598
599 training. In addition, since the model has the form of a product, it can be converted into a
600 sum through the use of logarithms—with significant consequent computational advantages.
601 If a feature is numerical, the usual procedure is to discretize it during data pre-processing
602 (Yang and Webb 2003), although a researcher can use the normal distribution to calculate
603 probabilities (Bouckaert 2004).
605 Instance-based learning algorithms are lazy-learning algorithms (Mitchell 1997), as they
unc
606 delay the induction or generalization process until classification is performed. Lazy-learning
607 algorithms require less computation time during the training phase than eager-learning algo-
608 rithms (such as decision trees, neural and Bayes nets) but more computation time during the
609 classification process. One of the most straightforward instance-based learning algorithms
610 is the nearest neighbour algorithm. Aha (1997) and De Mantaras and Armengol (1998)
611 presented a review of instance-based learning classifiers.
612 k-Nearest Neighbour (kNN) is based on the principle that the instances within a dataset
613 will generally exist in close proximity to other instances that have similar properties. If the
614 instances are tagged with a classification label, then the value of the label of an unclassi-
615 fied instance can be determined by observing the class of its nearest neighbours. The kNN
616 locates the k nearest instances to the query instance and determines its class by identifying the
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
617 single most frequent class label. For more accurate results, several algorithms use weighting
618 schemes that alter the distance measurements and voting influence of each instance. A survey
619 of weighting schemes is given by (Wettschereck et al. 1997).
620 The power of kNN has been demonstrated in a number of real domains, but there are
Author Proof
some reservations about the usefulness of kNN, such as: (i) they have large storage require-
of
621
622 ments, (ii) they are sensitive to the choice of the similarity function that is used to compare
623 instances, (iii) they lack a principled way to choose k, except through cross-validation or
624 similar, computationally-expensive technique (Guo et al. 2003).
625 The choice of k affects the performance of the kNN algorithm. Consider the following
reasons why a k-nearest neighbour classifier might incorrectly classify a query instance:
pro
626
627 • When noise is present in the locality of the query instance, the noisy instance(s) win the
628 majority vote, resulting in the incorrect class being predicted. A larger k could solve this
629 problem.
630 • When the region defining the class, or fragment of the class, is so small that instances
631 belonging to the class that surrounds the fragment win the majority vote. A smaller k
632 could solve this problem.
633 Wettschereck et al. (1997) investigated the behavior of the kNN in the presence of noisy
634
cted
instances. The experiments showed that the performance of kNN was not sensitive to the
635 exact choice of k when k was large. They found that for small values of k, the kNN algorithm
636 was more robust than the single nearest neighbour algorithm (1NN) for the majority of large
637 datasets tested. However, the performance of the kNN was inferior to that achieved by the
638 1NN on small datasets (<100 instances).
639 Okamoto and Yugami (2003) represented the expected classification accuracy of kNN as a
640 function of domain characteristics including the number of training instances, the number of
641 relevant and irrelevant attributes, the probability of each attribute, the noise rate for each type
642 of noise, and k. They also explored the behavioral implications of the analyses by presenting
643 the effects of domain characteristics on the expected accuracy of kNN and on the optimal
orre
653 ing set reduction algorithm is to modify the instances using a new representation such as
654 prototypes (Sanchez et al. 2002).
655 Breiman (1996) reported that the stability of nearest neighbor classifiers distinguishes
656 them from decision trees and some kinds of neural networks. A learning method is termed
657 “unstable” if small changes in the training-test set split can result in large changes in the
658 resulting classifier. As we have already mentioned, the major disadvantage of instance-based
659 classifiers is their large computational time for classification. A key issue in many applica-
660 tions is to determine which of the available input features should be used in modeling via
661 feature selection (Yu and Liu 2004), because it could improve the classification accuracy and
662 scale down the required classification time. Furthermore, choosing a more suitable distance
663 metric for the specific dataset can improve the accuracy of instance-based classifiers.
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
665 Support Vector Machines (SVMs) are the newest supervised machine learning technique. An
666 excellent survey of SVMs can be found in (Burges 1998), and a more recent book is by (Cris-
Author Proof
tianini and Shawe-Taylor 2000). SVMs revolve around the notion of a “margin”—either side
of
667
668 of a hyperplane that separates two data classes. Maximizing the margin and thereby creating
669 the largest possible distance between the separating hyperplane and the instances on either
670 side of it has been proven to reduce an upper bound on the expected generalisation error.
671 In the case of linearly separable data, once the optimum separating hyperplane is found,
data points that lie on its margin are known as support vector points and the solution is repre-
pro
672
673 sented as a linear combination of only these points. Other data points are ignored. Therefore,
674 the model complexity of an SVM is unaffected by the number of features encountered in
675 the training data (the number of support vectors selected by the SVM learning algorithm is
676 usually small). For this reason, SVMs are well suited to deal with learning tasks where the
677 number of features is large with respect to the number of training instances.
678 Even though the maximum margin allows the SVM to select among multiple candidate
679 hyperplanes, for many datasets, the SVM may not be able to find any separating hyperplane
680 at all because the data contains misclassified instances. The problem can be addressed by
681 using a s o f t margin that accepts some misclassifications of the training instances (Veropoulos
682 et al. 1999).
cted
683 Nevertheless, most real-world problems involve non-separable data for which no hyper-
684 plane exists that successfully separates the positive from negative instances in the training
685 set. One solution to the inseparability problem is to map the data onto a higher-dimensional
686 space and define a separating hyperplane there. This higher-dimensional space is called the
687 feature space, as opposed to the input space occupied by the training instances.
688 With an appropriately chosen feature space of sufficient dimensionality, any consistent
689 training set can be made separable. A linear separation in feature space corresponds to a
690 non-linear separation in the original input space. Mapping the data to some other (possibly
691 infinite dimensional) Hilbert space H as : R d → H . Then the training algorithm would
orre
692 only depend on the data through dot products in H, i.e., on functions of the form (xi )·(x j ).
693 If there were a “kernel function” K such that K (xi , x j ) = (xi ) · (x j ), we would only
694 need to use K in the training algorithm, and would never need to explicitly determine .
695 Thus, kernels are a special class of function that allow inner products to be calculated directly
696 in feature space, without performing the mapping described above (Scholkopf et al. 1999).
697 Once a hyperplane has been created, the kernel function is used to map new points into the
698 feature space for classification.
699 The selection of an appropriate kernel function is important, since the kernel function
700 defines the feature space in which the training set instances will be classified. Genton (2001)
unc
701 described several classes of kernels, however, he did not address the question of which class
702 is best suited to a given problem. It is common practice to estimate a range of potential
703 settings and use cross-validation over the training set to find the best one. For this reason a
704 limitation of SVMs is the low speed of the training. Selecting kernel settings can be regarded
705 in a similar way to choosing the number of hidden nodes in a neural network. As long as
706 the kernel function is legitimate, a SVM will operate correctly even if the designer does not
707 know exactly what features of the training data are being used in the kernel-induced feature
708 space.
709 Training the SVM is done by solving N th dimensional QP problem, where N is the num-
710 ber of samples in the training dataset. Solving this problem in standard QP methods involves
711 large matrix operations, as well as time-consuming numerical computations, and is mostly
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
712 very slow and impractical for large problems. Sequential Minimal Optimization (SMO) is a
713 simple algorithm that can, relatively quickly, solve the SVM QP problem without any extra
714 matrix storage and without using numerical QP optimization steps at all (Platt 1999). SMO
715 decomposes the overall QP problem into QP sub-problems. Keerthi and Gilbert (2002) sug-
Author Proof
gested two modified versions of SMO that are significantly faster than the original SMO in
of
716
pro
721
722 Discrete data presents another problem, although with suitable rescaling good results can be
723 obtained.
725 Supervised machine learning techniques are applicable in numerous domains (Saitta and Neri
726 1998). For the purpose of the present study, we used 45 well-known datasets mainly from
727
cted
the UCI repository (Blake and Merz 1998). The aim of this comparison is not to compare a
728 representative algorithm for each machine learning technique in a number of datasets but to
729 give a baseline accuracy for new algorithms based on each technique. These datasets come
730 from: pattern recognition (anneal, iris, mushroom, vote, zoo), image recognition (ionosphere,
731 segment, sonar, vehicle), medical diagnosis (breast-cancer, breast-w, colic, diabetes, haber-
732 man, heart-c, heart-h, heart-statlog, hepatitis, hypothyroid, lymphotherapy, primary-tumor,
733 sick) commodity trading (autos, credit-a, credit-g, labor), music composition (waveform),
734 computer games (badges, kr-vs-kp, monk1, monk2, monk3) and various control applications
735 (balance). In Table 2, there is a brief description of these datasets.
736 The C4.5 algorithm (Quinlan 1993) was the representative of decision trees in our study.
orre
737 The most well-known learning algorithm used in estimating the values of the weights of
738 a neural network — the Back Propagation (BP) algorithm (Mitchell 1997) — represented
739 perceptron-based algorithms. The Sequential Minimal Optimization (or SMO) algorithm was
740 the representative of the Support Vector Machines (Platt 1999). In our study, we also used
741 the 3-NN algorithm, which combines robustness to noise and less time for classification than
742 using a larger k for kNN (Wettschereck et al. 1997). Finally, the most-often used Naïve Bayes
743 algorithm (Domingos and Pazzani 1997) represented BNs in our study.
744 We used the free available source code for these algorithms by (Witten and Frank 2005)
745 for our experiments.
Apart from Witten and Frank (2005), who provide in their book a wide range of ML algo-
unc
746
747 rithms for free use, MLC++ is another free library of C++ classes for supervised machine
748 learning (http://www.sgi.com/tech/mlc/).
749 The SMO algorithm is a binary algorithm, thus we made use of error-correcting output
750 coding (ECOC), or, in short, the output coding approach, to reduce a multi-class problem to
751 a set of multiple binary classification problems (Crammer and Singer 2002).
752 In order to calculate the classifiers’ accuracy, the whole training set was divided into ten
753 mutually exclusive and equal-sized subsets and for each subset the classifier was trained on
754 the union of all of the other subsets.
755 Then, cross validation was run 10 times for each algorithm and the mean value of the
756 10-cross validations was calculated. The results for each algorithm are summed in Table 3.
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
of
Audiology 226 69 0 yes 24
Autos 205 10 15 yes 7
Badges 294 4 7 no 2
Balance 625 0 4 no 3
Breast-cancer 286 9 0 yes 2
Breast-w 699 0 9 yes 2
pro
Colic 368 15 7 yes 2
Credit-a 690 9 6 yes 2
Credit-g 1,000 13 7 no 2
Diabetes 768 0 8 no 2
Glass 214 0 9 no 5
Haberman 306 0 3 no 2
Heart-c 303 7 6 yes 5
Heart-h 294 7 6 yes 5
Heart-statlog 270 0 13 no 2
Hepatitis 155 13 6 yes 2
Hypothyroid 3,772 22 7 yes 4
Ionosphere 351 34 0 no 2
Iris
cted
150 0 4 no 3
kr-vs-kp 3,196 35 0 no 2
Labor 57 8 8 yes 2
Letter 20,000 0 16 no 26
Lymphotherapy 148 15 3 no 4
Mesocyc/radar 1,131 0 13 no 2
Monk1 124 6 0 no 2
Monk2 169 6 0 no 2
Monk3 122 6 0 no 2
Mushroom 8,124 22 0 yes 2
Nist-digits 1,000 0 784 no 10
Nursery 12,960 8 0 no 5
orre
Zoo 101 16 1 no 7
757 Below, we present our conclusions about the use of each technique. Discussions of all
758 the pros and cons of each individual algorithm and empirical comparisons of various bias
759 options are beyond the scope of this paper; as the choice of algorithm always depends on
760 the task at hand. However, we hope that the following remarks can help practitioners not to
761 select a wholly inappropriate algorithm for their problem.
762 Generally, SVMs and neural networks tend to perform much better when dealing with
763 multi-dimensions and continuous features. In contrast, logic-based systems (e.g., decision
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Author Proof
TYPESET
Heart-h 83.95(±6.27) 80.22(±7.95) 82.33(±6.23) 79.26(±7.20) 84.16(±6.12) 83.26(±6.46)
Heart-statlog 83.59(±5.98) 78.15(±7.42) 79.11(±6.77) 78.70(±6.62) 83.30(±6.20) 83.81(±5.59)
DISK
Hepatitis 83.81(±9.70) 79.22(±9.57) 80.85(±9.20) 77.43(±9.14) 84.29(±8.85) 85.03(±7.53)
Hypothyroid 95.30(±0.73) 99.54(±0.36) 93.21(±0.81) 99.39(±0.39) 93.44(±0.45) 93.49(±0.40)
orre
LE
Ionosphere 82.17(±6.14) 89.74(±4.38) 86.02(±4.31) 89.30(±4.63) 87.07(±5.52) 87.93(±4.69)
Iris 95.53(±5.02) 94.73(±5.30) 95.20(±5.11) 93.93(±6.57) 84.80(±7.10) 84.87(±7.63)
kr-vs-kp 87.79(±1.91) 99.44(±0.37) 96.56(±1.00) 99.13(±0.44) 98.92(±0.60) 95.78(±1.20)
Labor 93.57(±10.27) 78.60(±16.58) 92.83(±9.86) 83.37(±15.27) 88.93(±11.86) 94.00(±10.37)
Letter 64.12(±0.94) 87.95(±0.61) 94.54(±0.23) 87.10(±0.45) 81.93(±0.62) 81.57(±0.19)
cted
Lymphotherapy 83.13(±8.89) 75.84(±11.05) 81.74(±8.45) 77.36(±11.11) 82.26(±8.05) 85.94(±8.75)
Mesocyc/radar 65.77(±3.52) 74.80(±3.22) 73.62(±2.98) 76.19(±2.63) 75.23(±3.14) 74.52(±0.86)
Monk1 73.38(±12.49) 80.61(±11.34) 78.97(±11.89) 83.87(±16.34) 97.69(±5.99) 79.58(±11.99)
Monk2 56.83(±8.71) 57.75(±11.18) 54.74(±9.20) 56.21(±8.89) 100.00(±0.00) 58.70(±5.80)
Monk3 93.45(±6.57) 92.95(±6.68) 86.72(±9.99) 84.8(±9.27) 88.32(±8.49) 93.45(±6.57)
Mushroom 95.76(±0.73) 100.00(±0.00) 100.00(±0.00) 100.00(±0.03) 99.97(±0.11) 100.00(±0.00)
pro
123
Author Proof
Table 3 continued
123
Relation 77.85(±2.40) 78.55(±2.10) 78.9(±1.80) 78.01(±2.04) 78.44(±2.04) 77.6(±2.39)
Segment 80.16(±2.10) 96.79(±1.29) 96.12(±1.19) 95.53(±1.16) 91.35(±1.90) 89.72(±1.72)
Sick 92.75(±1.36) 98.72(±0.55) 96.21(±0.82) 98.30(±0.70) 94.87(±0.76) 93.88(±0.10)
Sonar 67.71(±8.66) 73.61(±9.34) 83.76(±8.51) 75.45(±10.01) 78.67(±9.21) 77.88(±8.45)
Soybean 92.94(±2.92) 91.78(±3.19) 91.2(±3.18) 91.85(±2.77) 93.35(±2.68) 93.10(±2.76)
Spambase 79.56(±1.56) 92.68(±1.08) 89.80(±1.35) 92.67(±1.21) 91.22(±2.52) 90.48(±1.22)
Vehicle 44.68(±4.59) 72.28(±4.32) 70.21(±3.93) 68.32(±4.59) 81.11(±3.84) 74.08(±3.82)
Vote 90.02(±3.91) 96.57(±2.56) 93.08(±3.70) 95.70(±2.86) 96.32(±2.87) 96.23(±2.77)
Vowel 62.9(±4.38) 80.2(±4.36) 96.99(±2.13) 71.17(±5.14) 92.73(±3.14) 70.61(±3.86)
Waveform 80.01(±1.45) 75.25(±1.90) 77.67(±1.79) 79.14(±1.72) 86.56(±1.56) 86.94(±1.48)
Wine 97.46(±3.70) 93.2(±5.90) 95.85(±4.19) 93.14(±6.94) 98.02(±3.26) 98.76(±2.73)
TYPESET
DISK
orre
LE
cted
pro
764 trees, rule learners) tend to perform better when dealing with discrete/categorical features.
765 For neural network models and SVMs, a large sample size is required in order to achieve its
766 maximum prediction accuracy whereas NB may need a relatively small dataset.
767 There is general agreement that kNN is very sensitive to irrelevant features: this character-
Author Proof
istic can be explained by the way the algorithm works. In addition, the presence of irrelevant
of
768
769 features can make neural network training very inefficient, even impractical.
770 A number of recent studies have shown that the decomposition of a classifier’s error into
771 bias and variance terms can provide considerable insight into the prediction performance
772 of the classifier (Bauer and Kohavi 1999). Bias measures the contribution to error of the
central tendency of the classifier when trained on different data. Variance is a measure of the
pro
773
774 contribution to error of deviations from the central tendency. Bias and variance are evaluated
775 with respect to a distribution of training sets, such as a distribution containing all possible
776 training sets of a specified size for a specified domain.
777 Learning algorithms with a high-bias profile usually generate simple, highly constrained
778 models which are quite insensitive to data fluctuations, so that variance is low. Naive
779 Bayes is considered to have high bias, because it assumes that the dataset under consider-
780 ation can be summarized by a single probability distribution and that this model is sufficient
781 to discriminate between classes. On the contrary, algorithms with a high-variance profile
782 can generate arbitrarily complex models which fit data variations more readily. Examples of
783
cted
high-variance algorithms are decision trees, neural networks and SVMs. The obvious pitfall
784 of high-variance model classes is overfitting.
785 Most decision tree algorithms cannot perform well with problems that require diagonal
786 partitioning. The division of the instance space is orthogonal to the axis of one variable and
787 parallel to all other axes. Therefore, the resulting regions after partitioning are all hyper-
788 rectangles. The ANNs and the SVMs perform well when multicollinearity is present and a
789 nonlinear relationship exists between the input and output features.
790 Although training time varies according to the nature of the application task and dataset,
791 specialists generally agree on a partial ordering of the major classes of learning algorithms.
792 For instance, lazy learning methods require zero training time because the training instance
orre
793 is simply stored. Naive Bayes methods also train very quickly since they require only a single
794 pass on the data either to count frequencies (for discrete variables) or to compute the normal
795 probability density function (for continuous variables under normality assumptions). Univar-
796 iate decision trees are also reputed to be quite fast—at any rate, several orders of magnitude
797 faster than neural networks and SVMs.
798 Naive Bayes requires little storage space during both the training and classification stages:
799 the strict minimum is the memory needed to store the prior and conditional probabilities. The
800 basic kNN algorithm uses a great deal of storage space for the training phase, and its execu-
801 tion space is at least as big as its training space. On the contrary, for all non-lazy learners,
unc
802 execution space is usually much smaller than training space, since the resulting classifier is
803 usually a highly condensed summary of the data.
804 Naive Bayes is naturally robust to missing values since these are simply ignored in com-
805 puting probabilities and hence have no impact on the final decision. On the contrary, kNN
806 and neural networks require complete records to do their work: all missing values must have
807 been eliminated, before the process analysis begins either by deleting the records or features
808 with missing values, or by inputting missing values.
809 Moreover, kNN is generally considered intolerant of noise; its similarity measures can be
810 easily distorted by errors in attribute values, thus leading it to misclassify a new instance on
811 the basis of the wrong nearest neighbors. Contrary to kNN and rule learners and most decision
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
812 trees are considered resistant to noise because their pruning strategies avoid overfitting the
813 data in general and noisy data in particular.
814 Furthermore, the number of model or runtime parameters to be tuned by the user is an
815 indicator of an algorithm’s ease of use. It can help in prior model selection based on the
Author Proof
user’s priorities and preferences: for a non specialist in data mining, an algorithm with few
of
816
817 user-tuned parameters will certainly be more appealing, while a more advanced user might
818 find a large parameter set an opportunity to control the data mining process more closely. As
819 expected, neural networks and SVMs have more parameters than the remaining techniques.
820 Logic-based algorithms are all considered very easy to interpret, whereas neural networks
and SVMs have notoriously poor interpretability. kNN is also considered to have very poor
pro
821
822 interpretability because an unstructured collection of training instances is far from readable,
823 especially if there are many of them.
824 While interpretability concerns the typical classifier generated by a learning algorithm,
825 transparency refers to whether the principle of the method is easily understood. A particu-
826 larly eloquent case is that of kNN; while the resulting classifier is not quite interpretable,
827 the method itself is very transparent because it appeals to the intuition of human users, who
828 spontaneously reason in a similar manner. Similarly, Naive Bayes’ is very transparent, as it is
829 easily grasped by users like physicians who find that probabilistic explanations replicate their
830 way of diagnosing. Moreover, decision trees and rules are credited with high transparency.
831
cted
Finally, decision trees and NB generally have different operational profiles, when one is
832 very accurate the other is not and vice versa. On the contrary, decision trees and rule classifi-
833 ers have a similar operational profile according to our experiments. SVM and ANN have also
834 a similar operational profile. No single learning algorithm can uniformly outperform other
835 algorithms over all datasets. When faced with the decision “Which algorithm will be most
836 accurate on our classification problem?”, the simplest approach is to estimate the accuracy of
837 the candidate algorithms on the problem and select the one that appears to be most accurate.
838 The concept of combining classifiers is proposed as a new direction for the improvement
839 of the performance of individual classifiers. The goal of classification result integration algo-
840 rithms is to generate more certain, precise and accurate system results. The following section
orre
843 Numerous methods have been suggested for the creation of ensemble of classifiers
844 (Dietterich 2000). Although or perhaps because many methods of ensemble creation have
845 been proposed, there is as yet no clear picture of which method is best (Villada and Drissi
846 2002). Thus, an active area of research in supervised learning is the study of methods for the
unc
847 construction of good ensembles of classifiers. Mechanisms that are used to build ensemble of
848 classifiers include: (i) using different subsets of training data with a single learning method,
849 (ii) using different training parameters with a single training method (e.g., using different
850 initial weights for each neural network in an ensemble) and (iii) using different learning
851 methods.
852 8.1 Different subsets of training data with a single learning method
853 Bagging is a method for building ensembles that uses different subsets of training data with
854 a single learning method (Breiman 1996). Given a training set of size t, bagging draws
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
855 t random instances from the dataset with replacement (i.e. using a uniform distribution).
856 These t instances are learned, and this process is repeated several times. Since the draw is
857 with replacement, usually the instances drawn will contain some duplicates and some omis-
858 sions, as compared to the original training set. Each cycle through the process results in one
Author Proof
classifier. After the construction of several classifiers, taking a vote of the predictions of each
of
859
pro
864
865 too little data, the gains achieved via a bagged ensemble cannot compensate for the decrease
866 in accuracy of individual models, each of which now considers an even smaller training set.
867 On the other end, if the dataset is extremely large and computation time is not an issue, even
868 a single flexible classifier can be quite adequate.
869 Another method that uses different subsets of training data with a single learning method
870 is the boosting approach (Freund and Schapire 1996). Boosting is similar in overall struc-
871 ture to bagging, except that it keeps track of the performance of the learning algorithm and
872 concentrates on instances that have not been correctly learned. Instead of choosing the t train-
873 ing instances randomly using a uniform distribution, it chooses the training instances in such
874
cted
a manner as to favor the instances that have not been accurately learned. After several cycles,
875 the prediction is performed by taking a weighted vote of the predictions of each classifier,
876 with the weights being proportional to each classifier’s accuracy on its training set.
877 AdaBoost is a practical version of the boosting approach (Freund and Schapire 1996).
878 Adaboost requires less instability than bagging, because Adaboost can make much larger
879 changes in the training set. A number of studies that compare AdaBoost and bagging sug-
880 gest that AdaBoost and bagging have quite different operational profiles (Bauer and Kohavi
881 1999; Quinlan 1996). In general, it appears that bagging is more consistent, increasing the
882 error of the base learner less frequently than does AdaBoost. However, AdaBoost appears to
883 have greater average effect, leading to substantially larger error reductions than bagging on
orre
884 average.
885 Generally, bagging tends to decrease variance without unduly affecting bias (Breiman
886 1996; Schapire et al. 1998; Bauer and Kohavi 1999). On the contrary, in empirical studies
887 AdaBoost appears to reduce both bias and variance (Breiman 1996; Schapire et al. 1998;
888 Bauer and Kohavi 1999). Thus, AdaBoost is more effective at reducing bias than bagging,
889 but bagging is more effective than AdaBoost at reducing variance.
890 The decision on limiting the number of sub-classifiers is important for practical applica-
891 tions. To be competitive, it is important that the algorithms run in reasonable time. Quin-
892 lan (1996) used only 10 replications, while Bauer and Kohavi (1999) used 25 replications,
unc
893 Breiman (1996) used 50 and Freund and Schapire (1996) used 100. For both bagging and
894 boosting, much of the reduction in error appears to have occurred after ten to fifteen clas-
895 sifiers. However, Adaboost continues to measurably improve test-set error until around 25
896 classifiers for decision trees (Opitz and Maclin 1999).
897 As mentioned in Bauer and Kohavi (1999), the main problem with boosting seems to be
898 robustness to noise. This is expected because noisy instances tend to be misclassified, and
899 the weight will increase for these instances. They presented several cases where the perfor-
900 mance of boosted algorithms degraded compared to the original algorithms. On the contrary,
901 they pointed out that bagging improves the accuracy in all datasets used in the experimental
902 evaluation.
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
903 MultiBoosting (Webb 2000) is another method of the same category. It can be conceptu-
904 alized wagging committees formed by AdaBoost. Wagging is a variant of bagging: bagging
905 uses resampling to get the datasets for training and producing a weak hypothesis, whereas
906 wagging uses reweighting for each training instance, pursuing the effect of bagging in a differ-
Author Proof
ent way. Webb (2000)in a number of experiments, showed that MultiBoost achieved greater
of
907
908 mean error reductions than any of AdaBoost or bagging decision trees in both committee
909 sizes that were investigated (10 and 100).
910 Another meta-learner, DECORATE (Diverse Ensemble Creation by Oppositional Rela-
911 beling of Artificial Training Examples), was presented by (Melville and Mooney 2003). This
method uses a learner (one that provides high accuracy on the training data) to build a diverse
pro
912
919
cted
There are also methods for creating ensembles, which produce classifiers that disagree on
920 their predictions. Generally, these methods focus on altering the training process in the hope
921 that the resulting classifiers will produce different predictions. For example, neural network
922 techniques that have been employed include methods for training with different topologies,
923 different initial weights and different parameters (Maclin and Shavlik 1995).
924 Another effective approach for generation of a set of base classifiers is ensemble feature
925 selection. Ensemble feature selection is finding a set of feature subsets for generation of the
926 base classifiers for an ensemble with one learning algorithm. Ho (1998) has shown that sim-
927 ple random selection of feature subsets may be an effective technique for ensemble feature
928 selection. This technique is called the random subspace method (RSM). In the RSM, one
orre
929 randomly selects N ∗ < N features from the N -dimensional dataset. By this, one obtains
930 the N ∗-dimensional random subspace of the original N -dimensional feature space. This is
931 repeated S times so as to get S feature subsets for constructing the base classifiers. Then, one
932 constructs classifiers in the random subspaces and aggregates them in the final integration
933 procedure. An experiment with a systematic partition of the feature space, using nine differ-
934 ent combination schemes, was performed by (Kuncheva and Whitaker 2001), showing that
935 there are no “best” combinations for all situations and that there is no assurance that in all
936 cases that a classifier team will outperform the single best individual.
937
938 Voting denotes the simplest method of combining predictions from multiple classifiers (Roli
939 et al. 2001). In its simplest form, called plurality or majority voting, each classification model
940 contributes a single vote (Hall et al. 2000). The collective prediction is decided by the major-
941 ity of the votes, i.e., the class with the most votes is the final prediction. In weighted voting,
942 on the other hand, the classifiers have varying degrees of influence on the collective predic-
943 tion that is relative to their predictive accuracy. Each classifier is associated with a specific
944 weight determined by its performance (e.g., accuracy, cost model) on a validation set. The
945 final prediction is decided by summing up all weighted votes and by choosing the class with
946 the highest aggregate. Kotsiantis and Pintelas (2004) combined the advantages of classifier
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
947 fusion and dynamic selection. The algorithms that are initially used to build the ensemble
948 are tested on a small subset of the training set and, if they have statistically worse accuracy
949 than the most accurate algorithm, do not participate in the final voting.
950 Except for voting, stacking (Ting and Witten 1999) aims to improve efficiency and scala-
Author Proof
bility by executing a number of learning processes and combining the collective results. The
of
951
952 main difference between voting and stacking is that the latter combines base classifiers in a
953 non-linear fashion. The combining task, called a meta-learner, integrates the independently
954 computed base classifiers into a higher level classifier, a meta-classifier, by relearning the
955 meta-level training set. This meta-level training set is created by using the base classifiers’
predictions on the validation set as attribute values and the true class as the target. Ting and
pro
956
957 Witten (1999) have shown that successful stacked generalization requires the use of output
958 class distributions rather than class predictions. In their experiments, only the MLR algorithm
959 (a linear discriminant) was suitable for use as a level-1 classifier.
960 Cascade Generalization (Gama and Brazdil 2000) is another algorithm that belongs to the
961 family of stacking algorithms. Cascade Generalization uses the set of classifiers sequentially,
962 at each step performing an extension of the original data by the insertion of new attributes.
963 The new attributes are derived from the probability class distribution given by a base classi-
964 fier. This constructive step extends the representational language for the high level classifiers,
965 reducing their bias.
966
cted
Todorovski and Dzeroski (2003) introduced meta-decision trees (MDTs). Instead of giv-
967 ing a prediction, MDT leaves specify which classifier should be used to obtain a prediction.
968 Each leaf of the MDT represents a part of the dataset, which is a relative area of expertise of
969 the base-level classifier in that leaf. MDTs can use the diversity of the base-level classifiers
970 better than voting, thus outperforming voting schemes in terms of accuracy, especially in
971 domains with a high diversity of errors made by base-level classifiers.
972 Another attempt to improve classification accuracy is the use of hybrid techniques.
973 Lazkano and Sierra (2003) presented a hybrid classifier that combines Bayesian Network
974 algorithm with the Nearest Neighbor distance based algorithm. The Bayesian Network struc-
975 ture is obtained from the data and the Nearest Neighbor algorithm is used in combination
orre
976 with the Bayesian Network in the deduction phase. LiMin et al. (2004) presented Flexible
977 NBTree: a decision tree learning algorithm in which nodes contain univariate splits as do
978 regular decision trees, but the leaf nodes contain General Naive Bayes, which is a variant of
979 the standard Naive Bayesian classifier. Zhou and Chen (2002) generated a binary hybrid deci-
980 sion tree according to the binary information gain ratio criterion. If attributes cannot further
981 distinguish training examples falling into a leaf node whose diversity is beyond the diversity
982 threshold, then the node is marked as a dummy node and a feed-forward neural network
983 named FANNC is then trained in the instance space defined by the used attributes. Zheng
984 and Webb (2000) proposed the application of lazy learning techniques to Bayesian induction
unc
985 and presented the resulting lazy Bayesian rule learning algorithm, called LBR. This algo-
986 rithm can be justified by a variant of the Bayes model, which supports a weaker conditional
987 attribute independence assumption than is required by naive Bayes. For each test example,
988 it builds a most appropriate rule with a local naive Bayesian classifier as its consequent.
989 Zhipeng et al. (2002) proposed a similar lazy learning algorithm: Selective Neighborhood
990 based Naive Bayes (SNNB). SNNB computes different distance neighborhoods of the input
991 new object, lazily learns multiple Naive Bayes classifiers, and uses the classifier with the
992 highest estimated accuracy to make the final decision. Domeniconi and Gunopulos (2001)
993 combined local learning with SVMs. In this approach an SVM is used to determine the
994 weights of the local neighborhood instances.
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
995 9 Conclusions
996 This paper describes the best-know supervised techniques in relative detail. The key question
997 when dealing with ML classification is not whether a learning algorithm is superior to others,
Author Proof
but under which conditions a particular method can significantly outperform others on a
of
998
999 given application problem. Meta-learning is moving in this direction, trying to find functions
1000 that map datasets to algorithm performance (Brazdil et al. 2003; Kalousis and Gama 2004).
1001 If we are only interested in the best possible classification accuracy, it might be difficult or
1002 impossible to find a single classifier that performs as well as a good ensemble of classifiers.
Despite the obvious advantages, ensemble methods have at least three weaknesses. The first
pro
1003
1004 weakness is increased storage as a direct consequence of the requirement that all component
1005 classifiers, instead of a single classifier, need to be stored after training. The total storage
1006 depends on the size of each component classifier itself and the size of the ensemble (num-
1007 ber of classifiers in the ensemble). The second weakness is increased computation because
1008 in order to classify an input query, all component classifiers (instead of a single classifier)
1009 must be processed. The last weakness is decreased comprehensibility. With involvement of
1010 multiple classifiers in decision-making, it is more difficult for non-expert users to perceive the
1011 underlying reasoning process leading to a decision. A first attempt for extracting meaningful
1012 rules from ensembles was presented in (Wall et al. 2003).
1013
cted
For all these reasons, the application of ensemble methods is suggested only if we are only
1014 interested in the best possible classification accuracy. Another time-consuming attempt that
1015 tried to increase classification accuracy without decreasing comprehensibility is the wrapper
1016 feature selection procedure (Guyon and Elissee 2003). Theoretically, having more features
1017 should result in more discriminating power. However, practical experience with machine
1018 learning algorithms has shown that this is not always the case. Wrapper methods wrap the
1019 feature selection around the induction algorithm to be used, using cross-validation to predict
1020 the benefits of adding or removing a feature from the feature subset used.
1021 The database community deals with gigabyte databases. Of course, it is unlikely that all
1022 the data in a data warehouse would be mined simultaneously. Most of the current learning
orre
1023 algorithms are computationally expensive and require all data to be resident in main memory,
1024 which is clearly untenable for many realistic problems and databases. Distributed machine
1025 learning involves breaking the dataset up into subsets, learning from these subsets concur-
1026 rently and combining the results (Basak and Kothari 2004). Distributed agent systems can
1027 be used for this parallel execution of machine learning processes (Klusch et al. 2003).
1028 References
Acid S, de Campos LM (2003) Searching for Bayesian network structures in the space of restricted acyclic
unc
1029
1030 partially directed graphs. J Artif Intell Res 18:445–490
1031 Aha D (1997) Lazy learning. Kluwer Academic Publishers, Dordrecht
1032 An A, Cercone N (1999) Discretization of continuous attributes for learning classification rules. Third Pacific-
1033 Asia conference on methodologies for knowledge discovery & data mining, 509–514
1034 An A, Cercone N (2000) Rule quality measures improve the accuracy of rule induction: an experimental
1035 approach. Lecture notes in computer science, vol. 1932, pp 119–129
1036 Auer P, Warmuth M (1998) Tracking the best disjunction. Machine Learning 32:127–150
1037 Baik S, Bala J (2004) A decision tree algorithm for distributed data mining: towards network intrusion detec-
1038 tion. Lecture notes in computer science, vol. 3046, pp 206–212
1039 Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning.
1040 Appl Artif Intell 17:519–533
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
1041 Basak J, Kothari R (2004) A classification paradigm for distributed vertically partitioned data. Neural Comput
1042 16(7):1525–1544
1043 Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Irvine.
1044 (http://www.ics.uci.edu/~mlearn/MLRepository.html)
Blockeel H, De Raedt L (1998) Top-down induction of first order logical decision trees. Artif Intell 101(1–
Author Proof
1045
2):285–297
of
1046
1047 Blum A (1997) Empirical support for winnow and weighted-majority algorithms: results on a calendar sched-
1048 uling domain. Mach Learn 26(1):5–23
1049 Bonarini A (2000) An introduction to learning fuzzy classifier systems. Lect Notes Comput Sci 1813:83–92
1050 Bouckaert R (2003) Choosing between two learning algorithms based on calibrated tests. In: Proceedings of
1051 20th International Conference on Machine Learning. Morgan Kaufmann, pp 51–58
Bouckaert R (2004) Naive Bayes classifiers that perform well with continuous variables. Lect Notes Comput
pro
1052
1053 Sci 3339:1089–1094
1054 Brazdil P, Soares C, Da Costa J (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy
1055 and time results. Mach Learn 50:251–277
1056 Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
1057 Breslow LA, Aha DW (1997) Simplifying decision trees: a survey. Knowl Eng Rev 12:1–40
1058 Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data
1059 Min Knowl Disc 6:153–172
1060 Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2):1–
1061 47
1062 Camargo LS, Yoneyama T (2001) Specification of training sets and the number of hidden neurons for multi-
1063 layer perceptrons. Neural Comput 13:2673–2680
1064
1065
cted
Castellano G, Fanelli A, Pelillo M (1997) An iterative pruning algorithm for feedforward neural networks.
IEEE Trans Neural Netw 8:519–531
1066 Cheng J, Greiner R (2001) Learning Bayesian belief network classifiers: algorithms and system. In: Stroulia
1067 E, Matwin S (eds) AI 2001, LNAI 2056, pp 141–151
1068 Cheng J, Greiner R, Kelly J, Bell D, Liu W (2002) Learning Bayesian networks from data: an information-
1069 theory based approach. Artif Intell 137:43–90
1070 Chickering DM (2002) Optimal structure identification with greedy search. J Mach Learn Res 3:507–554
1071 Cohen W (1995) Fast effective rule induction. In: Proceedings of ICML-95, pp 115–123
1072 Cowell RG (2001) Conditions under which conditional independence and scoring methods lead to identical
1073 selection of Bayesian network models. In: Proceedings of 17th International Conference on Uncertainty
1074 in Artificial Intelligence
1075 Crammer K, Singer Y (2002) On the learnability and design of output codes for multiclass problems. Mach
Learn 47:201–233
orre
1076
1077 Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other Kernel-based
1078 learning methods. Cambridge University Press, Cambridge
1079 Dantsin E, Eiter T, Gottlob G, Voronkov A (2001) Complexity and expressive power of logic programming.
1080 ACM Comput Surveys 33:374–425
1081 De Raedt L (1996) Advances in inductive logic programming. IOS Press
1082 De Mantaras RL, Armengol E (1998) Machine learning from examples: inductive and Lazy methods. Data
1083 Knowl Eng 25:99–123
1084 Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms.
1085 Neural Comput 10(7):1895–1924
1086 Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision
1087 trees: bagging, boosting, and randomization. Mach Learn 40:139–157
unc
1088 Domeniconi C, Gunopulos D (2001) Adaptive nearest neighbor classification using support vector machines.
1089 Adv Neural Inf Process Syst 14:665–672
1090 Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach
1091 Learn 29:103–130
1092 Dutton D, Conroy G (1996) A review of machine learning. Knowl Eng Rev 12:341–367
1093 Dzeroski S, Lavrac N (2001) Relational data mining. Springer, Berlin
1094 Elomaa T, Rousu J (1999) General and efficient multisplitting of numerical attributes. Mach Learn 36:201–244
1095 Elomaa T (1999) The biases of decision tree pruning strategies. Lect Notes Comput Sci 1642:63–74, Springer
1096 Fidelis MV, Lopes HS, Freitas AA (2000) Discovering comprehensible classification rules using a genetic
1097 algorithm. In: Proceedings of CEC-2000, conference on evolutionary computation La Jolla, USA, v. 1,
1098 pp 805–811
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
1099 Flach PA, Lavrac N (2000) The role of feature construction in inductive rule learning. In: De Raedt L, Kramer
1100 S (eds) Proceedings of the ICML2000 workshop on attribute-value learning and relational learning:
1101 bridging the Gap. Stanford University
1102 Frank E, Witten I (1998) Generating accurate rule sets without global optimization. In: Shavlik J (ed) Machine
learning: proceedings of the fifteenth international conference. Morgan Kaufmann Publishers, San Fran-
Author Proof
1103
cisco
of
1104
1105 Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to
1106 boosting. JCSS 55(1):119–139
1107 Freund Y, Schapire R (1999) Large margin classification using the perceptron algorithm. Mach Learn 37:277–
1108 296
1109 Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163
Friedman N, Koller D (2003) Being Bayesian about network structure: a Bayesian approach to structure dis-
pro
1110
1111 covery in Bayesian networks. Mach Learn 50(1):95–125
1112 Furnkranz J (1997) Pruning algorithms for rule learning. Mach Learn 27:139–171
1113 Furnkranz J (1999) Separate-and-conquer rule learning. Artif Intell Rev 13:3–54
1114 Furnkranz J (2001) Round robin rule learning. In: Proceedings of the 18th international conference on machine
1115 learning (ICML-01), pp 146–153
1116 Gama J, Brazdil P (1999) Linear tree. Intelligent Data Anal 3:1–22
1117 Gama J, Brazdil P (2000) Cascade generalization. Mach Learn 41:315–343
1118 Gehrke J, Ramakrishnan R, Ganti V (2000) RainForest—a framework for fast decision tree construction of
1119 large datasets. Data Min Knowl Disc 4(2–3):127–162
1120 Genton M (2001) Classes of Kernels for machine learning: a statistics perspective. J Mach Learn Res 2:299–
1121 312
1122
1123 Comput Sci 2888:986–996
cted
Guo G, Wang H, Bell D, Bi Y, Greer K (2003) KNN model-based approach in classification. Lect Notes
1124 Guyon I, Elissee A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
1125 Heckerman D, Meek C, Cooper G (1999) A Bayesian approach to causal discovery. In: Glymour C, Cooper
1126 G (eds) Computation, causation, and discovery, MIT Press, pp 141–165
1127 Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach
1128 Intell 20:832–844
1129 Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126
1130 Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–
1131 450
1132 Jain AK, Murty MN, Flynn P (1999) Data clustering: a review. ACM Comput Surveys 31(3):264–323
1133 Jensen F (1996) An introduction to Bayesian networks. Springer
Jordan MI (1998) Learning in graphical models. MIT Press,
orre
1134
1135 Kalousis A, Gama G (2004) On data and algorithms: understanding inductive performance. Mach Learn
1136 54:275–312
1137 Keerthi S, Gilbert E (2002) Convergence of a generalized SMO algorithm for SVM classifier design. Mach
1138 Learn 46:351–360
1139 Kivinen J (2002) Online learning of linear classifiers. In: Advanced Lectures on Machine Learning: Machine
1140 Learning Summer School 2002, Australia, February 11–22, pp 235–257, ISSN: 0302–9743
1141 Klusch M, Lodi S, Moro G (2003) Agent-based distributed data mining: the KDEC scheme. In: Intelligent
1142 information agents: the agentlink perspective, LNAI 2586, Springer, pp 104–122
1143 Kon M, Plaskota L (2000) Information complexity of neural networks. Neural Netw 13:365–375
1144 Kotsiantis S, Pintelas P (2004) Selective voting. In: Proceedings of the 4th International Conference on Intel-
1145 ligent Systems Design and Applications (ISDA 2004), August 26–28, Budapest, pp 397–402
unc
1146 Kuncheva L, Whitaker C (2001) Feature subsets for classifier combination: an enumerative experiment. Lect
1147 Notes Comput Sci 2096:228–237
1148 Lazkano E, Sierra B (2003) BAYES-NEAREST: a new hybrid classifier combining Bayesian network and
1149 distance based algorithms. Lect Notes Comput Sci 2902:171–183
1150 LiMin W, SenMiao Y, Ling L, HaiJun L (2004) Improving the performance of decision tree: a hybrid approach.
1151 Lect Notes Comput Sci 3288:327–335
1152 Lindgren T (2004) Methods for rule conflict resolution. Lect Notes Comput Sci 3201:262–273
1153 Littlestone N, Warmuth M (1994) The weighted majority algorithm. Informa Comput 108(2):212–261
1154 Liu H, Metoda H (2001) Instance selection and constructive data mining. Kluwer,
1155 Maclin R, Shavlik J (1995) Combining the prediction of multiple classifiers: using competitive learning to
1156 initialize ANNs. In: Proceedings of the 14th International joint conference on AI, pp 524–530
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
Machine learning
1157 Madden M (2003) The performance of Bayesian network classifiers constructed using different techniques.
1158 In: Proceedings of European conference on machine learning, workshop on probabilistic graphical mod-
1159 els for classification, pp 59–70
1160 Markovitch S, Rosenstein D (2002) Feature generation using general construction functions. Mach Learn
49:59–98
Author Proof
1161
McSherry D (1999) Strategic induction of decision trees. Knowl Based Syst 12(5–6):269–275
of
1162
1163 Melville P, Mooney R (2003) Constructing diverse classifier ensembles using artificial training examples. In:
1164 Proceedings of the IJCAI-2003, Acapulco, Mexico, pp 505–510
1165 Mitchell T (1997) Machine learning. McGraw Hill
1166 Muggleton S (1995) Inverse entailment and Progol. New Generat Comput Special issue on Inductive Logic
1167 Programming 13(3–4):245–286
Muggleton S (1999) Inductive logic programming: issues, results and the challenge of learning language in
pro
1168
1169 logic. Artif Intell 114:283–296
1170 Murthy SK (1998) Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min
1171 Knowl Disc 2:345–389
1172 Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52:239–281
1173 Neocleous C, Schizas C (2002) Artificial neural network learning: a comparative review, LNAI 2308, Springer-
1174 Verlag, pp 300–313
1175 Okamoto S, Yugami N (2003) Effects of domain characteristics on instance-based learning algorithms. Theoret
1176 Comput Sci 298:207–233
1177 Parekh R, Yang J, Honavar V (2000) Constructive neural network learning algorithms for pattern classification.
1178 IEEE Trans Neural Netw 11(2):436–451
1179 Platt J (1999) Using sparseness and analytic QP to speed training of support vector machines. In: Kearns M,
1180
1181
cted
Solla S, Cohn D (eds) Advances in neural information processing systems, MIT Press
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann,
1182 Quinlan JR (1995) Induction of logic programs: FOIL and related systems. New Generat Comput 13:287–312
1183 Ratanamahatana C, Gunopulos D (2003) Feature selection for the naive Bayesian classifier using decision
1184 trees. Appl Artif Intell 17(5–6):475–487
1185 Reinartz T (2002) A unifying view on instance selection, data mining and knowledge discovery, vol. 6. Kluwer
1186 Academic Publishers, pp 191–210
1187 Reeves CR, Rowe JE (2003) Genetic algorithms—principles and perspectives: a guide to GA theory. Kluwer
1188 Academic
1189 Roli F, Giacinto G, Vernazza G (2001) Methods for designing multiple classifier systems. Lect Notes Comput
1190 Sci 2096:78–87
1191 Robert J, Howlett LCJ (2001) Radial basis function networks 2: new advances in design
Roy A (2000) On connectionism, rule extraction, and brain-like learning. IEEE Trans Fuzzy Syst 8(2):222–
orre
1192
1193 227
1194 Saad D (1998) Online learning in neural networks. Cambridge University Press,
1195 Sanchez J, Barandela R, Ferri F (2002) On filtering the training prototypes in nearest neighbour classification.
1196 Lect Notes Comput Sci 2504:239–248
1197 Scholkopf C, Burges JC, Smola AJ (1999) Advances in Kernel Methods. MIT Press
1198 Setiono R, Loew WK (2000) FERNN: an algorithm for fast extraction of rules from neural networks. Appl
1199 Intell 12:15–25
1200 Siddique MNH, Tokhi MO (2001) Training neural networks: backpropagation vs. genetic algorithms. IEEE
1201 Int Joint Conf Neural Netw 4:2673–2678
1202 Srivastava A, Han E, Kumar V, Singh V (1999) Parallel formulations of decision tree classification algorithms.
1203 Data Min Knowl Disc 3:237–261
unc
1204 Ting K, Witten I (1999) Issues in stacked generalization. Artif Intell Res 10:271–289
1205 Tjen-Sien L, Wei-Yin L, Yu-Shan S (2000) A comparison of prediction accuracy, complexity, and training
1206 time of thirty-three old and new classification algorithms. Mach Learn 40:203–228
1207 Todorovski L, Dzeroski S (2003) Combining classifiers with meta decision trees. Mach Learn 50:223–249
1208 Utgoff P, Berkman N, Clouse J (1997) Decision tree induction based on efficient tree restructuring. Mach
1209 Learn 29(1):5–44
1210 Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines.
1211 In: Proceedings of the international joint conference on artificial intelligence (IJCAI99)
1212 Villada R, Drissi Y (2002) A perspective view and survey of meta-learning. Artif Intell Rev 18:77–95
1213 Vivarelli F, Williams C (2001) Comparing Bayesian neural network algorithms for classifying segmented
1214 outdoor images. Neural Netw 14:427–437
1215 Wall R, Cunningham P, Walsh P, Byrne S (2003) Explaining the output of ensembles in medical decision
1216 support on a case by case basis. Artif Intell Med 28(2):191–206
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small
S. B. Kotsiantis et al.
1217 Wettschereck D, Aha DW, Mohri T (1997) A review and empirical evaluation of feature weighting methods
1218 for a class of lazy learning algorithms. Artif Intell Rev 10:1–37
1219 Wilson DR, Martinez T (2000) Reduction techniques for instance-based learning algorithms. Mach Learn
1220 38:257–286
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan
Author Proof
1221
Kaufmann,
of
1222
1223 Yam J, Chow W (2001) Feedforward networks training speed enhancement by optimal initialization of the
1224 synaptic coefficients. IEEE Trans Neural Netw 12:430–434
1225 Yang Y, Webb G (2003) On why discretization works for Naive-Bayes classifiers. Lect Notes Comput Sci
1226 2903:440–452
1227 Yen GG, Lu H (2000) Hierarchical genetic algorithm based neural network design. In: IEEE symposium on
combinations of evolutionary computation and neural networks, pp 168–175
pro
1228
1229 Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. JMLR 5:1205–1224
1230 Zhang G (2000) Neural networks for classification: a survey. IEEE Trans Syst Man Cy C 30(4):451–462
1231 Zhang S, Zhang C, Yang Q (2002) Data preparation for data mining. Appl Artif Intell 17:375–381
1232 Zheng Z (1998) Constructing conjunctions using systematic search on decision trees. Knowl Based Syst J
1233 10:421–430
1234 Zheng Z (2000) Constructing X-of-N attributes for decision tree learning. Mach Learn 40:35–75
1235 Zheng Z, Webb G (2000) Lazy learning of Bayesian rules. Mach Learn 41(1):53–84
1236 Xie Z, Hsu W, Liu Z, Lee ML (2002) SNNB: a selective neighborhood based naive Bayes for lazy learning.
1237 Lect Notes Comput Sci 2336:104–115
1238 Zhou Z, Chen Z (2002) Hybrid decision tree. Knowl Based Syst 15(8):515–528
1239 Zhou Z (2004) Rule extraction: using neural networks or for neural networks?. J Comput Sci Technol
1240 19(2):249–253
cted
orre
unc
123
Journal: 10462 MS: AIRE418 CMS: 10462_2007_9052_article TYPESET DISK LE CP Disp.:2007/10/26 Pages: 32 Layout: Small