0% found this document useful (0 votes)
18 views

Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42

MACHINE LEARNING UNIT

Uploaded by

eeemictech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42

MACHINE LEARNING UNIT

Uploaded by

eeemictech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Preliminaries  19

FIGURE 2.4 The volume of the unit hypersphere for different numbers of dimensions.

make predictions on data inputs. In the next section we consider how to evaluate how well
an algorithm actually achieves this.

2.2 KNOWING WHAT YOU KNOW: TESTING MACHINE LEARNING ALGO-


RITHMS
The purpose of learning is to get better at predicting the outputs, be they class labels or
continuous regression values. The only real way to know how successfully the algorithm has
learnt is to compare the predictions with known target labels, which is how the training is
done for supervised learning. This suggests that one thing you can do is just to look at the
error that the algorithm makes on the training set.
However, we want the algorithms to generalise to examples that were not seen in the
training set, and we obviously can’t test this by using the training set. So we need some
different data, a test set, to test it on as well. We use this test set of (input, target) pairs by
feeding them into the network and comparing the predicted output with the target, but we
don’t modify the weights or other parameters for them: we use them to decide how well the
algorithm has learnt. The only problem with this is that it reduces the amount of data that
we have available for training, but that is something that we will just have to live with.

2.2.1 Overfitting
Unfortunately, things are a little bit more complicated than that, since we might also want
to know how well the algorithm is generalising as it learns: we need to make sure that we
do enough training that the algorithm generalises well. In fact, there is at least as much
danger in over-training as there is in under-training. The number of degrees of variability in
most machine learning algorithms is huge — for a neural network there are lots of weights,
and each of them can vary. This is undoubtedly more variation than there is in the function
we are learning, so we need to be careful: if we train for too long, then we will overfit the
data, which means that we have learnt about the noise and inaccuracies in the data as well
as the actual function. Therefore, the model that we learn will be much too complicated,
and won’t be able to generalise.
Figure 2.5 shows this by plotting the predictions of some algorithm (as the curve) at
20  Machine Learning: An Algorithmic Perspective

FIGURE 2.5 The effect of overfitting is that rather than finding the generating function
(as shown on the left), the neural network matches the inputs perfectly, including the
noise in them (on the right). This reduces the generalisation capabilities of the network.

two different points in the learning process. On the left of the figure the curve fits the
overall trend of the data well (it has generalised to the underlying general function), but
the training error would still not be that close to zero since it passes near, but not through,
the training data. As the network continues to learn, it will eventually produce a much
more complex model that has a lower training error (close to zero), meaning that it has
memorised the training examples, including any noise component of them, so that is has
overfitted the training data.
We want to stop the learning process before the algorithm overfits, which means that
we need to know how well it is generalising at each timestep. We can’t use the training data
for this, because we wouldn’t detect overfitting, but we can’t use the testing data either,
because we’re saving that for the final tests. So we need a third set of data to use for this
purpose, which is called the validation set because we’re using it to validate the learning so
far. This is known as cross-validation in statistics. It is part of model selection: choosing the
right parameters for the model so that it generalises as well as possible.

2.2.2 Training, Testing, and Validation Sets


We now need three sets of data: the training set to actually train the algorithm, the validation
set to keep track of how well it is doing as it learns, and the test set to produce the final
results. This is becoming expensive in data, especially since for supervised learning it all has
to have target values attached (and even for unsupervised learning, the validation and test
sets need targets so that you have something to compare to), and it is not always easy to
get accurate labels (which may well be why you want to learn about the data). The area of
semi-supervised learning attempts to deal with this need for large amounts of labelled data;
see the Further Reading section for some references.
Clearly, each algorithm is going to need some reasonable amount of data to learn from
(precise needs vary, but the more data the algorithm sees, the more likely it is to have seen
examples of each possible type of input, although more data also increases the computational
time to learn). However, the same argument can be used to argue that the validation and
Preliminaries  21

FIGURE 2.6 The dataset is split into different sets, some for training, some for validation,
and some for testing.

test sets should also be reasonably large. Generally, the exact proportion of training to
testing to validation data is up to you, but it is typical to do something like 50:25:25 if you
have plenty of data, and 60:20:20 if you don’t. How you do the splitting can also matter.
Many datasets are presented with the first set of datapoints being in class 1, the next in
class 2, and so on. If you pick the first few points to be the training set, the next the test
set, etc., then the results are going to be pretty bad, since the training did not see all the
classes. This can be dealt with by randomly reordering the data first, or by assigning each
datapoint randomly to one of the sets, as is shown in Figure 2.6.
If you are really short of training data, so that if you have a separate validation set there
is a worry that the algorithm won’t be sufficiently trained; then it is possible to perform
leave-some-out, multi-fold cross-validation. The idea is shown in Figure 2.7. The dataset is
randomly partitioned into K subsets, and one subset is used as a validation set, while the
algorithm is trained on all of the others. A different subset is then left out and a new model
is trained on that subset, repeating the same process for all of the different subsets. Finally,
the model that produced the lowest validation error is tested and used. We’ve traded off
data for computation time, since we’ve had to train K different models instead of just one.
In the most extreme case of this there is leave-one-out cross-validation, where the algorithm
is validated on just one piece of data, training on all of the rest.

2.2.3 The Confusion Matrix


Regardless of how much data we use to test the trained algorithm, we still need to work
out whether or not the result is good. We will look here at a method that is suitable for
classification problems that is known as the confusion matrix. For regression problems things
are more complicated because the results are continuous, and so the most common thing
to use is the sum-of-squares error that we will use to drive the training in the following
chapters. We will see these methods being used as we look at examples.
The confusion matrix is a nice simple idea: make a square matrix that contains all the
possible classes in both the horizontal and vertical directions and list the classes along the
top of a table as the predicted outputs, and then down the left-hand side as the targets. So
for example, the element of the matrix at (i, j) tells us how many input patterns were put

You might also like