Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
FIGURE 2.4 The volume of the unit hypersphere for different numbers of dimensions.
make predictions on data inputs. In the next section we consider how to evaluate how well
an algorithm actually achieves this.
2.2.1 Overfitting
Unfortunately, things are a little bit more complicated than that, since we might also want
to know how well the algorithm is generalising as it learns: we need to make sure that we
do enough training that the algorithm generalises well. In fact, there is at least as much
danger in over-training as there is in under-training. The number of degrees of variability in
most machine learning algorithms is huge — for a neural network there are lots of weights,
and each of them can vary. This is undoubtedly more variation than there is in the function
we are learning, so we need to be careful: if we train for too long, then we will overfit the
data, which means that we have learnt about the noise and inaccuracies in the data as well
as the actual function. Therefore, the model that we learn will be much too complicated,
and won’t be able to generalise.
Figure 2.5 shows this by plotting the predictions of some algorithm (as the curve) at
20 Machine Learning: An Algorithmic Perspective
FIGURE 2.5 The effect of overfitting is that rather than finding the generating function
(as shown on the left), the neural network matches the inputs perfectly, including the
noise in them (on the right). This reduces the generalisation capabilities of the network.
two different points in the learning process. On the left of the figure the curve fits the
overall trend of the data well (it has generalised to the underlying general function), but
the training error would still not be that close to zero since it passes near, but not through,
the training data. As the network continues to learn, it will eventually produce a much
more complex model that has a lower training error (close to zero), meaning that it has
memorised the training examples, including any noise component of them, so that is has
overfitted the training data.
We want to stop the learning process before the algorithm overfits, which means that
we need to know how well it is generalising at each timestep. We can’t use the training data
for this, because we wouldn’t detect overfitting, but we can’t use the testing data either,
because we’re saving that for the final tests. So we need a third set of data to use for this
purpose, which is called the validation set because we’re using it to validate the learning so
far. This is known as cross-validation in statistics. It is part of model selection: choosing the
right parameters for the model so that it generalises as well as possible.
FIGURE 2.6 The dataset is split into different sets, some for training, some for validation,
and some for testing.
test sets should also be reasonably large. Generally, the exact proportion of training to
testing to validation data is up to you, but it is typical to do something like 50:25:25 if you
have plenty of data, and 60:20:20 if you don’t. How you do the splitting can also matter.
Many datasets are presented with the first set of datapoints being in class 1, the next in
class 2, and so on. If you pick the first few points to be the training set, the next the test
set, etc., then the results are going to be pretty bad, since the training did not see all the
classes. This can be dealt with by randomly reordering the data first, or by assigning each
datapoint randomly to one of the sets, as is shown in Figure 2.6.
If you are really short of training data, so that if you have a separate validation set there
is a worry that the algorithm won’t be sufficiently trained; then it is possible to perform
leave-some-out, multi-fold cross-validation. The idea is shown in Figure 2.7. The dataset is
randomly partitioned into K subsets, and one subset is used as a validation set, while the
algorithm is trained on all of the others. A different subset is then left out and a new model
is trained on that subset, repeating the same process for all of the different subsets. Finally,
the model that produced the lowest validation error is tested and used. We’ve traded off
data for computation time, since we’ve had to train K different models instead of just one.
In the most extreme case of this there is leave-one-out cross-validation, where the algorithm
is validated on just one piece of data, training on all of the rest.