Hyperparameters and Model Validation _ Python Data Science Handbook
Hyperparameters and Model Validation _ Python Data Science Handbook
Open in Colab
(https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/not
Hyperparameters-and-Model-Validation.ipynb)
In the previous section, we saw the basic recipe for applying a supervised
machine learning model:
The first two pieces of this—the choice of model and choice of hyperparameters
—are perhaps the most important part of using these tools and techniques
effectively. In order to make an informed choice, we need a way to validate that
our model and our hyperparameters are a good fit to the data. While this may
sound simple, there are some pitfalls that you must avoid to do this effectively.
The following sections first show a naive approach to model validation and why
it fails, before exploring the use of holdout sets and cross-validation for more
robust model evaluation.
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 1/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
Then we train the model, and use it to predict labels for data we already know:
In [3]: model.fit(X, y)
y_model = model.predict(X)
Out[4]: 1.0
We see an accuracy score of 1.0, which indicates that 100% of points were
correctly labeled by our model! But is this truly measuring the expected
accuracy? Have we really come upon a model that we expect to be correct 100%
of the time?
As you may have gathered, the answer is no. In fact, this approach contains a
fundamental flaw: it trains and evaluates the model on the same data.
Furthermore, the nearest neighbor model is an instance-based estimator that
simply stores the training data, and predicts labels by comparing new data to
these stored points: except in contrived cases, it will get 100% accuracy every
time!
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 2/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
Out[5]: 0.90666666666666662
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 3/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
Here we do two validation trials, alternately using each half of the data as a
holdout set. Using the split data from before, we could implement it like this:
What comes out are two accuracy scores, which we could combine (by, say,
taking the mean) to get a better measure of the global model performance. This
particular form of cross-validation is a two-fold cross-validation—that is, one in
which we have split the data into two sets and used each in turn as a validation
set.
We could expand on this idea to use even more trials, and more folds in the
data—for example, here is a visual depiction of five-fold cross-validation:
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 4/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
Here we split the data into five groups, and use each of them in turn to evaluate
the model fit on the other 4/5 of the data. This would be rather tedious to do by
hand, and so we can use Scikit-Learn's cross_val_score convenience routine
to do it succinctly:
Repeating the validation across different subsets of the data gives us an even
better idea of the performance of the algorithm.
Out[8]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1.])
Because we have 150 samples, the leave one out cross-validation yields scores
for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0)
prediction. Taking the mean of these gives an estimate of the error rate:
In [9]: scores.mean()
Out[9]: 0.95999999999999996
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 5/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
learn.org/stable/modules/cross_validation.html).
It is clear that neither of these models is a particularly good fit to the data, but
they fail in different ways.
The model on the left attempts to find a straight-line fit through the data.
Because the data are intrinsically more complicated than a straight line, the
straight-line model will never be able to describe this dataset well. Such a
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 6/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
model is said to underfit the data: that is, it does not have enough model
flexibility to suitably account for all the features in the data; another way of
saying this is that the model has high bias.
The model on the right attempts to fit a high-order polynomial through the
data. Here the model fit has enough flexibility to nearly perfectly account for
the fine features in the data, but even though it very accurately describes the
training data, its precise form seems to be more reflective of the particular
noise properties of the data rather than the intrinsic properties of whatever
process generated that data. Such a model is said to overfit the data: that is, it
has so much model flexibility that the model ends up accounting for random
errors as well as the underlying data distribution; another way of saying this is
that the model has high variance.
To look at this in another light, consider what happens if we use these two
models to predict the y-value for some new data. In the following diagrams, the
red/lighter points indicate data that is omitted from the training set:
For high-bias models, the performance of the model on the validation set
is similar to the performance on the training set.
For high-variance models, the performance of the model on the
validation set is far worse than the performance on the training set.
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 7/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
The diagram shown here is often called a validation curve, and we see the
following essential features:
The training score is everywhere higher than the validation score. This is
generally the case: the model will be a better fit to data it has seen than to
data it has not seen.
For very low model complexity (a high-bias model), the training data is
under-fit, which means that the model is a poor predictor both for the
training data and for any previously unseen data.
For very high model complexity (a high-variance model), the training data
is over-fit, which means that the model predicts the training data very
well, but fails for any previously unseen data.
For some intermediate value, the validation curve has a maximum. This
level of complexity indicates a suitable trade-off between bias and
variance.
The means of tuning the model complexity varies from model to model; when
we discuss individual models in depth in later sections, we will see how each
model allows for such tuning.
y = ax + b
A degree-3 polynomial fits a cubic curve to the data; for model parameters
:
a, b, c, d
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 8/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
3 2
y = ax + bx + cx + d
Now let's create some data to which we will fit our model:
X, y = make_data(40)
We can now visualize our data, along with polynomial fits of several degrees:
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 9/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
plt.scatter(X.ravel(), y, color='black')
axis = plt.axis()
for degree in [1, 3, 5]:
y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test, label='degree={0}'.format(degre
plt.xlim(-0.1, 1.0)
plt.ylim(-2, 12)
plt.legend(loc='best');
The knob controlling model complexity in this case is the degree of the
polynomial, which can be any non-negative integer. A useful question to answer
is this: what degree of polynomial provides a suitable trade-off between bias
(under-fitting) and variance (over-fitting)?
We can make progress in this by visualizing the validation curve for this
particular data and model; this can be done straightforwardly using the
validation_curve convenience routine provided by Scikit-Learn. Given a
model, data, parameter name, and a range to explore, this function will
automatically compute both the training score and validation score across the
range:
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 10/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
This shows precisely the qualitative behavior we expect: the training score is
everywhere higher than the validation score; the training score is monotonically
improving with increased model complexity; and the validation score reaches a
maximum before dropping off as the model becomes over-fit.
From the validation curve, we can read-off that the optimal trade-off between
bias and variance is found for a third-order polynomial; we can compute and
display this fit over the original data as follows:
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 11/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
In [14]: plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = PolynomialRegression(3).fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test);
plt.axis(lim);
Notice that finding this optimal model did not actually require us to compute
the training score, but examining the relationship between the training score
and validation score can give us useful insight into the performance of the
model.
# Learning Curves
One important aspect of model complexity is that the optimal model will
generally depend on the size of your training data. For example, let's generate a
new dataset with a factor of five more points:
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 12/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
We will duplicate the preceding code to plot the validation curve for this larger
dataset; for reference let's over-plot the previous results as well:
The solid lines show the new results, while the fainter dashed lines show the
results of the previous smaller dataset. It is clear from the validation curve that
the larger dataset can support a much more complicated model: the peak here
is probably around a degree of 6, but even a degree-20 model is not seriously
over-fitting the data—the validation and training scores remain very close.
Thus we see that the behavior of the validation curve has not one but two
important inputs: the model complexity and the number of training points. It is
often useful to to explore the behavior of the model as a function of the number
of training points, which we can do by using increasingly larger subsets of the
data to fit our model. A plot of the training/validation score with respect to the
size of the training set is known as a learning curve.
A model of a given complexity will overfit a small dataset: this means the
training score will be relatively high, while the validation score will be
relatively low.
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 13/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 14/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
ax[i].set_ylim(0, 1)
ax[i].set_xlim(N[0], N[-1])
ax[i].set_xlabel('training size')
ax[i].set_ylabel('score')
ax[i].set_title('degree = {0}'.format(degree), size=14)
ax[i].legend(loc='best')
The only way to increase the converged score is to use a different (usually more
complicated) model. We see this in the right panel: by moving to a much more
complicated model, we increase the score of convergence (indicated by the
dashed line), but at the expense of higher model variance (indicated by the
difference between the training and validation scores). If we were to add even
more data points, the learning curve for the more complicated model would
eventually converge.
Plotting a learning curve for your particular choice of model and dataset can
help you to make this type of decision about how to move forward in improving
your analysis.
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 15/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
Notice that like a normal estimator, this has not yet been applied to any data.
Calling the fit() method will fit the model at each grid point, keeping track of
the scores along the way:
Now that this is fit, we can ask for the best parameters as follows:
In [20]: grid.best_params_
Finally, if we wish, we can use the best model and show the fit to our data using
code from before:
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 16/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = model.fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test, hold=True);
plt.axis(lim);
The grid search provides many more options, including the ability to specify a
custom scoring function, to parallelize the computations, to do randomized
searches, and more. For information, see the examples in In-Depth: Kernel
Density Estimation (05.13-kernel-density-estimation.html) and Feature
Engineering: Working with Images (05.14-image-features.html), or refer to
Scikit-Learn's grid search documentation (http://Scikit-
Learn.org/stable/modules/grid_search.html).
# Summary
In this section, we have begun to explore the concept of model validation and
hyperparameter optimization, focusing on intuitive aspects of the bias–
variance trade-off and how it comes into play when fitting models to data. In
particular, we found that the use of a validation set or cross-validation
approach is vital when tuning parameters in order to avoid over-fitting for more
complex/flexible models.
In later sections, we will discuss the details of particularly useful models, and
throughout will talk about what tuning is available for these models and how
these free parameters affect model complexity. Keep the lessons of this section
in mind as you read on and learn about these machine learning approaches!
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 17/18
4/14/24, 11:10 AM Hyperparameters and Model Validation | Python Data Science Handbook
Open in Colab
(https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/not
Hyperparameters-and-Model-Validation.ipynb)
https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html 18/18