Data Science Concepts Overfitting Underfitting
Data Science Concepts Overfitting Underfitting
Aitor Larrinoa
January 2025
Contents
1 Introduction 1
3 Example 3
Our principal goal when we train a ML model is to get good results. Thus, the better is the
metric of the model, the better the performance of it. However, is this entirely true? We have
to be cautious because our main goal should be to look for a good generalization, instead.
A model is said to generalize well when it can handle new, unseen input data effectively.
However, finding a balance between fitting the training data and performing well on new data
is not straightforward and can lead to two common problems: overfitting and underfitting.
In this post we will dive into the concepts of overfitting and underfitting, we will understand
why they happen, get into a practical example and what strategies can we use in order to
avoid them.
1
2 What are overfitting and underfitting?
One of the biggest problems when dealing with ML models are overfitting and underfitting.
As if it were a human being, learning machines must be able to generalize concepts. Suppose
that we see a Labrador Retriever for the first time in our lives, and someone tells us, ”That
is a dog.” Later, we are shown a Poodle and asked, ”Is that a dog?” We might say, ”No,” as
it looks nothing like what we previously learned. Now imagine someone shows us a book with
pictures of 10 different dog breeds. When we see a breed we are unfamiliar with, we will be able
to recognize it as a dog because of the characteristics observed in the various dogs depicted in
the photos.
The goal is to ensure that the model can generalize a concept so that when presented with a
new, unfamiliar dataset, it can understand, and provide a reliable result thanks to its
generalization ability. Before diving into overfitting and underfitting, the following concepts
must be understood:
Defintion 2.1. Bias is the difference between the model’s prediction and the correct value it
aims to predict.
Definition 2.2. Variance is the variability of model prediction for a given data point or a
value which tells us spread of our data.
So now, what is overfitting? what is underfitting?
• Overfitting: The model will only adjust to learn the specific cases it is taught (training
set) and will be unable to recognize new input data (test set).
In other words, underfitting occurs when the model is too simple, resulting in high bias and
an inability to capture the true patterns in the data, whereas overfitting happens when the
model is too complex, leading to high variance and poor generalization to unseen data.
Next we will show some overfitting and underfitting visual examples for classification and
regression tasks:
2
Figure 1: Overfitting and underfitting
3 Example
We will create an example in order to see the relevance of overfitting and underfitting more
easily.
Let’s supose we are in front of a dataset where y is a function of x and their relationship is
given by the next equation:
y = x2
If we consider a linear regression model, for example y = β0 + x · β1 , the error will be high
because a straight line cannot capture the curvature of the previously seen relationship. This
is underfitting and can be shown in the next plot:
3
Clearly, the line does not fit well our data points. In fact, the metric result show us the poor
performance of the model. Thus, clearly underfitting appears.
However, if we consider a polynomial regression with a high degree, let’s say for example 22,
we will obtain a model that fits extremely well on training data and will not be capable of
generalizing predictions.
As said before, the model performs extremely well on training data. This results in that if we
consider a new data point now, a little bit different from training data, the error will be quite
high due to overfitting. Thus, our main goal when training a model should be to look for
generalization.
As seen, overfitting and underfitting can cause serious problems when creating a machine
learning model. Thus, we need to control them. We are going to talk about different generic
considerations we can take in order to have underfitting and overfitting under control:
• More data. Training with few data points can cause overfitting.
• Reduce model complexity. Less is more, thus, a very complex model can lead to
overfitting.
• Feature engineering. Poor feature engineering means underfitting, and that is why this
is one of the most important considerations in a data science project.
• Cross-validation. Techniques like k-fold cross-validation can help evaluate the model’s
performance on unseen data, reducing the risk of overfitting or underfitting during
training.
4
In fact, depending on the model we are dealing with, avoiding overfitting and underfitting
never take the same path. Let’s dive into different type of models and how we can deal with
these problems in each case:
Parametric models, such as linear regression and logistic regression, assume a fixed functional
form with a finite number of parameters. Here are some approaches to controlling
underfitting and overfitting in these models:
• Regularization: Techniques like ridge or lasso regression add constraints to the model
coefficients, reducing overfitting.
• Feature selection: Choose only the most relevant features for the model. Reducing
irrelevant or highly correlated features can improve generalization.
Descision trees, random forests, XGBoost, ... are examples of tree based algorithms. These
type of algorithms tend to overfit. Let’s see what can be done in these types of model:
• Ensemble methods: Models like random forests and gradient boosting combine multiple
trees to improve generalization. Use techniques like bagging (random forests) or
boosting to balance bias and variance.
• Feature importance: Selecting features that contribute most to the model’s predictions
can help when generalizing those predictions and avoids overfitting.
Neural networks are the most complex models within machine learning and AI. Thus, these
models are prone to overfitting when the architecture is too complex or the dataset is small.
Consider the following tips:
• Early stopping: Stop the process of training once the loss stops improving, preventing
the network from overfitting.
5
• Data augmentation: Normally used when dealing with images. The idea is to artificially
increase the size of the training dataset by applying transformations such as rotations,
flips, or noise to the input data. Thus, we will get more data for free.
• Architecture tuning: Reduce the number of layers or neurons if the network is too large
for the dataset.. This is the principal approach when we have an overfitted neural
network.