An Overview of Overfitting and Its Solutions
An Overview of Overfitting and Its Solutions
To cite this article: Xue Ying 2019 J. Phys.: Conf. Ser. 1168 022022 - Three-level Generalized Discontinuous
Pulse-width Modulation Strategy
Considering Neutral Point Potential
Balance
WANG Jikang, LIU Jianzheng, WANG Yi
et al.
View the article online for updates and enhancements.
- Global and local targeted immunization in
networks with community structure
Shu Yan, Shaoting Tang, Wenyi Fang et
al.
Xue Ying
Building 1, Huizhong Tower, NO.1 Shangdi Seven Street, Haidian District Beijing 10
0085 China
Email: [email protected]
1. Introduction
In supervised machine learning, there’s an un-detouring issue. Model does not generalize well from
observed data to unseen data, which is called overfitting [1]. Because of existence of overfitting, the
model performs perfectly on training set, while fitting poorly on testing set. This is due to that
over-fitted model has difficulty coping with pieces of the information in the testing set, which may be
different from those in the training set. On the other hand, over-fitted models tend to memorize all the
data, including unavoidable noise on the training set, instead of learning the discipline hidden behind
the data.
The causes of this phenomenon might be complicated. Generally, we can categorize them into three
kinds: 1) noise learning on the training set: when the training set is too small in size, or has less
representative data or too many noises. This situation makes the noises have great chances to be
learned, and later act as a basis of predictions. So, a well-functioning algorithm should be able to
distinguish representative data from noises [2]; 2) hypothesis complexity: the trade-off in complexity,
a key concept in statistic and machining learning, is a compromise between Variance and Bias. It
refers to a balance between accuracy and consistency. When the algorithms have too many hypothesis
(too many inputs), the model becomes more accurate on average with lower consistency [2]. This
situation means that the models can be drastically different on different datasets; and 3) multiple
comparisons procedures which are ubiquitous in induction algorithms, as well as in other Artificial
Intelligence (AI) algorithms [3]. During these processes, we always compare multiple items based on
scores from an evaluation function and select the item with the maximum score. However, this process
will probably choose some items which will not improve, or even reduce classification accuracy.
In order to reduce the effect of overfitting, multiple solutions based on different strategies are
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022022 doi:10.1088/1742-6596/1168/2/022022
proposed to inhibit the different triggers. Nevertheless, most of them perform poorly when dealing
with real-world issues, because of the great amount of hypothesis. However, none of the hypothesis
sets can cover all the application fields. The contribution of this paper is to give some general
guidelines to choose solutions according to the application fields, from four perspectives.
2. Early-stopping
This strategy is used to avoid the phenomenon “learning speed slow-down”. This issue means that the
accuracy of algorithms stops improving after some point, or even getting worse because of
noise-learning. The idea has a fairly long history which can be dating back to the 1970s in the context
of the Landweber iteration [4]. Also, it is widely used in iterative algorithms, especially in neural
networks starting from the 1990s.
As shown in Figure 1, where the horizontal axis is epoch, and the vertical axis is error, the blue line
shows the training error and the red line shows the validation error.
If the model continues learning after the point, the validation error will increase while the training
error will continue decreasing.
= ∑ (𝜎(𝑧) − 𝑦) (2)
Practically, to find out the point to stop learning, the obvious way is to keep track of accuracy on
the test data as our network trains. In another word, we compute the accuracy at the end of each epoch
and stop training when the accuracy on test data stops improving.
More generally, we can track the accuracy on validation set instead of test set in order to determine
when to stop training. In another word, we use validation set to figure out a perfect set of values for
the hyper-parameters, and later use the test set to do the final evaluation of accuracy. In this way, a
higher generalization level can be guaranteed, compared to directly using test data to find out
hyper-parameters’ values. This strategy ensures that, at each step of an iterative algorithm, it will
reduce bias but increase variance. So finally, the variance of the estimator will not be too high. Besides,
it has a lower computational complexity. However, there also are some problems: 1) Strictly speaking,
this is not a necessary sign of overfitting. It might be that accuracy of both the test data and the
training data stops improving at the same time; 2) We cannot detect the stopping point immediately.
Generally, we stop later to make sure that we found the exact point to stop.
There are some interesting usages of early-stopping. For instance, in the work of Rich Caruana,
2
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022022 doi:10.1088/1742-6596/1168/2/022022
Steve Lawrence, and Lee Giles [6], they combined early stopping with back-propagation. With the
algorithm they proposed, we can train large nets without significant overfitting in an effective way. In
the work of Hao WU, Jonathan L. Shapiro [7], they used early stopping to prevent overfitting and
improve performance in Estimation of Distribution Algorithms (EADs, a class of evolutionary
algorithms that use machine learning techniques to solve optimization problems).
3. Network-reduction
As we know, noise learning is one important cause of overfitting. So logically, noise reduction
becomes one researching direction for overfitting inhibition. Based on this thinking, pruning is
proposed to reduce the size of finial classifiers in relational learning, especially in decision tree
learning. Pruning is a significant theory used to reduce classification complexity by eliminating less
meaningful, or irrelevant data, and finally to prevent overfitting and to improve the classification
accuracy.
Pre-pruning and post-pruning are two standard approaches used to dealing with noise:
Pre-pruning algorithms function during the learning process. Commonly, they use stopping criteria
to determine when to stop adding conditions to a rule, or adding rule to a model description, such as
encoding length restriction based on the evaluation of encoding cost; significance testing is based on
significant differences between the distribution of positive and negative examples; cutoff stopping
criterion based on a predefined threshold [8].
Post-pruning splits the training set into two subsets: growing set and pruning set. Compared to
pre-learning, post-pruning algorithms ignore overfitting problems during the learning process on
growing set. Instead, they prevent overfitting through deleting conditions and rules from the model
generated during learning. This approach is much more accurate, and however, less efficient [8].
Lots of previous works have proven the effectiveness of this strategy in inducing decision tree
learning, such as CN2 [9], FOIL [10], and Fossil [11], Reduced Error Pruning (REP) [12], GROW [13]
and J-pruning based on information-theoretic J-Measure [14]. Furthermore, knowing that pre-pruning
is more efficient and post-pruning is much more accurate, researchers proposed some variants
combining or integrating pre-pruning and post-pruning are introduced to improve the accuracy and
efficiency of classification, like Top-Down Pruning (TDP) and Incremental (I-REP) [15].
3
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022022 doi:10.1088/1742-6596/1168/2/022022
5. Regularization
Generally, the output of a model can be affected by multiple features. When the number of features
increases, the model becomes complicated. An overfitting model tends to take all the features into
consideration, even though some of them have very limited effect on the final output. Or even worse,
some of them are noises which are meaningless to the output.
In order to limit these cases, we have two kinds of solutions:
1. Select only the useful features and remove the useless features from our model
2. Minimize the weights of the features which have little influence on the final classification
In another word, we need to limit the effect of those useless features. However, we do not always
know which features are useless, so we try to limit them all by minimizing the cost function of our
model. To do this, we add a “penalty term”, called regularizor, to the cost function as shown in the
following formula:
𝐽(𝜔; 𝑋, 𝑦) = 𝐽(𝜔; 𝑋, 𝑦) + 𝛼Ω(𝜔) (3)
𝐽(𝜔; 𝑋, 𝑦) = ||𝑋 − 𝑦|| + 𝛼Ω(𝜔) (4)
where 𝐽(𝜔; 𝑋, 𝑦) is the original cost function, 𝜔 is the weight, 𝑋 is the training set, 𝑦 is the
labeled value (true value), 𝑚 is the size of training set, 𝛼 is the regularization coefficient, and
𝛼Ω(𝜔) is the penalty term.
Here we can use the “Gradient-Descent” method to find out the set of weights.
( )
𝜔( )
= 𝜔( )
−𝛼 ∑ 𝑝 𝑥( ) − 𝑦( ) 𝑥( ) −𝜆 (5)
( ) ( ) ∑ () () ()
𝜔 =𝜔 1− −𝛼 𝑝 𝑥 −𝑦 𝑥 (6)
( )
As shown in the formula (4), the bigger 𝑚 is, the smaller 𝜆 will be. In another word, the
bigger the training set is, the less risk of overfitting and regularization effect there will be.
In the following section, we will discuss in detail 3 general regularization methods: 1) L1
Regularization; 2) L2 Regularization; and 3) Dropout.
5.1 L1 regularization
To figure out the minimum of cost function, L1 regularization uses the Lasso Regression, one linear
regression theory. In this approach, we use the so-called taxi-cab distance, sum of absolute values of
all the weights as the penalty term.
Ω(ω) = ||𝑤|| = ∑ |𝑤 | (7)
To minimize the cost function, we need to set the weights of some features to be zero. In another
word, we remove some features from our model, and keep only those features more valuable. In this
way, we can get a simpler model which are easier to interpret. However, at the same time, we lost
some useful features which have lower influence on the final output.
5.3 Dropout
Dropout is a popular and effective technique against overfitting in neural networks. The initial idea of
dropout is to randomly drop units and relevant connections from the neural networks during training.
This prevents units from co-adapting too much. The general process is as follow:
4
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022022 doi:10.1088/1742-6596/1168/2/022022
1) Drop randomly half of the hidden neurons to construct a new simpler network
2) Train the thinned network using stochastic gradient descent
The gradients for each parameter are averaged over the training cases in each mini-batch. Any
training case which does not use a parameter contributes a gradient of zero for that parameter [19].
3) Restore the removed neurons
4) Randomly remove half of hidden neurons from the restored network to form a new thinned
network
5) Repeat the process above until get an ideal set of parameters
By temporarily removing some units from neural nets, dropout approximates the effect of
averaging the predictions of all these thinned networks. In this way, overfitting are inhibited to some
extent while combining the predictions of many different large neural nets at test time. Besides,
dropout reduces significantly the amount of computations. This makes it an effective choice for big or
complicated networks which need lots of training data.
In the work of David Warde-Farley, Ian J. Goodfellow, Aaron Courville, and Yoshua Bengio [20],
they investigated the efficacy of dropout, focusing on the specific case of the popular rectified linear
nonlinearity for hidden units. And finally, they proved that dropout was an extremely effective
ensemble learning method, paired with a clever approximate inference scheme that was remarkably
accurate in the case of rectified linear networks.
6. Conclusion
Overfitting is general issue in supervised machine learning, which cannot be completely avoided. It
happens because of either the limits of training data, which can have a limited size or include plenty of
noises, or the constraints of algorithms, which are too complicated, and require too many parameters.
Responding to these causes, a variety of algorithms are introduced to reduce the effect of
overfitting. On the one hand, to deal with noises in training set, algorithms based on the
“early-stopping” strategy help us to stop training before learning noises; besides, algorithms based on
the “reduce the size of network” strategy provide us an approach to reduce noises in the training set.
On the other hand, “data-expansion” strategy is proposed for complicated models which require
plentiful data to fine-tune their hyper-parameters. Besides, “Regularization”-based algorithms help us
distinguish noises, meaning and meaningless features and assign different weights to them.
To deal with real-world problems, the majority of models are complicated, because generally, the
final output can be affected by dozens of or even hundreds of factors. A well-generalized model will
be prone to take into consideration of all the potential features instead of arbitrarily discarding the
useless-like ones. The increase of parameters demands a great amount of training data to tune the
hyper-parameters set, such as the weights. Thereby, data becomes a key point in machine learning,
especially in supervised machine learning. In most cases, the more training data we use for training,
the more accurate our final model is. However, until this moment, the acquisition of data is still a
painful topic in machine learning. In many fields, data can be really expensive, because some data is
difficult to obtain, and some data requires human labor for labeling. What’s more, a perfect training is
not great in size, but also includes a limited proportion of noise. Thus, data-acquisition and
data-cleaning can be two interesting research directions for future works.
References
[1] https://elitedatascience.com/overfitting-in-machine-learning
[2] Paris G., Robilliard D., Fonlupt C. (2004) Exploring Overfitting in Genetic Programming.
Artificial Evolution, International Conference, Evolution Artificielle, Ea 2003,
Marseilles, France, October. DBLP, pp.267-277.
[3] Jensen D D, Cohen P R. (2000) Multiple Comparisons in Induction Algorithms. Machine
Learning, 38(3):309-338.
5
CISAT 2018 IOP Publishing
IOP Conf. Series: Journal of Physics: Conf. Series 1168 (2019) 022022 doi:10.1088/1742-6596/1168/2/022022