Bias and Variance
Minati Rath
Example: Linear regression (housing prices)
Fitting a linear function
Fitting a quadratic function
Price
Fitting a higher order function
Size
Bias vs. variance in linear regression
Price
Size
Bias vs. variance in linear regression
Price
Size
High bias “Just right” High variance
(underfitting) (overfitting)
Overfitting
If we have too many features, the learned hypothesis may fit the
training set very well
but fail to generalize to new examples.
Bias vs. variance in logistic regression
Example: Logistic regression
Sources of noise and error
While learning a target function using a training set
Two sources of noise
Some training points may not come exactly from the target
function: stochastic noise
The target function may be too complex to capture using the
chosen hypothesis set: deterministic noise
Generalization error: Model tries to fit the noise in the training data,
which gets extrapolated to the test set
Ways to handle noise
Validation
Check performance on data other than training data, and tune model
accordingly
Regularization
Constraint the model so that the noise cannot be learnt too well
Validation
Divide given data into train set and test set
E.g., 80% train and 20% test
Better to select randomly
Learn parameters using training set
Check performance (validate the model) on test set, using
measures such as accuracy, misclassification rate, etc.
Trade-off: more data for training vs. validation
An example: model selection
• Which order polynomial will best fit a given data? Polynomials
available: h1, h2, …, h10
• As if an extra parameter - degree of the polynomial - is to be
learned
• Approach 1
– Divide into train and test set
– Train each hypothesis on train set, measure error on test set
– Select the hypothesis with minimum test set error
• Problem with the previous approach
– The test set error we computed is not a true estimate of
generalization error
– Since our extra parameter (order of polynomial) is fit to the test
set
An example: model selection
Approach 2
– Divide data into train set (60%), validation set
(20%) and test set (20%)
– Select that hypothesis which gives lowest error on
validation set
– Use test set to estimate generalization error
Note: Test set not at all seen during training
Popular methods of evaluating a classifier
• Holdout method
– Split data into train and test set (usually 2/3 for train and 1/3 for
test). Learn model using train set and measure performance
over test set
– Usually used when there is sufficiently large data, since both
train and test data will be a part
• Repeated Holdout method
– Repeat the Holdout method multiple times with different
subsets used for train/test
– In each iteration, a certain portion of data is randomly selected
for training, rest for testing
– The error rates on the different iterations are averaged to yield
an overall error rate
– More reliable than simple Holdout
Popular methods of evaluating a classifier
• k-fold cross-validation
– First step: data is split into k subsets of equal size;
– Second step: each subset in turn is used for testing and the
remainder for training
– Performance measures averaged over all folds
Popular choice for k: 10 or 5
Advantage: all available data points being used to train as well test
model
Classifier
k-fold cross validation (shown for k=3)
train train test
train test train
Data
test train train
Regularization
Addressing overfitting: Two ways
1. Reduce number of features
— Manually select which features to keep
— Problem: loss of some information (discarded features)
2. Regularization
— Keep all the features, but reduce magnitude/values of parameters
— Works well when we have a lot of features, each of which contributes a
bit to predicting
Intuition of regularization
Price
Price
Size of house
Size of house
Suppose we penalize and make really small
Combatting Overfitting
➢ Problem of overfitting can be overcome by increasing the input
training data points
➢ Number of input data points should be at least more than 10
times the number of parameters or features
➢ But what if we have less data points:
➢ Put a bound on regression coefficients by using regularization
Regularization for linear regression
In regularized linear regression, we choose to minimize
By convention, regularization is
not applied on θ0 (makes little
difference to the solution)
λ: Regularization parameter
Smaller values of parameters lead to more generalizable models,
less overfitting
L1, L2 and Elastic net Regularization
What we are discussing is called L2 regularization or “ridge”
regularization – it adds squared magnitude of parameters as penalty
term
Look up L1 or “Lasso” regularization
– adds absolute value of magnitude of parameters as penalty term
Elastic Net (Combination of L1 and L2 Regularization)
Effect: Combines the benefits of both Ridge and Lasso. It allows
for some coefficients to be set to zero (like Lasso) while shrinking
others (like Ridge). It is useful when there is multicollinearity, and
some feature selection is needed