1729585037_ML11_Generalization
1729585037_ML11_Generalization
Selection
Generalisation
• How well a model trained on the training set predicts the right output for
new instances is called generalization.
• The goal of a good machine learning model is to generalize well from the
training data to any data from the problem domain. This allows us to
make predictions in the future on data the model has never seen.
• Overfitting and underfitting are the two biggest causes for poor
performance of machine learning algorithms.
• Underfitting is the production of a machine learning model that is not complex
enough to accurately capture relationships between a dataset features and a target
variable.
• Overfitting is the production of an analysis which corresponds too closely or
exactly to a particular set of data, and may therefore fail to fit additional data or
predict future observations reliably.
• A high bias model results from not A high variance model results from
learning data well enough. learning the data too well.
• Hence future predictions will be Hence it is impossible to predict the
unrelated to the data and thus incorrect. next point accurately.
Both training Small training error
and test errors and large test error
are large
Adding more
Adding more training examples
training can possibly help
examples here because we
won’t help as possibly have a
the model complex model
itself is too but not enough
simple training data to
learn it well
Use a more
complex model Or regularize the model
or increase the Bias-Variance trade-off: To balance, better to prevent overfitting
number of both the bias and variance and keep or use fewer features, or
features both of them low. switch to a simpler model
Addressing overfitting
• Reduce number of features
• Regularization: Keep all
features, but reduce
magnitude of parameters θ
𝑁 𝑚
(𝑖) (𝑖) 2
𝐽 𝜃 = ℎ𝜃 𝑥 −𝑦 +𝜆 |𝜃𝑗 |
𝑖=1 𝑗=1
𝑁 𝑚 𝑚
(𝑖) (𝑖) 2
𝐽 𝜃 = ℎ𝜃 𝑥 −𝑦 +𝜆 (𝜃𝑗 )2 + 𝜆 |𝜃𝑗 |
𝑖=1 𝑗=1 𝑗=1
y_pred = classifier.predict(x_test)
Boosting
• Use boosting for combining weak learners with high bias.
• Produce a model with a lower bias than that of the individual models.
• Boosting involves sequentially training weak learners. Here, each subsequent
learner improves the errors of previous learners in the sequence.
• A sample data is first taken from the initial dataset, and is used to train the first
model; The model makes its prediction, either be correctly or incorrectly
predicted.
• The samples that are wrongly predicted are reused for training the next model.
• In this way, subsequent models can improve on the errors of previous models.
• Unlike bagging, which aggregates prediction results at the end, boosting
aggregates the results at each step. They are aggregated using max voting or
weighted averaging.
Adaboost (Adaptive Boosting)
• Most common Weak Learner used is Decision Tree with one
level, i.e. one branch (called Decision Stumps)
• Where ever the model is performing bad, needs improvement
and focus more as compared to where it is performing better.
• Example:
X1 1 2 3 4 5 6 7
X2 7 6 5 4 3 2 1
Y R B R R B R B
FIRST WEAK LEARNER: Lets say split DT for stump at X2 >= 4
Actual Y Predicted Y
8
X2 >= 4 R R 7
B R 6
YES NO 5
R R
X2
4
R R 3
2
RED BLUE B B 1
0
R B 0 2 4 6 8
B B X1
Normalized Random
Actual Y Predicted Y Weight weight Sampling
R R (1/7)×(0.6325)=0.0903 0.1 [0.00 – 0.10]
B R (1/7)×(1.5811)=0.2259 0.25 [0.10 – 0.35]
R R (1/7)×(0.6325)=0.0903 0.1 [0.35 – 0.45]
R R (1/7)×(0.6325)=0.0903 0.1 [0.45 – 0.55]
B B (1/7)×(0.6325)=0.0903 0.1 [0.55 – 0.65]
R B (1/7)×(1.5811)=0.2259 0.25 [0.65 – 0.90]
B B (1/7)×(0.6325)=0.0903 0.1 [0.90 – 1.00]
Rand # X1 X2 Y 8
0.612092 5 3 B 6
0.932126 7 1 B
X2
4
0.32812 2 6 B
2
0.127205 2 6 B
0.172399 2 6 B 0
0 2 4 6 8
0.473324 4 4 R
X1
0.238094 2 6 B
SECOND WEAK LEARNER: Lets say split DT for stump at X1 < 3
X1 < 3
YES NO 2 misclassification
BLUE RED
• 𝑒 𝜆𝑘 : λ is a learning rate (here the value is chosen between 0 to 1)
1 1−𝑒𝑟𝑟𝑜𝑟 1 1−0.286
• 𝑘 = ln ; error = 2/7 = 0.286, then 𝑘 = ln = 0.4581
2 𝑒𝑟𝑟𝑜𝑟 2 0.286
• Let, λ = 1 𝑒 𝜆𝑘 = 𝑒 0.4581 = 1.5811 and 𝑒 −𝜆𝑘 = 0.6325
Normalized Random
Actual Y Predicted Y Weight weight Sampling
B R 0.1×1.5811 =0.1581 0.15625 [0.00 – 0.16]
B R 0.1×1.5811 =0.1581 0.15625 [0.16 – 0.31]
B B 0.25×0.6325 =0.1581 0.15625 [0.31 – 0.47]
B B 0.25×0.6325 =0.1581 0.15625 [0.47 – 0.63]
B B 0.25×0.6325 =0.1581 0.15625 [0.63 – 0.78]
R R 0.1×0.6325 =0.0632 0.0625 [0.78 – 0.84]
B B 0.25×0.6325 =0.1581 0.15625 [0.84 – 1.00]
Rand # X1 X2 Y 8
0.1465 5 3 B 6
0.2749 7 1 B
X2
4
0.6467 2 6 B
0.0360 5 3 B 2
0.0472 5 3 B 0
0.8494 2 6 B 0 2 4 6 8
0.0953 5 3 B X1
X1 + X2 <= 8
BLUE RED
X2 > 4 X1 < 3 X1 + X2 <= 8
Max Voting
Say, X1 = 1, X2 = 7: RED, BLUE, BLUE Class: BLUE
Say, X1 = 3, X2 = 3: BLUE, RED, BLUE Class: BLUE
Say, X1 = 5, X2 = 5: RED, RED, RED Class: RED
Gradient Boosting
• Gradient tree boosting is a very useful algorithm for both regression and
classification problems.
• It handles mixed data types, and it is also quite robust to outliers.
• Additionally, it has better predictive powers than many other algorithms;
• However, its sequential architecture makes it unsuitable for parallel
techniques, and therefore, it does not scale well to large data sets. For
datasets with a large number of classes, it is recommended to use
RandomForestClassifier instead.
• Gradient boosting typically uses decision trees to build a prediction model
based on an ensemble of weak learners, applying an optimization algorithm
on the cost function.
MODEL 1
IQ CGPA Salary Predicted Psedo 1
90 8 3 4.8 -1.8 IQ < 105
100 7 4 4.8 -0.8
110 6 8 4.8 3.2 YES NO
120 9 6 4.8 1.2
IQ < 95 CGPA < 7.5
80 5 3 4.8 -1.8
Mean Salary = 4.8 YES NO YES NO
• Prediction = Mean + α (Prediction from Model 1) α is the learning rate (say 0.1)
• Prediction after model 2 = Mean + 0.1 (Prediction from Model 1) + 0.1 (Prediction
from Model 2)
• Adaboost: use decision stump; prioritize models based on weights
• Gradient Boost: Use DT with leaf node of 8 to 32; Use learning rate
concept.
• For classification problems instead of Mean use log(odds) to compute
pseudo residuals