ML - Module 5
ML - Module 5
• Aim is to find a vector x0 such that the sum of the squared distances between x0 and the various xk is as
small as possible.
choice x0 =m.
• The sample mean is a zero-dimensional representation of the data set.
• We create an one-dimensional representation by projecting the data onto a line running through the
sample mean.
• where the scalar a corresponds to the distance of any point on x from the mean m.
• If we represent xk by m+ake, we can find an optimal set of coefficients ak by minimizing the squared-
error criterion function
• (1)
• As ||e||=1, , partially differentiating with respect to ak, and setting the derivative to zero, we obtain
• (2)
• i.e. we obtain a least squares solution by projecting the vector xk onto the line in the direction of e that passes through the sample
mean.
Finding the best direction e for the line
• (2) in (1)
• The vector e that minimizes J1 also maximizes eTSe.
• We use the method of Lagrange multipliers to maximize eTSe subject to the constraint that
||e||=1,
• we select the eigenvector corresponding to the largest eigenvalue of the scatter matrix.
• ie, to find the best one-dimensional projection of the data (best in the least sum-of-
squared-error sense), we project the data onto a line through the sample mean in the
direction of the eigenvector of the scatter matrix having the largest eigenvalue.
• This result can be readily extended from a one-dimensional projection to a dimensional projection.
• is minimized when the vectors e1…..ed’ are the d’ eigenvectors of the scatter matrix having the largest
eigenvalues.
• Because the scatter matrix is real and symmetric, these eigenvectors are orthogonal. They form a
natural set of basis vectors for representing any feature vector x. The coefficients ai are the
components of x in that basis, and are called the principal components.
Evaluation and model Selection:
• Now to evaluate your model, and improving the model based on what we learnt by
evaluating it.
• It's important to use new data when evaluating our model to prevent the likelihood of
overfitting to the training set.
• However, sometimes it's useful to evaluate our model as we're building it to find that
best parameters of a model
• But we can't use the test set for this evaluation or else we'll end up selecting the
parameters that perform best on the test data but maybe not the parameters that
generalize best.
• To evaluate the model while still building and tuning the model, we create a third
subset of the data known as the validation set.
Evaluation and model Selection:
• In machine learning, model validation is referred to as the process where a trained model is
evaluated with a testing data set.
• The testing data set is a separate portion of the same data set from which the training set is
derived.
• The main purpose of using the testing data set is to test the generalization ability of a trained
model
• Validating your machine learning model outcomes is all about making sure you’re getting the
right data and that the data is accurate.
• Validation catches problems before they become big problems and is a critical step in the
• Training Dataset: The sample of data used to fit the model. The actual dataset that we use
to train the model. The model sees and learns from this data.
• Test Dataset: The sample of data used to provide an unbiased evaluation of a final model
fit on the training dataset.
• The training set is -typically- the largest subset created out of the original dataset that is
used to for the models.
• The validation set is then used to evaluate the models in order to perform model selection.
• The test set is used to evaluate whether final model (that was selected in the previous step)
can generalize well to new, unseen data.
• While data preparation and training a machine learning model is a key step in the machine
learning pipeline, it’s equally important to measure the performance of this trained model
• Security - Training data and machine learning model data are all valuable, especially if that data is private or
sensitive.
• Avoid Bias - Knowing how to look for bias and how to fix the bias in your machine learning model is an important
aspect of model validation
• Prevent Concept Drift - Concept drift is the situation where a machine learning model has been allowed to degrade
• Concept drift happens, but how the model drifts is unpredictable. Drift is harmful to the machine learning model as
the output data becomes less useful. While initial machine learning model validation won’t catch concept drift,
proper maintenance and regular testing will. Concept drift happens over time, but it’s completely preventable with
routine maintenance.
Cross Validation
• Then when training is done, the data that was removed can be used to test the performance
of the learned model on ``new'' data.
• This is the basic idea for a whole class of model evaluation methods called cross
validation.
• To test the performance of a classifier, we need to have a number of training/validation set
pairs from a dataset X.
• To get them, if the sample X is large enough, we can randomly divide it randomly into two
and use one half for training and the other half for validation.
• Unfortunately, datasets are never large enough to do this. So, we use the same data split
differently; this is called cross-validation.
• The data set is separated into two sets, called the training set and the testing set.
• Then the function is used to predict the output values for the data in the testing set (it has
never seen these output values before).
• But, its evaluation can have a high variance ie. the evaluation may depend heavily on
which data points end up in the training set and which end up in the test set, and thus the
evaluation may be significantly different depending on how the division is made.
• In problems where we have a sparse dataset we may not be able to afford the “luxury” of
setting aside a portion of the dataset for testing
K-fold cross-validation
• The data set is divided into k subsets, and the holdout method is repeated k times.
• Each time, one of the k subsets is used as the test set and the other k-1 subsets are put
together to form a training set.
• The advantage of this method is that it matters less how the data gets divided.
• Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times.
• All the examples in the dataset are eventually used for both training and testing
Disadvantage
• The training algorithm has to be rerun from scratch k times, which means it takes k times as
much computation to make an evaluation
• To keep the training set large, we allow validation sets that are small.
• The training sets overlap considerably, namely, any two training sets share K − 2 parts.
get more strong estimators, but the validation set becomes smaller.
• Given a dataset of N data points, only one data point is left out as the validation set and
training uses the remaining N − 1 data points.
• We then get N separate pairs by leaving out a different instance at each iteration.
• This is typically used in applications such as medical diagnosis, where labeled data is hard
to find.
• T the average error is computed and used to evaluate the model he average error is
computed and used to evaluate the model
• + The bias of the true error rate estimator will be small (the estimator will be very
accurate)
•
• With a small number of folds
• - The bias of the estimator will be large (conservative or smaller than the true error rate)
• In practice, the choice of the number of folds depends on the size of the dataset
• For large datasets, even 3-Fold Cross Validation will be quite accurate
• For very sparse datasets, we may have to use leave-one-out in order to train on as many
examples as possible
• From a dataset with N examples, Randomly select (with replacement) N examples and use
this set for training
• The remaining examples that were not selected for training are used for testing
• The true error is estimated as the average error rate on test examples
Common metrics used to evaluate models.
• True positives are when you predict an observation belongs to a class and it actually does belong to
that class.
• True negatives are when you predict an observation does not belong to a class and it actually does
not belong to that class.
• False positives occur when you predict an observation belongs to a class when in reality it does not.
• False negatives occur when you predict an observation does not belong to a class when in fact it
does.
Confusion matrix
• A confusion matrix is a table that categorizes predictions according to whether they match
the actual value.
Two-class datasets
• A confusion matrix is a table with two rows and two columns that reports the number of
false positives, false negatives, true positives, and true negatives.
Multiclass datasets
Actual Values
Dog Cat Rabbit
Dog 3 3 2
Predicted Values
Cat 2 5 0
Rabbit 2 0 11
Precision and recall
• Precision and recall are two measures used to assess the quality of results produced by a
binary classifier.
• Precision refers to the proportion (total number) of all observations that have been
predicted to belong to the positive class and are actually positive
• Sensitivity / Recall tells us what proportion of the positive class got correctly classified ie.
predicted to belong to the positive class, that truly belongs to the positive class.
• It indirectly tells us the model’s ability to randomly identify an observation that belongs to
the positive class.
Accuracy
• Accuracy is defined as the percentage of correct predictions for the test data. It can be
calculated easily by dividing the number of correct predictions by the number of total
predictions.
Other measures of performance
Receiver Operating Characteristic (ROC)
• ROC or Receiver Operating Characteristic Curve is the most frequently used tool for
evaluating the binary or multi-class classification model. Unlike other metrics, it is
calculated on prediction scores like Precision-Recall Curve
• ROC Curve is a plot of True Positive Rate(TPR) plotted against False Positive
Rate(FPR) at various threshold values. It helps to visualize how threshold affects classifier
performance.
ROC space
• We plot the values of FPR along the horizontal axis (that is , x-axis) and the values of
• For each classifier, there is a unique point in this plane with coordinates (FPR;TPR).
• The ROC space is the part of the plane whose points correspond to (FPR;TPR).
• Each prediction result or instance of a confusion matrix represents one point in the
ROC space.
• The position of the point (FPR;TPR) in the ROC space gives an indication of the
performance of the classifier
Special points in ROC space
• The left bottom corner point (0; 0): Always negative prediction
A classifier which produces this point in the ROC space never classifies an example as
positive, neither rightly nor wrongly, because for this point TP = 0 and FP = 0. It always
makes negative predictions. All positive instances are wrongly predicted and all negative
instances are correctly predicted. It commits no false positive errors.
• The right top corner point (1; 1): Always positive prediction
A classifier which produces this point in the ROC space always classifies an example as
positive because for this point FN = 0 and TN = 0. All positive instances are correctly
predicted and all negative instances are wrongly predicted. It commits no false negative
errors.
• The left top corner point (0; 1): Perfect prediction
A classifier which produces this point in the ROC space may be thought as a perfect classifier. It
produces no false positives and no false negatives.
Consider a classifier where the class labels are randomly guessed, say by flipping a coin. Then,
the corresponding points in the ROC space will be lying very near the diagonal line joining
• Different values of the parameter will give different classifiers and these in turn give
different values to TPR and FPR.
• The ROC curve is the curve obtained by plotting in the ROC space the points (TPR ; FPR)
obtained by assigning all possible values to the parameter in the classifier.
• The closer the ROC curve is to the top left corner (0; 1) of the ROC space, the better the
accuracy of the classifier.
• Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish
between classes and is used as a summary of the ROC curve.
• The higher the AUC, the better the performance of the model at distinguishing between
the positive and negative classes.
• When AUC = 1, then the classifier is able to perfectly distinguish between all the Positive
and the Negative class points correctly.
• If, however, the AUC had been 0, then the classifier would be predicting all Negatives as
Positives, and all Positives as Negatives.
• When AUC is 0.5. It means the model is useless.
• Model checks whether someone is positive or negative, and it tells you: well, maybe it’s
positive, maybe it’s negative (50:50).
How Does the AUC-ROC Curve Work?
• In a ROC curve, a higher X-axis
value indicates a higher number
of False positives than True
negatives.