Guide
Guide
● Figure 1-5
Image 1-5 shows a Supervised machine learning system; in this system, the training
set that is fed into the algorithm includes the desired outcome(solution) referred to
as “labels”. A typical supervised learning task is Classification and an example of
this task is spam email detection whereby the unwanted emails are labelled as
spam and the wanted emails are labelled as ham. Another task is prediction, this
task requires the system to predict a target value. It is referred to as the Regression
task.
Some regression may be used to carry out classification tasks as well. Exam
● Figure 1-7
Image 1-7 shows unlabelled training set for unsupervised machine learning system
● Figure 1-11
The figure 1-11 shows a semi- supervised learning with two class (triangles and
squares) the unlabelled class helps to classify the new instance (X) to the triangle
class. The semi supervised learning makes use of plenty of unlabelled instances and
few labelled instances to operate. It’s a combination both supervised and
unsupervised learning algorithms
● Figure 1-12
Figure 1-12 shows reinforcement learning, this is typical of robots. This learning
system is called an agent which observes the environment, selects and performs
actions and gets rewards in return or penalty
● Figure 1-15
In instance based learning, the machine learns the examples by heart and then
generalizes to new cases by using a similarity measure to compare the new
instances to the learned examples.
● Figure 1-16
This shows that the model is overfitting the training set, it means that the model
the algorithm is too complex, the validation (generalization) error is high and there
is low training error. Regularization needs to be performed, more training data is
need and the noise needs to be reduced.
● Figure 1-23
This image shows the effect of regularization on the model, it forces the model to fit
the data.
Chapter 2
Once the data set is obtained, the first thing to do is visualization of the data via
plots. Afterwards, the data is split in two; the Training data and the test data. The
split is carried out by picking instances randomly using the following code
The split is usually 20% test data 80% training data. After the training model using
the training set, the model is evaluated using the test set. By evaluating the model
on the test set, we can estimate the generalization error and this value tells us how
the model would perform.
Every single time that this function is run, it produces different sets because of its
randomness,a way to lock this is to set the random number generators seed.
When trying to create a test data it is important that the test set is representative of
the whole dataset. Random sampling is generally fine if the dataset is large and
contains multiple features but if the data is not very large, one runs the risk of
introducing sampling bias. A way of doing this is carrying out stratified sampling;
dividing the dataset into homogenous subgroups making sure that the dataset is
representative of the entire dataset.
● Be able to answer questions pertaining to training and evaluating on the training set, as
discussed in "Training and Evaluating on the Training Set," pp. 72-73
When training your data, different algorithms can be used and compared for accuracy. For
prediction task we can use the linear regression model. After using a few instances from the
training set, it was observed that the prediction is not so accurate so we measured the
performance of the model using the RMSE (root mean square error). We measure the models
RSME. The Decision tree model is trained and then evaluated on the training set. The
Cross validation techniques helps for evaluating the trained model better. There are cases
where using the model might be over fitting the training data but might not be evident using
the RSME, it might just give a perfect model. The best thing is to cover all bases by carrying out
cross validation.
one way to start is to split the training set into a smaller training set, train on that and then
evaluate using the validation set. A better alternative is to use the scikit learns K-fold cross
validation feature which splits the training set into different folds(eg. 10folds) and sets one
aside each time for evaluation and trains the other 9. The result is an array containing 10
evaluation scores. The benefit of this is that it shows where the model is lacking if its overfitting
the training data.
scoring="neg_mean_squared_error", cv=10)
Chapter 3
● The MNIST dataset is very (very) widely used. As such, you should be
able to describe the MNIST dataset, as well as explain Python code
segments in the section "MNIST," pp. 85-87
The first step is to fetch the data, the data is in scikit learn. The data set has a
similar dictionary structure including; DESCR key describes the data set, data key
contains an array with one row per instance and one column per feature, target key
contains the array label.
The code above describes the data size, there are 70,000 images each image having
784 features. Because each image is 28 X 28 pixels
This code is used to reshape an instance feature and then display using the
Matplotlib function
Binary Classifier refers to the classification task that have two class labels; normal
state and abnormal state. In this section we are trying to classify what is 5 and what
is not 5
y_train_5 = (y_train == 5) # True for all 5s, False for all other digits
y_test_5 = (y_test == 5)
The SGD classifier guessed that the image represents a 5, to evaluate the models
performance we use cross validation
clone_clf = clone(sgd_clf)
X_train_folds = X_train[train_index]
y_train_folds = y_train_5[train_index]
X_test_fold = X_train[test_index]
y_test_fold = y_train_5[test_index]
clone_clf.fit(X_train_folds, y_train_folds)
y_pred = clone_clf.predict(X_test_fold)
The Stratified Kfold class performs stratified sampling to produce folds that contain
a representative ratio of each class
At each iteration the code creates a clone of the classifier, trains the code on
training folds and then makes predictions on the test fold. Then it counts the
number of correct predictions and outputs the ratio of correct predictions.
Second code
The above code is executed using the cross_val_score function to evaluate the
SGDClassifier model. The first thing the code does is to create k-folds, in this case 3
folds, then it trains and evaluates the model 3 times; picking a fold for evaluation
everytime and training on the other 3 folds. it gives above 93% accuracy on all cross
validation folds.
looking at a dumb classifier that classifies every image as not-5, using the cross
validation.
class Never5Classifier(BaseEstimator):
return self
The rows are the actual values and the columns are the predicted values, the
interpretation of the result and image is that 53057 of the instances were classified
correctly as non-5’s while 1,522 were classified wrongly as 5’s. The second row
depicts that 1,325 were classifies wrongly as non-5’s while 4,095 were classified
correctly as 5’s
● Precision/Recall Trade-off, especially Figure 3-3 and Figure 3-4, along with
associated Python code, pp. 93-97
Precision is the measure of accuracy of the positive predictions , recall is the ratio of
the positive instances that are correctly detected by the classifier.
The image about shows the recal and precision at different thresholds; at the middle threshold, the precision is =
The higher the precision the lower the recall, you cant have it both ways when both
are high, that what the trade off means. you have to decide if you want the model
to have more precision or recall. The trade off computes the score based on the
decision function
If the score is greater than the threshold it assigns positive but if it is less than the
threshold it assigns negative; the code above shows this. this confirms that raising
the threshold decreases the recall. the image is a 5 but when the threshold is raised
the classifier misses so its important to know which threshold to use.
To do this use the cross_val_predict function to get scores of all the instances in the
training set, specify that you want to return decision scores
then use the precision_recall_curve function to compute the precision and recall
for all possible thresholds afterwards use the matplolib to plot the precision and
recall as a function of the threshold.
This is the precision-recall plot, this helps to decide the threshold to use, this shows
all possible thresholds for precision and recall
● Understand Error Analysis, especially the included Python code and graphs,
pp. 102-105
Error Analysis: This is the process of analyzing the types of errors a model
makes..
Make predictions using the cross_val_predict() function, then plot matrix using the
confusion_matrix() function
The 5s look slightly darker than the other digits, which could mean that there
are fewer images of 5s in the dataset and/or that the classifier does not
perform as well on 5s as on other digits.
II. Fill the diagonal with zeros to keep only the errors, and plot the result:
Results: The confusion matrix is not necessarily symmetrical. The errors
include:
I. The column for class 8 is quite bright, which tells you that many
images get misclassified as 8s.
II. However, the row for class 8 is not that bad, telling you that
actual 8s, in general, get properly classified as 8s.
III. 3s and 5s often get confused (in both directions).
Solution: Time should be spent on reducing the false 8s. For example:
I. gather more training data for digits that look like 8s (but are not) so
that the classifier can learn to distinguish them from real 8s.
II. engineer new features that would help the classifier—for example,
writing an algorithm to count the number of closed loops (e.g., 8 has two,
6 has one, 5 has none).
III. Preprocess the images (e.g., using Scikit-Image, Pillow, or OpenCV) to
make some patterns, such as closed loops, stand out more.
For example:
Plot examples of 3s and 5s (the plot_digits() function just uses
Matplotlib’s imshow() function
These include:
All it does is assign a weight per class to each pixel, and when it
sees a new image it just sums up the weighted pixel intensities
to get a score for each class. So since 3s and 5s differ only by
a few pixels, this model will easily confuse them.
Chapter 4
This is one one of the simplest model there is. It is a supervised learning model.
● Gradient Descent: Explain Figure 4-3, Figure 4-4. Figure 4-5, Figure 4-6, Figure
4-7, pp. 118-121
How it work:
Gradient Descent measures the local gradient of the error function with regard
to the parameter vector θ, and it goes in the direction of descending gradient.
Once the gradient is zero, you have reached a minimum!
Figure 4-3:
In this depiction of GD, it starts by filling θ with random values (this is called
random initialization). Then it gradually tweaks the model parameters to minimize
the cost function over the training set. (e.g., the MSE), until the algorithm
converges to a minimum.
Also, the steps gradually get smaller as the parameters approach the minimum
because the learning step size is proportional to the slope of the cost
function.
Figure 4.4:
In this depiction of GD, the learning rate is small. Then it goes through many
iterations to converge, which will take a long time.
Figure 4-5:
In this depiction of GD, the learning rate is too high. Then GD might jump
across the valley and, possibly even higher up than you were before.
This might make the algorithm diverge, with larger and larger values, failing to
find a good solution.
Figure 4-6:
In this depiction of GD, the cost function has holes, ridges, plateaus, and all
sorts of irregularities, making convergence to the minimum difficult.
Figure 4-7:
In this depiction of GD, the cost function has the shape of a bowl.
The figure shows GD where features 1 and 2 have the same scale (on the left),
and on a training set where feature 1 has much smaller values than feature 2 (on
the right).
Polynomial Regression, a more complex model that can fit non‐linear datasets.
How to do PR:
Demerit:
Ideally, a high-degree polynomial regression will likely fit the training data
much better than a plain linear model.
Figure 4-15:
In this depiction of learning curves, LM’s performance on the training set and the
validation set, as a function of the training set size, is show. In conclusion, this
model underfits the training data.
When there are just one or two instances in the training set, the model
can fit them perfectly, which is why the curve starts at zero. But as new
instances are added to the training set, it becomes impossible for the
model to fit the training data perfectly, both because the data is noisy and
because it is not linear at all. So the error on the training data goes up
until it reaches a plateau, at which point adding new instances to the
training set doesn’t make the average error much better or worse.
Figure 4-16:
In this depiction of learning curves, the 10th-degree polynomial model’s performance on
the training set and the validation set, as a function of the training set size, is shown. In
conclusion, this model overfits the training data.
Solution: feed the model with more training data (increase size of training data)
until the validation error reaches the training error.
These learning curves look a bit like the previous ones, but there are two very important
differences:
I. The error on the training data is much lower (than with the LM model).
II. There is a gap between the curves. This means that the model performs
significantly better on the training data than on the validation data, which is the
hallmark of an overfitting model.
● Early Stopping, as illustrated in Figure 4-20, p. 141, as well as the Python code
on page 142
Figure 4-20:
In this depiction of learning curves, a complex model ( high-degree Polynomial
Regression) is being trained with Batch Gradient Descent demonstrates early stopping.
As the epochs go by the algorithm learns, and its prediction error (RMSE) on the
training set and validation set both decrease. After a while though, the validation
error stops decreasing and starts to go back up (this indicates that the model has
started to overfit the training data). With early stopping, you just stop training as soon as
the validation error reaches the minimum (this is indicated as the best model).
● Explain Decision Boundary, as illustrated in Figure 4-23, p. 146
This figure depicts a model’s (log_reg) estimated probabilities for flowers with petal
widths varying from 0 cm to 3 cm.
The petal width of Iris virginica flowers (represented by triangles) ranges from 1.4
cm to 2.5 cm, while the petal width of other iris flowers (represented by squares)
generally have a smaller petal width, ranging from 0.1 cm to 1.8 cm.
Notice that there is a bit of overlap:
Above about 2 cm the classifier is highly confident that the flower is an Iris virginica
(it outputs a high probability for that the virginica class), while below 1 cm it is highly
confident that it is not an Iris virginica (high probability for the “Not Iris virginica class).
In between these extremes, the classifier is unsure. However, if you ask it to predict
the class (using the predict() method rather than the predict_proba() method), it will
return whichever class is the most likely. Therefore, there is a decision boundary
at around 1.6 cm where both probabilities are equal to 50%: if the petal width is
higher than 1.6 cm, the classifier will predict that the flower is an Iris virginica, and
otherwise it will predict that it is not (even if it is not very confident).