Lecture - 6.2 - Logistic Regression - Standford ML Andrew NG
Lecture - 6.2 - Logistic Regression - Standford ML Andrew NG
https://joparga3.github.io/standford_logistic_regression/ Page 1 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
library(knitr)
As mentioned, logistic regression is one of main and most simple CLASSIFICATION algorithms. It extends
the idea of linear regression to cases where the dependent variable, y, only has two possible outcomes,
called classes (careful, this only applies with binary logistic regression; multiple logistic regression deals
with situations where the outcome can have three or more possible types (e.g., “disease A” vs. “disease B”
vs. “disease C”)).
Examples of dependent variables that could be used with logistic regression are predicting whether a new
business will succeed or fail, predicting the approval or disapproval of a loan, and predicting whether a
stock will increase or decrease in value. These are all called classification problems, since the goal is to
figure out which class each observation belongs to.
https://joparga3.github.io/standford_logistic_regression/ Page 2 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
https://joparga3.github.io/standford_logistic_regression/ Page 3 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
For us humans, it would be really easy to differentiate groups in each of the datasets, and we would
immediately be able to draw a boundary that separated them, but, how can a machine do this? Moving
onwards in this blog, we will be focusing on the example shown in the 3 figure, which represents a binary
classification.
For the sake of curiosity, just mention that the logistic function is used to describe many real-world
situations, for example, population growth. This is easily understood by looking at the normalised graph:
the initial stages suffer an exponential growth, but after some time, due to the competition for certain
https://joparga3.github.io/standford_logistic_regression/ Page 4 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
resources (bottle neck), the growth rate decreases until it gets to a stalemate and there is no growth.
This is really cool, but how is this going to help us determine the probability of a classifying data into
different classes?
The coefficients, or ?? values, are selected to maximize the likelihood of predicting a high probability for
observations actually belonging to class 1, and predicting a low probability for observations actually
belonging to class 0. See the graph to hint the coefficients we would like to achieve:
https://joparga3.github.io/standford_logistic_regression/ Page 5 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
2.1.3 Step 2
In the second step of logistic regression, a threshold value is used to classify each observation into one of
the classes. For example, if we chose a threshold of 0.5, that would mean that if P(y = 1) > 0.5 , then the
observation would be classified into class 1, and the rest into class 0.
The threshold value is something the user can specify, so, which is the best value for it? This election often
depends on error preferences and there are two types of errors that we need to take into consideration:
false positives, and false negatives.
A false positive error is made when the model predicts class 1, but the observation actually belongs to
class 0. A false negative error is made when the model predicts class 0, but the observation actually
belongs to class 1. The perfect model would classify all classes correctly: all 1´s (or trues) as 1´s, and all 0´s
(or false) as 0´s. We would have then FN = FP = 0!! Fantastic!
https://joparga3.github.io/standford_logistic_regression/ Page 6 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
1. Higher threshold value. Classify as 1 if P(y=1) > 0.7. The model is being more restrictive when
classifying as 1´s, and therefore, more False Negative errors will be made.
2. Lower threshold value. . Classify as 1 if P(y=1) > 0.3. The model is now being less strict and we are
classifying more examples as class 1, therefore, we are making more false positives errors.
Ok, this is all very good to know, but how to choose the strategy to follow? How do we know if we want to
make more FN errors or more FP errors? A fair answer to the previous question would be that to know
which strategy to follow can initially depend on your experience and risk-averse problem you are trying to
solve. Lets present 2 examples:
One application where decision-makers often have an error preference is in disease prediction. Suppose
you built a model to predict whether or not someone will develop heart disease in the next 10 years. We
will consider class 1 to be the outcome in which the person does develop heart disease, and class 0 the
outcome in which the person does not develop heart disease. If you pick a high threshold, you will tend to
make more false negative errors, which means that you predicted that the person would not develop heart
disease, but they actually did. If you pick a lower threshold, you will tend to make more false positive
errors, which means that you predicted they would develop heart disease, but they actually did not. In this
case, a false positive error is often preferred. Unnecessary resources might be spent treating a patient who
did not need to worry, but you did not let as many patients go untreated (which is what a false negative
error does).
Now, let’s consider spam filters. Almost every email provider has a built in spam filter that tries to detect
whether or not an email message is spam. Let’s classify spam messages as class 1 and non-spam
messages as class 0. Then if we build a logistic regression model to predict spam, we will probably want
to select a high threshold. Why? In this case, a false positive error means that we predicted a message
was spam, and sent it to the spam folder, when it actually was not spam. We might have just sent an
important email to the junk folder! On the other hand, a false negative error means that we predicted a
message was not spam, when it actually was. This creates a slight annoyance for the user (since they have
to delete the message from the inbox themselves) but at least an important message was not missed.
To understand better how to choose the threshold value, we make use of the confusion matrix and
the ROC curve.
https://joparga3.github.io/standford_logistic_regression/ Page 7 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
Numeric example
Imagine we have created a classification system that has been trained to distinguish between cats, dogs
and rabbits, a confusion matrix will summarize the results of testing the algorithm for further inspection.
Assuming a sample of 27 animals - 8 cats, 6 dogs, and 13 rabbits, the resulting confusion matrix could
look like the one below.
. SENSITIVITY: measures the proportion of positives which are correctly identified as such . SPECIFICITY:
measures the proportion of negatives which are correctly identified as such . ACCURACY: measures the
proportion of positives and negatives that have been correctly labelled.
https://joparga3.github.io/standford_logistic_regression/ Page 8 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
recognition rate for the dog class. Therefore, if you goal was to be build a machine that was able to classify
both types, using accuracy wouldn’t be useful, because dogs wouldn’t be detected. This is why, you
always have to study the results of the data and not rely solely on a single number or parameter.
The best possible prediction method would yield a point in the upper left corner or coordinate (0,1) of the
ROC space, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives).
The (0,1) point is also called a perfect classification.
https://joparga3.github.io/standford_logistic_regression/ Page 9 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
https://joparga3.github.io/standford_logistic_regression/ Page 10 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
https://joparga3.github.io/standford_logistic_regression/ Page 11 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
4 Hypothesis
Recall that the hypothesis expression for linear regression was:
Logistic regression cannot rely solely on a linear expression to classify, and in addition to that, using a
linear classifier boundary requires the user to establish a threshold where the predicted continuous
probabilities would be grouped into the different classes. This is why logistic regression makes use of the
sigmoid function. Let me present the logistic regression hypothesis function:
https://joparga3.github.io/standford_logistic_regression/ Page 12 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
Let’s say we decide to establish a threshold of 0.5 (just to adapt to the sigmoid function cut in the Y-axis).
So:
https://joparga3.github.io/standford_logistic_regression/ Page 13 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
Using the same analysis as we did for the general sigmoid function:
Using the same analysis as we did for the general sigmoid function:
https://joparga3.github.io/standford_logistic_regression/ Page 14 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
IMPORTANT NOTE! From these 3 examples we observe that the decision boundary is not directly
defined by the training dataset, but by the θ parameters.
6 Cost function
Now that we know what the expression of our logistic regression hypothesis is, we need to know how to
define the cost function in order to evaluate the errors a logistic model is going to make. Recall the cost
function for linear regression:
If we minimized this function applying our new h? ?(x(i) ) hypothesis we cannot assure that we will
T
converge the global minimum of the cost function! As h ? ?(x) = 1/(1 + e(?? X) ) is not linear we might end
up in a local minimum. Therefore, our new cost function will be:
https://joparga3.github.io/standford_logistic_regression/ Page 15 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
https://joparga3.github.io/standford_logistic_regression/ Page 16 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
7 Multiclass classification
Until now we have been only focusing on binary classification, but what happens when instead of
classifying with a Yes or a No, we want to classify with many classes? For example, classifying your emails
in family, work, travel, bills, etc? The answer to this is very simple: we run binary classification for all
possible classes and then pick the one with the highest value.
https://joparga3.github.io/standford_logistic_regression/ Page 17 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
https://joparga3.github.io/standford_logistic_regression/ Page 18 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
7.1 Overfitting/underfitting
The best way to understand what is over and underfitting is showing 3 simple examples:
Underfitting: clearly the model is too simple and requires additional features to adjust better to the data
Overfitting: the model adjust incredibly well to the training data, but if that model is applied to the unseen
data it will fail to generalise new examples.
https://joparga3.github.io/standford_logistic_regression/ Page 19 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
How to know if our model is overfitting? 1. In very simple models were we only use 2 parameters, we can
plot the data (just like we did above) and check if the model is under or overfitting. 2. But with model were
there number of features is really high, we cannot take this approach as the visualisation of the data would
be extremely complicated. 2.1. This can be solved by feature reduction to then plot the data, but this will
be explained in further posts. 2.2. It can be also solved by REGULARIZATION.
8 Introducing regularization
Suppose we have the previous models and that the regression models are the following:
The general idea would be that we want to make the high order parameters (??3 and ??4) really small to
have a simple model that would be less prone to overfitting.
Perfect, by doing this we have reduced our function to a second order model which is simpler. But again,
in this case we know which the high order terms are. How do we then deal with the general case were we
might have 100 features? How do we know which is/are the high order terms? In order to tackle this issue,
we introduce the REGULARIZATION PARAMETER.
https://joparga3.github.io/standford_logistic_regression/ Page 20 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
Gradient descent:
Gradient descent:
https://joparga3.github.io/standford_logistic_regression/ Page 21 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
Data sample:
https://joparga3.github.io/standford_logistic_regression/ Page 22 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
The figure shows that our dataset cannot be separated into positive and negative examples by a straight-
line through the plot. Therefore, a straight-forward application of logistic regression will not perform well on
this dataset since logistic regression will only be able to find a linear decision boundary.
https://joparga3.github.io/standford_logistic_regression/ Page 23 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
2. To try to apply a non-linear decision boundary and at the same time tackle the overfitting issue that
a high dimensional vector can introduce, we will introduce regularization parameters to our logistic
regression model.
9.2 Calculations
To show a numeric example of how the choosing / optimization of the theta values affect the cost of the
model, let´s say we decide that THETA parameters are all equal to zero:
This is our starting point to compare if the model is actually converging to a minimum. In linear regression,
we understood how gradient descent worked. In this case, I want to present another way to deal with the
optimization of the theta parameters that minimize the cost function. This method is a function already
built in Matlab called FMINUNC. Here is the code to use it:
https://joparga3.github.io/standford_logistic_regression/ Page 24 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
fminunc will call the cost function (in this case, you have to code this function yourself. You can implement
the simplified cost function for logistic regression shown above), a random initial theta, and some
debugging parameters like the maximum iterations we want and that we want fminunc to return both the
theta parameters and the cost function.
The results show that we have improved the model by choosing some thetas that have drastically reduced
the magnitude of the cost function. Let´s plot the decision boundary that the logistic regression model has
created:
The graph shows the linear boundary the logistic regression model has created. It pretty much divides
correctly most example, although there are some were the model incorrectly predicts the real outcome. Let
´s evaluate the accuracy of these training examples (in this case we haven´t produced a test set in which to
produce accuracy tests of unseen data). As we saw in logistic theory, the logistic regression model returns
probabilities of each of the examples belonging to different classes. To finally classify in a binary
https://joparga3.github.io/standford_logistic_regression/ Page 25 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
classification model like this one, we need to establish a threshold. Let´s evaluate the performance of the
model with different thresholds. These different selection will show the effect the threshold has in the
model´s accuracy and which one would suit best your specific problem.
Taking into consideration the nature of the problem, if we want to detect those microchips that have
defects we need to decide which threshold are we going to apply to the model. This decision would need
to take into consideration many factors such as chip manufacturing costs (just in case we throw away non-
defective chips that were labelled as defective), costs related to delivering bad quality chips etc. Let´s say
that we want to detect as many defective chips as possible. To do that we will like to only categorise as 1´s
(accepted) the ones that have a very high probability, in other words, their quality parameters are nearly
perfect. If we are very selective when categorising as 1´s, then we are allowing more chips to be labelled
as defective. Our accuracy would decrease, but you would be flagging nearly all (if not all) defective chips.
Again, this is just a really simplified way of thinking about quality checks but it is useful to understand
threshold selection and the logistic regression classification model.
https://joparga3.github.io/standford_logistic_regression/ Page 26 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
https://joparga3.github.io/standford_logistic_regression/ Page 27 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
With a regularization parameter lambda = 0, no high order features are controlled or attenuated, and
therefore the model is much more prone to suffer overfitting.
https://joparga3.github.io/standford_logistic_regression/ Page 28 of 29
Notes Logistic Regression - Standford ML Andrew Ng 03/02/20, 11(18 AM
https://joparga3.github.io/standford_logistic_regression/ Page 29 of 29