0% found this document useful (0 votes)
152 views

A Tutorial of Machine Learning

This document provides an introduction to linear regression and logistic regression for machine learning. It discusses how linear regression aims to minimize cost by finding weights that generate predictions closest to the true values. Gradient descent is introduced as an algorithm for updating weights to reduce cost. Logistic regression is presented as an extension of linear regression for classification problems. It applies the sigmoid function to map predictions between 0-1 and defines the cost function accordingly. Gradient descent is again used, with some modifications, as the optimization algorithm. The document also notes issues like overfitting, underfitting and regularization.

Uploaded by

Sumod Sundar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views

A Tutorial of Machine Learning

This document provides an introduction to linear regression and logistic regression for machine learning. It discusses how linear regression aims to minimize cost by finding weights that generate predictions closest to the true values. Gradient descent is introduced as an algorithm for updating weights to reduce cost. Logistic regression is presented as an extension of linear regression for classification problems. It applies the sigmoid function to map predictions between 0-1 and defines the cost function accordingly. Gradient descent is again used, with some modifications, as the optimization algorithm. The document also notes issues like overfitting, underfitting and regularization.

Uploaded by

Sumod Sundar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction to Machine Learning

su wang
Winter 2016

Linear Regression

1.1

Prediction: Minimizing Cost

As a motivating example, we take the size of houses as an independent/input


variable, and the house price as an dependent/output variable. To introduce
some convention:
m: The number of training examples.
(x(i) ,y(i) ): A single tuple of training example (i = 1, ..., m).
Given a batch of m pairs of xs and ys, our goal is to build a function which taks
a new x as the input and produce a y as a prediction. Specific to the current
example: Given the size of a house, we would like to predict its price. The
predictive function is called a hypothesis, which comes in the form as follows:
h (x) = T x = 0 + 1 x1 + 2 x2 + ... + p xp

(1.1)

The 0 here is called a bias, which is introduced to handle some special situations
where weights cannot be adjusted and yet a reasonable prediction is not produced. 1 , ..., p are weights, which indicate the relative contribution each
input variable makes. Proper weight combination gives sharp prediction, and
it is these weight combination our learning algorithms are learning.
Then how do we find the weights? Basically, the weights (1.1), together with
the bias term, define a fitting/regression line1 . The line that enable us to
make the best prediction possible is the one which is the closest possible to all
the data points (i.e. the (x,y) pairs). In mathematical terms, the regression line
is under the constraint that the magnitude at which it deviates from the data
points is minimized. Call the deviation error. Error is defined as follows2 :
m

J(1 ...p ) =

1 X
(h (x(i) ) y (i) )2
2m i=1

(1.2)

1 X
(i x(i) y (i) )2
2m i=1

(1.3)

Our goal, then, is:


argminj J(1 ...p )
1 Not

(1.4)

that the regression line is not necessarily a straight line. The essential difference
between a linear model and a nonlinear model is whether the regression line produced has a
constant rate of change!
2 1 is so choosing for i) the convenience of differentitaion; ii) obtaining average error.
2m

1.2

Gradient Descent

Essentially, our task here is to find the set of j (i.e. weights; j = 1, ..., p) such
that we have a regression line (or hypothesis) that makes the best prediction
of y for some new x. Such a regression line is defined as one that has the least
J(1 ...p ) possible (i.e. cost, or deviation from training data points). Based
on the knowledge, we introduce an weight adjustment term:

J(1 ...p ), j [1, p]


j

(1.5)

We know that the derivative indicates whether we are on an uphill slope or a


downhill slope. Specifically, when the derivative is positive, we are on an uphill
slope, and a downhill slope otherwise. To find the optimal weights, we need
to reduce the weights while on an uphill slope, and increase them when on a
downhill slope. Based on the analysis, we use the following Gradient Descent
algorithm to adjust weights:
repeat until convergence {
j := j

J(1 ...p ), j [1, p]


j

(1.6)

}
Here we can see, if we are on an uphill slope (i.e. J>0), the weight is reduced;
Otherwise it is increased. here is called a learning rate. We use it to finetune to magnitude of the step at which we would like to adjust the weights3 .
In each iteration, we simultaneously update all the s. In implementation terms,
we compute all the new weights and put them in temprary variables, and then
assign them to the weight variables at the same time. This type of simultaneous
updating is also called a batch updating.
Turns out, to obtain weight adjustment terms, we do not have to do the differentiation for each weight j. In other words, the results for (1.5) is generalizable:
m

Adjust Bias Weight:

1 X
(h (x(i) ) y (i) )
m i=1

(1.7)

1 X
(h (x(i) ) y (i) )x(i)
Adjust Other Weights:
m i=1

(1.8)

3 If the step is too large, we may overshoot; otherwise, undershoot. In the former case,
we will be bouncing between hillsides and do not efficiently reach a valley; In the latter case,
it will take too long for us to do so.

1.3
1.3.1

Remnant Issues
Normalization

One way to boost the performance of the gradient descent algorithm is normalization. Normalized data have a circular/spheral-shaped gradient descent
(hyper)plane, which requires less number of oscillations before convergence.
The normalization is done as follows:
xnorm =
1.3.2

x
sd

(1.9)

Choice of Learning Rate

We look at the following indicators for the learning rate being too small or
too large:
Small : The gradient descent algorithm takes long to converge, and the
decrease of the cost J() is miniscule (103 , for instance).
Large : The cost does not monotonically decrease at each step of gradient
descent.
Monitoring the behavior of the gradient descent algorithm, we will be able to
pinpoint a sweet spot for .
1.3.3

Choice of Features

Sometimes a straight-line linear regression does not fit the data well. By observing the data, we may find that a loglinear, or a power-3 curve may fit the
scattering better. In these cases, we can create new variables in order to finetune to shape of our fitting curve to do a better job4 .
1.3.4

Analytical Solution to Linear Regression

Of course, we will always have an analytical solution to a linear regression


problem, in the form of the following (Normal Equation):
= (X T X)1 X T y

(1.10)

However, this does not mean that our gradient descent algorithm can thereby
be replaced. The following table shows the strengths/weaknesses of the two
approaches:
Gradient Descent
Normal Equation
Need to choose
No need to choose
Needs many iterations
Dont need to iterate
Works well with large n
Slow with large n
4
A rule of thumb is to take 10 as the threshold.
4 But not dovetailing the data. Because in that case we will have an overfitting (discussed
later on) problem.

Logistic Regression

2.1

Logistic Regression for Classification

In a typical classification problem (in its simplest form), the dependent variable
come of two classes. Our task, then, is to predict the class of the dependent
variable given an independent variable.
On the surface, it seems that applying linear regression to such tasks would be
reasonable, however this causes two major issues:
Linear regression may give out-of-bound values that are not easily interpreted.
Linear regressions prediction can easily be influenced by outliers, which
causes major shift of the regression line.
While the outlier sabotage effect cannot be easily avoided5 , the out-ofbound problem can be fixed by mapping the range of the prediction to, for
instance, 0 to 1, if so desired6 . Specifically, we apply a method which is called a
sigmoid conversion to achieve our goal. Say that our model is h (x) = T x,
the sigmoid conversion of it will be as follows:
g(h (x)) = g(T x) =

1
1 + eT x

(2.1)

Apparently, the function maps T x which is a real number (thus (-,+) ) to


(0,1). When T x = 0, which can be interpreted as the independent variable(s)
having no predictive power, the output of the function is 0.5, which can be
interpreted as a je sais quoi situation.

2.2

Cost Function

We will define the cost function for logistic regression slight differently here. To
engineer our cost function (of the hypothesis h (x)) such that it has desired
properties, we put it as follows:
(
log(h (x)) if y = 1
Cost(h (x), y) =
(2.2)
log(1 h (x)) if y = 0
Putting the cost function in this way, we will have the following relationship
between the hypothesis/prediction and cost, which is represented with a graphic.
5 Sometimes

we use other means than model choice to tackle this problem. For instance,
setting up a bar to identify and drop outliers.
6 -1 to 1 is sometimes desired too. [0,1] interval, however, enables a probabilistic interpretation, and in cerntain contexts more intuitive and fitting.

As we can see, when y = 0 (i.e. the actual class is 0), if the hypothesis predicts
0 also (approaching 0), the cost will be 0. On the other hand, if the prediction
is approaching 1, which is incorrect, the cost will approach infinity. It is readily
checked that we also have this desired property when y = 1.
To integrate the two parts in 2.2 into one single cost function, we put down the
following:
Cost(h (x), y) = y log(h (x)) (1 y) log(1 h (x))

(2.3)

It can be seen that 2.2 and 2.3 are equivalent: When y = 1, the second term in
2.3 is equal to 0; when y = 0, the first term is equal to 0. Finally we have the
generalized form for the cost function for logistic regression:
m

J() =

1 X
Cost(h (x(i) ), y (i) )
m i=1

2.3

(2.4)

m
1 X (i)
[
y log(h (x(i) )) + (1 y (i) ) log(1 h (x(i) ))]
m i=1

(2.5)

Gradient Descent

The gradient descent algorithm for logistic regression is almost identical to that
for linear regression:
repeat until convergence {
j := j

m
X

(h (x(i) ) y (i) )xj (i)

(2.6)

i=1

}
However, note that the hypothesis h (x) is defined differently for linear regression and logistic regression.

Linear Regression: h (x) = T x


1
Logistic Regression: h (x) =
1 + eT x

(2.7)
(2.8)

Note that gradient descent is not the only game in town. The following alternatives are available7 .
Conjugate Gradient
BFGS
L-BFGS
These advanced algorithms are usually faster than the base gradient descent,
and you do not need to manually pick a learning rate (it is picked adaptively).
Therefore, they would converge must faster.

2.4

Over/Underfitting and Regularization

If the regression line is far away from data points, then the prediction it makes
will apparently be abysmal. This is a case of underfitting, where we have
a high bias. On the other hand, having a meandering regression curve that
dovetails all the data points in the training set is equally undesirable. Because
it is so specialized to the training set, such that we will not expect it to generalize in any decent capacity to new data sets. This is an overfitting, or high
variation case.
We first address overfitting. Overfitting is usually cued by close-to-zero J()
(i.e. the cost), and usually sets in when we have a large number of features (e.g.
100). We have the following two options to ameliorate the problem:
Reduce the number of features
Manually select some features to keep
Model selection algorithm (later on)
Regularization
In the interest of keep as much information available as we can, regularization
is usually preferred. Essentially, the idea is to reduce the magnitude/weight of
s. This is how its done: Say we have a model y = 0 +1 x+2 x2 , and we would
like to reduce the weight of 2 in order to reduce the contribution of the third
variable. We add a regularization term to the cost function Cost(h (x), y)
(which looks to find the argmax , i.e. the weights that minimize the cost), and
set for it a large coefficient such that the minimizing weights (i.e. 2 here) are
7 Although

we will not get into the details of these algorithms

forced to be small. In mathematical terms (taking the cost function of linear


regression for an instance):
m

Cost(h (x(i) ), y (i) ) = min

1 X
(h (x(i) ) y (i) )2 + Coef2 2
2m i=1

(2.9)

In general, we do not have to specifically target certain weights. A better (and


simpler) strategy is to regularize all weights. In this way, our model will be less
prone to overfitting. Graphically, the regression curve will be smoother (thus
less variation). Therefore, in general, our regularized cost function will be of
the following form (in linear regression cost function)8 :
J() =

m
m
X
1 X
[ (h (x(i) ) y (i) )2 +
j 2 ]
2m i=1
i=1

(2.10)

Apparently, the greater the regularization coefficient is, contribution of the


variables is smaller. Note that we do not want to over-regularize, in which case
the model will simply be close to y = intercept (i.e. a flat line underfitting).
With regularization, our new gradient descent algorithm will be as follows:
repeat until convergence {
m

0 := 0

1 X
(h (x(i) ) y (i) )x0 (i)
m i=1

(2.11)

j := j [

1 X

(h (x(i) ) y (i) )xj (i) + j ]


m i=1
m

(2.12)

}
To simplify the equation:
m

1 X
j := j (1 )
(h (x(i) ) y (i) )xj (i)
m
m i=1

(2.13)

Usually, we set the term 1 m


slightly smaller than 1, such that on each
iteration of the loop, the weights are reduced slightly.

Finally, the normal equation (i.e. the analytical alternative for gradient descent)
also comes with a regularized form:
8 here is the regularization coefficient, and by convention, we do not regularize the intercept weight.


0
0

= (X T X + .
..

0
1
..
.

...
...
..
.

0
0
1 T
.. ) X y
.

...

where the matrix is n + 1 by n + 1


(2.14)

Neural Networks

Why yet another model? Consider the case where we have a large number of
features/variables, and we would like to build a nonlinear classification model.
In this case, to generate polynomial variables out of the original variables, we
have a proliferation problem. With such a large number of variables, the model
will soon become insurmountably complicated.

3.1

Model Representation and Forward Propogation

A simple three-layered neural network is represented as follows:

The outputs are indexed as, for the ones from layer 1 to layer 2, z1 (2) , z2 (2) , and
z3 (2) . These are simply linear combinations of the input variables (i.e. x). For
the first layer, we simply define ai (j) = xi (j) . As for the bias units, we always
set them to 1 by convention. This computation from inputs to outputs is called
forward propagation.

To machinize the operations involved in the forward propagation, we now look


at an example neural network which has two hidden layers.

From the architecture of the network, we can tell that the classification model
here is of the form y = 0 + 1 x1 + 2 x2 + 3 x3 , where every output y is a 4-class
value. For illustrative purpose, assuming we only have one training example,
which is a 3-by-1 vector. The forward propagation therefore goes through the
following streamline9 :
a(1) = x

add bias x0 (dim:4x1)

z (2) = (1) a(1)

dim:5x4*4x1=5x1

a(2) = sigmoid(z (2) )


z (3) = (2) a(2)

dim:5x6*6x1=5x1

a(3) = sigmoid(z (3) )


z (4) = (3) a(3)

add bias a0 (2) dim:6x1

add bias a0 (3) dim:6x1

dim:4x6*6x1=4x1

a(4) = h (x) = sigmoid(z (4) )

dim:4x1

Going back to the motivation of proposing the use of neural network, with
an extra layer of variables (or layers of), we are now able to incorporate the
complex relationships among variables, and are saved the trouble of generating
polynomial variables which significantly slows down the computation.

3.2

Cost Function and Backward Propagation

The cost function for a neural network is basically a glorified version of logistic
regression cost function. Concretely, instead of having only one classification
cost10 , we know have a sum of k classification cost when there are k classes
involved. The cost function for neural network is described as follows:
9 Recall
10 As

that the dimensions of s are equal to slayerj+1 (slayerj + 1)


logistic regression model is always a binary classification.

10

J() =

m K
1 X X (i)
[
yk log(h (x(i) ))k + (1 yk (i) )log(1 (h (x(i) ))k )]
m i=1

+
2m

k=1
sl sX
L
l+1
XX

(3.1)

(ji (l) )2

l=1 i=1 j=1

K here is the number of classification points corresponding to the last layer


of our neural network. Since we have k classes in the neural network, naturally
we are summing up k logistic regressions. The regularization term basically
adds up all the s in the network.
With this in mind, we now move on to the backpropagation. The backpropagation basically computes the magnitude of error resulted in the forward
propagation operation given certain weights. Previously (cf. section 1.1), in the
context of linear/logistic regression, we defined the error as the deviation of
our prediction11 from the actual data. In the neural network, error results not
only in the output step, as is the case in the linear/logistic regression, where
the inputs are directly linked to the outputs. Rather, the error in the neural
network pops up in each layer of computation (corresponding to each layer of
neurons). It is also crucial to note that the error in each layer contributes to the
error in the next layer. In other words, the error propagates in the computation
of forward propagation. As the only observable error is between the output (at
the last layer of the neural network) and the actual data, we need to go backwards in evaluating the errors in previous layers. Informally, the computation
of errors in a neural network goes as follows:
Compute the error of the output (i.e. the difference between the output
and the actual data).
Compute the error in layer l by:
Multiplying the error in layer l + 1 and the weights between layer l
and layer l + 1;
Pairwise multiplying the result from the last step with the direction
of error 12 at layer l (i.e. the derivative of the output at layer l; or z
terms).
Adjust weights between each pair of layers by adding the product of activation (i.e. sigmoid of output from the emitting/left layer layer l) and
the error at the layer to a D13 term.
11 In

the current context, the predicted classes of data.


is, whether the weight should increment or decrement. Or, graphically, whether we
are on an uphill slope, or a downhill one.

13 Short for deviation, which can be mathematically proved to be equal to


(l) J()
12 That

ij

(i.e. the derivative of the cost).

11

If this all sounds awfully abstract, we now put the backpropagation algorithm
in mathematically machemical terms. We notate the first observable error (i.e.
the difference between the output at the last layer and the actual data) with .
Taking the example neural network in section 3.1 for an instance, this error is
at layer 4 (4) = y a(4) . With (4) as our starting point, we now rephrase
the application of the algorithm to our current neural network as follows:
Compute the error of the output:
(4) = y a(4)
Compute the errors in previous layers (i.e. layer 3 and 2):
(3) = ((3) )T (4) . g 0 (z (3) )
(2) = ((2) )T (3) . g 0 (z (2) )
Compute adjustment term

J():

Set all (l) = 0, where l = 1, 2, 3:


(3) := (3) + a(3) (4)14
(2) := (2) + a(2) (3)
(1) := (1) + a(1) (2)
Regularization:
D(l) =
D

(l)

1
(l)
m
1
(l)
m

+ (l)

if j 6= 0

if j = 0

Adjust weights (gradient descent):

=
J()15

Evaluating Learning Algorithms

Previously (cf. section 2.4), we learned that if we have a linear regression model
with features/variables very low orders of polynomial, then it has high bias, and
is underfitting the data. On the other end of the spectrum, if our model has
high-order polynomial features/variables, and performs poor in generalizing to
new data set, then it is said to have high variance and overfit the data. But
how do we find the sweet spot where the model strikes a nice balance between
data-fitting (i.e. training data) and data-generalizing?
The simplest way to do a diagnostic is by first partitioning the entire data set
into i) a training set, ii) a development set, and iii) a test set, and then apply
14 In

general, (l) := (l) + a(l) (l+1)


practice, if you are not an expert in numerical computing, just use an off-the-shelf
optimization algorithm such as L-BFGS, etc.
15 In

12

the trained model on the development set to come to an evaluation. Concretely,


we plot a two-dimensional graph with the degrees of polynomials and the magnitude of error. It will be shown that, by observing the curves plotted, we will
be able to decide whether we have a bias problem, or a variance problem.
Let us resort to intuition first, to consider the error level will be like in the bias
case and the variance case when the degrees of polynomials increase. For the
training set, obviously the error will decrease (asymptotically) with the increase
of the degrees of polynomials16 . However, when we move to the development set,
the error curve will no longer decrease monotonically. Specifically, with a low
degrees of polynomials, the errors in both the training set and the development
set will be pretty high (with the one in the development set being higher17 ),
because the predictors are not allowed much predictive and data-adaptive
power. On the other hand, with a high degrees of polynomials, the error will
be low in the training set and high in the development set. Therefore we know
that the magical number of degrees of polynomials must lie between the two
extremes. Concretely, the graph we propose here will be typically as follows:

In the graph, the lowest-error point on the curve on top will then be the magical
number of the degrees of polynomial we are looking for. Now, a diagnostic is
possible:
Bias case: The errors of the training set and the development set are both
high, and the error difference between the two data sets is low.
Variance case: The error is low for the training set and high for the development set, and the error difference between the two data sets is high.
Alternatively, we may also take the error as a function of (i.e. the regularization coefficient) to decide between bias/variance cases. Plotting the error
against the , we will have a graph similar to the one in the following:
16 Intuitively, the higher the degrees of polynomials in the model, the more oscillations in
the regression curve with be possible and thus the curve will be able to adapt to the data
better.
17 Because the model is trained on the training set after all.

13

As before, the lower curve is for the training set and the upper curve the development set18 . We observe that, with lower level of regularization, the training
error is relatively lower than with higher level of regularization, because we allow
the predictors to make more contribution in the former case. The development
error, however, forms a concave curve. Applying the analysis in the degree-ofpolynomial situation, we know that the large difference between training and
development error at a small indicates a high variance problem, and on the
other end of the curve, where the error difference is small, we have a high bias
problem.
Finally, to find out whether increasing the size of training set would help improve the performance of our model, we may play the number of training data
points against the error. In general, the error goes up when the training set
expands, for the simple reason that it gets more difficult to have a curve (with
its oscillatoriness/adaptivity fixed by the model) that cater to all the data
points19 .
With this in mind, let consider what the error curve for training and development respectively will look like in the high bias vs. high variance cases. First
consider the high bias case. If a model has high bias, graphically the regression
line will be less curvy. In the extreme case of having only one predictor, the
18 Crucially, note that we are assuming that the weights are properly selected such that they
minimally reduce the error.
19 Consider the case where one has 1, 2 or 3 data points, a quadratic curve will be able to
fit all the data points perfectly. This no longer necessarily holds true with larger number of
data points.

14

regression line is a straight line. As the size of the training set gets larger, the
error in both training and development increase, because there are now more
square-error to include in computing overall error. The give-away of a high
bias situation, however, lies in the general level of error in both the training and
the development, which will be relatively high. Consider the case of a straight
regression curve. Due to the lack of oscillation of the curve, it will still run
through the data points cloud with a large training set as it does with a small
training set. Thus, in this case, adding more data will not improve the predictive power of the model. A graphical representation of the discussion is as
follows20 :

Then consider the high variance scenario. One aspect the high variance case
has in common with the high bias case is that the overall error increases as the
training/development set gets larger. However, given the meandering curviness of the high variance regression curve, it will have a relatively lower error
rate than its high bias counterpart in the training set. As one might expect, the
high variance curve performs as underwhelmingly as the high bias curve in the
development set, although the error climbs down as the size of the development
goes up. The distinguishing feature of the high variance case is the big gap
between the error curves of the training and the development. This coincides
with our intuition that a high variance curve is one that does well in the training and falls off the cliff when it goes outside to meet with new data sets.
Graphically, the high variance cases looks like the following21 :
20 Citation:
Stanford ML, A.
learning/lecture/Kont7/learning-curves.
21 Citation:
Stanford ML, A.
learning/lecture/ Kont7/learning-curves.

Ng,

https://www.coursera.org/learn/machine-

Ng,

https://www.coursera.org/learn/machine-

15

Crucially, note that in the high variance case, getting more training data actually may help improve the performance of the model. It is likely that the error
curve for the development set dips down further with larger data set, because a
high variance model simply adapts to data better than a high bias model.
In summary, on a high bias/variance diagnosis, we may take remedial actions
accordingly:
High Bias
Add more features/variables.
Add polynomial features/variables.
Decrease regularization coefficient lambda.
High Variance
Increase the size of training set.
Drop some features/variables.
Increase regularization coefficient lambda.

16

You might also like