A Tutorial of Machine Learning
A Tutorial of Machine Learning
su wang
Winter 2016
Linear Regression
1.1
(1.1)
The 0 here is called a bias, which is introduced to handle some special situations
where weights cannot be adjusted and yet a reasonable prediction is not produced. 1 , ..., p are weights, which indicate the relative contribution each
input variable makes. Proper weight combination gives sharp prediction, and
it is these weight combination our learning algorithms are learning.
Then how do we find the weights? Basically, the weights (1.1), together with
the bias term, define a fitting/regression line1 . The line that enable us to
make the best prediction possible is the one which is the closest possible to all
the data points (i.e. the (x,y) pairs). In mathematical terms, the regression line
is under the constraint that the magnitude at which it deviates from the data
points is minimized. Call the deviation error. Error is defined as follows2 :
m
J(1 ...p ) =
1 X
(h (x(i) ) y (i) )2
2m i=1
(1.2)
1 X
(i x(i) y (i) )2
2m i=1
(1.3)
(1.4)
that the regression line is not necessarily a straight line. The essential difference
between a linear model and a nonlinear model is whether the regression line produced has a
constant rate of change!
2 1 is so choosing for i) the convenience of differentitaion; ii) obtaining average error.
2m
1.2
Gradient Descent
Essentially, our task here is to find the set of j (i.e. weights; j = 1, ..., p) such
that we have a regression line (or hypothesis) that makes the best prediction
of y for some new x. Such a regression line is defined as one that has the least
J(1 ...p ) possible (i.e. cost, or deviation from training data points). Based
on the knowledge, we introduce an weight adjustment term:
(1.5)
(1.6)
}
Here we can see, if we are on an uphill slope (i.e. J>0), the weight is reduced;
Otherwise it is increased. here is called a learning rate. We use it to finetune to magnitude of the step at which we would like to adjust the weights3 .
In each iteration, we simultaneously update all the s. In implementation terms,
we compute all the new weights and put them in temprary variables, and then
assign them to the weight variables at the same time. This type of simultaneous
updating is also called a batch updating.
Turns out, to obtain weight adjustment terms, we do not have to do the differentiation for each weight j. In other words, the results for (1.5) is generalizable:
m
1 X
(h (x(i) ) y (i) )
m i=1
(1.7)
1 X
(h (x(i) ) y (i) )x(i)
Adjust Other Weights:
m i=1
(1.8)
3 If the step is too large, we may overshoot; otherwise, undershoot. In the former case,
we will be bouncing between hillsides and do not efficiently reach a valley; In the latter case,
it will take too long for us to do so.
1.3
1.3.1
Remnant Issues
Normalization
One way to boost the performance of the gradient descent algorithm is normalization. Normalized data have a circular/spheral-shaped gradient descent
(hyper)plane, which requires less number of oscillations before convergence.
The normalization is done as follows:
xnorm =
1.3.2
x
sd
(1.9)
We look at the following indicators for the learning rate being too small or
too large:
Small : The gradient descent algorithm takes long to converge, and the
decrease of the cost J() is miniscule (103 , for instance).
Large : The cost does not monotonically decrease at each step of gradient
descent.
Monitoring the behavior of the gradient descent algorithm, we will be able to
pinpoint a sweet spot for .
1.3.3
Choice of Features
Sometimes a straight-line linear regression does not fit the data well. By observing the data, we may find that a loglinear, or a power-3 curve may fit the
scattering better. In these cases, we can create new variables in order to finetune to shape of our fitting curve to do a better job4 .
1.3.4
(1.10)
However, this does not mean that our gradient descent algorithm can thereby
be replaced. The following table shows the strengths/weaknesses of the two
approaches:
Gradient Descent
Normal Equation
Need to choose
No need to choose
Needs many iterations
Dont need to iterate
Works well with large n
Slow with large n
4
A rule of thumb is to take 10 as the threshold.
4 But not dovetailing the data. Because in that case we will have an overfitting (discussed
later on) problem.
Logistic Regression
2.1
In a typical classification problem (in its simplest form), the dependent variable
come of two classes. Our task, then, is to predict the class of the dependent
variable given an independent variable.
On the surface, it seems that applying linear regression to such tasks would be
reasonable, however this causes two major issues:
Linear regression may give out-of-bound values that are not easily interpreted.
Linear regressions prediction can easily be influenced by outliers, which
causes major shift of the regression line.
While the outlier sabotage effect cannot be easily avoided5 , the out-ofbound problem can be fixed by mapping the range of the prediction to, for
instance, 0 to 1, if so desired6 . Specifically, we apply a method which is called a
sigmoid conversion to achieve our goal. Say that our model is h (x) = T x,
the sigmoid conversion of it will be as follows:
g(h (x)) = g(T x) =
1
1 + eT x
(2.1)
2.2
Cost Function
We will define the cost function for logistic regression slight differently here. To
engineer our cost function (of the hypothesis h (x)) such that it has desired
properties, we put it as follows:
(
log(h (x)) if y = 1
Cost(h (x), y) =
(2.2)
log(1 h (x)) if y = 0
Putting the cost function in this way, we will have the following relationship
between the hypothesis/prediction and cost, which is represented with a graphic.
5 Sometimes
we use other means than model choice to tackle this problem. For instance,
setting up a bar to identify and drop outliers.
6 -1 to 1 is sometimes desired too. [0,1] interval, however, enables a probabilistic interpretation, and in cerntain contexts more intuitive and fitting.
As we can see, when y = 0 (i.e. the actual class is 0), if the hypothesis predicts
0 also (approaching 0), the cost will be 0. On the other hand, if the prediction
is approaching 1, which is incorrect, the cost will approach infinity. It is readily
checked that we also have this desired property when y = 1.
To integrate the two parts in 2.2 into one single cost function, we put down the
following:
Cost(h (x), y) = y log(h (x)) (1 y) log(1 h (x))
(2.3)
It can be seen that 2.2 and 2.3 are equivalent: When y = 1, the second term in
2.3 is equal to 0; when y = 0, the first term is equal to 0. Finally we have the
generalized form for the cost function for logistic regression:
m
J() =
1 X
Cost(h (x(i) ), y (i) )
m i=1
2.3
(2.4)
m
1 X (i)
[
y log(h (x(i) )) + (1 y (i) ) log(1 h (x(i) ))]
m i=1
(2.5)
Gradient Descent
The gradient descent algorithm for logistic regression is almost identical to that
for linear regression:
repeat until convergence {
j := j
m
X
(2.6)
i=1
}
However, note that the hypothesis h (x) is defined differently for linear regression and logistic regression.
(2.7)
(2.8)
Note that gradient descent is not the only game in town. The following alternatives are available7 .
Conjugate Gradient
BFGS
L-BFGS
These advanced algorithms are usually faster than the base gradient descent,
and you do not need to manually pick a learning rate (it is picked adaptively).
Therefore, they would converge must faster.
2.4
If the regression line is far away from data points, then the prediction it makes
will apparently be abysmal. This is a case of underfitting, where we have
a high bias. On the other hand, having a meandering regression curve that
dovetails all the data points in the training set is equally undesirable. Because
it is so specialized to the training set, such that we will not expect it to generalize in any decent capacity to new data sets. This is an overfitting, or high
variation case.
We first address overfitting. Overfitting is usually cued by close-to-zero J()
(i.e. the cost), and usually sets in when we have a large number of features (e.g.
100). We have the following two options to ameliorate the problem:
Reduce the number of features
Manually select some features to keep
Model selection algorithm (later on)
Regularization
In the interest of keep as much information available as we can, regularization
is usually preferred. Essentially, the idea is to reduce the magnitude/weight of
s. This is how its done: Say we have a model y = 0 +1 x+2 x2 , and we would
like to reduce the weight of 2 in order to reduce the contribution of the third
variable. We add a regularization term to the cost function Cost(h (x), y)
(which looks to find the argmax , i.e. the weights that minimize the cost), and
set for it a large coefficient such that the minimizing weights (i.e. 2 here) are
7 Although
1 X
(h (x(i) ) y (i) )2 + Coef2 2
2m i=1
(2.9)
m
m
X
1 X
[ (h (x(i) ) y (i) )2 +
j 2 ]
2m i=1
i=1
(2.10)
0 := 0
1 X
(h (x(i) ) y (i) )x0 (i)
m i=1
(2.11)
j := j [
1 X
(2.12)
}
To simplify the equation:
m
1 X
j := j (1 )
(h (x(i) ) y (i) )xj (i)
m
m i=1
(2.13)
Finally, the normal equation (i.e. the analytical alternative for gradient descent)
also comes with a regularized form:
8 here is the regularization coefficient, and by convention, we do not regularize the intercept weight.
0
0
= (X T X + .
..
0
1
..
.
...
...
..
.
0
0
1 T
.. ) X y
.
...
Neural Networks
Why yet another model? Consider the case where we have a large number of
features/variables, and we would like to build a nonlinear classification model.
In this case, to generate polynomial variables out of the original variables, we
have a proliferation problem. With such a large number of variables, the model
will soon become insurmountably complicated.
3.1
The outputs are indexed as, for the ones from layer 1 to layer 2, z1 (2) , z2 (2) , and
z3 (2) . These are simply linear combinations of the input variables (i.e. x). For
the first layer, we simply define ai (j) = xi (j) . As for the bias units, we always
set them to 1 by convention. This computation from inputs to outputs is called
forward propagation.
From the architecture of the network, we can tell that the classification model
here is of the form y = 0 + 1 x1 + 2 x2 + 3 x3 , where every output y is a 4-class
value. For illustrative purpose, assuming we only have one training example,
which is a 3-by-1 vector. The forward propagation therefore goes through the
following streamline9 :
a(1) = x
dim:5x4*4x1=5x1
dim:5x6*6x1=5x1
dim:4x6*6x1=4x1
dim:4x1
Going back to the motivation of proposing the use of neural network, with
an extra layer of variables (or layers of), we are now able to incorporate the
complex relationships among variables, and are saved the trouble of generating
polynomial variables which significantly slows down the computation.
3.2
The cost function for a neural network is basically a glorified version of logistic
regression cost function. Concretely, instead of having only one classification
cost10 , we know have a sum of k classification cost when there are k classes
involved. The cost function for neural network is described as follows:
9 Recall
10 As
10
J() =
m K
1 X X (i)
[
yk log(h (x(i) ))k + (1 yk (i) )log(1 (h (x(i) ))k )]
m i=1
+
2m
k=1
sl sX
L
l+1
XX
(3.1)
(ji (l) )2
ij
11
If this all sounds awfully abstract, we now put the backpropagation algorithm
in mathematically machemical terms. We notate the first observable error (i.e.
the difference between the output at the last layer and the actual data) with .
Taking the example neural network in section 3.1 for an instance, this error is
at layer 4 (4) = y a(4) . With (4) as our starting point, we now rephrase
the application of the algorithm to our current neural network as follows:
Compute the error of the output:
(4) = y a(4)
Compute the errors in previous layers (i.e. layer 3 and 2):
(3) = ((3) )T (4) . g 0 (z (3) )
(2) = ((2) )T (3) . g 0 (z (2) )
Compute adjustment term
J():
(l)
1
(l)
m
1
(l)
m
+ (l)
if j 6= 0
if j = 0
=
J()15
Previously (cf. section 2.4), we learned that if we have a linear regression model
with features/variables very low orders of polynomial, then it has high bias, and
is underfitting the data. On the other end of the spectrum, if our model has
high-order polynomial features/variables, and performs poor in generalizing to
new data set, then it is said to have high variance and overfit the data. But
how do we find the sweet spot where the model strikes a nice balance between
data-fitting (i.e. training data) and data-generalizing?
The simplest way to do a diagnostic is by first partitioning the entire data set
into i) a training set, ii) a development set, and iii) a test set, and then apply
14 In
12
In the graph, the lowest-error point on the curve on top will then be the magical
number of the degrees of polynomial we are looking for. Now, a diagnostic is
possible:
Bias case: The errors of the training set and the development set are both
high, and the error difference between the two data sets is low.
Variance case: The error is low for the training set and high for the development set, and the error difference between the two data sets is high.
Alternatively, we may also take the error as a function of (i.e. the regularization coefficient) to decide between bias/variance cases. Plotting the error
against the , we will have a graph similar to the one in the following:
16 Intuitively, the higher the degrees of polynomials in the model, the more oscillations in
the regression curve with be possible and thus the curve will be able to adapt to the data
better.
17 Because the model is trained on the training set after all.
13
As before, the lower curve is for the training set and the upper curve the development set18 . We observe that, with lower level of regularization, the training
error is relatively lower than with higher level of regularization, because we allow
the predictors to make more contribution in the former case. The development
error, however, forms a concave curve. Applying the analysis in the degree-ofpolynomial situation, we know that the large difference between training and
development error at a small indicates a high variance problem, and on the
other end of the curve, where the error difference is small, we have a high bias
problem.
Finally, to find out whether increasing the size of training set would help improve the performance of our model, we may play the number of training data
points against the error. In general, the error goes up when the training set
expands, for the simple reason that it gets more difficult to have a curve (with
its oscillatoriness/adaptivity fixed by the model) that cater to all the data
points19 .
With this in mind, let consider what the error curve for training and development respectively will look like in the high bias vs. high variance cases. First
consider the high bias case. If a model has high bias, graphically the regression
line will be less curvy. In the extreme case of having only one predictor, the
18 Crucially, note that we are assuming that the weights are properly selected such that they
minimally reduce the error.
19 Consider the case where one has 1, 2 or 3 data points, a quadratic curve will be able to
fit all the data points perfectly. This no longer necessarily holds true with larger number of
data points.
14
regression line is a straight line. As the size of the training set gets larger, the
error in both training and development increase, because there are now more
square-error to include in computing overall error. The give-away of a high
bias situation, however, lies in the general level of error in both the training and
the development, which will be relatively high. Consider the case of a straight
regression curve. Due to the lack of oscillation of the curve, it will still run
through the data points cloud with a large training set as it does with a small
training set. Thus, in this case, adding more data will not improve the predictive power of the model. A graphical representation of the discussion is as
follows20 :
Then consider the high variance scenario. One aspect the high variance case
has in common with the high bias case is that the overall error increases as the
training/development set gets larger. However, given the meandering curviness of the high variance regression curve, it will have a relatively lower error
rate than its high bias counterpart in the training set. As one might expect, the
high variance curve performs as underwhelmingly as the high bias curve in the
development set, although the error climbs down as the size of the development
goes up. The distinguishing feature of the high variance case is the big gap
between the error curves of the training and the development. This coincides
with our intuition that a high variance curve is one that does well in the training and falls off the cliff when it goes outside to meet with new data sets.
Graphically, the high variance cases looks like the following21 :
20 Citation:
Stanford ML, A.
learning/lecture/Kont7/learning-curves.
21 Citation:
Stanford ML, A.
learning/lecture/ Kont7/learning-curves.
Ng,
https://www.coursera.org/learn/machine-
Ng,
https://www.coursera.org/learn/machine-
15
Crucially, note that in the high variance case, getting more training data actually may help improve the performance of the model. It is likely that the error
curve for the development set dips down further with larger data set, because a
high variance model simply adapts to data better than a high bias model.
In summary, on a high bias/variance diagnosis, we may take remedial actions
accordingly:
High Bias
Add more features/variables.
Add polynomial features/variables.
Decrease regularization coefficient lambda.
High Variance
Increase the size of training set.
Drop some features/variables.
Increase regularization coefficient lambda.
16