Machine Learning Summary
Machine Learning Summary
1 Introduction 1
4 Logistic regression 4
4.1 Hypothesis function and cost function . . . . . . . . . . . . . . . 4
4.2 One-vs-all classification . . . . . . . . . . . . . . . . . . . . . . . 4
6 Neural network 6
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.3 Cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
11 Dimensionality reduction 24
11.1 Formulation of PCA . . . . . . . . . . . . . . . . . . . . . . . . . 24
11.2 Implementation of PCA . . . . . . . . . . . . . . . . . . . . . . . 25
11.3 Mathematics of SVD . . . . . . . . . . . . . . . . . . . . . . . . . 25
11.4 Choice of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
11.5 Good use v.s. bad use . . . . . . . . . . . . . . . . . . . . . . . . 25
12 Anomaly detection 26
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
12.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
12.3 Developing and evaluating an anomaly detection system . . . . . 27
12.4 Anomaly detection v.s. supervised learning . . . . . . . . . . . . 27
12.5 Choosing features to use . . . . . . . . . . . . . . . . . . . . . . . 28
12.6 Multivariate Gaussian distribution . . . . . . . . . . . . . . . . . 28
13 Recommender systems 29
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
13.2 Collaborative filtering algorithm . . . . . . . . . . . . . . . . . . 30
13.3 Mean normalization . . . . . . . . . . . . . . . . . . . . . . . . . 31
1 Introduction
Machine learning enables computer to complete tasks without being explicitly
programmed. With appropriate methods applied, computer can obtain the
ability to better complete a task by learning from previous results with respect
to this task, which resembles the learning process of human beings. By “better”,
we mean better performance under the evaluation of some sort of quantative
measurement.
Machine learning can be divided into supervised learning and unsuper-
vised learning according to the data set used.
In supervised learning, the correct output of each case in the data set is
already known. The learning algorithm is supposed to reveal the relationship
between the input and the output. Supervised learning problems can be cate-
gorized into regression problems and classification problems. In regression
problems, the output has continuous value, while in classification the output is
discrete.
In unsupervised learning, no correct output is provided in advance. Structure
of the data needs to be derived by clustering the data based on the relationship
among the variables in the data set.
hθ (x) = θ0 + θ1 x (1)
Our target is to find the approriate parameters θ0 , θ1 that minimize the cost
function
m 2
1 X
J(θ0 , θ1 ) = hθ (x(i) − y (i) (2)
2m i=1
in which x(i) , y (i) , i = 1, 2 . . . m are the training examples, and m is the size
of the training set. In mathematical languague, the problem we are supposed
to solve is min J (θ0 , θ1 ).
θ0 ,θ1
until convergence.
Gradient descent algorithm does not necessarily converge to the global min-
imum for any function. If there exists a few local minimums, the algorithm
could wind up at any of them with different initial choices of θj . However,
cost functions admit only one local minimum, which is hence also the global
minimum. Thus gradient descent can be applied here without having to worry
about the possibility of converging to a local minimum due to unwise choice of
initial value.
For univariate linear regression, (3) becomes
m
α X
θ0 := θ0 − hθ (x(i) ) − y (i)
m i=1
m
(4)
α X
θ1 := θ1 − hθ (x(i) ) − y (i) x(i)
m i=1
in which j = 0, 1 . . . n.
3 MULTIVARIATE LINEAR REGRESSION 3
Normal equation method does not require a learning rate to be chosen. Nei-
−1
ther is iteration needed. However, since it requires the calculation of X T X ,
it can become quite slow when n is large, whereas gradient descent still works
fine.
What’s more, it is possible that X T X is non-invertible (singular, degenerate)
when there are redundant features or when there are too many features (n > m).
4 LOGISTIC REGRESSION 4
In such cases, the pinv (persudo inverse) function in octave and matlab will
help avoid failure of the calculation. We could also manually delete redundant
features or consider using regularisation, which will be covered later.
4 Logistic regression
4.1 Hypothesis function and cost function
Recall that in supervised learning, when the output is discretely valued, the
problem is said to be a classification problem. When there are only two classes,
the problem is called a logistic regression problem. We will illustrate how to
solve such problem.
We use the hyphothesis function
1
hθ (x) = g(θT x) = (12)
1 + e−θT x
hθ (x) = P (y = 1|x, θ)
If hθ (x) ≥ 0.5, we predict that y = 1; otherwise (hθ (x) < 0.5) we predict
y = 0. This gives us the decision boundry θT x = 0.
Squared error cost function is no longer appropriate for logistic regression.
We use the cost function
m
1 X (i)
J(θ) = − y log(hθ (x(i) )) + (1 − y (i) ) log(1 − hθ (x(i) )) (13)
m i=1
The logistic regression problem can now be solved by minimize he cost func-
tion: min J(θ). Again we can apply gradient descent
θ
m
α X
(i)
θj := θj − hθ (x(i) ) − y (i) xj (14)
m i=1
(i)
For each class y = i, we fit a hypothesis function (regression classifier) hθ (x)
that can be interpreted as the probability of y = i, i.e.
(i)
hθ (x) = P (y = i|x, θ)
Equiped with these classifiers, when asked to predict the output for a certain
(i)
x, we choose the class i that maximizes hθ (x).
Intuitively, the 1 − αλ
m factor reduces the magnitude of θj at each iteration.
The solution of the normal equation changes into
−1
0
1
T
θ = X X + λ
1 X T y
(17)
. .
.
1
6 Neural network
6.1 Introduction
Linear regression and logistic regression may not be pratical solutions to some
machine learning problems because we will end up with too many features to
deal with. As an example, suppose we want to conduct a logistic regression on a
100×100 grey scale image (maybe trying to tell if it is an image of a car). If we
want to take the quardatic items (xi xj ) as features, there will be roughly 5×107
different features. If higher order items were to be taken into account, which
is not uncommon, the number could become even higher. Trying to solve such
problem with linear/logistic regression is obviously a dead end. The method to
turn to in such case is neural network, which is the state-of-the-art technique
for many machine learning applications.
6 NEURAL NETWORK 7
Neural network dates back to a few decades ago when people started to try to
mimic the way human brain works with computer. Due to the high requirement
of computing capacity, it did not get into the spotlight until recent years. In
a human neural cell, an electricity signal is inputted into the cell through an
input wire called “dendrite”. The cell processes the signal, and then sends a new
signal out via an output wire called “axon”. In machine learning, a logistic unit
inside a neural network works in similar way. Different signals xi is inputted
into the unit, and the unit outputs a new signal
1
hθ (x) = (20)
1 + e−θT x
as illustrated in Figure 1.
When a series of such logistic units, or neurons get connected, they form a
neural network, as illustrated by Figure 2.
(j)
Quantatively, by denoting the “activation” of unit i in layer j as ai , the
matrix of weights controlling function mapping from layer j to layer j + 1 as
6 NEURAL NETWORK 8
Θ(j) , we have
(2) (1) (1) (1) (1)
a1 = g Θ10 x0 + Θ11 x1 + Θ12 x2 + Θ13 x3
(2) (1) (1) (1) (1)
a2 = g Θ20 x0 + Θ21 x1 + Θ22 x2 + Θ23 x3
(21)
(2) (1) (1) (1) (1)
a3 = g Θ30 x0 + Θ31 x1 + Θ32 x2 + Θ33 x3
(3) (2) (2) (2) (2) (2) (2) (2) (2)
hθ (x) = a1 = g Θ10 a0 + Θ11 a1 + Θ12 a2 + Θ13 a3
in which g(z) is the sigmoid function. The vectorized version of (21) can be
written as
1
a(j+1) = g Θ(j) (j) (22)
a
(j)
If layer j has sj neurons (not including the bias neuron a0 = 1), it is clear that
Θ(j) is a sj+1 ∗ (sj + 1) matrix.
6.2 Examples
With the examples given above, we can calculate a more complex logical
function XNOR. Note that
x1 XNOR x2 = (x1 AND x2 ) OR (N OT (x1 ) AND N OT (x2 ))
6 NEURAL NETWORK 9
Thus we can calculate XNOR with the neural network shown in Figure 5.
in which L is the number of layers. Note that the regularization term does not
include Θj0 , which is related to the bias term.
In order to minimize J(Θ) using gradient descent or other methods, we need
to compute ∂J(Θ)
∂Θij for all i, j. The method to do this is called backpropagation.
6.4 Backpropagation
Consider one specific training example (x, y). Intuitively speaking, in the back-
(l)
propagation algorithm, we use δj to denote the “error” of node j in layer l.
6 NEURAL NETWORK 10
∂ J(θ1 . . . θi + . . . θn ) − J(θ1 . . . θi − . . . θn )
J(θ) =
∂θi 2
7 MACHINE LEARNING DIAGNOSTICS 11
6.5 Summary
Here is a summary of the implementation of neural network learning algorithm.
J(Θ) is not necessarily convex for neural network, which means it is possible
that we wind up at a local minimum rather than the global minimum. In reality
this seldom causes problem because even if a local minimum is obtained, it is
not too far away from the global minimum and is thus an acceptable solution.
For logistic regression (i.e. 2-classification) problem, another way to define the
test set error is
mtest
1 X (i) (i)
Jtest (θ) = err(hθ (xtest ), ytest )
mtest i=1
in which
(
1, if y = 0, hθ (x) ≥ 0.5 or y = 1, hθ (x) < 0.5
err(hθ (x, y) =
0, otherwise
(i)
For each training model candidate hθ (x), we train a series of parameters θ(i)
by minimizing J(θ). Then we calculate the cross-validation set error Jcv (θ(i) )
for them, in which
mcv
1 X
Jcv (θ) = (hθ (x(i) (i) 2
cv ) − ycv )
2mcv i=1
for linear regression, and similar as (30) for logistic regression. Finally, the
model that minimizes Jcv (θ) stands out, and its performance can be measured
with the test set error.
It is obvious that in a high bias (underfit) case, both Jcv (θ) and Jtrain (θ)
are high, while in a high variance (overfit) case, Jcv (θ) is high but Jtrain (θ) is
low.
We introduced regularization as the method to combat overfit, but the choice
of the regularization parameter λ can be subtle. It turns out that an appropriate
7 MACHINE LEARNING DIAGNOSTICS 14
In a high-bias case, the examples are underfitted (e.g. trying to fit examples
that follow a 5-degree polynomial with a straight line). Even if the size of
the training set is increased, the hypothesis will still fail to correctly describe
the relationship between the input and the output. Thus, both Jcv (θ) and
Jtrain (θ) will be high, and they will be quite close to each other, as shown in
Figure 9. In this case, collecting more training examples is not likely to
help. Essential adjustment to the hypothesis should be made, i.e. adding more
features.
In a high variance case, the examples are overfitted. Jtrain (θ) is typically
lower than Jcv (θ), as shown in Figure 10. If more examples are added to the
training set, the training set becomes “less overfitted”, thus Jtrain (θ) is likely to
increase, while Jcv (θ) will decrease. In such case, collecting more training
8 MACHINE LEARNING SYSTEM DESIGN 16
7.5 Conclusion
We will now conclude this section by providing the effects of the solutions to
performance problems listed at the begining.
Rather than randomly making a choice based on gut feeling, systematic analysis
methods are available to help choose the best option.
Table 1 provides some useful definitions for skewed data error metrics. If
the actual class is 1 and the algorithm predicts 1, we say that the algorithm has
made a true positive prediction, etc. Now we can define the precision and the
recall of the algorithm.
True Positive
Precision =
True Positive + False Positive
(31)
True Positive
Recall =
True Positive + False Negative
9 SUPPORT VECTOR MACHINE (SVM) 18
Clearly, precision is the percentage of patients who actually have cancer among
patients diagnosed to have cancer, while recall is the percentage of patients who
are diagnosed to have cancer among patients who actually have cancer.
With precision and recall defined, we can introduce the tradeoff between
them. Normally we predict y = 1 when hθ (x) ≥ 0.5 and y = 0 when hθ (x) < 0.5.
If we want to predict y = 1 only when very confident, we can increase the
threshold for prediction, e.g. to 0.9. Obviously this will increase the precision
but decrease the recall. On the contrary, if we want to alert more patients
possible to have cancer, we can decrease the threshold, say to 0.3, and we will
end up with lower precision but higher recall.
Taking both precision and recall into account, we define the F1 score of the
algorithm
2P R
F1 = (32)
P +R
and use it as a measurement of the performance of the algorithm. Generally, an
algorithm with a better F1 score has better overall performance.
8.3 Data
Sufficient and appropriate data is essential to the performance of the algorithm.
We should always ensure that the features we use include enough information to
predict the output correctly. A useful test to determine whether the information
is enough is to ask ourselves: is a human expert in this field capable of making
a confident predication of the output based on the information we provide?
If we are using low-bias algorithms, e.g. neural network with a lot of hidden
layers or a linear regression with a lot of feature, large amount of data will help
avoid overfitting and is thus usually preferable.
In SVM, we will use new functions cost1(z), cost0(z) as depicted by the red
lines in Figure 11 and Figure 12 to substitute f1 (z) and f0 (z). The new cost
function is
m
X 1X n
J(θ) = C y (i) cost1(θT x(i) ) + (1 − y (i) )cost0(θT x(i) ) + θ2 (34)
i=1
2 j=1 j
Note that for the reason of convention, m is dropped and rather than writing
the function as A + λB, we are now writing it as CA + B. C has the same effect
as the original λ1 when it comes to its effect on regularization.
Unlike logistic regression, in which hθ (x) is interpreted as the probablity of
y = 1, SVM has the following hypothesis:
(
1, if θT x ≥ 0
hθ (x) = (35)
0, otherwise
9 SUPPORT VECTOR MACHINE (SVM) 20
When there exist outliners, the regularization factor C ensures that the
algorithm does not overfit the examples. Obviously C cannot be too large.
trated by Figure 14. In this simple 2-d case, our target becomes
n
1X 2 1
min θ = min kθk2
θ 2 j=1 j θ 2
( T (i) (36)
θ x ≥1, if y (i) = 1
s.t.
θT x(i) ≤ − 1, if y (i) = 0
Note that
θT x = kθkkxk coshθ, xi = p · kθk
in which p is the projection of x along the direction of θ. In order to minimize
kθk, p should be as large as possible for all samples. From Figure 14, obviously
l1 is a better decision boundary because p1 > p2 .
9.3 Kernals
In order to adapt SVMs to develop complex non-linear classifiers, we have to
use kernals.
One way to develop non-linear classifiers is to use high degree polynomial
features. We will end up with a classifier that predicts y = 1 if
kx − l(i) k2
(i)
fi = similarity x, l = exp − (38)
2σ 2
Here we are using Gaussian Kernal. When x is close to l(i) , this kernal returns
approximately 1, while when x is far from l(i) , it returns approximately 0.
With featurs fi defined as such, we now have SVMs that can conduct non-
linear classification. The SVM predicts y = 1 when θT f ≥ 0, and y = 0
otherwise.
10 CLUSTERING 22
As for the choice of landmarks l(i) , in practice, we use all examples in the
training set as landmarks. Thus we will end up with m features. θ can be
trained with
m
X 1X m
min C y (i) cost1(θT f (i) ) + (1 − y (i) )cost0(θT f (i) ) + θ2 (39)
θ
i=1
2 i=1 i
10 Clustering
Unlike supervised learning, in which the training examples are (x, y) pairs, the
training set of unsupervised learning contains only x without y label. The first
type of unsupervised learning problem that we will introduce is clustering. The
target of clustering is to group the training examples {x(1) , x(2) . . . x(m) } into
a few clusters. Clustering is widely applied in scientific research and indus-
trial practice, such as market segmentation, social network analysis, computing
clusters organization as well as astronomial data analysis.
This cost function is also called the distortion function. Obviously, the cluster
assignment process could be interpreted as minimizing J(c, µ) with respect to
c, while the move centroids process could be interpreted as minimizing J(c, µ)
with respect to µ.
10.4 Choice of K
The choice of K could be subtle in a practical clustering problem. “Elbow
method” might work in some situations, but more often provides no optimal
option. Usually, K-means clustering is run for some later/downstream purpose.
The choice of K should aim at better serving the downstream purpose. For in-
stance, if we are running K-means on height/weight data of potential customers
of a T-shirt we intend to produce in order to figure out how to segment the
customers into groups of different sizes, K should obviously be 3 (S, M, L) or
5(XS, S, M, L, XL), even if the choice of K is completely ambiguous at first
sight of the data.
11 DIMENSIONALITY REDUCTION 24
11 Dimensionality reduction
Sometimes we may wish to reduce the dimension of the data for some reason.
Dimensionality reduction could be useful for data compression. If all 3D
samples are close to a 2D plane, we can project all samples on this plane, and
the 3D data is compressed to 2D. In practice the compression could be huge.
For example, in a 8-bit RGB image, each pixel requires 24 bits to store the
RGB values, each of them being a 8-bit integer from 0 to 255. If we can cluster
the RGB values of all pixels into 16 clusters, which could be carried out with
K-means, the RGB values of each pixel could be substituted with the values of
the centroid to which it is assigned, and we manage to code the image with 4
bits per pixel plus some overhead (RGB values of the 16 centroids) at the price
of the loss of some details. Also, when trying to develop pattern recognization
machine learning algorithms such as face detection, the training examples are
often of high dimensionality (e.g. 1.6 × 104 dimensions for a 128×128 greyscale
image). With proper dimensionality reduction preconditioning of the data, the
scale of the data can be significantly reduced (maybe from 10000 dimensions
to 100 dimensions), and the algorithm can be significantly accelerated without
damaging the ability to solve the problem.
Another situation that calls for dimensionality reduction is when we want
to visualize the data. Visualization of the data can sometimes provide some
intuitive inspirations on the properties of the data set, but it requires that the
data be reduced to 2D or 3D.
Principal component analysis (PCA) is the most popular algorithm for di-
mensionality reduction.
in which
Uk = u(1) , u(2) , . . . , u(k) .
Note that z (i) could be obtained by
We end up with a matrix U that contains all eigen vectors of Σ as its columns.
The Uk we want is simply its first k colums.
in which the threshold t could be 1%, 5%, 10%, etc for different purposes. If
t = 1%, we say 99% of variance is retained.
Intuitively, what we should do is to choose the threshold we need, and start
from the PCA with k = 1. If the final result does not satisfy (43), we go to
k = 2, and so on. However, the matrix S obtained in svd provides us with a
better approach. S is a diagonal matrix that satisfies, for a specific k,
m
2 k
1
(i)
xapprox − x(i)
P P
Sii
m
i=1
m
= 1 − i=1
m . (44)
x(i)
2
1
P
P
m Sii
i=1 i=1
PCA helps to reduce the dimensionality of the data, and fewer features
bring smaller possibility of overfitting. But PCA is not a good way to address
overfitting. Regularization is always a better option.
PCA helps to accelerate machine learning algorithms, but sometimes it is
unnecessary because the algorithm could have run satisfactorily fast with the
raw data. It is always wise and worthwhile to give it a try with the raw data
before implementing PCA. If the algorithm runs too slowly or exhausts the
storage (memory/disk) so that the task is unlikely to be completed with the
raw data, it would be time to turn to PCA.
12 Anomaly detection
12.1 Introduction
Anomaly detection is an unsupervised learning problem that has some aspects
similar to supervised learning problem.
Given m normal(non-anomalous) examples x(1) , x(2) , . . . , x(m) , we are sup-
posed to tell whether a new example xtest is anomalous. The approach we will
take is to build a density model p(x) and choose a threshold . If p(xtest ) ≥ , the
new example xtest is flagged normal; otherwise it is recognized as an anomaly.
Anomaly detection could be used in fraud detection. Also it can be used to
decide whether a product is up to the normal quality standard in manufacturing.
What’s more, it can be used to monitor the status of computers in a data center.
12.2 Algorithm
Suppose the training examples x(i) ∈ Rn . We will assume that
n
Y
p(x) = p(xi ; µi , σi2 )
i=1
we can calculate p(xtest ) for any new example xtest and tell whether it is an
anomaly.
Since the data is quite skewed (most examples are normal), predication accuracy
is not a good evaluation metric. We can calculate the numbers of true positive,
false positive, true negative, false negative, and then calculate Precision/Re-
call, and finally evaluate the algorithm with the F1 score.
We will choose the value of that privides the best F1 score on the cv set.
Then the algorithm can be evaluated by applying the model on the test set.
learning should be used when there are enough positive examples to help
us get a sense of what an anomaly should “look like”, i.e. we are confident
that future anomalies will be similar to the ones we have seen.
in which Σ is the covariance matrix. This is a better approach when the features
chosen are correlated to each other closely.
When using multvariate Gaussian distribution, when given a training set
{x(1) , . . . , x(m) }, we can fit the model with
m
1 X (i)
µ= x
m i=1
m
(49)
1 X (i)
Σ= (x − µ)(x(i) − µ)T .
m i=1
It is not difficult to figure out that the original model used above is actually
the multvariate gaussian model with
2
σ1
σ22
Σ= . (50)
. .
.
σn2
When using the original model, we have to manually create extra features to
capture anomalies that involve unusual combinations of existing features, while
mulvariate gaussian model automatically captures the correlations between fea-
tures. However, we should be cautious to use mulvariate gaussian when there
are a lot of features because it is computationally more expensive (calculation
of Σ−1 ). What’s more, mulvariate gaussian requires m > n to ensure the in-
vertibility of Σ. In practice, it should only be used when m is sufficiently larger
than n, e.g. m > 10n.
13 Recommender systems
13.1 Introduction
A recommender system aims at recommending for its users what can promisingly
be of interest for them, e.g. books for customers visiting an online bookstore
like Amazon, movies for a user of an online move database like IMDB, people
that the user might want to add as friends on a social network like Facebook,
etc. It is an important application of machine learning. It is interesting for us
because it is one of the algorithms that do not call for manual choice of features.
We will illustrate the components of general recommender system with a
movie recommending system as an example. The system recommends movies
for users based on ratings given by all users in the database as well as the
user’s personal rating history. We will denote the total number of movies in
the database with nm and the total number of users with nu . We will denote
r(i, j) = 1 if the user j has rated movie i, and r(i, j) = 0 otherwise. In the
case r(i, j) = 1, the rating of movie i given by user j will be denoted as y (i,j) .
The
P number of movies that have been rated by user j is denoted with m(j) , i.e.
(j)
r(i, j) = m .
i
In order to make correct recommendation, the system will try to predict a
user’s rating for movies that he has not rated based on his personal taste, which
is indicated by his rating history, as well as the category of the movie, which is
indicated by other user’s rating of the movie. If there are n features to describe
movies, we can use an n dimension vector θ(j) to represent user j’s taste, and
an n dimension vector x(i) to represent features of movie i. The rating of movie
T
i given by user j could then be predicted with θ(j) x(i) .
13 RECOMMENDER SYSTEMS 30
Suppose we already have the x(i) vectors, then we can learn θ(j) by mini-
mizing the cost function
n u 2 nu
(1) (nu ) 1X X
(j) T (i) (i,j) λX T
J(θ ,...,θ )= θ x −y + θ(j) θ(j) ,
2 j=1 2 j=1
i:r(i,j)=1
which could be conducted with gradient descent. Note that by convention, θ(j)
and x(i) does not contain bias component (the 0th component).
Then the nm × nu matrix that stores the prediction of the ratings of all movies
by all users could simply be expressed by XΘT . This is called low rank matrix
factorization. In Matlab, the vectorized implementation of the cost function
and its derivatives is as follows:
14 LARGE SCALE MACHINE LEARNING 31
With x(i) and θ(j) learned, we can recommend unwatched movies for a spe-
cific user: simply choose the movies that he has not rated with the highest
predicted ratings. We can also find movies that are the most “similar” to a
given movie i: simply find movies j with the smallest kx(i) − x(j) k.
in which m is the number of examples in the training set. If the scale of the
training set is very large, say m = 109 , we will have to do 109 additions in each
single step of gradient descent, which is a problem that needs to be addressed,
otherwise the large amount of computation will make it impossible to solve the
problem in reasonable time. One way to avoid such problem is to check if the
problem could be solved with a small training set, say m = 1000. If the learing
curve shows that the problem does not have high variance (overfit) with the
small training set, it might be unnecessary to use the large traing set.
14.4 Convergence
With batch gradient descent, we check the convergence of the algorithm by
calculating Jtrain (θ) after each iteration. This does not make sense for stochas-
tic gradient descent or mini batch gradient descent because the calculation of
15 EXAMPLE: PHOTO OCR 33
Jtrain (θ) leads to the computation load that we are trying to get rid of due to
training set size.
What we can do is to calculate cost(θ, (x(i) , y (i) ) = 21 (hθ (x(i) ) − y (i) )2 before
updating θ using (x(i) , y (i) ) at each iteration. Then every 1000 iterations (just
an example), we can plot cost(θ, (x(i) , y (i) ) averaged over the last 1000 examples
processed by the algorithm to see the trend of the cost function, and adjust value
of α accordingly.
Learning rate α is usually kept constant. We could choose to use a learning
const1
rate that gradually diminishes over time, say α = const2+iterationN umber to make
θj get closer to the optima in the end, but this will introduce two more paramters
for us to tune, and is not always a good idea.
Text detection Pick out rectangle regions in the photo where characters ap-
pear.
Character segmentation Divide each region found in the previous step to
small rectangle regions that each contain one single character.
Character classification Classifier that recognizes each small region as the
character it contains.
By dividing the task of photo OCR into a few steps as shown above, we have
built the pipeline of this problem. In a pipeline, all or some of the steps may
involve machine learning. The division into separate steps facilitates the split
of work load among groups of engineers.
Component Accuracy
Overall system 72%
Text detection 89%
Character segmentation 90%
Character recognization 100%
15 EXAMPLE: PHOTO OCR 36