Andrew Ng Machine Learning
Andrew Ng Machine Learning
Next Index
Examples
Database mining
Machine learning has recently become so big party because of the huge amount of data being generated
Large datasets from growth of automation web
Sources of data include
Web data (click-stream or click through data)
Mine to understand users better
Huge segment of silicon valley
Medical records
Electronic records -> turn records in knowledges
Biological data
Gene sequences, ML algorithms give a better understanding of human genome
Engineering info
Data from sensors, log reports, photos etc
Applications that we cannot program by hand
Autonomous helicopter
Handwriting recognition
This is very inexpensive because when you write an envelope, algorithms can automatically route envelopes through
the post
Natural language processing (NLP)
AI pertaining to language
Computer vision
AI pertaining vision
Self customizing programs
Netflix
Amazon
iTunes genius
Take users info
Learn based on your behavior
Understand human learning and the brain
If we can build systems that mimic (or try to mimic) how the brain works, this may push our own understanding of the
associated neurobiology
Example problem: "Given this data, a friend has a house 750 square feet - how much can they be expected to get?"
Another example
Can we definer breast cancer as malignant or benign based on tumour size
Looking at data
Five of each
Can you estimate prognosis based on tumor size?
This is an example of a classification problem
Classify data into one of two discrete classes - no in between, either malignant or not
In classification problems, can have a discrete number of possible values for the output
e.g. maybe have four values
0 - benign
1 - type 1
2 - type 2
3 - type 4
In classification problems we can plot data in a different way
Based on that data, you can try and define separate classes by
Drawing a straight line between the two groups
Using a more complex function to define the two groups (which we'll discuss later)
Then, when you have an individual with a specific tumor size and who is a specific age, you can hopefully use that
information to place them into one of your classes
You might have many features to consider
Clump thickness
Uniformity of cell size
Uniformity of cell shape
The most exciting algorithms can deal with an infinite number of features
How do you deal with an infinite number of features?
Neat mathematical trick in support vector machine (which we discuss later)
If you have an infinitely long list - we can develop and algorithm to deal with that
Summary
Supervised learning lets you get the "right" data a
Regression problem
Classification problem
Clustering algorithm
Basically
Can you automatically generate structure
Because we don't give it the answer, it's unsupervised learning
Record sightly different versions of the conversation depending on where your microphone is
But overlapping none the less
Have recordings of the conversation from each microphone
Give them to a cocktail party algorithm
Algorithm processes audio recordings
Determines there are two audio sources
Separates out the two sources
Is this a very complicated problem
Algorithm can be done with one line of code!
[W,s,v] = svd((repmat(sum(x.*x,1), size(x,1),1).*x)*x');
Not easy to identify
But, programs can be short!
Using octave (or MATLAB) for examples
Often prototype algorithms in octave/MATLAB to test as it's very fast
Only when you show it works migrate it to C++
Gives a much faster agile development
Understanding this algorithm
svd - linear algebra routine which is built into octave
In C++ this would be very complicated!
Shown that using MATLAB to prototype is a really good way to do this
Linear Regression
Housing price data example used earlier
Supervised learning regression problem
What do we start with?
Training set (this is your data set)
Notation (used throughout the course)
m = number of training examples
x's = input variables / features
y's = output variable "target" variables
(x,y) - single training example
(xi, yj) - specific example (ith training example)
i is an index to training set
So in summary
A hypothesis takes in some variable
Uses parameters determined by a learning system
Outputs a prediction based on that input
Minimize squared different between predicted house price and actual house price
1/2m
1/m - means we determine the average
1/2m the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the smallest
value is still the smallest value!)
Minimizing θ0/θ1 means we get the values of θ0 and θ1 which find on average the minimal deviation of x from y when we
use those parameters in our hypothesis function
More cleanly, this is a cost function
Cost - is a way to, using your training data, determine values for your θ values which make the hypothesis as accurate as
possible
This cost function is also called the squared error cost function
This cost function is reasonable choice for most regression functions
Probably most commonly used function
In case J(θ0,θ1) is a bit abstract, going into what it does, why it works and how we use it in the coming sections
Lets consider some intuition about the cost function and why we want to use it
The cost function determines parameters
The value associated with the parameters determines how your hypothesis behaves, with different values generate
different
Simplified hypothesis
Assumes θ0 = 0
Cost function and goal here are very similar to when we have θ0, but with a simpler parameter
Simplified hypothesis makes visualizing cost function J() a bit easier
The optimization objective for the learning algorithm is find the value of θ1 which minimizes J(θ1)
So, here θ1 = 1 is the best value for θ1
We can see that the height (y) indicates the value of the cost function, so find where y is at a minimum
Each point (like the red one above) represents a pair of parameter values for Ɵ0 and Ɵ1
Our example here put the values at
θ0 = ~800
θ1 = ~-0.15
Not a good fit
i.e. these parameters give a value on our contour plot far from the center
If we have
θ0 = ~360
θ1 = 0
This gives a better hypothesis, but still not great - not in the center of the countour plot
Finally we find the minimum, which gives the best hypothesis
Doing this by eye/hand is a pain in the ass
What we really want is an efficient algorithm fro finding the minimum for θ0 and θ1
Here we can see one initialization point led to one local minimum
The other led to a different one
Derivative term
If you implement the non-simultaneous update it's not gradient descent, and will behave weirdly
But it might look sort of right - so it's important to remember this!
To understand gradient descent, we'll return to a simpler function where we minimize one parameter to help explain the algorithm
in more detail
min θ1 J(θ1) where θ1 is a real number
Two key terms in the algorithm
Alpha
Derivative term
Notation nuances
Partial derivative vs. derivative
Use partial derivative when we have multiple variables but only derive with respect to one
Use derivative when we are deriving with respect to all the variables
Derivative term
Derivative says
Lets take the tangent at the point and look at the slope of the line
So moving towards the mimum (down) will greate a negative derivative, alpha is always positive, so will update j(θ1) to
a smaller value
Similarly, if we're moving up a slope we make j(θ1) a bigger numbers
Alpha term (α)
What happens if alpha is too small or too large
Too small
Take baby steps
Takes too long
Too large
Can overshoot the minimum and fail to converge
When you get to a local minimum
Gradient of tangent/derivative is 0
So derivative term = 0
alpha * 0 = 0
So θ1 = θ1- 0
So θ1 remains the same
As you approach the global minimum the derivative term gets smaller, so your update gets smaller, even with alpha is fixed
Means as the algorithm runs you take smaller steps as you approach the minimum
So no need to change alpha over time
Matrices - overview
Rectangular array of numbers written between square brackets
2D array
Named as capital letters (A,B,X,Y)
Dimension of a matrix are [Rows x Columns]
Start at top left
To bottom left
To bottom right
R[r x c] means a matrix which has r rows and c columns
Is a [4 x 2] matrix
Matrix elements
A(i,j) = entry in ith row and jth column
Vectors - overview
Is an n by 1 matrix
Usually referred to as a lower case letter
n rows
1 column
e.g.
Is a 4 dimensional vector
Refer to this as a vector R4
Vector elements
vi = ith element of the vector
Vectors can be 0-indexed (C++) or 1-indexed (MATLAB)
In math 1-indexed is most common
But in machine learning 0-index is useful
Normally assume using 1-index vectors, but be aware sometimes these will (explicitly) be 0 index ones
Matrix manipulation
Addition
Add up elements one at a time
Can only add matrices of the same dimensions
Creates a new matrix of the same dimensions of the ones added
www.holehouse.org/mlclass/03_Linear_algebra_review.html 1/6
6/14/2019 03_Linear_algebra_review
Multiplication by scalar
Scalar = real number
Multiply each element by the scalar
Generates a matrix of the same size as the original matrix
Division by a scalar
Same as multiplying a matrix by 1/4
Each element is divided by the scalar
Combination of operands
Evaluate multiplications first
Detailed explanation
A*x=y
A is m x n matrix
x is n x 1 matrix
n must match between vector and matrix
i.e. inner dimensions must match
Result is an m-dimensional vector
www.holehouse.org/mlclass/03_Linear_algebra_review.html 2/6
To get yi - multiply A's ith row with all the elements of vector x and add them up
6/14/2019 Neat trick 03_Linear_algebra_review
Say we have a data set with four values
Say we also have a hypothesis hθ(x) = -40 + 0.25x
Create your data as a matrix which can be multiplied by a vector
Have the parameters in a vector which your matrix can be multiplied by
Means we can do
Prediction = Data Matrix * Parameters
Here we add an extra column to the data with 1s - this means our θ0 values can be calculated and expressed
Matrix-matrix multiplication
General idea
Step through the second matrix one column at a time
Multiply each column vector from second matrix by the entire first matrix, each time generating a vector
The final product is these vectors combined (not added or summed, but literally just put together)
Details
AxB=C
A = [m x n]
B = [n x o]
C = [m x o]
With vector multiplications o = 1
Can only multiply matrix where columns in A match rows in B
Mechanism
Take column 1 of B, treat as a vector
Multiply A by that column - generates an [m x 1] vector
Repeat for each column in B
There are o columns in B, so we get o columns in C
Summary
The i th column of matrix C is obtained by multiplying A with the i th column of B
Start with an example
AxB
Initially
Take matrix A and multiply by the first column vector from B
Take the matrix A and multiply by the second column vector from B
www.holehouse.org/mlclass/03_Linear_algebra_review.html 3/6
6/14/2019 03_Linear_algebra_review
Implementation/use
House prices, but now we have three hypothesis and the same data set
To apply all three hypothesis to all data we can do this efficiently using matrix-matrix multiplication
Have
Data matrix
Parameter matrix
Example
Four houses, where we want to predict the prize
Three competing hypotheses
Because our hypothesis are one variable, to make the matrices match up we make our data (houses sizes) vector into a 4x2
matrix by adding an extra column of 1s
Has the property that any matrix A which can be multiplied by an identity matrix gives you matrix A back
So if A is [m x n] then
A*I
I=nxn
I*A
I=mxm
(To make inside dimensions match to allow multiplication)
Identity matrix dimensions are implicit
Remember that matrices are not commutative AB != BA
Except when B is the identity matrix
Then AB == BA
Matrix transpose
Have matrix A (which is [n x m]) how do you change it to become [m x n] while keeping the same values
i.e. swap rows and columns!
How you do it;
Take first row of A - becomes 1st column of AT
Second row of A - becomes 2nd column...
A is an m x n matrix
B is a transpose of A
Then B is an n x m matrix
www.holehouse.org/mlclass/03_Linear_algebra_review.html 5/6
A(i,j) = B(j,i)
6/14/2019 03_Linear_algebra_review
www.holehouse.org/mlclass/03_Linear_algebra_review.html 6/6
04: Linear Regression
6/14/2019 with Multiple Variables
04_Linear_Regression_with_multiple_variables
θT is an [1 x n+1] matrix
In other words, because θ is a column vector, the transposition operation transforms it into a
row vector
So before
θ was a matrix [n + 1 x 1]
Now
θT is a matrix [1 x n+1]
Which means the inner dimensions of θT and X match, so they can be multiplied together as
[1 x n+1] * [n+1 x 1]
= hθ(x)
So, in other words, the transpose of our parameter vector * an input example X gives you a
predicted hypothesis which is [1 x 1] dimensions (i.e. a single value)
This x0 = 1 lets us write this like this
This is an example of multivariate linear regression
Similarly, instead of thinking of J as a function of the n+1 numbers, J() is just a function of the parameter vector
J(θ)
Gradient descent
www.holehouse.org/mlclass/04_Linear_Regression_with_multiple_variables.html 2/9
Above, we have slightly different update rules for θ0 and θ1
6/14/2019 04_Linear_Regression_with_multiple_variables (i)
Actually they're the same, except the end has a previously undefined x0 as 1, so wasn't shown
We now have an almost identical rule for multivariate gradient descent
www.holehouse.org/mlclass/04_Linear_Regression_with_multiple_variables.html 3/9
Pathological input to gradient descent
6/14/2019 So we need to rescale this input so04_Linear_Regression_with_multiple_variables
it's more effective
So, if you define each value from x1 and x2 by dividing by the max for each feature
Contours become more like circles (as scaled between 0 and 1)
May want to get everything into -1 to +1 range (approximately)
Want to avoid large ranges, small ranges or very different ranges from one another
Rule a thumb regarding acceptable ranges
-3 to +3 is generally fine - any bigger bad
-1/3 to +1/3 is ok - any smaller bad
Can do mean normalization
Take a feature xi
Replace it by (xi - mean)/max
So your values all have an average of about 0
Learning Rate α
Focus on the learning rate (α)
Topics
Update rule
Debugging
How to chose α
www.holehouse.org/mlclass/04_Linear_Regression_with_multiple_variables.html 4/9
6/14/2019 04_Linear_Regression_with_multiple_variables
But you overshoot, so reduce learning rate so you actually reach the minimum (green line)
Example
House price prediction
Two features
Frontage - width of the plot of land along road (x1)
Depth - depth away from road (x2)
You don't have to use just two features
Can create new features
Might decide that an important feature is the land area
So, create a new feature = frontage * depth (x3)
h(x) = θ0 + θ1x3
Area is a better indicator
Often, by defining new features you may get a better model
Polynomial regression
May fit the data better
θ0 + θ1x + θ2x2 e.g. here we have a quadratic function
For housing data could use a quadratic function
But may not fit the data so well - inflection point means housing prices decrease when size gets really
big
So instead must use a cubic function
For some linear regression problems the normal equation provides a better solution
So far we've been using gradient descent
Iterative algorithm which takes steps to converse
Normal equation solves θ analytically
Solve for the optimum value of theta
Has some advantages and disadvantages
Here
m=4
n=4
To implement the normal equation
Take examples
Add an extra column (x0 feature)
Construct a matrix (X - the design matrix) which contains all the training data features in an [m x n+1]
matrix
Do something similar for y
Construct a column vector y vector [m x 1] matrix
Using the following equation (X transpose * X) inverse times X transpose y
www.holehouse.org/mlclass/04_Linear_Regression_with_multiple_variables.html 7/9
6/14/2019 04_Linear_Regression_with_multiple_variables
If you compute this, you get the value of theta which minimize the cost function
General case
Vector y
Used by taking all the y values into a column vector
pinv(X'*x)*x'*y
When should you use gradient descent and when should you use feature scaling?
Gradient descent
Need to chose learning rate
Needs many iterations - could make it slower
Works well even when n is massive (millions)
www.holehouse.org/mlclass/04_Linear_Regression_with_multiple_variables.html
Better suited to big data 8/9
What is a big n though
6/14/2019 100 or even a 100004_Linear_Regression_with_multiple_variables
is still (relativity) small
If n is 10 000 then look at using gradient descent
Normal equation
No need to chose a learning rate
No need to iterate, check for convergence etc.
Normal equation needs to compute (XT X)-1
This is the inverse of an n x n matrix
With most implementations computing a matrix inverse grows by O(n3 )
So not great
Slow of n is large
Can be much slower
www.holehouse.org/mlclass/04_Linear_Regression_with_multiple_variables.html 9/9
06: Logistic Regression
6/14/2019 06_Logistic_Regression
Classification
Where y is a discrete value
Develop the logistic regression algorithm to determine what class a new input
should fall into
Classification problems
Email -> spam/not spam?
Online transactions -> fraudulent?
Tumor -> Malignant/benign
Variable in these problems is Y
Y is either 0 or 1
0 = negative class (absence of something)
1 = positive class (presence of something)
Start with binary class problems
Later look at multiclass classification problem, although this is just an extension of
binary classification
How do we develop a classification algorithm?
Tumour size vs malignancy (0 or 1)
We could use linear regression
Then threshold the classifier output (i.e. anything over some value is yes, else
no)
In our example below linear regression with thresholding seems to work
We can see above this does a reasonable job of stratifying the data points into one of two
classes
But what if we had a single Yes with a very small tumour
This would lead to classifying all the existing yeses as nos
Another issues with linear regression
We know Y is 0 or 1
Hypothesis can give values large than 1 or less than 0
So, logistic regression generates a value where is always either 0 or 1
Logistic regression is a classification algorithm - don't be confused
www.holehouse.org/mlclass/06_Logistic_Regression.html 1/14
Hypothesis representation
6/14/2019 06_Logistic_Regression
When our hypothesis (hθ(x)) outputs a number, we treat that value as the estimated probability
that y=1 on input x
Example
If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
hθ(x) = 0.7
Tells a patient they have a 70% chance of a tumor being malignant
We can write this using the following notation
hθ(x) = P(y=1|x ; θ)
What does this mean?
Probability that y=1, given x, parameterized by θ
Since this is a binary classification task we know y = 0 or 1
So the following must be true
P(y=1|x ; θ) + P(y=0|x ; θ) = 1
P(y=0|x ; θ) = 1 - P(y=1|x ; θ)
www.holehouse.org/mlclass/06_Logistic_Regression.html 2/14
6/14/2019 06_Logistic_Regression
Decision boundary
Gives a better sense of what the hypothesis function is computing
Better understand of what the hypothesis function looks like
One way of using the sigmoid function is;
When the probability of y being 1 is greater than 0.5 then we can predict y = 1
Else we predict y = 0
When is it exactly that hθ(x) is greater than 0.5?
Look at sigmoid function
g(z) is greater than or equal to 0.5 when z is greater than or equal to 0
Decision boundary
Mean we can build more complex decision boundaries by fitting complex parameters to
this (relatively) simple hypothesis
More complex decision boundaries?
By using higher order polynomial terms, we can get even more
complex decision boundaries
www.holehouse.org/mlclass/06_Logistic_Regression.html 5/14
6/14/2019 06_Logistic_Regression
Which, appropriately, is the sum of all the individual costs over the training
data (i.e. the same as linear regression)
To further simplify it we can get rid of the superscripts
So
To get around this we need a different, convex Cost() function which means we can
apply gradient descent
www.holehouse.org/mlclass/06_Logistic_Regression.html 8/14
6/14/2019 06_Logistic_Regression
This result is
p(y=1 | x ; θ)
Probability y = 1, given x, parameterized by θ
If you had n features, you would have an n+1 column vector for θ
This equation is the same as the linear regression rule
The only difference is that our definition for the hypothesis has changed
Previously, we spoke about how to monitor gradient descent to check it's working
Can do the same thing here for logistic regression
When implementing logistic regression with gradient descent, we have to update all
the θ values (θ0 to θn) simultaneously
Could use a for loop
Better would be a vectorized implementation
Feature scaling for gradient descent for logistic regression also applies here
Advanced optimization
Previously we looked at gradient descent for minimizing the cost function
Here look at advanced concepts for minimizing the cost function for logistic regression
Good for large machine learning problems (e.g. huge feature set)
What is gradient descent actually doing?
We have some cost function J(θ), and we want to minimize it
We need to write code which can take θ as input and compute the following
J(θ)
Partial derivative if J(θ) with respect to j (where j=0 to j = n)
www.holehouse.org/mlclass/06_Logistic_Regression.html 10/14
Given code that can do these two things
6/14/2019 06_Logistic_Regression
Gradient descent repeatedly does the following update
Example above
θ1 and θ2 (two parameters)
www.holehouse.org/mlclass/06_Logistic_Regression.html 11/14
6/14/2019
5)2 + ( θ2 - 5)2
Cost function here is J(θ) = (θ1 -06_Logistic_Regression
The derivatives of the J(θ) with respect to either θ1 and θ2 turns out to be the 2(θi -
5)
First we need to define our cost function, which should have the following signature
Input for the cost function is THETA, which is a vector of the θ parameters
Two return values from costFunction are
jval
How we compute the cost function θ (the underived cost function)
In this case = (θ1 - 5)2 + (θ2 - 5)2
gradient
2 by 1 vector
2 elements are the two partial derivative terms
i.e. this is an n-dimensional vector
Each indexed value gives the partial derivatives for the partial derivative
of J(θ) with respect to θi
Where i is the index position in the gradient vector
With the cost function implemented, we can call the advanced algorithm using
Here
options is a data structure giving options for the algorithm
fminunc
function minimize the cost function (find minimum of
unconstrained multivariable function)
@costFunction is a pointer to the costFunction function to be used
For the octave implementation
initialTheta must be a matrix of at least two dimensions
www.holehouse.org/mlclass/06_Logistic_Regression.html 12/14
6/14/2019 06_Logistic_Regression
Here
theta is a n+1 dimensional column vector
Octave indexes from 1, not 0
Write a cost function which captures the cost function for logistic regression
www.holehouse.org/mlclass/06_Logistic_Regression.html 13/14
Given a dataset with three classes, how do we get a learning algorithm to work?
6/14/2019 06_Logistic_Regression
Use one vs. all classification make binary classification work for multiclass
classification
One vs. all classification
Split the training set into three separate binary classification problems
i.e. create a new fake training set
Triangle (1) vs crosses and squares (0) hθ1(x)
P(y=1 | x1; θ)
Crosses (1) vs triangle and square (0) hθ2(x)
P(y=1 | x2; θ)
Square (1) vs crosses and square (0) hθ3(x)
P(y=1 | x3; θ)
Overall
Train a logistic regression classifier hθ(i)(x) for each class i to predict the probability
that y = i
On a new input, x to make a prediction, pick the class i that maximizes the
probability that hθ(i)(x) = 1
www.holehouse.org/mlclass/06_Logistic_Regression.html 14/14
07: Regularization
6/15/2019 07_Regularization
To recap, if we have too many features then the learned hypothesis may give a cost function of
exactly zero
But this tries too hard to fit the training set
Fails to provide a general solution - unable to generalize (apply to new examples)
www.holehouse.org/mlclass/07_Regularization.html 1/7
Overfitting with logistic regression
6/15/2019 07_Regularization
Addressing overfitting
The addition in blue is a modification of our cost function to help penalize θ3 and θ4
So here we end up with θ3 and θ4 being close to zero (because the constants are massive)
So we're basically left with a quadratic function
Regularization
Small values for parameters corresponds to a simpler hypothesis (you effectively get rid
of some of the terms)
A simpler hypothesis is less prone to overfitting
Another example
Have 100 features x1, x2, ..., x100
Unlike the polynomial example, we don't know what are the high order terms
How do we pick the ones to pick to shrink?
With regularization, take cost function and modify it to shrink all the parameters
Add a term at the end
This regularization term shrinks every parameter
By convention you don't penalize θ0 - minimization is from θ1 onwards
Previously, gradient descent would repeatedly update the parameters θj, where j = 0,1,2...n
simultaneously
Shown below
www.holehouse.org/mlclass/07_Regularization.html 4/7
6/15/2019 07_Regularization
The term
We saw earlier that logistic regression can be prone to overfitting with lots of features
Logistic regression cost function is as follows;
www.holehouse.org/mlclass/07_Regularization.html 5/7
6/15/2019 07_Regularization
Again, to modify the algorithm we simply need to modify the update rule for θ1, onwards
Looks cosmetically the same as linear regression, except obviously the hypothesis is very
different
www.holehouse.org/mlclass/07_Regularization.html 6/7
use fminunc
6/15/2019
Pass it an @costfunction argument07_Regularization
Minimizes in an optimized manner using the cost function
jVal
Need code to compute J(θ)
Need to include regularization term
Gradient
Needs to be the partial derivative of J(θ) with respect to θi
Adding the appropriate term here is also necessary
www.holehouse.org/mlclass/07_Regularization.html 7/7
08: Neural Networks - Representation
6/15/2019 08_Neural_Networks_Representation
Neural networks (NNs) were originally motivated by looking at machines which replicate
the brain's functionality
Looked at here as a machine learning technique
Origins
To build learning systems, why not mimic the brain?
Used a lot in the 80s and 90s
Popularity diminished in late 90s
Recent major resurgence
NNs are computationally expensive, so only recently large scale neural networks
became computationally feasible
Brain
Does loads of crazy things
Hypothesis is that the brain has a single learning algorithm
Evidence for hypothesis
Auditory cortex --> takes sound signals
If you cut the wiring from the ear to the auditory cortex
Re-route optic nerve to the auditory cortex
Auditory cortex learns to see
Somatosensory context (touch processing)
If you rewrite optic nerve to somatosensory cortex then it learns to see
With different tissue learning to see, maybe they all learn in the same way
Brain learns by itself how to learn
Other examples
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 2/14
Seeing with your tongue
6/15/2019 08_Neural_Networks_Representation
Brainport
Grayscale camera on head
Run wire to array of electrodes on tongue
Pulses onto tongue represent image signal
Lets people see with their tongue
Human echolocation
Blind people being trained in schools to interpret sound and echo
Lets them move around
Haptic belt direction sense
Belt which buzzes towards north
Gives you a sense of direction
Brain can process and learn from data from any source
Model representation 1
How do we represent neural networks (NNs)?
Neural networks were developed as a way to simulate networks of neurones
What does a neurone look like
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 6/14
6/15/2019 08_Neural_Networks_Representation
For example
Ɵ131 = means
1 - we're mapping to node 1 in layer l+1
3 - we're mapping from node 3 in layer l
1 - we're mapping from layer 1
Model representation II
Here we'll look at how to carry out the computation efficiently through a vectorized
implementation. We'll also consider
why NNs are good and how we can use them to learn complex non-linear things
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 7/14
6/15/2019 08_Neural_Networks_Representation
z2 is a 3x1 vector
We can vectorize the computation of the neural network as as follows in two steps
z2 = Ɵ(1)x
i.e. Ɵ(1) is the matrix defined above
x is the feature vector
a = g(z(2))
2
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 10/14
6/15/2019 08_Neural_Networks_Representation
Example on the right shows a simplified version of the more complex problem we're dealing
with (on the left)
We want to learn a non-linear decision boundary to separate the positive and negative
examples
y = x1 XOR x2
x1 XNOR x2
Positive examples when both are true and both are false
Let's start with something a little more straight forward...
Don't worry about how we're determining the weights (Ɵ values) for now - just get
a flavor of how NNs work
Can we get a one-unit neural network to compute this logical AND function? (probably...)
Add a bias unit
Add some weights for the networks
What are weights?
Weights are the parameter values which multiply into the input nodes (i.e. Ɵ)
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 11/14
6/15/2019 08_Neural_Networks_Representation
So, as we can see, when we evaluate each of the four possible input, only (1,1) gives a positive
output
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 12/14
Negation is achieved by putting a large negative weight in front of the variable you want to
6/15/2019 08_Neural_Networks_Representation
negative
Simplez!
Multiclass classification
Multiclass classification is, unsurprisingly, when you distinguish between more than two
categories (i.e. more than 1 or 0)
With handwritten digital recognition problem - 10 possible categories (0-9)
How do you do that?
Done using an extension of one vs. all classification
Recognizing pedestrian, car, motorbike or truck
Build a neural network with four output units
Output a vector of four numbers
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 13/14
1 is 0/1 pedestrian
6/15/2019 08_Neural_Networks_Representation
2 is 0/1 car
3 is 0/1 motorcycle
4 is 0/1 truck
When image is a pedestrian get [1,0,0,0] and so on
Just like one vs. all described earlier
Here we have four logistic regression classifiers
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 14/14
09: Neural Networks - Learning
6/15/2019 09_Neural_Networks_Learning
So here
l =4
s1 = 3
s2 = 5
s3 = 5
s4 = 4
For neural networks our cost function is a generalization of this equation above, so instead of one output we generate k outputs
There are basically two halves to the neural network logistic regression cost function
First half
Second half
This is a massive regularization summation term, which I'm not going to walk through, but it's a
fairly straightforward triple nested summation
This is also called a weight decay term
As before, the lambda value determines the important of the two halves
Remember that the partial derivative term we calculate above is a REAL number (not a vector or a matrix)
Ɵ is the input parameters
Ɵ1 is the matrix of weights which define the function mapping from layer 1 to layer 2
Ɵ101 is the real number parameter which you multiply the bias unit (i.e. 1) with for the bias unit input into the
first unit in the second layer
Ɵ111 is the real number parameter which you multiply the first (real) unit with for the first input into the first
unit in the second layer
Ɵ211 is the real number parameter which you multiply the first (real) unit with for the first input into the second
unit in the second layer
As discussed, Ɵijl i
i here represents the unit in layer l+1 you're mapping to (destination node)
j is the unit in layer l you're mapping from (origin node)
l is the layer your mapping from (to layer l+1) (origin layer)
NB
The terms destination node, origin node and origin layer are terms I've made up!
So - this partial derivative term is
The partial derivative of a 3-way indexed dataset with respect to a real number (which is one of the values in that
dataset)
Gradient computation
One training example
Imagine we just have a single pair (x,y) - entire training set
How would we deal with this example?
The forward propagation algorithm operates as follows
Layer 1
a1 = x
z2 = Ɵ1a1
Layer 2
a2 = g(z2) (add a02)
z3 = Ɵ2a2
Layer 3
a3 = g(z3) (add a03)
z4 = Ɵ3a3
Output
a4 = hƟ(x) = g(z4)
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 3/11
6/15/2019 09_Neural_Networks_Learning
And if we take a second to consider the vector dimensionality (with our example above [3-5-5-4])
Ɵ3 = is a matrix which is [4 X 5] (if we don't include the bias term, 4 X 6 if we do)
(Ɵ3)T = therefore, is a [5 X 4] matrix
4
δ = is a 4x1 vector
So when we multiply a [5 X 4] matrix with a [4 X 1] vector we get a [5 X 1] vector
Which, low and behold, is the same dimensionality as the a3 vector, meaning we can run our pairwise multiplication
Why do we do this?
We do all this to get all the δ terms, and we want the δ terms because through a very complicated derivation you can use δ to get the
partial derivative of Ɵ with respect to individual parameters (if you ignore regularization, or regularization is 0, which we deal with
later)
= ajl δi(l+1)
By doing back propagation and computing the delta terms you can then compute the partial derivative terms
We need the partial derivatives to minimize the cost function!
i.e. for each example in the training set (dealing with each example as (x,y)
Set a1 (activation of input layer) = xi
Perform forward propagation to compute al for each layer (l = 1,2, ... L)
i.e. run forward propagation
Then, use the output label for the specific example we're looking at to calculate δL where δL = aL - yi
So we initially calculate the delta value for the output layer
Then, using back propagation we move back through the network from layer L-1 down to layer
Finally, use Δ to accumulate the partial derivative terms
Note here
l = layer
j = node in that layer
i = the error of the affected node in the target layer
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 5/11
You can vectorize the Δ expression too, as
6/15/2019 09_Neural_Networks_Learning
Finally
After executing the body of the loop, exit the for loop and compute
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 6/11
6/15/2019 09_Neural_Networks_Learning
The sigmoid function applied to the z values gives the activation values
Below we show exactly how the z value is calculated for an example
Back propagation
This function cycles over each example, so the cost for one example really boils down to this
Which, we can think of as a sigmoidal version of the squared difference (check out the derivation if you don't believe me)
So, basically saying, "how well is the network doing on example i "?
We can think about a δ term on a unit as the "error" of cost for the activation value associated with a unit
More formally (don't worry about this...), δ is
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 7/11
6/15/2019 09_Neural_Networks_Learning
Looking at another example to see how we actually calculate the delta value;
So, in effect,
Back propagation calculates the δ, and those δ values are the weighted sum of the next layer's delta values, weighted by
the parameter associated with the links
Forward propagation calculates the activation (a) values, which
Depending on how you implement you may compute the delta values of the bias values
However, these aren't actually used, so it's a bit inefficient, but not a lot more!
Example
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 8/11
6/15/2019 09_Neural_Networks_Learning
Use the thetaVec = [ Theta1(:); Theta2(:); Theta3(:)]; notation to unroll the matrices into a long vector
To go back you use
Theta1 = resape(thetaVec(1:110), 10, 11)
Gradient checking
Backpropagation has a lot of details, small bugs can be present and ruin it :-(
This may mean it looks like J(Ɵ) is decreasing, but in reality it may not be decreasing by as much as it should
So using a numeric method to check the gradient can help diagnose a bug
Gradient checking helps make sure an implementation is working correctly
Example
Have an function J(Ɵ)
Estimate derivative of function at point Ɵ (where Ɵ is a real number)
How?
Numerically
Compute Ɵ + ε
Compute Ɵ - ε
Join them by a straight line
Use the slope of that line as an approximation to the derivative
So, in octave we use the following code the numerically compute the derivatives
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 9/11
So on each loop thetaPlus = theta except for thetaPlus(i)
6/15/2019 Resets thetaPlus on each loop 09_Neural_Networks_Learning
Create a vector of partial derivative approximations
Using the vector of gradients from backprop (DVec)
Check that gradApprox is basically equal to DVec
Gives confidence that the Backproc implementation is correc
Implementation note
Implement back propagation to compute DVec
Implement numerical gradient checking to compute gradApprox
Check they're basically the same (up to a few decimal places)
Before using the code for learning turn off gradient checking
Why?
GradAprox stuff is very computationally expensive
In contrast backprop is much more efficient (just more fiddly)
Random initialization
Pick random small initial values for all the theta values
If you start them on zero (which does work for linear regression) then the algorithm fails - all activation values for each layer
are the same
So chose random values!
Between 0 and 1, then scale by epsilon (where epsilon is a constant)
for i = 1:m {
Forward propagation on (xi, yi) --> get activation (a) terms
Back propagation on (xi, yi) --> get delta (δ) terms
Compute Δ := Δl + δl+1(al)T
}
Notes on implementation
Usually done with a for loop over training examples (for forward and back propagation)
Can be done without a for loop, but this is a much more complicated way of doing things
Be careful
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 10/11
2.5) Use gradient checking to compare the partial derivatives computed using the above algorithm and numerical estimation of
6/15/2019 gradient of J(Ɵ) 09_Neural_Networks_Learning
Disable the gradient checking code for when you actually run it
2.6) Use gradient descent or an advanced optimization method with back propagation to try to minimize J(Ɵ) as a function of
parameters Ɵ
Here J(Ɵ) is non-convex
Can be susceptible to local minimum
In practice this is not usually a huge problem
Can't guarantee programs with find global optimum should find good local optimum at least
e.g. above pretending data only has two features to easily display what's going on
Our minimum here represents a hypothesis output which is pretty close to y
If you took one of the peaks hypothesis is far from y
Gradient descent will start from some random point and move downhill
Back propagation calculates gradient down that hill
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 11/11
10: Advice for applying Machine
6/16/2019 Learning
10_Advice_for_applying_machine_learning
So, say you've implemented regularized linear regression to predict housing prices
Trained it
But, when you test on new data you find it makes unacceptably large errors in its predictions
:-(
What should you try next?
There are many things you can do;
Get more training data
Sometimes more data doesn't help
Often it does though, although you should always do some preliminary testing to make sure more data will
actually make a difference (discussed later)
Try a smaller set a features
Carefully select small subset
You can do this by hand, or use some dimensionality reduction technique (e.g. PCA - we'll get to this later)
Try getting additional features
Sometimes this isn't helpful
LOOK at the data
Can be very time consuming
Adding polynomial features
You're grasping at straws, aren't you...
Building your own, new, better features based on your knowledge of the problem
Can be risky if you accidentally over fit your data by creating new features which are inherently specific/relevant
to your training data
Try decreasing or increasing λ
Change how important the regularization term is in your calculations
These changes can become MAJOR projects/headaches (6 months +)
Sadly, most common method for choosing one of these examples is to go by gut feeling (randomly)
Many times, see people spend huge amounts of time only to discover that the avenue is fruitless
No apples, pears, or any other fruit. Nada.
There are some simple techniques which can let you rule out half the things on the list
Save you a lot of time!
Machine learning diagnostics
Tests you can run to see what is/what isn't working for an algorithm
See what you can change to improve an algorithm's performance
These can take time to implement and understand (week)
But, they can also save you spending months going down an avenue which will never work
Evaluating a hypothesis
When we fit parameters to training data, try and minimize the error
We might think a low error is good - doesn't necessarily mean a good parameter set
Could, in fact, be indicative of overfitting
This means you model will fail to generalize
How do you tell if a hypothesis is overfitting?
Could plot hθ(x)
But with lots of features may be impossible to plot
Standard way to evaluate a hypothesis is
Split data into two portions
1st portion is training set
2nd portion is test set
www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html
Typical split might be 70:30 (training:test) 1/7
6/16/2019 10_Advice_for_applying_machine_learning
i.e. its the fraction in the test set the hypothesis mislabels
These are the standard techniques for evaluating a learned hypothesis
So
Minimize cost function for each of the models as before
Test these hypothesis on the cross validation set to generate the cross validation error
Pick the hypothesis with the lowest cross validation error
e.g. pick θ5
Finally
Estimate generalization error of model using the test set
Final note
In machine learning as practiced today - many people will select the model using the test set and then check the model is
OK for generalization using the test error (which we've said is bad because it gives a bias analysis)
With a MASSIVE test set this is maybe OK
But considered much better practice to have separate training and validation sets
www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html 4/7
6/16/2019 10_Advice_for_applying_machine_learning
The equation above describes fitting a high order polynomial with regularization (used to keep parameter values small)
Consider three cases
λ = large
All θ values are heavily penalized
So most parameters end up being close to zero
So hypothesis ends up being close to 0
So high bias -> under fitting data
λ = intermediate
Only this values gives the fitting which is reasonable
λ = small
Lambda = 0
So we make the regularization term 0
So high variance -> Get overfitting (minimal regularization means it obviously doesn't do what it's meant to)
Define cross validation error and test set errors as before (i.e. without regularization term)
So they are 1/2 average squared error of various sets
Choosing λ
Have a set or range of values to use
Often increment by factors of 2 so
model(1)= λ = 0
model(2)= λ = 0.01
model(3)= λ = 0.02
model(4) = λ = 0.04
model(5) = λ = 0.08
.
.
.
model(p) = λ = 10
This gives a number of models which have different λ
With these models
Take each one (pth)
Minimize the cost function
This will generate some parameter vector
Call this θ(p)
So now we have a set of parameter vectors corresponding to models with different λ values
Take all of the hypothesis and use the cross validation set to validate them
Measure average squared error on cross validation set
Pick the model which gives the lowest error
www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html 5/7
Say we pick θ(5)
6/16/2019 Finally, take the one we've selected (θ(5)) and
10_Advice_for_applying_machine_learning
test it with the test set
Bias/variance as a function of λ
Plot λ vs.
Jtrain
When λ is small you get a small value (regularization basically goes to 0)
When λ is large you get a large vale corresponding to high bias
Jcv
When λ is small we see high variance
Too small a value means we over fit the data
When λ is large we end up underfitting, so this is bias
So cross validation error is high
Such a plot can help show you you're picking a good value for λ
Learning curves
A learning curve is often useful to plot for algorithmic sanity checking or improving performance
What is a learning curve?
Plot Jtrain (average squared error on training set) or Jcv (average squared error on cross validation set)
Plot against m (number of training examples)
m is a constant
So artificially reduce m and recalculate errors with the smaller training set sizes
Jtrain
Error on smaller sample sizes is smaller (as less variance to accommodate)
So as m grows error grows
Jcv
Error on cross validation set
When you have a tiny training set your generalize badly
But as training set grows your hypothesis generalize better
So cv error will decrease as m increases
So if an algorithm is already suffering from high variance, more data will probably help
Maybe
These are clean curves
In reality the curves you get are far dirtier
But, learning curve plotting can help diagnose the problems your algorithm will be suffering from
Try adding additional features --> fixes high bias (because hypothesis is too simple, make hypothesis more
specific)
www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html 7/7
11: Machine Learning System Design
6/16/2019 11_Machine_Learning_System_Design
Chose 100 words which are indicative of an email being spam or not spam
Spam --> e.g. buy, discount, deal
Non spam --> Andrew, now
All these words go into one long vector
In practice its more common to have a training set and pick the most frequently n words,
where n is 10 000 to 50 000
So here you're not specifically choosing your own features, but you are choosing
how you select them from the training set data
Error analysis
www.holehouse.org/mlclass/11_Machine_Learning_System_Design.html 2/7
When faced with a ML problem lots of ideas of how to improve a problem
6/16/2019
Talk about error analysis - how11_Machine_Learning_System_Design
to better make decisions
If you're building a machine learning system often good to start by building a simple algorithm
which you can implement quickly
Spend at most 24 hours developing an initially bootstrapped algorithm
Implement and test on cross validation data
Plot learning curves to decide if more data, features etc will help algorithmic optimization
Hard to tell in advance what is important
Learning curves really help with this
Way of avoiding premature optimization
We should let evidence guide decision making regarding development trajectory
Error analysis
Manually examine the samples (in cross validation set) that your algorithm made
errors on
See if you can work out why
Systematic patterns - help design new features to avoid these shortcomings
e.g.
Built a spam classifier with 500 examples in CV set
Here, error rate is high - gets 100 wrong
Manually look at 100 and categorize them depending on features
e.g. type of email
Looking at those email
May find most common type of spam emails are pharmacy
emails, phishing emails
See which type is most common - focus your work on those ones
What features would have helped classify them correctly
e.g. deliberate misspelling
Unusual email routing
Unusual punctuation
May fine some "spammer technique" is causing a lot of your misses
Guide a way around it
Importance of numerical evaluation
Have a way of numerically evaluated the algorithm
If you're developing an algorithm, it's really good to have some performance
calculation which gives a single real number to tell you how well its doing
e.g.
Say were deciding if we should treat a set of similar words as the same word
This is done by stemming in NLP (e.g. "Porter stemmer" looks at
the etymological stem of a word)
This may make your algorithm better or worse
Also worth consider weighting error (false positive vs. false negative)
e.g. is a false positive really bad, or is it worth have a few of one to
improve performance a lot
Can use numerical evaluation to compare the changes
See if a change improves an algorithm or not
A single real number may be hard/complicated to compute
But makes it much easier to evaluate how changes impact your algorithm
You should do error analysis on the cross validation set instead of the test set
This curve can take many different shapes depending on classifier details
Is there a way to automatically chose the threshold
Or, if we have a few algorithms, how do we compare different algorithms or parameter
sets?
www.holehouse.org/mlclass/11_Machine_Learning_System_Design.html 5/7
How do we decide which of these algorithms is best?
6/16/2019 11_Machine_Learning_System_Design
We spoke previously about using a single real number evaluation metric
By switching to precision/recall we have two numbers
Now comparison becomes harder
Better to have just one number
How can we convert P & R into one number?
One option is the average - (P + R)/2
This is not such a good solution
Means if we have a classifier which predicts y = 1 all the time you get a
high recall and low precision
Similarly, if we predict Y rarely get high precision and low recall
So averages here would be 0.45, 0.4 and 0.51
0.51 is best, despite having a recall of 1 - i.e. predict y=1 for
everything
So average isn't great
F1Score (fscore)
= 2 * (PR/ [P + R])
Fscore is like taking the average of precision and recall giving a higher
weight to the lower value
Many formulas for computing comparable precision/accuracy values
If P = 0 or R = 0 the Fscore = 0
If P = 1 and R = 1 then Fscore = 1
The remaining values lie between 0 and 1
Threshold offers a way to control trade-off between precision and recall
Fscore gives a single real number evaluation metric
If you're trying to automatically set the threshold, one way is to try a range of threshold
values and evaluate them on your cross validation set
Then pick the threshold which gives the best fscore.
www.holehouse.org/mlclass/11_Machine_Learning_System_Design.html 6/7
Varied training set size and tried algorithms on a range of sizes
6/16/2019 11_Machine_Learning_System_Design
www.holehouse.org/mlclass/11_Machine_Learning_System_Design.html 7/7
12: Support Vector Machines (SVMs)
6/16/2019 12_Support_Vector_Machines
Start with logistic regression, see how we can modify it to get the SVM
As before, the logistic regression hypothesis is as follows
www.holehouse.org/mlclass/12_Support_Vector_Machines.html 1/19
Similarly, when y = 0
6/16/2019 Then we hope hθ(x) is close to 012_Support_Vector_Machines
With hθ(x) close to 0, (θT x) must be much less than 0
This is our classic view of logistic regression
Let's consider another way of thinking about the problem
Alternative view of logistic regression
If you look at cost function, each example contributes a term like the one below to the overall cost
function
For the overall cost function, we sum over all the training examples using the above function,
and have a 1/m term
If you then plug in the hypothesis definition (hθ(x)), you get an expanded cost function equation;
So each training example contributes that term to the cost function for logistic regression
Similarly
When y = 0
Do the equivalent with the y=0 function plot
If this looks unfamiliar its because we previously had the - sign outside the expression
For the SVM we take our two logistic regression y=1 and y=0 terms described previously and replace
with
cost1(θT x)
www.holehouse.org/mlclass/12_Support_Vector_Machines.html 3/19
cost0(θT x)
6/16/2019 12_Support_Vector_Machines
So we get
Unlike logistic, hθ(x) doesn't give us a probability, but instead we get a direct prediction of 1 or 0
So if θT x is equal to or greater than 0 --> hθ(x) = 1
Else --> hθ(x) = 0
Turns out when you solve this problem you get interesting decision boundaries
The green and magenta lines are functional decision boundaries which could be chosen by logistic
regression
But they probably don't generalize too well
The black line, by contrast is the the chosen by the SVM because of this safety net imposed by the
optimization graph
www.holehouse.org/mlclass/12_Support_Vector_Machines.html 5/19
More robust separator
6/16/2019 Mathematically, that black line has a 12_Support_Vector_Machines
larger minimum distance (margin) from any of the training
examples
By separating with the largest margin you incorporate robustness into your decision making
process
We looked at this at when C is very large
SVM is more sophisticated than the large margin might look
If you were just using large margin then SVM would be very sensitive to outliers
You would risk making a ridiculous hugely impact your classification boundary
A single example might not represent a good reason to change an algorithm
If C is very large then we do use this quite naive maximize the margin approach
Have two (2D) vectors u and v - what is the inner product (uT v)?
Plot u on graph
i.e u1 vs. u2
So here p is negative
Use the vector inner product theory to try and understand SVMs a little better
Although we haven't been thinking about examples as vectors it can be described as such
Now, say we have our parameter vector θ and we plot that on the same axis
The next question is what is the inner product of these two vectors
So, given we've redefined these functions let us now consider the training example below
Given this data, what boundary will the SVM choose? Note that we're still assuming θ0 = 0, which
means the boundary has to pass through the origin (0,0)
Green line - small margins
θ is always at 90 degrees to the decision boundary (can show with linear algebra,
although we're not going to!)
So now lets look at what this implies for the optimization objective
www.holehouse.org/mlclass/12_Support_Vector_Machines.html 10/19
Look at first example (x1)
6/16/2019 12_Support_Vector_Machines
Now if you look at the projection of the examples to θ we find that p1 becomes large and ||θ|| can
become small
www.holehouse.org/mlclass/12_Support_Vector_Machines.html 11/19
So with some values drawn in
6/16/2019 12_Support_Vector_Machines
This means that by choosing this second decision boundary we can make ||θ|| smaller
Which is why the SVM choses this hypothesis as better
This is how we generate the large margin effect
These points l1, l2, and l3, were chosen manually and are called landmarks
Given x, define f1 as the similarity between (x, l1)
= exp(- (|| x - l1 ||2 ) / 2σ2)
=
|| x - l1 || is the euclidean distance between the point x and the landmark l1 squared
Disussed more later
If we remember our statistics, we know that
σ is the standard deviation
σ2 is commonly called the variance
Remember, that as discussed
So, f2 is defined as
f2 = similarity(x, l1) = exp(- (|| x - l2 ||2 ) / 2σ2)
And similarly
f3 = similarity(x, l2) = exp(- (|| x - l1 ||2 ) / 2σ2)
This similarity function is called a kernel
This function is a Gaussian Kernel
So, instead of writing similarity between x and l we might write
www.holehouse.org/mlclass/12_Support_Vector_Machines.html 13/19
f1 = k(x, l1)
6/16/2019 12_Support_Vector_Machines
So lets see what these kernels do and why the functions defined make sense
Say x is close to a landmark
Then the squared distance will be ~0
So
We see here that as you move away from 3,5 the feature f1 falls to zero much more rapidly
The inverse can be seen if σ2 = 3
www.holehouse.org/mlclass/12_Support_Vector_Machines.html 14/19
6/16/2019 12_Support_Vector_Machines
Given this definition, what kinds of hypotheses can we learn?
With training examples x we predict "1" when
θ0+ θ1f1+ θ2f2 + θ3f3 >= 0
For our example, lets say we've already run an algorithm and got the
θ0 = -0.5
θ1 = 1
θ2 = 1
θ3 = 0
Given our placement of three examples, what happens if we evaluate an example at the
magenta dot below?
Looking at our formula, we know f1 will be close to 1, but f2 and f3 will be close to 0
So if we look at the formula we have
θ0+ θ1f1+ θ2f2 + θ3f3 >= 0
-0.5 + 1 + 0 + 0 = 0.5
0.5 is greater than 1
If we had another point far away from all three
www.holehouse.org/mlclass/12_Support_Vector_Machines.html
Inside we predict y = 1 15/19
Outside we predict y = 0
6/16/2019 So this show how we can create a non-linear boundary with landmarks and the kernel function in the
12_Support_Vector_Machines
support vector machine
But
How do we get/chose the landmarks
What other kernels can we use (other than the Gaussian kernel)
Kernels II
Filling in missing detail and practical implications regarding kernels
Spoke about picking landmarks manually, defining the kernel, and building a hypothesis function
Where do we get the landmarks from?
For complex problems we probably want lots of them
Choosing a kernel
When should you use SVM and when is logistic regression more applicable
If n (features) is large vs. m (training set)
e.g. text classification problem
Feature vector dimension is 10 000
Training set is 10 - 1000
Then use logistic regression or SVM with a linear kernel
If n is small and m is intermediate
n = 1 - 1000
m = 10 - 10 000
Gaussian kernel is good
If n is small and m is large
n = 1 - 1000
www.holehouse.org/mlclass/12_Support_Vector_Machines.html 18/19
m = 50 000+
6/16/2019 SVM will be slow to run with Gaussian kernel
12_Support_Vector_Machines
In that case
Manually create or add more features
Use logistic regression of SVM with a linear kernel
Logistic regression and SVM with a linear kernel are pretty similar
Do similar things
Get similar performance
A lot of SVM's power is using diferent kernels to learn complex non-linear functions
For all these regimes a well designed NN should work
But, for some of these problems a NN might be slower - SVM well implemented would be faster
SVM has a convex optimization problem - so you get a global minimum
It's not always clear how to chose an algorithm
Often more important to get enough data
Designing new features
Debugging the algorithm
SVM is widely perceived a very powerful learning algorithm
www.holehouse.org/mlclass/12_Support_Vector_Machines.html 19/19
13: Clustering
Please follow me on LinkedIn for more
Previous Next Index Steve Nouri
https://www.linkedin.com/in/stevenouri/
Unsupervised learning - introduction
Talk about clustering
Learning from unlabeled data
Unsupervised learning
Useful to contras with supervised learning
Compare and contrast
Supervised learning
Given a set of labels, fit a hypothesis to it
Unsupervised learning
Try and determining structure in the data
Clustering algorithm groups data together based on data features
What is clustering good for
Market segmentation - group customers into different market segments
Social network analysis - Facebook "smartlists"
Organizing computer clusters and data centers for network layout and location
Astronomical data analysis - Understanding galaxy formation
K-means algorithm
Want an algorithm to automatically group the data into coherent clusters
K-means is by far the most widely used clustering algorithm
Overview
Algorithm overview
1) Randomly allocate two points as the cluster centroids
Have as many cluster centroids as clusters you want to do (K cluster centroids, in fact)
In our example we just have two clusters
2) Cluster assignment step
Go through each example and depending on if it's closer to the red or blue centroid assign each point to one of
the two clusters
To demonstrate this, we've gone through the data and "colour" each point red or blue
Loop 1
This inner loop repeatedly sets the c(i) variable to be the index of the closes variable of cluster
centroid closes to xi
i.e. take ith example, measure squared distance to each cluster centroid, assign c(i)to the
cluster closest
Loop 2
Loops over each centroid calculate the average mean based on all the points associated with
each centroid from c(i)
What if there's a centroid with no data
Remove that centroid, so end up with K-1 classes
Or, randomly reinitialize it
Not sure when though...
i.e. squared distances between training example xi and the cluster centroid to which xi has been assigned to
This is just what we've been doing, as the visual description below shows;
The red line here shows the distances between the example xi and the cluster to which that example has been
assigned
Means that when the example is very close to the cluster, this value is small
When the cluster is very far away from the example, the value is large
This is sometimes called the distortion (or distortion cost function)
So we are finding the values which minimizes this function;
Random initialization
The local optimum are valid convergence, but local optimum not global ones
If this is a concern
We can do multiple random initializations
See if we get the same result - many same results are likely to indicate a global optimum
Algorithmically we can do this as follows;
Elbow method
Compression
Speeds up algorithms
Reduces space used by data for them
What is dimensionality reduction?
So you've collected many features - maybe more than you need
Can you "simply" your data set in a rational and useful way?
Example
Redundant data set - different units for same attribute
Reduce data to 1D (2D->1D)
Because they're all in this relative shallow area, we can basically ignore one of the dimension, so we draw two
new lines (z1 and z2) along the x and y planes of the box, and plot the locations in that box
i.e. we loose the data in the z-dimension of our "shallow box" (NB "z-dimensions" here refers to the
dimension relative to the box (i.e it's depth) and NOT the z dimension of the axis we've got drawn above) but
because the box is shallow it's OK to lose this. Probably....
Plot values along those projections
Motivation 2: Visualization
It's hard to visualize highly dimensional data
Dimensionality reduction can improve how we display information in a tractable manner for human consumption
Why do we care?
Often helps to develop algorithms if we can understand our data better
Dimensionality reduction helps us do this, see data in a helpful
Good for explaining something to someone if you can "show" it in the data
Example;
Collect a large data set about many facts of a country around the world
So
x1 = GDP
...
x6 = mean household
Say we have 50 features per country
How can we understand this data better?
Very hard to plot 50 dimensional data
Using dimensionality reduction, instead of each country being represented by a 50-dimensional feature vector
Come up with a different feature representation (z values) which summarize these features
In other words, find a single line onto which to project this data
How do we determine this line?
The distance between each point and the projected version should be small (blue lines below are short)
PCA tries to find a lower dimensional surface so the sum of squares onto that surface is minimized
The blue lines are sometimes called the projection error
PCA tries to find the surface (a straight line in this case) which has the minimum projection error
As an aside, you should normally do mean normalization and feature scaling on your data before
PCA
A more formal description is
For 2D-1D, we must find a vector u(1), which is of some dimensionality
Onto which you can project the data so as to minimize the projection error
PCA Algorithm
Before applying PCA must do data preprocessing
Given a set of m unlabeled examples we must do
Mean normalization
Replace each xji with xj - μj,
In other words, determine the mean of each feature set, and then for each feature subtract the
mean from the value, so we re-scale the mean to be 0
Feature scaling (depending on data)
If features have very different scales then scale so they all have a comparable range of values
e.g. xji is set to (xj - μj) / sj
Where sj is some measure of the range, so could be
Biggest - smallest
Standard deviation (more commonly)
With preprocessing done, PCA finds the lower dimensional sub-space which minimizes the sum of the square
In summary, for 2D->1D we'd be doing something like this;
This is commonly denoted as Σ (greek upper case sigma) - NOT summation symbol
Σ = sigma
This is an [n x n] matrix
Remember than xi is a [n x 1] matrix
In MATLAB or octave we can implement this as follows;
Next we need to find some way to change x (which is n dimensional) to z (which is k dimensional)
(reduce the dimensionality)
Take first k columns of the u matrix and stack in columns
n x k matrix - call this Ureduce
We calculate z as follows
z = (Ureduce)T * x
So [k x n] * [n x 1]
Generates a matrix which is
k*1
If that's not witchcraft I don't know what is!
Exactly the same as with supervised learning except we're now doing it with unlabeled data
So in summary
Preprocessing
Calculate sigma (covariance matrix)
Calculate eigenvectors with svd
Take k vectors from U (Ureduce= U(:,1:k);)
Calculate z (z =Ureduce' * x;)
No mathematical derivation
Very complicated
But it works
We lose some of the information (i.e. everything is now perfectly on that line) but it is now projected into 2D space
Total variation in data can be defined as the average over data saying how far are the training examples from the
origin
Ratio between averaged squared projection error with total variation in data
Want ratio to be small - means we retain 99% of the variance
If it's small (0) then this is because the numerator is small
The numerator is small when xi = xapproxi
i.e. we lose very little information in the dimensionality reduction, so when we decompress we
regenerate the same data
So we chose k in terms of this ratio
Often can significantly reduce data dimensionality while retaining the variance
How do you do this
Applications of PCA
Compression
Why
Reduce memory/disk needed to store data
Speed up learning algorithm
How do we chose k?
% of variance retained
Visualization
Typically chose k =2 or k = 3
Because we can plot these values!
One thing often done wrong regarding PCA
A bad use of PCA: Use it to prevent over-fitting
Reasoning
If we have xi we have n features, zi has k features which can be lower
If we only have k features then maybe we're less likely to over fit...
This doesn't work
BAD APPLICATION
Might work OK, but not a good way to address over fitting
Better to use regularization
PCA throws away some data without knowing what the values it's losing
Probably OK if you're keeping most of the data
But if you're throwing away some crucial data bad
So you have to go to like 95-99% variance retained
So here regularization will give you AT LEAST as good a way to solve over fitting
A second PCA myth
Used for compression or visualization - good
Sometimes used
Design ML system with PCA from the outset
But, what if you did the whole thing without PCA
See how a system performs without PCA
ONLY if you have a reason to believe PCA will help should you then add PCA
PCA is easy enough to add on as a processing step
Try without first!
15: Anomaly Detection
Previous Next Index
0
Next day you have a new engine
An anomaly detection method is used to see if the new engine is anomalous (when compared to the previous engines)
If the new engine looks like this;
More formally
We have a dataset which contains normal (data)
How we ensure they're normal is up to us
In reality it's OK if there are a few which aren't actually normal
Using that dataset as a reference point we can see if other examples are anomalous
How do we do this?
First, using our training dataset we build a model
We can access this model using p(x)
This asks, "What is the probability that example x is normal"
Having built a model
if p(xtest) < ε --> flag this as an anomaly
if p(xtest) >= ε --> this is OK
ε is some threshold probability value which we define, depending on how sure we need/want to be
We expect our model to (graphically) look something like this;
Applications
Fraud detection
Users have activity associated with them, such as
Length on time on-line
Location of login
Spending frequency
Using this data we can build a model of what normal users' activity is like
What is the probability of "normal" behavior?
Identify unusual users by sending their data through the model
Flag up anything that looks a bit weird
Automatically block cards/transactions
Manufacturing
Already spoke about aircraft engine example
Monitoring computers in data center
If you have many machines in a cluster
Computer features of machine
x1 = memory use
x2 = number of disk accesses/sec
x3 = CPU load
In addition to the measurable features you can also define your own complex features
x4 = CPU load/network traffic
If you see an anomalous machine
Maybe about to fail
Look at replacing bits from it
What is it?
Say we have a data set of m examples
Give each example is a real number - we can plot the data on the x axis as shown below
Seems like a reasonable fit - data seems like a higher probability of being in the central region, lower probability of
being further away
Estimating μ and σ2
μ = average of examples
σ2 = standard deviation squared
As a side comment
These parameters are the maximum likelihood estimation values for μ and σ2
You can also do 1/(m) or 1/(m-1) doesn't make too much difference
Slightly different mathematical problems, but in practice it makes little difference
x1
Mean is about 5
Standard deviation looks to be about 2
x2
Mean is about 3
Standard deviation about 1
So we have the following system
With this surface plot, the height of the surface is the probability - p(x)
We can't always do surface plots, but for this example it's quite a nice way to show the probability of a 2D feature vector
Check if a value is anomalous
Set epsilon as some value
Say we have two new data points new data-point has the values
x1test
x2test
We compute
p(x1test) = 0.436 >= epsilon (~40% chance it's normal)
Normal
p(x2test) = 0.0021 < epsilon (~0.2% chance it's normal)
Anomalous
What this is saying is if you look at the surface plot, all values above a certain height are normal, all the values below that threshold
are probably anomalous
Anomaly detection
Supervised learning
Can play with different transformations of the data to make it look more Gaussian
Might take a log transformation of the data
i.e. if you have some feature x1, replace it with log(x1)
Here we have one dimension, and our anomalous value is sort of buried in it (in green - Gaussian superimposed in blue)
Look at data - see what went wrong
Can looking at that example help develop a new feature (x2) which can help distinguish further anomalous
Example - data center monitoring
Features
x1 = memory use
x2 = number of disk access/sec
x3 = CPU load
x4 = network traffic
We suspect CPU load and network traffic grow linearly with one another
If server is serving many users, CPU is high and network is high
Fail case is infinite loop, so CPU load grows but network traffic is low
New feature - CPU load/network traffic
May need to do feature scaling
Say you can fit a Gaussian distribution to CPU load and memory use
Lets say in the test set we have an example which looks like an anomaly (e.g. x1 = 0.4, x2 = 1.5)
Looks like most of data lies in a region far away from this example
Here memory use is high and CPU load is low (if we plot x1 vs. x2 our green example looks miles away from the others)
Problem is, if we look at each feature individually they may fall within acceptable limits - the issue is we know we shouldn't don't get
those kinds of values together
But individually, they're both acceptable
This is because our function makes probability prediction in concentric circles around the the means of both
Probability of the two red circled examples is basically the same, even though we can clearly see the green one as an outlier
Doesn't understand the meaning
For the sake of completeness, the formula for the multivariate Gaussian distribution is as follows
Using these values we can, therefore, define the shape of this to better fit the data, rather than assuming symmetry in every dimension
One of the cool things is you can use it to model correlation between data
If you start to change the off-diagonal values in the covariance matrix you can control how well the various dimensions correlation
So we see here the final example gives a very tall thin distribution, shows a strong positive correlation
We can also make the off-diagonal values negative to show a negative correlation
Hopefully this shows an example of the kinds of distribution you can get by varying sigma
We can, of course, also move the mean (μ) which varies the peak of the distribution
1) Fit model - take data set and calculate μ and Σ using the formula above
2) We're next given a new example (xtest) - see below
For it compute p(x) using the following formula for multivariate distribution
If you plug your variance values into the covariance matrix the models are actually identical
If we have features like these, each film can be recommended by a feature vector
Add an extra feature which is x0 = 1 for each film
So for each film we have a [3 x 1] vector, which for film number 1 ("Love at Last") would be
Create some parameters which give values as close as those seen in the data when applied
Sum over all values of i (all movies the user has used) when r(i,j) = 1 (i.e. all the films that the user has rated)
This is just like linear regression with least-squared error
We can also add a regularization term to make our equation look as follows
The regularization term goes from k=1 through to m, so (θj) ends up being an n+1 feature vector
Don't regularize over the bias terms (0)
If you do this you get a reasonable value
We're rushing through this a bit, but it's just a linear regression problem
To make this a little bit clearer you can get rid of the mj term (it's just a constant so shouldn't make any difference
to minimization)
So to learn (θj)
But for our recommender system we want to learn parameters for all users, so we add an extra summation
term to this which means we determine the minimum (θj) value for every user
When you do this as a function of each (θj) parameter vector you get the parameters for each user
So this is our optimization objective -> J(θ1, ..., θnu)
In order to do the minimization we have the following gradient descent
Using that same approach we should then be able to determine the remaining feature vectors for the other
films
So we're summing over all the indices j for where we have data for movie i
We're minimizing this squared error
Like before, the above equation gives us a way to learn the features for one film
We want to learn all the features for all the films - so we need an additional summation term
If we're given the users' preferences we can use them to work out the film's features
Algorithm Structure
1) Initialize θ1, ..., θnu and x1, ..., xnm to small random values
A bit like neural networks - initialize all parameters to small random numbers
2) Minimize cost function (J(x1, ..., xnm, θ1, ...,θnu) using gradient descent
We find that the update rules look like this
Where the top term is the partial derivative of the cost function with respect to xki while the bottom is the
partial derivative of the cost function with respect to θki
So here we regularize EVERY parameters (no longer x0 parameter) so no special case update rule
3) Having minimized the values, given a user (user j) with parameters θ and movie (movie i) with learned
features x, we predict a start rating of (θj)T xi
This is the collaborative filtering algorithm, which should give pretty good predictions for how users like new movies
5 movies
4 users
Get a [5 x 4] matrix
Given [Y] there's another way of writing out all the predicted ratings
Finally, having run the collaborative filtering algorithm, we can use the learned features to find related films
When you learn a set of features you don't know what the features will be - lets you identify the features
which define a film
Say we learn the following features
x1 - romance
x2 - action
x3 - comedy
x4 - ...
So we have n features all together
After you've learned features it's often very hard to come in and apply a human understandable metric to
what those features are
Usually learn features which are very meaning full for understanding what users like
Say you have movie i
Find movies j which is similar to i, which you can recommend
Our features allow a good way to measure movie similarity
If we have two movies xi and xj
We want to minimize ||xi - xj||
i.e. the distance between those two movies
Provides a good indicator of how similar two films are in the sense of user perception
NB - Maybe ONLY in terms of user perception
Why - If there's no data to pull the values away from 0 this gives the min value
So this means we predict ANY movie to be zero
Presumably Eve doesn't hate all movies...
So if we're doing this we can't recommend any movies to her either
Mean normalization should let us fix this problem
Now we compute the average rating each movie obtained and stored in an nm - dimensional column vector
If we look at all the movie ratings in [Y] we can subtract off the mean rating
As an aside - we spoke here about mean normalization for users with no ratings
If you have some movies with no ratings you can also play with versions of the algorithm where you
normalize the columns
BUT this is probably less relevant - probably shouldn't recommend an unrated movie
To summarize, this shows how you do mean normalization preprocessing to allow your system to deal with users
who have not yet made any ratings
Means we recommend the user we know little about the best average rated products
17: Large Scale Machine Learning
Previous Next Index
One of best ways to get high performance is take a low bias algorithm and train it on a lot of data
e.g. Classification between confusable words
We saw that so long as you feed an algorithm lots of data they all perform pretty similarly
For example, say we have a data set where m = 100, 000, 000
This is pretty realistic for many datasets
Census data
Website traffic data
How do we train a logistic regression model on such a big system?
So you have to sum over 100,000,000 terms per step of gradient descent
Because of the computational cost of this massive summation, we'll look at more efficient ways around
this
- Either using a different approach
- Optimizing to avoid the summation
First thing to do is ask if we can train on 1000 examples instead of 100 000 000
Randomly pick a small selection
Can you develop a system which performs as well?
Sometimes yes - if this is the case you can avoid a lot of the headaches associated with big data
To see if taking a smaller sample works, you can sanity check by plotting error vs. training set size
If our plot looked like this
Cost function
If we plot our two parameters vs. the cost function we get something like this
So the function represents the cost of θ with respect to a specific example (xi, yi)
And we calculate this value as one half times the squared error on that example
Measures how well the hypothesis works on a single example
The overall cost function can now be re-written in the following form;
2) - Algorithm body
Is the same as that found in the summation for batch gradient descent
It's possible to show that this term is equal to the partial derivative with respect to the
parameter θj of the cost(θ, (xi, yi))
What stochastic gradient descent algorithm is doing is scanning through each example
The inner for loop does something like this...
Looking at example 1, take a step with respect to the cost of just the 1st training example
Having done this, we go on to the second training example
Now take a second step in parameter space to try and fit the second training example better
Now move onto the third training example
And so on...
Until it gets to the end of the data
We may now repeat this who procedure and take multiple passes over the data
The randomly shuffling at the start means we ensure the data is in a random order so we don't bias
the movement
Randomization should speed up convergence a little bit
Although stochastic gradient descent is a lot like batch gradient descent, rather than waiting to sum up
the gradient terms over all m examples, we take just one example and make progress in improving the
parameters already
Means we update the parameters on EVERY step through data, instead of at the end of each loop
through all the data
What does the algorithm do to the parameters?
As we saw, batch gradient descent does something like this to get to a global minimum
With stochastic gradient descent every iteration is much faster, but every iteration is flitting a single
example
What you find is that you "generally" move in the direction of the global minimum, but not
always
Never actually converges like batch gradient descent does, but ends up wandering around
some region close to the global minimum
In practice, this isn't a problem - as long as you're close to the minimum that's probably
OK
One final implementation note
May need to loop over the entire dataset 1-10 times
If you have a truly massive dataset it's possible that by the time you've taken a single pass through
the dataset you may already have a perfectly good hypothesis
In which case the inner loop might only need to happen 1 if m is very very large
If we contrast this to batch gradient descent
We have to make k passes through the entire dataset, where k is the number of steps needed to
move through the data
Mini-batch algorithm
To be honest, stochastic gradient descent and batch gradient descent are just specific forms of batch-
gradient descent
For mini-batch gradient descent, b is somewhere in between 1 and m and you can try to optimize for
it!
With batch gradient descent, we could plot cost function vs number of iterations
Should decrease on every iteration
This works when the training set size was small because we could sum over all examples
Doesn't work when you have a massive dataset
With stochastic gradient descent
We don't want to have to pause the algorithm periodically to do a summation over all data
Moreover, the whole point of stochastic gradient descent is to avoid those whole-data
summations
For stochastic gradient descent, we have to do something different
Take cost function definition
This disadvantage of a larger average means you get less frequent feedback
Sometimes you may get a plot that looks like this
Learning rate
We saw that with stochastic gradient descent we get this wandering around the minimum
In most implementations the learning rate is held constant
However, if you want to converge to a minimum you can slowly decrease the learning rate over time
A classic way of doing this is to calculate α as follows
α = const1/(iterationNumber + const2)
Which means you're guaranteed to converge somewhere
You also need to determine const1 and const2
BUT if you tune the parameters well, you can get something like this
Online learning
New setting
Allows us to model problems where you have a continuous stream of data you want an algorithm to
learn from
Similar idea of stochastic gradient descent, in that you do slow updates
Web companies use various types of online learning algorithms to learn from traffic
Can (for example) learn about user preferences and hence optimize your website
Example - Shipping service
Users come and tell you origin and destination
You offer to ship the package for some amount of money ($10 - $50)
Based on the price you offer, sometimes the user uses your service (y = 1), sometimes they don't (y =
0)
Build an algorithm to optimize what price we offer to the users
Capture
Information about user
Origin and destination
Work out
What the probability of a user selecting the service is
We want to optimize the price
To model this probability we have something like
p(y = 1|x; θ)
Probability that y =1, given x, parameterized by θ
Build this model with something like
Logistic regression
Neural network
If you have a website that runs continuously an online learning algorithm would do something like
this
User comes - is represented as an (x,y) pair where
x - feature vector including price we offer, origin, destination
y - if they chose to use our service or not
The algorithm updates θ using just the (x,y) pair
So we basically update all the θ parameters every time we get some new data
While in previous examples we might have described the data example as (xi, yi) for an online
learning problem we discard this idea of a data "set" - instead we have a continuous stream of data
so indexing is largely irrelevant as you're not storing the data (although presumably you could store
it)
If you have a major website where you have a massive stream of data then this kind of algorithm is pretty
reasonable
You're free of the need to deal with all your training data
If you had a small number of users you could save their data and then run a normal algorithm on a
dataset
An online algorithm can adapt to changing user preferences
So over time users may become more price sensitive
The algorithm adapts and learns to this
So your system is dynamic
Assume m = 400
Normally m would be more like 400 000 000
If m is large this is really expensive
Split training sets into different subsets
So split training set into 4 pieces
Machine 1 : use (x1, y1), ..., (x100, y100)
Uses first quarter of training set
Just does the summation for the first 100
So now we have these four temp values, and each machine does 1/4 of the work
Once we've got our temp variables
Send to to a centralized master server
Put them back together
Update θ using
This equation is doing the same as our original batch gradient descent algorithm
More generally map reduce uses the following scheme (e.g. where you split into 4)
More broadly, by taking algorithms which compute sums you can scale them to very large datasets
through parallelization
Parallelization can come from
Multiple machines
Multiple CPUs
Multiple cores in each CPU
So even on a single compute can implement parallelization
The advantage of thinking about Map Reduce here is because you don't need to worry about network
issues
It's all internal to the same machine
Finally caveat/thought
Depending on implementation detail, certain numerical linear algebra libraries can
automatically parallelize your calculations across multiple cores
So, if this is the case and you have a good vectorization implementation you can not worry about
local Parallelization and the local libraries sort optimization out for you
Hadoop
OCR pipeline
Pedestrian detection
This is a slightly simpler problem because the aspect ration remains pretty constant
Building our detection system
Have 82 x 36 aspect ratio
This is a typical aspect ratio for a standing human
Collect training set of positive and negative examples
This example misses a piece of text on the door because the aspect ratio is wrong
Very hard to read
We train a classifier to try and classify between positive and negative examples
Run that classifier on the regions detected as containing text in the previous section
Use a 1-dimensional sliding window to move along text regions
Does each window snapshot look like the split between two characters?
If yes insert a split
If not move on
So we have something that looks like this
Character classification
Standard OCR, where you apply standard supervised learning which takes an input and identify which character
we decide it is
Multi-class characterization problem
If we go and collect a large labeled data set will look like this
Goal is to take an image patch and have the system recognize the character
Treat the images as gray-scale (makes it a bit easer)
Random background
Maybe some blurring/distortion filters
Takes thought and work to make it look realistic
If you do a sloppy job this won't help!
So unlimited supply of training examples
This is an example of creating new data from scratch
Other way is to introduce distortion into existing data
e.g. take a character and warp it
16 new examples
Allows you amplify existing training set
This, again, takes though and insight in terms of deciding how to amplify
Before creating new data, make sure you have a low bias classifier
Plot learning curve
If not a low bias classifier increase number of features
Then create large artificial training set
Very important question: How much work would it be to get 10x data as we currently have?
Often the answer is, "Not that hard"
This is often a huge way to improve an algorithm
Good question to ask yourself or ask the team
How many minutes/hours does it take to get a certain number of examples
Say we have 1000 examples
10 seconds to label an example
So we need another 9000 - 90000 seconds
Comes to a few days (25 hours!)
Crowd sourcing is also a good way to get data
Risk or reliability issues
Cost
Example
E.g. Amazon mechanical turks
Three modules
Each one could have a small team on it Follow me on LinkedIn for more:
Where should you allocate resources? Steve Nouri
Good to have a single real number as an evaluation metric https://www.linkedin.com/in/stevenouri/
So, character accuracy for this example
Find that our test set has 72% accuracy