ML:Introduction What Is Machine Learning?: Continuous and Discrete Data
ML:Introduction What Is Machine Learning?: Continuous and Discrete Data
P = the probability that the program will win the next game.
In general, any machine learning problem can be assigned to one of two broad
classifications:
supervised learning, OR
unsupervised learning.
Supervised Learning
In supervised learning, we are given a data set and already know what our correct
output should look like, having the idea that there is a relationship between the input
and the output.
Supervised learning problems are categorized into "regression" and "classification"
problems. In a regression problem, we are trying to predict results within
a continuous output, meaning that we are trying to map input variables to
some continuous function. In a classification problem, we are instead trying to
predict results in a discrete output. In other words, we are trying to map input
variables into discrete categories. Here is a description on Math is Fun on Continuous
and Discrete Data.
Example 1:
Given data about the size of houses on the real estate market, try to predict their
price. Price as a function of size is a continuous output, so this is a regression problem.
We could turn this example into a classification problem by instead making our output
about whether the house "sells for more or less than the asking price." Here we are
classifying the houses based on price into two discrete categories.
Example 2: (a)Regression - Given a picture of Male/Female, We have to predict
his/her age on the basis of given picture. (b)Classification - Given a picture of
Male/Female, We have to predict Whether He/She is of High school, College, Graduate
age. Another Example for Classification - Banks have to decide whether or not to give
a loan to someone on the basis of his credit history.
Unsupervised Learning
Unsupervised learning, on the other hand, allows us to approach problems with little or
no idea what our results should look like. We can derive structure from data where we
don't necessarily know the effect of the variables.
We can derive this structure by clustering the data based on relationships among the
variables in the data.
With unsupervised learning there is no feedback based on the prediction results, i.e.,
there is no teacher to correct you.
Example:
Clustering: Take a collection of 1000 essays written on the US Economy, and find a way
to automatically group these essays into a small number that are somehow similar or
related by different variables, such as word frequency, sentence length, page count,
and so on.
Non-clustering: The "Cocktail Party Algorithm", which can find structure in messy data
(such as the identification of individual voices and music from a mesh of sounds at
a cocktail party). Here is an answer on Quora to enhance your understanding. : What
is the difference between supervised and unsupervised learning algorithms?
Model Representation
Recall that in regression problems, we are taking input variables and trying to fit the
output onto a continuous expected result function.
Linear regression with one variable is also known as "univariate linear regression."
Univariate linear regression is used when you want to predict a single
output value
so that means we already have an idea about what the input/output cause and effect
should be.
y=h(x)=0+1x
Note that this is like the equation of a straight line. We give to
for 0 and 1 to get our estimated output
function called
h(x) values
h that is trying to map our input data (the x's) to our output data (the
y's).
Example:
Suppose we have the following set of training data:
input
output
h(x)=2+2x.
So for input of 1 to our hypothesis, y will be 4. This is off by 3. Note that we will be
trying out various values of 0 and 1 to try to find values which provide the best
possible "fit" or the most representative "straight line" through the data points mapped
on the x-y plane.
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function.
This takes an average (actually a fancier version of an average) of all the results of the
hypothesis with inputs from x's compared to the actual output y's.
J(0,1)=1/2mi=1m(yiyi)2=1/2mi=1m(h(xi)yi)2
To break it apart, it is 12x where
descent, as the derivative term of the square function will cancel out the 12 term.
Now we are able to concretely measure the accuracy of our predictor function against
the correct results we have so that we can predict new results we don't have.
If we try to think of it in visual terms, our training data set is scattered on the x-y
plane. We are trying to make straight line (defined by
scattered set of data. Our objective is to get the best possible line. The best possible
line will be such so that the average squared vertical distances of the scattered points
from the line will be the least. In the best case, the line should pass through all the
points of our training data set. In such a case the value of
J(0,1) will be 0.
(h(x)y) or abs(h(x)y) ) ?
A: It might be easier to think of this as measuring the distance of two points. In this
case, we are measuring the distance of two multi-dimensional values (i.e. the observed
output value yi and the estimated output value
distance of two points
ni=1(xiyi)2. That's where the sum of squares comes from. (see also Euclidean
distance)
The sum of squares isnt the only possible cost function, but it has many nice
properties. Squaring the error means that an overestimate is "punished" just the same
as an underestimate: an error of -1 is treated just like +1, and the two equal but
opposite errors cant cancel each other. If we cube the error (or just use the
difference), we lose this property. Also in the case of cubing, big errors are punished
more than small ones, so an error of 2 becomes 8.
The squaring function is smooth (can be differentiated) and yields linear forms after
differentiation, which is nice for optimization. It also has the property of being
convex. A convex cost function guarantees there will be a global minimum, so our
algorithms will converge.
If you throw in absolute value, then you get a non-differentiable function. If you try to
take the derivative of abs(x) and set it equal to zero to find the minimum, you won't
get any answers since it's undefined in 0.
Q: Why cant I use 4th powers in the cost function? Dont they have the nice
properties of squares?
A: Imagine that you are throwing darts at a dartboard, or firing arrows at a target. If
you use the sum of squares as the error (where the center of the bulls-eye is the origin
of the coordinate system), the error is the distance from the center. Now rotate the
coordinates by 30 degree, or 45 degrees, or anything. The distance, and hence the
error, remains unchanged. 4th powers lack this property, which is known as rotational
invariance.
Q: Why does 1/(2 * m) make the math easier?
A: When we differentiate the cost to calculate the gradient, we get a factor of 2 in the
numerator, due to the exponent inside the sum. This '2' in the numerator cancels-out
with the '2' in the denominator, saving us one math operation in the formula.