PATTERN
RECOGNITION
1
SUPERVISED LEARNING
CLASSIFICATION
2
LEARNİNG A CLASS FROM EXAMPLES
Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people
expect from a family car?
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power
3
TRAİNİNG SET X
For each car
x1
x
x2
1 if x is positive
r
0 if x is negative
For N training examples
t t N
X {x ,r } t 1
4
CLASS C
p1 price p2 AND e1 engine power e2
For suitable values
of p1,p2,e1 and e2
Class C is defined
by a rectangle in
the price-engine
power space.
5
CLASS C
p1 price p2 AND e1 engine power e2
This equation fixes the hypothesis class H –
the set of rectangles
Learning algorithm to find a particular
hypothesis h Є H to approximate C as closely
as possible
Expert defines the hypothesis
class
Algorithm finds the parameters
6
HYPOTHESİS CLASS H
1 if h classifies x as positive
h (x)
0 if h classifies x as negative
Training Error: Predictions
of h which do not match
the required values in X
N
E (h | X ) 1 h xt r t
t 1
7
HYPOTHESİS CLASS H How to read?
N
E (h | X ) 1 h xt r t
t 1
Error on hypothesis
h given the training
set X
8
S, G, AND THE VERSİON SPACE
Most specific hypothesis, S
Most general hypothesis, G
h Î H, between S and G is
consistent
and make up the
version space
9
Ci for i=1,...,K
MULTIPLE CLASSES
1 if xt
Ci
X {xt ,r t }tN1 t
ri t
0 if x C j , j i
Train hypotheses
hi(x), i =1,...,K:
1 if xt
Ci
t
hi x t
0 if x C j , j i
10
Ci for i=1,...,K
MULTIPLE CLASSES
K Class problem = K – 2 class problems
Positive examples
for class : Luxury
Sedan
Rest ALL –
Negative examples
11
LINEAR REGRESSION
12
EXAMPLE
David Beckham: 1.83m Brad Pitt: 1.83m George Bush :1.81m
Victoria Beckham: 1.68m Angelina Jolie: 1.70m Laura Bush: ?
To predict height of the wife in a couple, based on the husband’s
height
Response (out come or dependent) variable (Y):
height of the wife
13
Predictor (explanatory or independent) variable (X):
height of the husband
WHAT IS LINEAR
Remember this?
14
WHAT IS LINEAR
A slope of 2
means that every
1-unit change in X
yields a 2-unit
change in Y.
15
EXAMPLE
Dataset giving the living areas and prices of
50 houses
16
EXAMPLE
We can plot this data
Given data like
this, how can we
learn to predict the
prices of other
houses as a
function of the size
of their living
areas?
17
NOTATIONS
The “input” variables – x(i) (living area in this
example)
The “output” or target variable that we are
trying to predict – y(i) (price)
A pair (x(i), y(i)) is called a training example
A list of m training examples {(x(i), y(i)); i =
1, . . . ,m}—is called a training set
X denote the space of input values, and Y
the space of output values
18
REGRESSION
Given a training set, to learn a function h :
X → Y so that h(x) is a “good” predictor for
the corresponding value of y. For historical
reasons, this function h is called a
hypothesis.
19
CHOICE OF HYPOTHESIS
Decision
How to represent the hypothesis h
For linear regression – we assume that the
hypothesis is Linear
h( x ) 0 1 x
20
HYPOTHESIS
Generally we’ll have more than one input
features
x1=Living area h( x ) 0 1 x1 2 x2
x2 = # of bedrooms 21
HYPOTHESIS
Hypothesis
h( x ) 0 1 x1 2 x2
To show dependence on θ:
h ( x ) 0 1 x1 2 x2
OR
h( x | ) 0 1 x1 2 x2
This is the price that the hypothesis predicts
22
for a given house with living area x1 and
number of bedrooms x2
HYPOTHESIS
h ( x ) 0 1 x1 2 x2
For conciseness
Define x0 1 h ( x ) 0 x0 1 x1 2 x2
2
h ( x) i xi θs are called the
i 0 parameters and are real
For n features numbers
n Job of learning
h ( x) i xi T X alogrithm to find or
i 0
learn these parameters
23
CHOSING THE REGRESSION LINE
Which of these
lines to chose?
Y Y
X X 24
y h ( x ) 0 1 x
CHOSING THE REGRESSION LINE
The predicted value is:
yˆ i h ( xi ) 0 1 xi
Y The true value for xi is yi
yˆ i
ˆ i yi
Error or residual y
yi
Consider this point xi
xi X 25
CHOSING THE REGRESSION LINE
How to chose this
best fit line
m
min ( h ( x (i ) ) y (i ) ) 2
Y
i 1
Minimize the sum of the
In other words: squared (why squared?)
How to chose distances of the points (Yi’s)
the θs from the line for the m
training examples
X
26
CHOSING THE REGRESSION LINE
Sum the error
over m training
To simplify examples We dont want
calculations negative values
m
1
J ( ) min ( h ( x (i ) ) y (i ) ) 2
2 i 1
Difference between what
hypothesis predicted and what
Find θ which
the actual value is
minimizes the
expression
27
min J ( )
GRADIENT DESCENT
Chose initial values of θ0 and θ1 and continue
moving the direction of steepest descente
J(θ)
28
θ0
θ1
GRADIENT DESCENT
Chose initial values of θ0 and θ1 and continue
moving the direction of steepest descente
The step size is controlled by a parameter
called learning rate
Starting point is
important
29
MODEL SELECTION
g x w1x w 0
Life is not as simple as
Non-Linear Regression
g x w 2x 2 w1x w 0
Higher order polynomial
Linear
30
MODEL SELECTION
Inductive Bias
The set of assumptions we make to have
learning possible is called the inductive bias of
the learning algorithm.
Examples:
Chosing the hypothesis class – Rectangle
Regression – Assuming the function is linear
Learning – Need to chose a bias
How to chose the right bias?
Model Selection
31
GENERALIZATION
Generalization: How well a model performs
on new data
Overfitting:
The chosen hypothesis is too complex
For example: Fitting a 3rd order polynomial on
linear data
Underfitting:
The chosen hypothesis is too simple
For example: Fitting a line on a quardatic
function
32
CROSS VALIDATION
To estimate generalization error, we need
data unseen during training. We split the
data as
Training set (50%)
Validation set (25%)
Test (publication) set (25%)
Chose the hypothesis that is best on the
validation set – Cross Validation
33
CROSS VALIDATION
Example: Find the right order of polynomial in
regression?
Use the training set to estimate the coefficients
Caclulate the errors on the validation set
Chose the one with the least validation error
Question: What is the expected error of the
chosen model?
Can NOT use the validation error
The validation data has been used to chose the
model – effectively a part of training
Use the TEST data set
34
SUMMARY
Model
h ( x) or h x|
Loss Function
m
E ( | x) J ( ) (h ( x (i ) ) y ( i ) ) 2
i 1
Optimization
min E ( | x)
35
COVARIANCE
n
(x i X )( yi Y )
cov ( x , y ) i 1
n 1
cov(X,Y) > 0 X and Y are positively
correlated
cov(X,Y) < 0 X and Y are inversely
correlated
cov(X,Y) = 0 X and Y are independent
36
CORRELATION COEFFICIENT
Pearson’s Correlation Coefficient is
standardized covariance (unitless):
cov( x, y )
r
var x var y
37
CORRELATION COEFFICIENT
Measures the relative strength of the linear
relationship between two variables
Unit-less
Ranges between –1 and 1
The closer to –1, the stronger the negative
linear relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any positive linear
relationship
38
CORRELATION COEFFICIENT
Y Y
X X
r = -0.8 r = -0.6
Y
Y Y
39
X X
r = +0.8 r = +0.2
CORRELATION COEFFICIENT
Strong relationships Weak relationships
Y Y
X X
Y Y
40
X X
ACKNOWLEDGEMENTS
Machine Intelligence, Dr M. Hanif, UET, Lahore
Machine Learning, Andrew Ng – Stanford
University
Lecture Slides, Introduction to Machine
Learning, E. Alpyadin, MIT Press.
41