0% found this document useful (0 votes)
24 views

Class15 18 Regression - 26sept 10oct2019

The document discusses regression and prediction models. It defines regression as predicting continuous or ordered values given input variables. Regression involves building a model from training data and then using the model to make predictions. Linear regression fits a linear function to relate one or more independent variables to a dependent variable. It finds the coefficients that minimize the squared errors between predictions and actual values in the training data.

Uploaded by

Saili Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Class15 18 Regression - 26sept 10oct2019

The document discusses regression and prediction models. It defines regression as predicting continuous or ordered values given input variables. Regression involves building a model from training data and then using the model to make predictions. Linear regression fits a linear function to relate one or more independent variables to a dependent variable. It finds the coefficients that minimize the squared errors between predictions and actual values in the training data.

Uploaded by

Saili Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

21-11-2019

Regression (Prediction)

Prediction (Regression)
• Numeric prediction: Task of predicting continuous (or
ordered) values for given input
• Example:
– Predicting potential sales of a new product given its
price
– Predicting rain fall given the temperature and humidity
in the atmosphere

t
(rain fall)
t
(sales)

x2
(temperature) x1
x (price) (humidity)

t = f(x) t = f(x)
x = [x1 , x2]T
• Regression and prediction are synonymous terms 2

1
21-11-2019

Prediction (Regression)
• Regression analysis is used to model the relationship
between one or more independent (predictor) variable
and a dependent (response) variable
– Dependent variable is always continuous valued or
ordered valued
– Example: Dependent variable: Rain fall
Independent variable(s): temperature, humidity
• The values of predictor variables are known
• The response variable is what we want to predict
• Regression analysis can be viewed as mapping
function:
x y x y
f(.) f(.)

• Single independent variable (x) • Multiple independent variable


(x R )
d
• Single dependent variable (y)
• Single dependent variable (y) 3

Prediction (Regression)
• Regression is a two step process
– Step1: Building a regression model
• Learning from data (training phase)
• Regression model is build by analysing or learning from a
training data set made up of one or more independent
variables and their dependent labels
• Supervised learning: In supervised learning, each example
is a pair consisting of an input example (independent
variables) and a desired output value (dependent variable)
– Step2: Using regression model for prediction
• Testing phase
• Predicting dependent variable
• Accuracy of a predictor:
– How well a given predictor can predict for new values
• Target of learning techniques: Good generalization
ability
4

2
21-11-2019

Illustration of Training Set: Salary


Prediction
• Independent variable: Years of experience
Years of Salary (in
experience Rs 1000)• Dependent variable: Salary
(x) (y)
3 30
8 57
9 64
13 72
3 36 Salary

6 43
11 59
21 90
1 20
16 83 Years of experience

Illustration of Training Set: Temperature


Prediction
Humidity Pressure Temp
(x1) (x2) (y)
• Independent variable: Humidity,
Pressure
82.19 1036.35 25.47
• Dependent variable: Temperature
83.15 1037.60 26.19
(Temp)
85.34 1037.89 25.17
87.69 1036.86 24.30
87.65 1027.83 24.07
95.95 1006.92 21.21
Temp.
96.17 1006.57 23.49
98.59 1009.42 21.79
88.33 991.65 25.09
Pressure
90.43 1009.66 25.39
Humidity
94.54 1009.27 23.89
99.00 1009.80 22.51
98.00 1009.90 22.90
99.00 996.29 21.72
98.97 800.00 23.18 6

3
21-11-2019

Linear Regression
• Linear approach to model the relationship between a
scalar response, (y) (or dependent variable) and one
or more predictor variables, (x or x) (or independent
variables)
• The response is going to be the linear function of
input (one or more independent variables)
• Simple linear regression (straight-line regression):
– Single independent variable (x)
– Single dependent variable (y) x y
1
f(.)
– Fitting a straight-line

• Multiple linear regression:


– One or more independent variable (x)
x y
– Single dependent variable (y) f(.)
d
– Fitting a hyperplane 7

Straight-Line (Simple Linear) Regression


• Given:- Training data: D  { xn , y n }nN1 , xn  R 1 and yn  R 1
– xn: nth input example (independent variable)
– yn: Dependent variable (output) corresponding to
nth independent variable
• Function governing the relationship between input and
output: yn  f ( xn , w, w0 )  w xn  w0
– The coefficients w0 and w are parameters of straight-line
(regression coefficients) - Unknown

y • Function f(xn,w,w0) is a
linear function of xn and
it is a linear function of
coefficients w and w0
– Linear model for
regression
x 8

4
21-11-2019

Straight-Line (Simple Linear) Regression:


Training Phase
• The values for the coefficients will be determined by
fitting the linear function (straight-line) to the training
data
• Method of least squares: Minimizes the squared
error between the actual data (yn) i.e. actual
dependent variable and the estimate of line (predicted
dependent variable (ŷ n)) i.e. the function f(xn,w,w0)
yˆ n  f ( x n , w , w 0 )  w x n  w 0
1 N
minimize E ( w , w 0 )    yˆ n  y n 
2
w , w0 2 n 1
• The derivatives of error function with respect to the
coefficients will be linear in the elements of w and w0
• Hence the minimization of the error function has
unique solution and found in closed form

Straight-Line (Simple Linear) Regression:


Training Phase
• Cost function for optimization:
1 N
E ( w , w0 )    f ( x n , w , w0 )  y n 2
2 n 1
 E ( w , w0 )
• Conditions for optimality: E ( w, w0 )  0 0
w  w0
1 N 1 N
  w x n  w0  y n   w xn  w0  y n 
2
2

2 n 1 2 n 1
0 0
w w0

• Solving this give optimal wˆ and wˆ 0 as


• μx: sample mean of
 xn   x yn   y 
N

independent variable x
wˆ  n 1 wˆ 0   y  w  x
N • μy: sample mean of
 x  x 
2

n 1
n
independent variable y
10

5
21-11-2019

Straight-Line (Simple Linear) Regression:


Testing
• For any test example x, the predicted value is given
by:
yˆ  f ( x, w, w0 )  wˆ x  wˆ 0

• The prediction accuracy is measured in terms of


squared error:
E   yˆ  y 
2

• Let Nt be the total number of test samples


• The prediction accuracy of regression model is
measured in terms of root mean squared error:
Nt
1
  yˆ  yn 
2
E RMS  n
Nt n 1

11

Illustration of Simple Linear Regression:


Salary Prediction - Training
 xn   x yn   y 
N
Years of Salary (in
experience Rs 1000)
(x) (y) wˆ  n 1 wˆ 0   y  w  x
N

 x  x 
2
3 30 n
n 1
8 57
9 64
• μx: 9.1 • ŵ : 3.54
13 72 • μy: 55.4 • wˆ 0: 23.21
3 36
6 43
11 59
21 90
Salary
1 20
16 83

Years of experience 12

6
21-11-2019

Illustration of Simple Linear Regression:


Salary Prediction - Test
• ŵ : 3.54
• wˆ 0: 23.21 Salary 10

Years of Salary (in


experience Rs 1000)
(x) (y)
10 -

Years of experience

• Predicted salary: 58.584


• Actual salary: 58.000
• Squared error: 0.34
13

Multiple Linear Regression


• Multiple linear regression: x y
– One or more independent variable (x) d
f(.)
– Single dependent variable (y)
• Given:- Training data: D  {x n , yn }nN1 , x n  R d and yn  R 1
– d: dimension of input example (number of
independent variables)
• Function governing the relationship between input and
output: d
yn  f (x n , w )  w T x n  w0   wi xi
i 0

– The coefficients w0, w1, … , wd are collectively denoted by


the vector w - Unknown
• Function f(xn,w) is a linear function of xn and it is a
linear function of coefficients w
– Linear model for regression 14

7
21-11-2019

Linear Regression: Linear Function


Approximation
• Linear function:
– 2-dimensional space: The mapping function is a line
specified by
f ( x , w )  w1 x1  w 2 x 2  w 0  0
w w
x 2   1 x1  0
w2 w2
– d-dimensional space: The mapping function is a
hyperplane specified by
d
f (x, w )  wd xd  ....  w2 x2  w1 x1  w0   wi xi  w T x  0
i 0

where w  [ w0 , w1 ,..., wd ] and x  [1, x1 ,..., xd ]T


T

15

Multiple Linear Regression


• The values for the coefficients will be determined by
fitting the linear function to the training data
• Method of least squares: Minimizes the squared
error between the actual data (yn) i.e. actual
dependent variable and predicted dependent variable
( ŷ n) i.e. the estimate linear function f(xn,w), for any
given value of w d
yˆ n  f ( x n , w )  w T x n  w 0  wxi0
i i

1 N
minimize E ( w )    yˆ n  y n 2
w 2 n 1
• The error function is a
– quadratic function of the coefficients w and
– The derivatives of error function with respect to the
coefficients will be linear in the elements of w
• Hence the minimization of the error function has
unique solution and found in closed form 16

8
21-11-2019

Multiple Linear Regression


• Cost function for optimization:
1 N
E (w )    f ( x n , w )  y n 2
2 n 1
• Conditions for optimality:  E ( w )
0
w
• Application of optimality conditions gives optimal ŵ :
2
1 N  d 
    wi x ni  y n 
2 n 1  i  0  0
w


1 N

 w T x n  yn
2 n 1

2

0
w

17

Multiple Linear Regression


• Cost function for optimization:
1 N
E (w )    f ( x n , w )  y n 2
2 n 1
• Conditions for optimality:  E ( w )
0
w
• Application of optimality conditions gives optimal ŵ :
1 x11 x12 . . . x1 d   y1 

1 N

2 n 1

w T x n  yn 
2
1 x x . . . x  y 
0  21 22 2d   2
w           
X  y 

ˆ  XT X
w 
1
XT y 1 x n1 x n 2 . . . x nd   yn 
          
– Assumption: d < N    
1 x N 1 x N 2 . . . x Nd   y N 
X is data matrix 18

9
21-11-2019

Multiple Linear Regression: Testing


• Optimal coefficient vector w is given by
 1

ˆ  XT X XT y
w
ˆ  Xy
w
where 
X   XT X 
1
XT is the pseudo inverse of matrix X
• For any test example x, the predicted value is given
by: d
yˆ  f ( x, w ˆ T x   wˆ i xi
ˆ)w
i 0
• The prediction accuracy is measured in terms of
squared error: E   yˆ  y 2
• Let Nt be the total number of test samples
• The prediction accuracy of regression model is
measured in terms of root mean squared error:
Nt
1
  yˆ  yn 
2
E RMS  n
Nt n 1 19

Illustration of Multiple Linear Regression:


Temperature Prediction
Humidity Pressure Temp
(x1) (x2) (y) • Training:
82.19 1036.35 25.47
83.15 1037.60 26.19 w 
ˆ  XT X 
1
XT y
85.34 1037.89 25.17
87.69 1036.86 24.30
87.65 1027.83 24.07
95.95 1006.92 21.21
Temp.
96.17 1006.57 23.49
98.59 1009.42 21.79
88.33 991.65 25.09
90.43 1009.66 25.39 Pressure

94.54 1009.27 23.89 Humidity


99.00 1009.80 22.51
98.00 1009.90 22.90
99.00 996.29 21.72
98.97 800.00 23.18 20

10
21-11-2019

Illustration of Multiple Linear Regression:


Temperature Prediction - Test

ˆ  XT X
w 1
XT y
99.00 1009.21

Humidity Pressure Temp


(x1) (x2) (y)
99.00 1009.21 -
Temp.
T
y  f ( x, w
ˆ)w
ˆ x

Pressure

Humidity
• Predicted rainfall: 21.72
• Actual rainfall: 21.24
• Squared error: 0.2347
21

Application of Regression:
A Method to Handle Missing Values
• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let x1, x2, …, xd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
yn = f(xn1, xn2, …, xnd )

x y
d
f(.)

• Simple or Multiple Linear regression: yn = w1xn1 + w2xn2 +… +


wdxnd
• Popular strategy
• It uses the most information from the present data to
predict the missing values
• It preserves the relationship with other variables

11
21-11-2019

Application of Regression:
A Method to Handle Missing Values
• Training process:
– Let y be the attribute, whose missing values to be
predicted
– Training examples: All x=[x1, x2, …, xd ]T, a set of d
dependent attributes for which the independent variable
y is available
– The values for the coefficients will be determined by
fitting the linear function to the training data
• Dependent variable:
Temperature
• Independent variables: Humidity
and Rainfall

Application of Regression:
A Method to Handle Missing Values
• Testing process (Prediction):
– Optimal coefficient vector w is given by

ˆ  XT X
w 
1
XT y
– For any test example x, the predicted value is given by:
d
yˆ  f ( x, w ˆ T x   wˆ i xi
ˆ)w
i 0

12
21-11-2019

Nonlinear Regression
• Nonlinear approach to model the relationship between
a scalar response, (y) (or dependent variable) and
one or more predictor variables, (x or x) (or
independent variables)
• The response is going to be the nonlinear function of
input (one or more independent variables)
• Simple nonlinear regression (Polynomial curve fitting):
– Single independent variable (x)
x y
– Single dependent variable (y)
1
f(.)
– Fitting a curve
• Nonlinear regression (Polynomial regression):
– One or more independent variable (x)
– Single dependent variable (y) x y
– Fitting a surface d
f(.)
25

Polynomial Curve Fitting


• Given:-Training data:
x y
f(.) D  {xn , y n }nN1 , xn  R 1 and yn  R 1

• Function governing the relationship between input and


output given by a polynomial function of degree p:
p
yn  f ( xn , w )  w0  w1 xn  w x  ...  w x   w j xnj
2
2 n
p
p n
j 0
• The coefficients w=[w0, w1,…, wp]
are parameters of polynomial
curve (regression coefficients)
y
- Unknown
• Polynomial function f(xn,w) is a
nonlinear function of xn and it is
x a linear function of coefficients w
y = f(x,w) – Linear model for regression 26

13
21-11-2019

Polynomial Curve Fitting: Training Phase


• The values for the coefficients will be determined by
fitting the polynomial curve to the training data
• Method of least squares: Minimizes the squared
error between the actual data (yn) i.e. actual
dependent variable and the estimate of line (predicted
dependent variable (ŷ n) i.e. the function f(xn,w)
yˆ n  f ( xn , w )  w0  w1 xn  w2 xn2  ...  w p xnp
1 N
minimize E ( w )    yˆ n  y n 2
w 2 n 1
• The error function is a quadratic function of the
coefficients w and
• Derivatives of error function with respect to the
coefficients will be linear in the elements of w
• Hence the minimization of the error function has
unique solution and found in closed form
27

Polynomial Curve Fitting: Training Phase


p
yˆ n  f ( xn , w )  w0  w1 xn  w x  ...  w x   w j xnj
2
2 n
p
p n
j 0

• Lets consider: x x 2 x 3 xnp p is degree of polynomial


n n n

   ... 
z n1 z n 2 z n 3 z np
yˆ n  f ( z n , w )  w0  w1 z n1  w2 z n 2  ...  w p z np
p
yˆ n  f (z n , w )   w j z nj  w T z n
j 0

where w  [ w 0 , w1 ,..., w p ]T and z n  [1, z n1 ,..., z np ]T

28

14
21-11-2019

Polynomial Curve Fitting: Training Phase


• Cost function for optimization:
1 N
E (w )    f ( z n , w )  y n 2
2 n 1
• Conditions for optimality:  E ( w )
0
w
• Application of optimality conditions gives optimal ŵ :
2
1 N  p 
    w j z nj  y n 
2 n 1  j  0  0
w


1 N

 w T z n  yn
2 n 1

2

0
w

29

Polynomial Curve Fitting: Training Phase


• Cost function for optimization:
1 N
E (w )    f ( z n , w )  y n 2
2 n 1
• Conditions for optimality:  E ( w )
0
w
• Application of optimality conditions gives optimal ŵ :
1 z11 z12 . . . z1 p   y1 

1 N
 
w T z n  yn 
2
  y 
2 n 1
0 1 z 21 z 22 . . . z 2 p   2
w           
Z  y 

ˆ  ZT Z
w 
1
ZT y 1 z n1 z n 2 . . . z np   yn 
   
– Assumption: p < N           
1 z N 1 z N 2 . . . z Np   y N 
 
Z is Vandermonde matrix
where, z nj  x nj 30

15
21-11-2019

Polynomial Curve Fitting: Testing


• Optimal coefficient vector w is given by
 1

ˆ  ZT Z ZT y
w
ˆ  Zy
w
where 
Z  ZT Z 1
ZT is the pseudo inverse of matrix Z
• For any test example x, the predicted value is given
by: p
T
yˆ  f ( x, w ˆ z   wˆ i x
ˆ)w j

j 0
• The prediction accuracy is measured in terms of
squared error: E   yˆ  y 2
• Let Nt be the total number of test samples
• The prediction accuracy of regression model is
measured in terms of root mean squared error:
Nt
1
  yˆ  yn 
2
E RMS  n
Nt n 1 31

Determining p, Degree of Polynomial


• This is determined experimentally
• Starting with p=1, test set is used to estimate the
accuracy, in terms of error, of the regression model
• This process is repeated each time by incrementing p
• The regression model with p that gives the minimum
error on test set may be selected

32

16
21-11-2019

Illustration of Polynomial Curve Fitting:


Humidity Prediction - Training
Temp Humidity
(x) (y) • Degree of polynomial p : 1
25.47 82.19
26.19 83.15 w 
ˆ  ZT Z 
1
ZT y
25.17 85.34
24.30 87.69
24.07 87.65
21.21 95.95
23.49 96.17
21.79 98.59 Humidity
25.09 88.33
25.39 90.43
23.89 94.54
22.51 99.00
22.90 98.00
Temperature
21.72 99.00
23.18 98.97 33

Illustration of Polynomial Curve Fitting:


Humidity Prediction - Test
• Degree of polynomial p : 1
98.76
Temp Humidity
(x) (y)
22.98 --

Humidity

95.05

Temperature
• Predicted humidity: 95.05
• Actual humidity: 98.76
• Squared error: 13.77
34

17
21-11-2019

Illustration of Polynomial Curve Fitting:


Humidity Prediction - Training
Temp Humidity
(x) (y) • Degree of polynomial p : 2
25.47 82.19
26.19 83.15 w 
ˆ  ZT Z 
1
ZT y
25.17 85.34
24.30 87.69
24.07 87.65
21.21 95.95
23.49 96.17
21.79 98.59 Humidity
25.09 88.33
25.39 90.43
23.89 94.54
22.51 99.00
22.90 98.00
21.72 99.00 Temperature
23.18 98.97 35

Illustration of Polynomial Curve Fitting:


Humidity Prediction - Test
• Degree of polynomial p : 2
98.76
Temp Humidity
(x) (y)
22.98 --

Humidity
96.21

Temperature
• Predicted humidity: 96.21
• Actual humidity: 98.76
• Squared error: 06.49
36

18
21-11-2019

Illustration of Polynomial Curve Fitting:


Humidity Prediction - Training
Temp Humidity
(x) (y) • Degree of polynomial p : 3
25.47 82.19
26.19 83.15 w 
ˆ  ZT Z 
1
ZT y
25.17 85.34
24.30 87.69
24.07 87.65
21.21 95.95
23.49 96.17
21.79 98.59 Humidity
25.09 88.33
25.39 90.43
23.89 94.54
22.51 99.00
22.90 98.00
21.72 99.00 Temperature
23.18 98.97 37

Illustration of Polynomial Curve Fitting:


Humidity Prediction - Test
• Degree of polynomial p : 3
98.76
Temp Humidity
(x) (y)
22.98 --

Humidity
97.71

Temperature
• Predicted humidity: 97.71
• Actual humidity: 98.76
• Squared error: 01.11
38

19
21-11-2019

Illustration: Polynomial Curve Fitting


p=0 p=1
y y y

x x x
• Condition: Less number
of training examples p=3 p=9
(N=10) y y
• Effect of increasing the
degree of polynomial (p)

x x

ERMSE
C. M. Bishop, Pattern Recognition and Machine
Learning, Springer, 2006.

p 39

Illustration: Polynomial Curve Fitting


p=9 p=9
y y N = 15

x x

p=9
N = 100
y
• Increasing the size of the
data set reduces the over-
fitting problem

x
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
40

20
21-11-2019

Nonlinear Regression:
Polynomial Regression
• Polynomial regression:
x y
– One or more independent variable (x) f(.)
d
– Single dependent variable (y)

• Given:- Training data: D  {x n , y n }nN1 , x n  R d and y n  R 1


• Function governing the relationship between input and
output given by a polynomial function of degree p:
D 1
yn  f (x n , w )  f (φ(x n ), w )   w j j (x n )
j 0

– D is the number of monomials of polynomial up to


degree p
– φj(xn) is the jth monomial of degree p for xn
• For 2-dimensional input, xn=[xn1, xn2]T and degree, p=2
φ(x n )  0 (x n ), 1 (xn ), 2 (xn ), 3 (x n ), 4 (x n ), 5 (x n )
T

D= 6

φ(xn )  1, 2 xn1 , 2xn 2 , xn21 , xn22 , 
2 xn1xn2
T
41

Nonlinear Regression:
Polynomial Regression
• Polynomial regression:
x y
– One or more independent variable (x) f(.)
d
– Single dependent variable (y)

• Given:- Training data: D  {x n , y n }nN1 , x n  R d and y n  R 1


• Function governing the relationship between input and
output given by a polynomial function of degree p:
D 1
yn  f (x n , w )  f (φ(x n ), w )   w j j (x n )
j 0

– D is the number of monomials of polynomial up to


degree p
– φj(xn) is the jth monomial of degree p for xn

The number of monomials D for the ( d  p )!


polynomial of degree p and the dimension of D 
d is given by d ! p!
42

21
21-11-2019

Nonlinear Regression:
Polynomial Regression
• Given:- Training data: D  {x n , yn }nN1 , x n  R d and yn  R 1
• Function governing the relationship between input and
output given by a polynomial function of degree p:
D 1
yn  f (x n , w )  f (φ(x n ), w )   w j j (x n )
j 0

• The coefficients w=[w0, w1,…, wD-1]


are parameters of surface
y (polynomial function) (regression
coefficients) - Unknown

x2
x1 • Polynomial function f(xn,w) is a
nonlinear function of xn and it is a
y = f(xn,w)
linear function of coefficients w
x = [x1 , x2]T
– Linear model for regression
Fitting a surface 43

Nonlinear Regression:
Polynomial Regression
• The values for the coefficients will be determined by
fitting the polynomial to the training data
• Method of least squares: Minimizes the squared
error between the actual data (yn) i.e. actual
dependent variable and the estimate of line (predicted
dependent variable (ŷ n) i.e. the function f(xn,w)
D 1
yˆ n  f (x n , w )  f (φ (x n ), w )   w j j (x n )
j 0
1 N
minimize E ( w )    yˆ n  y n 
2
w 2 n 1
• The error function is a quadratic function of the
coefficients w
• Derivatives of error function with respect to the
coefficients will be linear in the elements of w
• Hence the minimization of the error function has
unique solution and found in closed form 44

22
21-11-2019

Polynomial Regression : Training Phase

yˆ n  f (x n , w)
yˆ n  f (φ(x n ), w )
D 1
yˆ n   w j j (x n )
j 0

yˆ n  w T φ(x n )
where w  [ w 0 , w1 ,..., w D 1 ]T and

φ ( x n )  [ 0 ( x n ) ,  1 ( x n ) ,  2 ( x n ) ,...,  D 1 ( x n ) ]T

45

Polynomial Regression : Training Phase


• Cost function for optimization:
1 N
E (w )    f (φ ( x n ), w )  y n 2
2 n 1
• Conditions for optimality:  E ( w )  0
w

• Application of optimality conditions gives optimal ŵ :


2
1 N  D 1 
    w j j ( x n )  y n 
2 n 1  j  0  0
w


1 N
 
2 n 1
w T φ(x n )  yn 2

0
w
46

23
21-11-2019

Polynomial Regression : Training Phase


• Cost function for optimization:
1 N
  f (φ ( x n ), w )  y n 
2
E (w ) 
2 n 1
• Conditions for optimality:  E ( w )  0
w
• Application of optimality conditions gives optimal ŵ :
 0 ( x 1 )  1 ( x 1 ) . . .  D 1 ( x 1 ) 

1 N

 w T φ(x n )  yn
2 n 1
2
 ( x )  ( x ) . . .  ( x ) 
0  0 2 1 2 D 1 2 
w                
Φ 

ˆ  ΦT Φ
w 
1
ΦT y  0 ( x n )  1 ( x n ) . . .  D 1 ( x n ) 
               
– Assumption: D < N  
 0 ( x N )  1 ( x N ) . . .  D 1 ( x N ) 
47

Polynomial Regression: Testing


• Optimal coefficient vector w is given by
1

ˆ  ΦT Φ ΦT y
w 
wˆ  Φy
where 
Φ   ΦT Φ  1
ΦT is the pseudo inverse of matrix Φ
• For any test example x, the predicted value is given
by: D 1
T
yˆ  f (x, w ˆ φ ( x )   w j j ( x )
ˆ)w
j 0
• The prediction accuracy is measured in terms of
squared error: E   yˆ  y 2
• Let Nt be the total number of test samples
• The prediction accuracy of regression model is
measured in terms of root mean squared error:
Nt
1
  yˆ  yn 
2
E RMS  n
Nt n 1 48

24
21-11-2019

Determining p, Degree of Polynomial


• This is determined experimentally
• Starting with p=1, test set is used to estimate the
accuracy, in terms of error, of the regression model
• This process is repeated each time by incrementing p
• The regression model with p that gives the minimum
error on test set may be selected

49

Illustration of Polynomial Regression:


Temperature Prediction
Humidity Pressure Temp
(x1) (x2) (y)
82.19 1036.35 25.47
83.15 1037.60 26.19
85.34 1037.89 25.17
87.69 1036.86 24.30
87.65 1027.83 24.07
95.95 1006.92 21.21
Temp.
96.17 1006.57 23.49
98.59 1009.42 21.79
88.33 991.65 25.09
Pressure
90.43 1009.66 25.39
Humidity
94.54 1009.27 23.89
99.00 1009.80 22.51
98.00 1009.90 22.90
99.00 996.29 21.72
98.97 800.00 23.18 50

25
21-11-2019

Illustration of Polynomial Regression:


Temperature Prediction
Humidity Pressure Temp
(x1) (x2) (y) • Training:
82.19 1036.35 25.47
• Polynomial Degree p = 3
83.15 1037.60 26.19
85.34 1037.89 25.17 w 
ˆ  ΦTΦ 
1
ΦT y
87.69 1036.86 24.30
87.65 1027.83 24.07
95.95 1006.92 21.21
Temp.
96.17 1006.57 23.49
98.59 1009.42 21.79
88.33 991.65 25.09
90.43 1009.66 25.39
Pressure Humidity
94.54 1009.27 23.89
99.00 1009.80 22.51
98.00 1009.90 22.90
99.00 996.29 21.72
98.97 800.00 23.18 51

Illustration of Polynomial Regression:


Temperature Prediction - Test
• Degree of polynomial p = 3

 1
99.00 1009.21
ˆ  ΦTΦ
w ΦT y

Humidity Pressure Temp


(x1) (x2) (y)
99.00 1009.21 -
Temp.
yˆ  f (x, w ˆ T φ( x)
ˆ)w
D 1
  w j j ( x )
j 0
Pressure Humidity
• Predicted Temperature: 21.05
• Actual Temperature: 21.24
• Squared error: 0.035
52

26
21-11-2019

Multiple Linear Regression vs Polynomial Regression


Temperature Prediction
• Multiple Linear Regression • Polynomial Regression
– Degree of polynomial p=3
99.00 1009.21
Temp.
Temp.

Pressure
Humidity Pressure Humidity

• Predicted rainfall: 21.72 • Predicted Temperature: 21.05


• Actual rainfall: 21.24 • Actual Temperature: 21.24
• Squared error: 0.2347 • Squared error: 0.035
53

Autoregression (AR)

27
21-11-2019

Autoregression (AR)
• Regression on the values of same attribute
• Autoregression is a time series model that
– uses observations from previous time steps as input to a
linear regression equation to predict the value at the
next time step

55

Time Series Data


• Time series is a sequential set of data points,
measured typically over successive times
• Time series data are simply a collection of
observations gathered over time
• Time series data is given as:
X = (x1, x2, …, xt, …, xT)
– xt is the observation at time t
– T be the number of observations
• Example:
– Weekly sales – time interval is week
– Daily temperature in Kamand – time interval is day
• Time series analysis comprises methods for analysing
time series data in order to extract meaningful
statistics and other characteristics of the data
• Scope: We consider single variable xt 56

28
21-11-2019

Time Series Data and Dependence


• Time series data is given as:
X = (x1, x2, …, xt, …, xT)
– xt is the observation at time t
– T be the number of observations
• In time series data, value of each element at time t
(xt) is dependent on the values elements at previous p
time steps (xt-1 , xt-2, …, xt-p ) – p time lag

57

Time Series Data and Dependence


• Example: Data series in i.i.d
– xt is a random number drawn from N(0,1)
• Each element at time t (xt) is not dependent on the
values elements at previous p time steps (xt-1 , xt-2, …,
xt-p ) – p time lag
0.54 1.83 -2.26 0.86 0.32 -1.31 -0.43 0.34 3.58 2.77 -1.35 3.03 0.73 -0.06 0.71

58

29
21-11-2019

Time Series Data and Dependence


• Example: Daily temperature at Kamand
• Each element at time t (xt) is dependent on the values
elements at previous p time steps (xt-1 , xt-2, …, xt-p ) –
p time lag

25.47 26.19 25.17 24.3 24.07 21.21 23.49 21.79 25.09 25.39 23.89 22.51 22.9 21.72 23.18

59

Checking Dependency
• It’s not always easy to just look at a time-series plot
and say whether or not the series is independent
• xt in a series is independent means that knowing
previous values doesn’t help you to predict the next
value
– Knowing xt-1 doesn’t help to predict xt
– More generally, knowing xt-1, xt-2, …, xt-p doesn’t help to
predict xt
• p is the number of previous time step (time lag)
• Dependency of each element at time t (xt) with the
values of elements at previous p time steps (xt-1 , xt-2,
…, xt-p ) is observed using autocorrelation

60

30
21-11-2019

Checking Dependency - Autocorrelation


• The relationship between variables is called
correlation
• Autocorrelation: The correlation calculated between
the variable and itself at previous time steps
• Example: Data series in i.i.d
– Autocorrelation between xt and xt-1 – Pearson
correlation coefficient
xt 0.54 1.83 -2.26 0.86 0.32 -1.31 -0.43 0.34 3.58 2.77 -1.35 3.03 0.73 -0.06 0.71

xt-1 0.54 1.83 -2.26 0.86 0.32 -1.31 -0.43 0.34 3.58 2.77 -1.35 3.03 0.73 -0.06

– Autocorrelation:

xt xt-1
xt-1
xt 1 -0.1242

xt-1 -0.1242 1

xt 61

Checking Dependency - Autocorrelation


• The relationship between variables is called
correlation
• Autocorrelation: The correlation calculated between
the variable and itself at previous time steps
• Example: Daily temperature at Kamand
– Autocorrelation between xt and xt-1
xt 25.47 26.19 25.17 24.3 24.07 21.21 23.49 21.79 25.09 25.39 23.89 22.51 22.9 21.72 23.18

xt-1 25.47 26.19 25.17 24.3 24.07 21.21 23.49 21.79 25.09 25.39 23.89 22.51 22.9 21.72

– Autocorrelation:

xt xt-1 xt-1
xt 1 0.4054

xt-1 0.4054 1

xt 62

31
21-11-2019

Autoregression (AR) Model


• Autoregression (AR) is a linear regression model that
uses observations from previous time steps as input
to predict the value at the next time step
• An autoregression (AR) model makes an assumption
that the observations at previous time steps are
useful to predict the value at the next time step
• The autocorrelation statistics help to choose which lag
variables (p) will be useful in a model
• Interestingly, if all lag variables (xt-1 , xt-2, …, xt-p )
show low or no correlation with the output variable
(xt), then it suggests that the time series problem
may not be predictable
• This can be very useful when getting started on a new
dataset

63

Autoregression (AR) Model


• AR(1) model: AR model using one time lag (p=1)
– uses xt-1 i.e. value of previous time step to predict xt
• Given: Time series data: X = (x1, x2, …, xt, …, xT)
– xt is the observation at time t
– T be the number of observations
• AR(1) model is given as: xt  f ( xt 1 , w0 , w1 )  w0  w1 xt 1
– The coefficients w0 and w1 are parameters of straight-line
(regression coefficients) - Unknown
• The regression coefficients are obtained as seen in
simple linear regression (straight-line regression)
using least square method

64

32
21-11-2019

AR(1) Model - Training


• The regression coefficients are obtained as seen in
simple linear regression (straight-line regression)
using least square method
• Minimize the squared error between the actual data
(xt) at time t and the estimate of linear function
(predicted variable (x̂ t)) i.e. the function f(xt-1,w0,w1)
xˆ t  f ( x t 1 , w 0 , w1 )  w 0  w1 x t 1
1 T
minimize E ( w 0 , w1 )   xˆ t  x t 2
w , w0 2 t2
• The optimal ŵ 0 and ŵ1 is given as

T
• μt-1: sample mean of
 x t 1   t 1 xt   t 
wˆ 1  t 1
T
wˆ 0   t  w1  t variables at time t-1, xt-1
 x   t 1 
2
t 1 • μt: sample mean of
t 1
variables at time t, xt
65

AR(1) Model: Testing


• For any test example at time t-1, xt-1, the predicted
value at time t, x̂ t is given by:
xˆt  f ( xt 1 , w0 , w1 )  wˆ 0  wˆ 1 xt 1

• The prediction accuracy is measured in terms of


squared error:
E   xˆ t  x t 
2

• Let Ttest be the total number of test samples


• The prediction accuracy of regression model is
measured in terms of root mean squared error:
Ttest
1
 xˆ  xt 
2
E RMS  t
Ttest t 1

66

33
21-11-2019

Autoregression Model
• AR(p) model: AR model using p time lags (p < T)
– uses xt-1 , xt-2, …, xt-p i.e. value of previous p time step
to predict xt
• Given: Time series data: X = (x1, x2, …, xt, …, xT)
– xt is the observation at time t
– T be the number of observations
• AR(p) model is given as:
xt  f ( xt 1 , w0 , w1 ,..., w p )  w0  w1 xt 1  ...  w p xt  p
p
xt  f ( x, w )  w0   w j xt  j  w T x
j 1
where w  [ w0 , w1 ,..., w p ]T and x  [1, xt 1 , xt  2 ,..., xt  p ]T
– The coefficients w0, w1, …, wp are parameters of
hyperplane (regression coefficients) - Unknown

67

AR (p) Model - Training


• The regression coefficients are obtained as seen in
multiple linear regression with p input variables using
least square method
• Minimize the squared error between the actual data
(xt) at time t and the estimate of linear function
(predicted variable (x̂ t)) i.e. the function f(x,w)
p
xˆt  f ( x, w )  w0   w j xt  j  w0  w T x
j 1
1 T
minimize E ( w )   xˆ t  x t 2
w 2 t  p 1
• The autocorrelation statistics help to choose which lag
variables (p) will be useful in a model

68

34
21-11-2019

AR (p) Model - Training


ˆ  XT X
• The optimal ŵ is given as w   1
X T x (t )

1 x t 1 xt  2 . . . xt  p   xt 
  x 
1 x t x t 1 . . . x ( t  1 )  p 
 t 1 
          
X  x (t )   
1 x t  n 1 x t  n  2 . . . x ( t  n )  p   xt  n 
   
          
1 xT 1 xT  2 . . . xT  p   xT 
 
X is data matrix with time lag

• The autocorrelation statistics help to choose which lag


variables (p) will be useful in a model

69

AR (p) Model: Testing


• The value at time t, x̂t is predicted by taking values
from past p time steps (xt-1 , xt-2, …, xt-p) as input:
p
ˆ )  wˆ 0   wˆ j xt  j  w
xˆt  f (x, w ˆ Tx
j 1

• The prediction accuracy is measured in terms of


squared error:
E   xˆ t  x t 
2

• Let Ttest be the total number of test samples


• The prediction accuracy of regression model is
measured in terms of root mean squared error:
Ttest
1
 xˆ  xt 
2
E RMS  t
Ttest t 1

70

35
21-11-2019

Illustration AR(1) Model –


Prediction of Temperature: Training
Temp
Date
(xt) • T, the number of
Sept 1 25.47
observations = 61
Sept 2 26.19
Sept 3 25.17
Sept 4 24.30
Sept 5 24.07
Sept 6 21.21
Sept 7 23.49
Sept 8 21.79
Sept 9 25.09
Sept 10 25.39
--- ---
Oct 29 23.06
Oct 30 23.72
Oct 31 23.02
71

Illustration AR(1) Model –


Prediction of Temperature: Training
Temp Temp
Date
(xt-1) (xt) • T, the number of
Sept 1 25.47
observations = 60
Sept 2 25.47 26.19 60

Sept 3 26.19 25.17  x t 1   t 1  xt   t 


Sept 4 25.17 24.30 wˆ 1  t 1
60

 x  t 1 
2
Sept 5 24.30 24.07 t 1
t 1
Sept 6 24.07 21.21
Sept 7 21.21 23.49 wˆ 0   t  w1  t 1
Sept 8 23.49 21.79
Sept 9 21.79 25.09
Sept 10 25.09 25.39
• μt-1: 22.81 • wˆ 1: 0.523
--- --- ---
Oct 29 22.76 23.06 • μt : 22.85 • wˆ 0: 10.861
Oct 30 23.06 23.72
Oct 31 23.72 23.02
72

36
21-11-2019

Illustration AR(1) Model –


Prediction of Temperature: Test
• wˆ 1: 0.523 Date
Temp Temp
(xt-1) (xt)
• wˆ 0: 10.861 Nov 2 22.30 -

• Predicted Temperature for Nov 2 : 22.52


• Actual Temperature on Nov 2 : 21.43
• Squared error : 1.19

73

Illustration AR(p) Model – Prediction


of Temperature: Checking Dependency
Temp • p= 3
Date
(xt)
Sept 1 25.47
• T, the number of
observations = 61
Sept 2 26.19
Sept 3 25.17
Sept 4 24.30
Sept 5 24.07
Sept 6 21.21
Sept 7 23.49
Sept 8 21.79
Sept 9 25.09
--- ---
Oct 28 22.76
Oct 29 23.06
Oct 30 23.72
Oct 31 23.02
74

37
21-11-2019

Illustration AR(p) Model – Prediction


of Temperature: Checking Dependency
Temp Temp Temp Temp • p= 3
Date
(xt-3) (xt-2) (xt-1) (xt)
Sept 1 25.47
• T, the number of
observations = 61
Sept 2 25.47 26.19
• Autocorrelation between xt
Sept 3 25.47 26.19 25.17
and xt-1 : 0.54
Sept 4 25.47 26.19 25.17 24.30
• Autocorrelation between xt
Sept 5 26.19 25.17 24.30 24.07
and xt-2 : 0.25
Sept 6 25.17 24.30 24.07 21.21
• Autocorrelation between xt
Sept 7 24.30 24.07 21.21 23.49
and xt-3 : -0.08
Sept 8 24.07 21.21 23.49 21.79
• An autocorrelation is
Sept 9 21.21 23.49 21.79 25.09 deemed significant if
--- --- --- --- ---
2
Oct 28 22.83 23.98 24.47 22.76 autocorrel ation   0.25
T
Oct 29 23.98 24.47 22.76 23.06
• Time lag p=2 is sufficient
Oct 30 24.47 22.76 23.06 23.72
as xt is significant with xt-1
Oct 31 22.76 23.06 23.72 23.02 and xt-2
75

Illustration AR(p) Model – Prediction of


Temperature: Training
Temp Temp Temp
Date
(xt-2) (xt-1) (xt) • p=2
Sept 1 25.47 • T, the number of
Sept 2 25.47 26.19 observations = 59
Sept 3 25.47 26.19 25.17 • Multiple linear regression
Sept 4 26.19 25.17 24.30 with number of input
Sept 5 25.17 24.30 24.07 variables = 2
Sept 6 24.30 24.07 21.21
Sept 7 24.07 21.21 23.49 w 
ˆ  XT X 
1
X T x (t ) ; w
ˆ  R3
Sept 8 21.21 23.49 21.79
Sept 9 23.49 21.79 25.09
--- --- --- ---
Oct 28 23.98 24.47 22.76
Oct 29 24.47 22.76 23.06
Oct 30 22.76 23.06 23.72
Oct 31 23.06 23.72 23.02
76

38
21-11-2019

Illustration AR(p) Model –


Prediction of Temperature: Test
Temp Temp Temp
w 
ˆ  XT X 
1
X T x (t ) ; ˆ  R3
w
Date
(xt-2) (xt-1) (xt)
Nov 2 23.02 22.30 --

• Predicted Temperature for Nov 2 : 22.49


• Actual Temperature on Nov 2 : 21.43
• Squared error : 1.13

77

Summary: Regression
• Regression analysis is used to model the relationship
between one or more independent (predictor) variable
and a dependent (response) variable
• Response is some function of one or more input
variables
• Linear regression: Response is linear function of one
or more input variables
• Nonlinear regression: Response is nonlinear function
of one or more input variables
– Polynomial regression: Response is nonlinear function
approximated using polynomial function upto degree p
of one or more input variables

78

39
21-11-2019

Summary: Regression
• Autoregression (AR): Regression on the values of
same attribute
– It is a time series model
– Linear regression model that uses observations from
previous p time steps as input to predict the value at the
next time step
– It makes an assumption that the observations at
previous time steps are useful to predict the value at the
next time step
– The autocorrelation statistics help to choose which lag
variables (p) will be useful in a model
• AR model can be performed on time series data with
single variable or with multiple variables
• In this course we are limited only on the time series
data with single variable
79

Evaluation Metric for Regression

40
21-11-2019

Squared Error
• The prediction accuracy is measured in terms of
squared error: E   yˆ  y 
2

– y: actual value
– ŷ : predicted value
• Let Nt be the total number of test samples
• The prediction accuracy of regression model is
measured in terms of root mean squared error:
Nt
1
  yˆ  yn 
2
E RMS  n
Nt n 1

81

R Squared (R2)
• Coefficient of determination
• Statistical measure
• It is the proportion of the variation (variance) in the
dependent variables that is predictable from the one
or more independent variable(s).
• It provides the measure of how well observed
outcomes (actual values of dependent variables) are
replicated by the model, based on the proportion of
total variation of outcomes (dependent variables)
explained by the model

82

41
21-11-2019

R Squared (R2)
• Let N be the total number of samples
D  {x n , yn }nN1 , x n  R d and yn  R 1
• yn be the actual value of the nth dependent variable
• yˆ n be the predicted value corresponding to the yn
• The mean of the observed data (actual value of
dependent variable):
N
1
μy 
N
y
n 1
n

• The residual (error) for the nth dependent variable:


E n  y n  yˆ n
• The total sum of squares, proportional to the
variance, of observed data (actual value of dependent
SS tot    y n  μ y 
N
variable): 2

n 1 83

R Squared (R2)
• The total sum of squares of the residuals (residual
sum of squares):
N N
SS res    y n  yˆ n    E n2
2

n 1 n 1
• Coefficient of determination (R2):
SS res SS res / N
R2  1 R2  1
SS tot SS tot / N
• The values of R2 is in the range of 0 to 1
• R2 is interpreted as the proportion of response
variation explained by the independent variable in the
model
• It interpret the linear relationship between dependent
and independent variable(s)

84

42
21-11-2019

R Squared (R2)
• R2=1: The fitted model explains all variability
in y SS res
R2  1
• R2=0: Indicate no linear relationship between SS tot
the response variable and independent
variable
• R2=0.9: 90% of the variability in the response variable
(dependent variable) is explained by independent variables
• It is more suitable for the linear regression
• It capture the linear correlation between the dependent and
independent variable(s)
• For the simple linear regression, can be interpreted as
square of the correlation coefficient
• The R2 is not interpretable when the regression is non-
linear (independent variables have nonlinear relationship
with dependent variables)
– The values may go negative and smaller than 0
85

Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.

2. C. M. Bishop, Pattern Recognition and Machine Learning,


Springer, 2006.

86

43

You might also like