Class15 18 Regression - 26sept 10oct2019
Class15 18 Regression - 26sept 10oct2019
Regression (Prediction)
Prediction (Regression)
• Numeric prediction: Task of predicting continuous (or
ordered) values for given input
• Example:
– Predicting potential sales of a new product given its
price
– Predicting rain fall given the temperature and humidity
in the atmosphere
t
(rain fall)
t
(sales)
x2
(temperature) x1
x (price) (humidity)
t = f(x) t = f(x)
x = [x1 , x2]T
• Regression and prediction are synonymous terms 2
1
21-11-2019
Prediction (Regression)
• Regression analysis is used to model the relationship
between one or more independent (predictor) variable
and a dependent (response) variable
– Dependent variable is always continuous valued or
ordered valued
– Example: Dependent variable: Rain fall
Independent variable(s): temperature, humidity
• The values of predictor variables are known
• The response variable is what we want to predict
• Regression analysis can be viewed as mapping
function:
x y x y
f(.) f(.)
Prediction (Regression)
• Regression is a two step process
– Step1: Building a regression model
• Learning from data (training phase)
• Regression model is build by analysing or learning from a
training data set made up of one or more independent
variables and their dependent labels
• Supervised learning: In supervised learning, each example
is a pair consisting of an input example (independent
variables) and a desired output value (dependent variable)
– Step2: Using regression model for prediction
• Testing phase
• Predicting dependent variable
• Accuracy of a predictor:
– How well a given predictor can predict for new values
• Target of learning techniques: Good generalization
ability
4
2
21-11-2019
6 43
11 59
21 90
1 20
16 83 Years of experience
3
21-11-2019
Linear Regression
• Linear approach to model the relationship between a
scalar response, (y) (or dependent variable) and one
or more predictor variables, (x or x) (or independent
variables)
• The response is going to be the linear function of
input (one or more independent variables)
• Simple linear regression (straight-line regression):
– Single independent variable (x)
– Single dependent variable (y) x y
1
f(.)
– Fitting a straight-line
y • Function f(xn,w,w0) is a
linear function of xn and
it is a linear function of
coefficients w and w0
– Linear model for
regression
x 8
4
21-11-2019
independent variable x
wˆ n 1 wˆ 0 y w x
N • μy: sample mean of
x x
2
n 1
n
independent variable y
10
5
21-11-2019
11
x x
2
3 30 n
n 1
8 57
9 64
• μx: 9.1 • ŵ : 3.54
13 72 • μy: 55.4 • wˆ 0: 23.21
3 36
6 43
11 59
21 90
Salary
1 20
16 83
Years of experience 12
6
21-11-2019
Years of experience
7
21-11-2019
15
1 N
minimize E ( w ) yˆ n y n 2
w 2 n 1
• The error function is a
– quadratic function of the coefficients w and
– The derivatives of error function with respect to the
coefficients will be linear in the elements of w
• Hence the minimization of the error function has
unique solution and found in closed form 16
8
21-11-2019
1 N
w T x n yn
2 n 1
2
0
w
17
9
21-11-2019
10
21-11-2019
Pressure
Humidity
• Predicted rainfall: 21.72
• Actual rainfall: 21.24
• Squared error: 0.2347
21
Application of Regression:
A Method to Handle Missing Values
• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let x1, x2, …, xd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
yn = f(xn1, xn2, …, xnd )
x y
d
f(.)
11
21-11-2019
Application of Regression:
A Method to Handle Missing Values
• Training process:
– Let y be the attribute, whose missing values to be
predicted
– Training examples: All x=[x1, x2, …, xd ]T, a set of d
dependent attributes for which the independent variable
y is available
– The values for the coefficients will be determined by
fitting the linear function to the training data
• Dependent variable:
Temperature
• Independent variables: Humidity
and Rainfall
Application of Regression:
A Method to Handle Missing Values
• Testing process (Prediction):
– Optimal coefficient vector w is given by
ˆ XT X
w
1
XT y
– For any test example x, the predicted value is given by:
d
yˆ f ( x, w ˆ T x wˆ i xi
ˆ)w
i 0
12
21-11-2019
Nonlinear Regression
• Nonlinear approach to model the relationship between
a scalar response, (y) (or dependent variable) and
one or more predictor variables, (x or x) (or
independent variables)
• The response is going to be the nonlinear function of
input (one or more independent variables)
• Simple nonlinear regression (Polynomial curve fitting):
– Single independent variable (x)
x y
– Single dependent variable (y)
1
f(.)
– Fitting a curve
• Nonlinear regression (Polynomial regression):
– One or more independent variable (x)
– Single dependent variable (y) x y
– Fitting a surface d
f(.)
25
13
21-11-2019
...
z n1 z n 2 z n 3 z np
yˆ n f ( z n , w ) w0 w1 z n1 w2 z n 2 ... w p z np
p
yˆ n f (z n , w ) w j z nj w T z n
j 0
28
14
21-11-2019
1 N
w T z n yn
2 n 1
2
0
w
29
15
21-11-2019
j 0
• The prediction accuracy is measured in terms of
squared error: E yˆ y 2
• Let Nt be the total number of test samples
• The prediction accuracy of regression model is
measured in terms of root mean squared error:
Nt
1
yˆ yn
2
E RMS n
Nt n 1 31
32
16
21-11-2019
Humidity
95.05
Temperature
• Predicted humidity: 95.05
• Actual humidity: 98.76
• Squared error: 13.77
34
17
21-11-2019
Humidity
96.21
Temperature
• Predicted humidity: 96.21
• Actual humidity: 98.76
• Squared error: 06.49
36
18
21-11-2019
Humidity
97.71
Temperature
• Predicted humidity: 97.71
• Actual humidity: 98.76
• Squared error: 01.11
38
19
21-11-2019
x x x
• Condition: Less number
of training examples p=3 p=9
(N=10) y y
• Effect of increasing the
degree of polynomial (p)
x x
ERMSE
C. M. Bishop, Pattern Recognition and Machine
Learning, Springer, 2006.
p 39
x x
p=9
N = 100
y
• Increasing the size of the
data set reduces the over-
fitting problem
x
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
40
20
21-11-2019
Nonlinear Regression:
Polynomial Regression
• Polynomial regression:
x y
– One or more independent variable (x) f(.)
d
– Single dependent variable (y)
D= 6
φ(xn ) 1, 2 xn1 , 2xn 2 , xn21 , xn22 ,
2 xn1xn2
T
41
Nonlinear Regression:
Polynomial Regression
• Polynomial regression:
x y
– One or more independent variable (x) f(.)
d
– Single dependent variable (y)
21
21-11-2019
Nonlinear Regression:
Polynomial Regression
• Given:- Training data: D {x n , yn }nN1 , x n R d and yn R 1
• Function governing the relationship between input and
output given by a polynomial function of degree p:
D 1
yn f (x n , w ) f (φ(x n ), w ) w j j (x n )
j 0
x2
x1 • Polynomial function f(xn,w) is a
nonlinear function of xn and it is a
y = f(xn,w)
linear function of coefficients w
x = [x1 , x2]T
– Linear model for regression
Fitting a surface 43
Nonlinear Regression:
Polynomial Regression
• The values for the coefficients will be determined by
fitting the polynomial to the training data
• Method of least squares: Minimizes the squared
error between the actual data (yn) i.e. actual
dependent variable and the estimate of line (predicted
dependent variable (ŷ n) i.e. the function f(xn,w)
D 1
yˆ n f (x n , w ) f (φ (x n ), w ) w j j (x n )
j 0
1 N
minimize E ( w ) yˆ n y n
2
w 2 n 1
• The error function is a quadratic function of the
coefficients w
• Derivatives of error function with respect to the
coefficients will be linear in the elements of w
• Hence the minimization of the error function has
unique solution and found in closed form 44
22
21-11-2019
yˆ n f (x n , w)
yˆ n f (φ(x n ), w )
D 1
yˆ n w j j (x n )
j 0
yˆ n w T φ(x n )
where w [ w 0 , w1 ,..., w D 1 ]T and
φ ( x n ) [ 0 ( x n ) , 1 ( x n ) , 2 ( x n ) ,..., D 1 ( x n ) ]T
45
1 N
2 n 1
w T φ(x n ) yn 2
0
w
46
23
21-11-2019
24
21-11-2019
49
25
21-11-2019
1
99.00 1009.21
ˆ ΦTΦ
w ΦT y
26
21-11-2019
Pressure
Humidity Pressure Humidity
Autoregression (AR)
27
21-11-2019
Autoregression (AR)
• Regression on the values of same attribute
• Autoregression is a time series model that
– uses observations from previous time steps as input to a
linear regression equation to predict the value at the
next time step
55
28
21-11-2019
57
58
29
21-11-2019
25.47 26.19 25.17 24.3 24.07 21.21 23.49 21.79 25.09 25.39 23.89 22.51 22.9 21.72 23.18
59
Checking Dependency
• It’s not always easy to just look at a time-series plot
and say whether or not the series is independent
• xt in a series is independent means that knowing
previous values doesn’t help you to predict the next
value
– Knowing xt-1 doesn’t help to predict xt
– More generally, knowing xt-1, xt-2, …, xt-p doesn’t help to
predict xt
• p is the number of previous time step (time lag)
• Dependency of each element at time t (xt) with the
values of elements at previous p time steps (xt-1 , xt-2,
…, xt-p ) is observed using autocorrelation
60
30
21-11-2019
xt-1 0.54 1.83 -2.26 0.86 0.32 -1.31 -0.43 0.34 3.58 2.77 -1.35 3.03 0.73 -0.06
– Autocorrelation:
xt xt-1
xt-1
xt 1 -0.1242
xt-1 -0.1242 1
xt 61
xt-1 25.47 26.19 25.17 24.3 24.07 21.21 23.49 21.79 25.09 25.39 23.89 22.51 22.9 21.72
– Autocorrelation:
xt xt-1 xt-1
xt 1 0.4054
xt-1 0.4054 1
xt 62
31
21-11-2019
63
64
32
21-11-2019
T
• μt-1: sample mean of
x t 1 t 1 xt t
wˆ 1 t 1
T
wˆ 0 t w1 t variables at time t-1, xt-1
x t 1
2
t 1 • μt: sample mean of
t 1
variables at time t, xt
65
66
33
21-11-2019
Autoregression Model
• AR(p) model: AR model using p time lags (p < T)
– uses xt-1 , xt-2, …, xt-p i.e. value of previous p time step
to predict xt
• Given: Time series data: X = (x1, x2, …, xt, …, xT)
– xt is the observation at time t
– T be the number of observations
• AR(p) model is given as:
xt f ( xt 1 , w0 , w1 ,..., w p ) w0 w1 xt 1 ... w p xt p
p
xt f ( x, w ) w0 w j xt j w T x
j 1
where w [ w0 , w1 ,..., w p ]T and x [1, xt 1 , xt 2 ,..., xt p ]T
– The coefficients w0, w1, …, wp are parameters of
hyperplane (regression coefficients) - Unknown
67
68
34
21-11-2019
1 x t 1 xt 2 . . . xt p xt
x
1 x t x t 1 . . . x ( t 1 ) p
t 1
X x (t )
1 x t n 1 x t n 2 . . . x ( t n ) p xt n
1 xT 1 xT 2 . . . xT p xT
X is data matrix with time lag
69
70
35
21-11-2019
x t 1
2
Sept 5 24.30 24.07 t 1
t 1
Sept 6 24.07 21.21
Sept 7 21.21 23.49 wˆ 0 t w1 t 1
Sept 8 23.49 21.79
Sept 9 21.79 25.09
Sept 10 25.09 25.39
• μt-1: 22.81 • wˆ 1: 0.523
--- --- ---
Oct 29 22.76 23.06 • μt : 22.85 • wˆ 0: 10.861
Oct 30 23.06 23.72
Oct 31 23.72 23.02
72
36
21-11-2019
73
37
21-11-2019
38
21-11-2019
77
Summary: Regression
• Regression analysis is used to model the relationship
between one or more independent (predictor) variable
and a dependent (response) variable
• Response is some function of one or more input
variables
• Linear regression: Response is linear function of one
or more input variables
• Nonlinear regression: Response is nonlinear function
of one or more input variables
– Polynomial regression: Response is nonlinear function
approximated using polynomial function upto degree p
of one or more input variables
78
39
21-11-2019
Summary: Regression
• Autoregression (AR): Regression on the values of
same attribute
– It is a time series model
– Linear regression model that uses observations from
previous p time steps as input to predict the value at the
next time step
– It makes an assumption that the observations at
previous time steps are useful to predict the value at the
next time step
– The autocorrelation statistics help to choose which lag
variables (p) will be useful in a model
• AR model can be performed on time series data with
single variable or with multiple variables
• In this course we are limited only on the time series
data with single variable
79
40
21-11-2019
Squared Error
• The prediction accuracy is measured in terms of
squared error: E yˆ y
2
– y: actual value
– ŷ : predicted value
• Let Nt be the total number of test samples
• The prediction accuracy of regression model is
measured in terms of root mean squared error:
Nt
1
yˆ yn
2
E RMS n
Nt n 1
81
R Squared (R2)
• Coefficient of determination
• Statistical measure
• It is the proportion of the variation (variance) in the
dependent variables that is predictable from the one
or more independent variable(s).
• It provides the measure of how well observed
outcomes (actual values of dependent variables) are
replicated by the model, based on the proportion of
total variation of outcomes (dependent variables)
explained by the model
82
41
21-11-2019
R Squared (R2)
• Let N be the total number of samples
D {x n , yn }nN1 , x n R d and yn R 1
• yn be the actual value of the nth dependent variable
• yˆ n be the predicted value corresponding to the yn
• The mean of the observed data (actual value of
dependent variable):
N
1
μy
N
y
n 1
n
n 1 83
R Squared (R2)
• The total sum of squares of the residuals (residual
sum of squares):
N N
SS res y n yˆ n E n2
2
n 1 n 1
• Coefficient of determination (R2):
SS res SS res / N
R2 1 R2 1
SS tot SS tot / N
• The values of R2 is in the range of 0 to 1
• R2 is interpreted as the proportion of response
variation explained by the independent variable in the
model
• It interpret the linear relationship between dependent
and independent variable(s)
84
42
21-11-2019
R Squared (R2)
• R2=1: The fitted model explains all variability
in y SS res
R2 1
• R2=0: Indicate no linear relationship between SS tot
the response variable and independent
variable
• R2=0.9: 90% of the variability in the response variable
(dependent variable) is explained by independent variables
• It is more suitable for the linear regression
• It capture the linear correlation between the dependent and
independent variable(s)
• For the simple linear regression, can be interpreted as
square of the correlation coefficient
• The R2 is not interpretable when the regression is non-
linear (independent variables have nonlinear relationship
with dependent variables)
– The values may go negative and smaller than 0
85
Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.
86
43