Regression Analysis
Regression Analysis
Lecture Outline
Regression Analysis
Principles
Estimation
Testing
Practical Issues
-2-
Basic Idea of Regression Analysis
The key goal of regression analysis is to determine the strength of impact of one
or more independent variables on a dependent variable
INCOME
EDUCATION EDUCATION
-3-
Basic Model
x1 β1 e.g. y = sales
x1 = unit price
indep.
β2 y dep. x2 = advertising budget
x2 x3 = salesman‘s visits
β3
var. var.
= error term
x3
y 1 x1 2 x2 ... j x j
-4-
Applications
-5-
Bivariate Regression
Derive a mathematical
relationship between a
single metric
dependent variable
and a single metric
independent variable.
-6-
Bivariate Regression – Basic Concepts
Income
Change in income
Change in b = Slope =
Income Change in education
Change in Education
Education
-7-
Graphical Representation of Price-Sales Function
Price-sales function
y (quantity)
e3
y3 y3
ŷ 3 Regression function
x3
x (price)
x3 = price of observation no.3
y3 = quantity of observation no.3
ŷ 3 = estimated quantity for no.3 based on the regression function
e3 = estimation error (residual)
-8-
General Linear Regression Model
y 1 x1 2 x2 ... j x j y
dependent variable
intercept term
xj independent variable
j unknown parameter
Coefficient j indicate how much y changes (in units) error term
when xj increases by one unit.
-9-
Lecture Outline
Regression Analysis
Principles
Estimation
Testing
Practical Issues
- 10 -
Regression Example - Where Is the Regression Line?
INCOME
EDUCATION
- 11 -
Regression Example - Where Is the Regression Line?
INCOME
EDUCATION
- 12 -
Regression Example - Where Is the Regression Line?
INCOME
EDUCATION
- 13 -
Data Matrix
Objective: Explain the variation in y through the regression relation and estimate.
- 14 -
Estimation
e 2
ee (y Xβˆ)(y Xβˆ) yy 2 yXβˆ βˆ X Xβˆ min
β
- 15 -
Explanatory Power of Regression
- 16 -
Variation Decomposition – Graphical Representation
𝑦𝑖
𝑦ෝ𝑖
- 17 -
Variation Decomposition - Formula
Explained Unexplained
Total variation = variation + variation
y i y yˆ i y yi yˆ i
2 2
2
̂ X X 1 X Y
Min y yˆ
2
i i
αˆ,βˆ1,...,βˆ j
̂ y j x j
Note: The Least Squares Method is also referred to as OLS (Ordering least squares)
- 18 -
Regression Analysis – Statistics
Regression Slope of the regression function; signifies how much the dependent variable
coefficient i changes for one unit change of the independent variable.
Standardized
Slope obtained out of standardized data; signifies how important an
regression independent variable is in explaining the dependent variable.
coefficient i*
Coefficient of Measures the strength of association R2=[0;1]; signifies the proportion of total
determination R2 variance in Y that is accounted for by X.
Coefficient of determination adjusted for the number of independent variables
Adjusted R2 included in the model and the sample size to account for the diminishing
returns.
Used to test the null hypothesis that no linear relationship exists between Xi
t statistic and Y; H0: βi = 0.
Used to test the null hypothesis that the coefficient of determination is zero; H0:
F statistic β1 = β2 = … = βi = 0
- 19 -
Lecture Outline
Regression Analysis
Principles
Estimation
Testing
Practical Issues
- 20 -
Test the Regression Equation
Coefficient of determination R2
I
i i
y ˆ
y 2
i 1
n -1
R 1 (1 R 2 ) K= number of exploratory variables
n = total number of observation
n-K
- 21 -
Test the Regression Coefficients
n K 1
freedom K and n-K-1
Femp is compared to the critical F-value K= number of independant variables
n= total number of observations
(Ftab) at significant level
- 22 -
Test the Regression Coefficients
Standardize the coefficients to neutralize the impact of different scales and scale
units and to make the coefficients comparable:
− independent from linear transformations
s( x j )
− absolute measure for the importance of influence factors b bj
*
j
− influence expressed as part of standard deviation in y s( y)
− is similar to a path coefficient in path analysis
How many standard deviations units does the dependent variable change
if the independent variable changes by one standard deviation.
Can only be interpreted with and relative to the other variables in the equation,
will likely change if additional variables are added.
- 23 -
Test the Regression Model – Example
Model Summary
ANOVA
Sum of
Model Squares df Mean Square F Sig.
1 Regression 564.445 3 188.148 44.723 .000a
Residual 929.750 221 4.207
Total 1494.196 224
Coefficientsa,b
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -1.232 .671 -1.836 .068
x3 I have stylish clothes .252 .099 .136 2.542 .012
x5 Life is too short not .928 .089 .566 10.382 .000
to take some gambles.
x7 government is doing .101 .094 .058 1.079 .282
too much to control
pollution.
- 24 -
Confidence Interval around βi
Like all parameter estimate the estimates of the i population parameters are
point estimates
- 25 -
Confidence Intervals of a Linear Regression - SPSS
- 26 -
Confidence Intervals of a Linear Regression - SPSS
- 27 -
Confidence Intervals of a Linear Regression - SPSS
5
Type in the confidence
interval you select e.g. for
95% write 95, for 99% put
99
7
- 28 -
Confidence Intervals of a Linear Regression - SPSS
- 29 -
Testing Subsets of Regression Coefficients
R 2
UR R2R H0: 3=4=0
F h H1: at least one coefficient is unequal to zero
1 R 2
UR
Note can also test whether i=j
n K 1
- 30 -
Lecture Outline
Regression Analysis
Multicollinearity
Principles
Nonlinearity
Estimation
Heteroscedasticity
Testing
Practical Issues
- 31 -
Assumptions of Regression Analysis
3. Effects of different IVs are additive, i.e. total effect of the X’s on the expected value of Y is the
sum of their separate effects.
A2 Constant Variance of error terms (Homoscedasticity)
1. over time (in case of time series data)
2. over predictions
- 32 -
Normality of Error Term - Diagnosis
Actual
Theoretical
Note: Other tests for normality are available such as Kolmogorov-Smirnov test or Jarque-
Bera test are available
- 33 -
Normality of Error Term - Diagnosis
Actual
Theoretical
- 34 -
Normality of Error Term - Therapy
Could be a result of a few data points that deviate significantly from the
regression line. Thus investigating the normal quantile plot can help identifying
influential outliers.
Non-linear transformations (log-transformations) can help.
- 35 -
Assumptions‘ Violations – Multicollinearity
Multicolinearity
Assumption:
Independent variables are not substantially correlated Non-Linearity
Multicollinearity: Heteroscedasticity
Effect
• Least-square (LS) estimator for the regression coefficient is given by the formula: ̂ X X 1 X Y
• In the case of perfect multicollinearity, the matrix X X is singular and its inverse matrix does not exist
̂ cannot be estimated.
• In cases of high degree of multicollinearity, a solution for the LS estimator can still be found.
• But: the higher the linear association, the lower the determinant X X of ̂ is. The consequence is
inflation of variances and covariances of the estimated coefficients Coefficients become imprecise
and unstable.
- 36 -
Multicollinearity – Visualization
Multicolinearity
Ideal situation
X1 Non-Linearity
X1 is independent of X2
Y Heteroscedasticity
allows a good regression analysis
Autocorrelation
X2
Multicollinearity
X1
Correlation > 0.7 between X1 and X2
If the correlation between X1 and X2 is < 1, SPSS
Y a considers both variables (even if high)
If the correlation between X1 and X2 is equal to 1, SPSS
put the variable off the system.
X2
Difficulty of assigning shared variance a to either X1 or X2
- 37 -
Multicollinearity - Diagnosis
Autocorrelation
x1
x3
x1 1 x2 2 x3 ... i 1 xi
€€
x4
- 38 -
Multicollinearity - Diagnosis
Multicolinearity
Tolerance value Variance inflation factor (VIF)
Non-Linearity
1
1 Rij Tij VIF
Heteroscedasticity
1 Rij Autocorrelation
- 39 -
Multicollinearity – Therapy
Multicolinearity
Therapy for multicollinearity:
Non-Linearity
Heteroscedasticity
Elimination of interdependent variable based on
correlation and tolerance value Autocorrelation
- 40 -
Assumptions‘ Violations – Nonlinearity Diagnosis I
residual
- 41 -
Assumptions‘ Violations – Nonlinearity Diagnosis II
Residual
- 42 -
Assumptions‘ Violations – Nonlinearity Effects and Therapy
Multicolinearity
Effect
Non-Linearity
Parameters are biased: as sample size increases, parameter estimations
tend to over or underestimate the true value. Predictions based on a linear Heteroscedasticity
model that is in fact non-linear can be substantially wrong.
Autocorrelation
Therapy:
- 43 -
Assumptions‘ Violations – Nonlinearity Effects and Therapy
Therapy: Multicolinearity
Non-Linearity
Autocorrelation
- 44 -
Example Non-Linear Regression Model with Sensory Data
Multicolinearity
With sensory data there are acceptable Non-Monotonic Relations
ranges of attribute values that lead to Non-Linearity
quasi equivalent preferences.
Heteroscedasticity
Unaccept Acceptable Unaccept-
able range able Autocorrelation
Outside this acceptable range
Preference
bitterness bitterness
preference decline substantially.
- 45 -
Example Non-Linear Regression Model - Results
Key:
Non-Linearity
Acceptable Unacceptable
Liking Rating
range range
Heteroscedasticity
R²= 15%
b1 = 0.378; b2 = -0.055
β1 = 2.210; β2 = -2.070
- 46 -
Heteroscedasticity – Diagnosis
Multicolinearity
Non-Linearity
Heteroscedasticity
Residuals
Residuals
Autocorrelation
𝑦ෝ𝑖 , x𝑖 𝑦ෝ𝑖 , x𝑖
- 47 -
Assumptions‘ Violations – Heteroscedasticity
Homoscedasticity Multicolinearity
Heteroscedasticity
Heteroscedasticity
Variance of the residuals is not constant and doesn‘t depend on 𝑦
ෝ𝑖 , x𝑖 Autocorrelation
Effect
OLS estimators are unbiased but no longer efficient, standard errors of the estimates
might be wrong and confidence intervals might be too narrow or too large and impact
predictions.
- 48 -
Heteroscedasticity – Therapy
Multicolinearity
Non-Linearity
Can be a byproduct of linearity, independence assumption
thus i.e. conduct variable transformation, add higher order Heteroscedasticity
terms
Autocorrelation
- 49 -
Autocorrelation – Diagnosis I
Multicolinearity
Non-Linearity
Heteroscedasticity
Autocorrelation
Positive autocorrelation
- 50 -
Autocorrelation – Diagnosis II
Multicolinearity
Non-Linearity
Heteroscedasticity
Autocorrelation
Negative autocorrelation
- 51 -
Assumptions‘ Violations – Autocorrelation
Effect
• ̂is further on unbiased, i.e.
• But error variance is no longer minimized, i.e.
• Estimators are no longer efficient.
• All tests for regression coefficients and confidence intervals are incorrect.
• Estimators of the error variance underestimate the real error variance.
- 52 -
Autocorrelation – Durbin-Watson Test (I)
Multicolinearity
Examine correlations between the error
Durbin-Watson-Test terms. Test statistic (1st order serial Non-Linearity
correlation):
Heteroscedasticity
T
t t 1
( e e ) 2
Autocorrelation
Difficult to interpret exactly the value of the test statistic, since it is dependent on the
number of independent variables and the number of observations.
Two types of serial correlation (positive & negative) two interpretation intervals
Rule of thumb: DW 2 means no autocorrelation
DW test only test for auto-correlation of lag 1. Weekly/seasonal effect will not be
detected
- 53 -
Autocorrelation – Durbin-Watson Test (II)
Interpretation intervals
dL; /2: lower critical value at
positive negative
autocorrelation auto- significance level /2.
correlation
dU; /2: upper critical value at
significance level /2.
0 dL dU 2 4 - dU 4 - dL 4 : Unclear area
- 54 -
Autocorrelation – Therapy
Multicolinearity
- 55 -
Testing Assumptions – Sequence of Analysis
- 56 -
Assumptions‘ Violations – Diagnosis and Therapy (I)
Data transformation
Scatter plot
Nonlinearity Linearity test
Nonlinear regression with higher
order terms of IVs
Eliminate variables
Correlation matrix Factor score
Multicollinearity Tolerance test Increase information base
Ridge regression, Shapley values
- 57 -
Lecture Outline
Regression Analysis
Principles
Estimation
Testing
Practical Issues
Sample Size
Example Note
If you have a regression with 6 With large samples almost any multiple
predictions, you need 50+ correlation will become significant.
(8x6) = 98 cases to test For stepwise regression, even more
regression and 104+6 cases to cases are needed (cases to IV ratio
test individual predictions. 40:1)
- 59 -
Non-Metrical Independent Variables
The independent variable can be dummy coded. That is for each of its levels
a zero/one variable is created that indicates whether the level is present (1)
or absent (0)
1 5 Red 1 5 Red 1 0 0
2 2 Yellow 2 2 Yellow 0 1 0
3 4 Blue 3 4 Blue 0 0 1
4 3 Red 4 3 Red 1 0 0
5 1 Yellow 5 1 Yellow 0 1 0
… … … … … … … … …
- 60 -
Major Types of Multiple Regression
Sequential
regression
Statistical/Stepwise
regression
- 61 -
Major Types of Multiple Regression
Observed Correlations
X
1
X1 with X2 (high)
X1 with Y
Y
a X2 with Y
X2 X3 with Y (low)
b
X3 with X2 (negligible)
c
d
e Area a comes from X1
Area b comes from ? (ambiguity btw. X1 and X2 )
€€ Area c comes from X2
Area d comes from ? (ambiguity btw. X2 and X3 )
- 62 -
Standard Multiple Regression
X Attribution
1
• Each X is assigned only the area of its unique
contribution.
Y
a
• The overlapping areas b and d contribute to R2, but are
b X2 not assigned to any of the individual X.
c
d
e Interpretation
- 63 -
Sequential Multiple Regression
Attribution
X
• X1…n enter the equation in an order specified by the
1
researcher. In this case, we assume that the researcher
assigned X1 = first entry, X2 = second entry and X3 =
Y third entry.
a
• X1 has priority and is assigned areas a and b
b X2 • X2 arrives second and is assigned areas c and d
• X3 is assigned e, the only area remaining
c
d • If X2 was first, its importance would have increased
e dramatically, as it would have been assigned b, c and d.
€€ Precision
The researcher normally assigns order of entry of variable
according to logical or theoretical considerations.
X3
e.g “height” can be considered as a more important factor
than “amount of training” in accessing success as a
basketball player and would therefore be assigned first.
R2 = area (a+b +c+d+e)
- 64 -
Statistical/Stepwise Regression
Y
1
Attribution
X • Controversial procedure
a
• Order of entry of variables is based solely on statistical
b Y2 criteria.
c
d
e
€€
Y3
- 65 -