Simplelinearregression NBC
Simplelinearregression NBC
Regression
Slide-1
Correlation vs. Regression
▪ A scatter diagram can be used to show the
relationship between two variables
▪ Correlation analysis is used to measure
strength of the association (linear relationship)
between two variables
▪ Correlation is only concerned with strength of the
relationship
▪ No causal effect is implied with correlation
Slide-2
What Is Regression?
Slide-3
Introduction to
Regression Analysis
▪ Regression analysis is used to:
▪ Predict the value of a dependent variable based on the
value of at least one independent variable
▪ Explain the impact of changes in an independent
variable on the dependent variable
Slide-4
Simple Linear Regression
Model
▪ Only one independent variable, X
▪ Relationship between X and Y
is described by a linear function
▪ Changes in Y are assumed to be
caused by changes in X
Slide-5
Types of Relationships
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Slide-6
Types of Relationships
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Slide-7
Types of Relationships
No relationship
X
Slide-8
Slide-9
Simple Linear Regression
Model
Simple Linear Regression
Model
(continued)
Slide-11
Simple Linear Regression
Equation (Prediction Line)
Slide-12
Least Squares Method
Slide-13
Finding the Least Squares
Equation
Slide-14
Interpretation of the
Slope and the Intercept
Slide-15
Slide-1
The least squares line has two components: the slope m, and y-intercept b. We will solve for m first,
and then solve for b. The equations for m and b are:
Slide-1
Slide-1
Slide-1
Simple Linear Regression
Example
▪ A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet)
Slide-20
Sample Data for House Price
Model
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Slide-21
Graphical Presentation
350
300
250
200
150
100
50
0
0 500 10001500200025003000
Square Feet
Slide-22
Regression Using Excel
▪ Tools / Data Analysis / Regression
Slide-23
Slide-2
Excel Output
Regression Statistics
Multiple R
0.76211 The regression equation is:
R Square
0.58082
Adjusted R Square 0.52842 house price = 98.24833 + 0.10977 (square
Standard Error
41.33032 feet)
Observations
10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Slide-25
Graphical Presentation
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
Slide-26
Interpretation of the
Intercept, b0
Slide-27
Interpretation of the
Slope Coefficient, b1
Slide-28
Predictions using
Regression Analysis
Predict the price for a house
with 2000 square feet:
= 98.25 + 0.1098(2000)
= 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Slide-29
Interpolation vs. Extrapolation
▪ When using a regression model for prediction,
only predict within the relevant range of data
Relevant range for
interpolation
450
400
House Price ($1000s)
350
300
250
200
150 Do not try to
100
extrapolate
50
0 beyond the range
0 500 1000 1500 2000 2500 3000 of observed X’s
Square Feet
Department of Statistics, ITS Slide-24
Slide-3
Measures of Variation
SST = ∑ (Yi − Y) 2 i − Y) 2
SSE = ∑ (Y
i −
i
ˆ )2
Y
∑
SSR =
where:
ˆ
=(Y
Average value of the dependent variable
Y
Yi = Observed values of the dependent variable
ˆ = Predicted value of Y for the given X value
Yi i
Slide-31
Measures of Variation
(continued)
mean Y
▪ SSR = regression sum of squares
▪ Explained variation attributable to the relationship
between X and Y
▪ SSE = error sum of squares
▪ Variation attributable to factors other than the
relationship between X and Y
Slide-32
Measures of Variation
(continued)
Y
Yi ∧ ∧
SSE = ∑ (Yi - Yi Y
_ )2
SST = ∑(Yi - Y)2
∧
Y ∧ _
_ SSR = ∑ (Yi - Y)2
_
Y Y
Xi X
Slide-33
Coefficient of Determination, r2
▪ The coefficient of determination is the portion
of the total variation in the dependent variable
that is explained by variation in the
independent variable
▪ The coefficient of determination is also called
r-squared and is denoted as r2
note: 0≤r2 ≤
1 Slide-34
Examples of Approximate
2 Values
r
Y
r2 = 1
2
X
r =1
Slide-35
Examples of Approximate
2 Values
r
Y
0 < r2 < 1
X
Slide-36
Examples of Approximate
2 Values
r
r2 = 0
Y
No linear relationship
between X and Y:
Slide-37
Excel Output
Regression Statistics SSR 18934.9348
Multiple R 0.76211 r =
2 = = 0.58082
R Square 0.58082 SST 32600.5000
Adjusted R Square 0.52842 58.08% of the variation in
Standard Error 41.33032
house prices is explained by
Observations 10
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Slide-38
Extra
Slide-3
Standard Error of Estimate
▪ The standard deviation of the variation of
observations around the regression line is
estimated by
n
SSE ∑ (Yi − iY )
ˆ 2
i=1
SYX =
= n− n−2
Where 2
SSE = error sum of
squares n = sample
size
Slide-40
Excel Output
Regression Statistics
Multiple R
R Square
0.76211 SYX =
0.58082
Adjusted R Square 0.52842 41.33032
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Slide-41
Comparing Standard Errors
SYX is a measure of the variation of observed
Y values from the regression line
Y Y
Xs
small sYX large X
YX
▪ Independence of Errors
▪ Error values are statistically independent
▪ Normality of Error
▪ Error values (ε) are normally distributed for any given value of
X
Slide-43
Residual
Analysis
ˆ
ei = Y i − Y i
Y Y
x x
residuals
x residuals x
Not Linear
✔ Linear
Slide-45
Residual Analysis for
Independence
Not Independent
✔ Independent
residuals
residuals
X
residuals
Slide-46
Residual Analysis for Normality
Percent
100
0
-3 -2 -1 0 1 2 3
Residual
Slide-47
Residual Analysis for
Equal Variance
Y Y
x x
residuals
x residuals x
Slide-48
Excel Residual Output
Slide-50