Chapter 2 Simple Linear Regression
Chapter 2 Simple Linear Regression
CHAPTER 2
AT THE END OF THIS CHAPTER STUDENTS SHOULD
BE ABLE TO UNDERSTAND
2
2.1 BACKGROUND - WHAT IS LINEAR
REGRESSION (LR)?
Linear regression is a linear model, a model that
assumes a linear relationship between the input variables
(x) and the single output variable (y)
LR is relation between variables where changes in some
variables may “explain” the changes in other variables
LR model estimates the nature of the relationship between
independent and dependent variables.
Examples:
• Does change class size affect marks of students?
• Does cholesterol level depend on age, sex, or amount of
exercise? 3
2.1 What is Linear Regression (LR)?
Investigating the dependence of one variable (dependent
variable), on one or more variables (independent variable)
using a straight line.
Y Y
X X
Y Y
4
X X
2.1 What is LR model used for?
Linear regression models are used to show or predict
the relationship between two variables or factors. The
factor that is being predicted is called the dependent
variable.
Example of Linear Regression
Regression analysis is used in stats to find trends in data.
For example, you might guess that there's a connection
between how much you eat and how much you
weight; regression analysis can help you quantify that.
5
2.1 LR model - how does it work?
Linear Regression is the process of finding a line that best
fits the data points available on the plot, so that we can use
it to predict output values for inputs that are not present in
the data set we have, with the belief that those outputs
would fall on the line.
6
2.1 REGRESSION APPLICATIONS
THREE MAJOR APPLICATIONS
• description
• control
• prediction
Linear regression has many practical uses. Most
applications fall into one of the following two broad
categories: If the goal is prediction, forecasting, or
error reduction, linear regression can be used to fit
a predictive model to an observed data set of
values of the response and explanatory variables.
7
2.1 WHAT REGRESSION IS USED FOR?
1.Predictive Analytics: 2. Operation Efficiency:
forecasting future opportunities Regression models can also
and risks is the most prominent be used to optimize
application of regression analysis business processes.
in business
4. Correcting Errors:
3. Supporting Decisions: Regression is not only great for
Businesses today are overloaded lending empirical support to
with data on finances, operations management decisions but also for
and customer purchases. identifying errors in judgment.
Executives are now leaning on
data analytics to make informed 5. New Insights:
business decisions that have Over time businesses have
statistical significance, thus gathered a large volume of
eliminating the intuition and gut unorganized data that has the
feel. potential to yield valuable insights.
8
2.1 EXAMPLE LR IN FORECASTING
10
2.1 TYPES OF REGRESSION
Regression
1 Variable Models 2+ Variables
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
11
2.1 TYPES OF REGRESSION
(EDUCATION) Y (Income)
(EDUCATION)
(SEX)
(EXPERIENCE)
(AGE) Y (Income)
12
2.1 CORRELATION
Correlation is a statistical technique that can show whether
and how strongly pairs of variables are related.
( xi − x )( yi − y )
r=
(( xi − x ) 2 )(( yi − y ) 2 )
13
2.1
CORRELATION
VS
LINEAR
REGRESSION
14
2.2 INTRODUCTION – SIMPLE
LINEAR REGRESSION
• The quantitative analysis use the
information to predict its future
behavior.
• Current information is usually in the
form of a set of data.
• When the data form a set of pairs of
numbers, we may interpret them as
representing the observed values of
an independent (predictor) variable x
and a dependent (response) variable
y.
15
2.2 INTRODUCTION – SIMPLE
LINEAR REGRESSION
The goal is to find a functional relation between the response
variable y and the predictor variable x.
𝑦 = 𝑓(𝑥)
SELECTION of independent variable(s)
- choose the most important predictor variable(s).
SCOPE of model
- we may need to restrict the coverage of model to some
interval or region of values of the independent variable(s)
depend on the needs/requirements.
16
2.3 REGRESSION -
POPULATION & SAMPLE
17
2.3 REGRESSION - REGRESSION
MODEL
General regression model where,
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀
0, and 1 are unknown parameters, x is a known parameter
Deviations are independent, n(o, 2)
The values of the regression parameters 0, and 1 are
not known. We estimate them from data.
18
2.3 REGRESSION - REGRESSION LINE
• If the scatter plot of our sample data suggests a linear
relationship between two variables i.e.
𝑦 = 𝛽መ0 + 𝛽መ1 𝑥
the relationship can be summarized by a straight-line plot.
• Least squares method give us the “best” estimated line for
our set of sample data.
• The least squares method is a statistical procedure to find
the best fit for a set of data points by minimizing the sum
of the offsets or residuals of points from the plotted curve.
Least squares regression is used to predict the behavior of
dependent variables.
19
2.4 Least Squares Method
20
2.4 Least Squares Method
• ‘best fit’ means differences between actual y values & predicted
y values is a minimum.
• but positive differences off-set negative ones, so, square the
errors!
21
2.4 Least Squares Method
22
2.4 ASSUMPTIONS IN SLR
◼ Independent observations:
Observations are independent of
each other.
◼ Linear relationship: The relationship
between X and the mean of Y is
linear
◼ Normal distribution of error terms:
the residuals 𝜀𝑖 are normally Multivariate Normality–
Multiple regression assumes
distributed
that the residuals are normally
◼ No auto-correlation: The residuals distributed.
are independent of each other. No No Multicollinearity—Multiple
serial correlation in the values of regression assumes that the
residual 𝜀𝑖 . independent variables are not
highly correlated with each
other.
23
2.4 ASSUMPTIONS IN SLR
24
2.5 SLR - COMPUTATION
• write an estimated regression line based on sample data as
𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
• the method of least squares chooses the values for b0, and b1
to minimize the sum of squared errors (sse)
25
Example 1:
The manager of a car plant wishes to investigate how the
plant’s electricity usage depends upon the plant production.
The data is given below
Production 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2
($million) (x)
Electricity 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53
Usage (y)
x 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2
x
=58.62
y 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53
y
=34.15
xy 11.18 8.09 10.65 14.02 16.86 15.22 16.82 20.17 14.24 18.29 13.08 10.63
xy
=169.25
x2 20.34 12.82 18.58 25.60 31.81 24.90 27.98 33.99 22.09 31.47 24.01 17.64
x 2
=291.23
2
n
1 n
S XX = x − xi
2
ˆ S XY
i =1
i
n i =1 1 = = 0 . 4988
S XX
= 291 .23 – ( 58 .62 ) 2 /12 = 4 .8723
n
1 n n
S XY = xiy i − xi y i
i =1 n i =1 i =1
= 169 .25 – ( 58 .62 )( 34 .15 )/12 = 2 .43045
ˆ0 = y − ˆ1 x
= ( 34 .15 /12 ) – ( 0 .4988 )( 58 .62 / 12 )
= 0 .4091
Estimated Regression Line yˆ = 0 . 4 0 9 1 + 0 . 4 9 8 8 x
2.5 SLR - ESTIMATION OF MEAN
RESPONSE
• Fitted regression line can be used to estimate the mean
value of y for a given value of x.
• Example
• the weekly advertising expenditure (x) and weekly
sales (y) are presented in the following table.
y x
1250 41
1380 54
1425 63
1425 54
1450 48
1300 46
1400 62
1510 61
1575 64
1650 71 29
2.5 SLR – ESTIMATION OF MEAN
RESPONSE
• from the previous table:
𝑛 = 10 𝑥 = 564 𝑥 2 = 32604
𝑦 = 14365 𝑥𝑦 = 818755
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 10(818755) − (564)(14365)
𝛽መ1 = 2 2
= 2
= 10.8
𝑛 σ 𝑥 − (σ 𝑥) 10(32604) − (564)
yො = 828 + 10.8x
Sales = 828 + 10.8 Expenditure
this means that if the weekly advertising expenditure is increased
by $1 we would expect the weekly sales to increase by $10.8.
For $50 of expenditure, then estimated sales is:
𝑦𝑖 − 𝑦ො𝑖
2.6 ANOVA – SST, SSE & SSR
35
2.6 ANOVA – SST, SSE & SSR
▪ Sum Square Total (SST ):
- Measure how much variance is in the dependent
variable.
- Made up of the SSE and SSR
𝐧 𝐧 𝐧
37
2.6 ANOVA – SST, SSE & SSR
𝑆𝑆𝑅
• Mean square regression (MSR) 𝑀𝑆𝑅 =
1
𝑆𝑆𝐸
• Mean square error (MSE) 𝑀𝑆𝐸 =
𝑛−2
2.7 Model Evaluation
SLR model evaluation is using software output
40
2.7 Model Evaluation
(i) Standard error of estimate (s)
𝐒𝐒𝐄
➢ Compute Standard Error of Estimate by 𝜎ො 𝟐 =
𝐧−𝟐
where
𝐧
41
2.7 Model Evaluation
(ii) Coefficient of Determination
➢ Coefficient of determination
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = =1− R2 = 1 - (SSE/SST)
𝑆𝑆𝑇 𝑆𝑆𝑇
42
2.7 R-SQUARED
Hypothesis test
• A process that uses sample statistics to test a claim about the value
of a population parameter.
• Example: An automobile manufacturer advertises that its new hybrid
car has a mean mileage of 50 miles per gallon. To test this claim, a
sample would be taken. If the sample mean differs enough from the
© 2019 Petroliam Nasional Berhad (PETRONAS) | 46
advertised mean, you can decide the advertisement is wrong.
2.7 MODEL EVALUATION
(III) THE HYPOTHESIS TEST
• One sided (tailed) H 0 : 0 or = 0
lower-tail test H1 : 0
H 0 : = 0
• Two-sided (tailed) test H1 : 0
One sided
(tailed) upper-
tail test
One sided
(tailed) lower-
tail test
Two-sided
(tailed) test
51
2.7 Model Evaluation
(iii) Model evaluation – t-test
Electricity Usage 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.5
(y)(kWh) 3
59
Set up the hypothesis
60
Excel Results
Regression Statistics
Multiple R 0.895605603
R Square 0.802109396
Adjusted R Square 0.782320336
Standard Error 0.172947969
Observations 12
ANOVA
df SS MS F Significance F
Regression 1 1.212381668 1.21238 40.53297031 8.1759E-05
Residual 10 0.299109998 0.02991
Total 11 1.511491667
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.409048191 0.385990515 1.05974 0.314189743 -0.450992271 1.269088653 -0.45099227 1.269088653
X Variable 1 0.498830121 0.078351706 6.36655 8.1759E-05 0.324251642 0.673408601 0.32425164 0.673408601
Excel Results : Regression Line
Production Line Fit Plot
4
y = 0.4988x + 0.409
3.5 R² = 0.8021
2.5
electricity
2 electricity
Predicted electricity
1.5 Linear (electricity )
0.5
0
0 1 2 production
3 4 5 6 7
2.8 Example 1 - Summary
Estimated Regression Line 𝑦ො = 0.4091 + 0.4988𝑥
Electricity usage = 0.4091 + 0.4988*Production
Standard Error of Estimate = 0.173
Coefficient of Determination R2 = 0.802
Internal
2.9 Example 2 - Application of SLR
to Hydraulic-calibration data
Example: Given data on Permeability and Reservoir Quality
Index, RQI, investigate the dependence of RQI (Y) on
Permeability (X).
Set up the hypothesis :
Internal
Excel Results – Example 2
Regression Statistics
Multiple R 0.680322
R Square 0.462837
Adjusted R
Square 0.461716
Standard Error 0.40947
Observations 481
ANOVA
df SS MS F Significance F
Regression 1 69.19926 69.19926 412.7226 1.22E-66
Residual 479 80.31167 0.167665
Total 480 149.5109
Internal
Excel Results – Example 2
Permeability(md) Line Fit Plot
5.00
4.50 y = 0.3097 + 0.0017x
4.00 R² = 0.4628
3.50
3.00
RQI
2.50 RQI
2.00 Predicted RQI
1.50 Linear (RQI)
1.00
0.50
0.00
0.0 500.0 1000.0 1500.0 2000.0 2500.0
Permeability(md)
© 2019 Petroliam Nasional Berhad (PETRONAS) | 68
Internal
2.9 Example 2 - Interpretation of
the results
• Permeability(md) coefficient (𝛽 1 =0.0017): Each unit
increase in Permeability adds 0.0017 to RQI value when
all other variables are fixed.
• 𝛽 1> 0: (positive relationship): RQI increases with the
increase in Permeability.
• Intercept coefficient (𝛽 0 = 0.309): The value of RQI when
Permeability equal to zero.
• R Square = 0.462837: indicates that the model explains 46% of
the total variability in the RQI values around its mean.
• P-value < 0.05: The regression is significant
Internal
70