Introduction to Regression Analysis
Lecturer:
Wilhemina Adoma Pels
KNUST
January 24, 2024
1 / 33
SIMPLE LINEAR REGRESSION
2 / 33
REGRESSION
Regression is a statistical method used to describe the nature of the
relationship between variables, that is, positive or negative, linear or
nonlinear.
Regression Analysis is used to predict the value of one variable(the
dependent variable) on the basis of other variables(the independent
variables)
The variable we are trying to predict is called the response or
dependent variable: denoted Y
The variable predicting this is called the explanatory or independent
variable: denoted X
If we only have one independent variable, the model is
y = β0 + β1 x + ε (1)
This model is referred to as simple linear regression.
3 / 33
Applications
Economics
Social Science
Engineering
Management
Life & Biological Sciences
4 / 33
SIMPLE LINEAR REGRESSION
Is a model that estimate the linear relationship between a single dependent
variable Y and an independent variable X.
Model
Yi = β0 + β1 Xi + εi i = 1, · · · , n (2)
Variables:
X = Independent Variable(we provide this )
Y = Dependent Variable (we observe this)
Parameters:
β0 = Y-intercept
β1 = Slope
ε = random error
In this model β0 , β1 and εi are parameters and Yi and Xi are measured
values.
5 / 33
SIMPLE LINEAR REGRESSION
Required Conditions OR Assumption
For these regression methods to be valid, the following for conditions for
the error variable ε must be met:
The probability distribution of ε is normal.
The mean of the distribution is 0; that is, E (ε) = 0.
The standard deviation of ε is σε which is a constant regardless of the
value of x .
The value of ε associated with any particular value of y is
independent of ε associated with any other value of y .
6 / 33
LEAST SQUARE ESTIMATION OF THE PARAMETERS
Estimating the Coefficients
In much the same way we base estimates of µ on x̄ , we estimate β0
with βˆ0 and β1 with βˆ1 ,the y-intercept and slope respectively of the
least squares or regression line given by:
ŷ = βˆ0 + βˆ1 x (3)
This is an application of the least squares method and it produces a
straight line that minimizes the sum of the squared differences
between the points or the observation yi and the fitted line.
7 / 33
The Least Squares Line
Figure: Least Squares Line
8 / 33
LEAST SQUARE ESTIMATION OF THE PARAMETERS
n
X n
X
L = min ε̂2 = (Y − Ŷ )2 (4)
i=1 i=1
n n
(yi − βˆ0 − βˆ1 xi )2
X X
L = min ε2i = (5)
i=1 i=1
n
δL
= −2 (yi − βˆ0 − βˆ1 xi ) = 0
X
(6)
δ βˆ0 i=1
n
δL
(yi − βˆ0 − βˆ1 xi )xi = 0
X
= −2 (7)
δ βˆ1 i=1
simplifying the equations yields
n n
nβˆ0 + βˆ1
X X
xi = yi (8)
i=1 i=1
9 / 33
LEAST SQUARE ESTIMATION OF THE PARAMETERS
Cont’d
n n n
βˆ0 +βˆ1
X X X
xi2 = yi xi (9)
i=1 i=1 i=1
The solution to the equations results in the least squares estimators of βˆ0
and βˆ1 The least squares estimates of the intercept and slope in the simple
linear regression model are;
βˆ0 = ȳ − βˆ1 x̄ (10)
and Pn Pn
Pn ( i=1 yi )( i=1 xi )
i=1 yi xi −
βˆ1 = Pn n 2 (11)
Pn 2 ( i=1 xi )
i=1 xi −
n
10 / 33
LEAST SQUARE ESTIMATION OF THE PARAMETERS
Cont’d
the fomular of the slope can be denoted using the sum of squares,
Sxy
βˆ1 = (12)
Sxx
where;
n Pn Pn
X ( i=1 yi )( i=1 xi )
Sxy = yi xi − (13)
i=1
n
and
n Pn 2
X ( i=1 xi )
Sxx = xi2 − (14)
i=1
n
11 / 33
SIMPLE LINEAR REGRESSION
Regression Equation
Regression Equation describes the regression line mathematically by βˆ0
and βˆ1 the intercept and the slope. We replace a by βˆ0 and b by βˆ1 in the
graph below.
12 / 33
REGRESSION
Cont’d
13 / 33
Example
The amount of a chemical compound y, which is dissolved in 100 grams of
water at various temperatures x, were recorded as follows
xoC 6 5 10 7 8 12 5 9 7 11
y(grams) 21 19 31 25 28 33 20 29 22 32
1 Fit the linear regression model y = β0 + β1 x + ε to these data, using
the method of least squares
2 Estimate the amount of the chemical compound which will dissolve in
100 grams of water at 7.5o C
14 / 33
Solution
xi yi xi yi xi2
6 21 126 36
5 19 95 25
10 31 310 100
7 25 175 49
8 28 224 64
12 33 396 144
5 20 100 25
9 29 261 81
7 22 154 49
11 32 352 121
Σ = 80 260 2193 694
15 / 33
Solution
(260) × (80)
Sxy = 2193 − = 113
10
(80)
Sxx = 694 − = 54
10
Sxy 113
βˆ1 = = = 2.093
Sxx 54
βˆ0 = ȳ − βˆ1 x̄ = 26 − (2.093 × 8) = 9.259
The regression model is ŷ = 9.259 + 2.093x
2. When x=7.5
ŷ = 9.259 + 2.093(7.5) = 24.954
16 / 33
Interpretation of Coefficients in Regression Analysis
The coefficients describe the mathematical relationship between each
independent variable and the dependent variable.
The size of the coefficient for each independent variable gives you the
size of the effect that variable is having on your dependent variable,
and the sign on the coefficient (positive or negative) gives you the
direction of the effect.
1 A positive coefficient indicates that as the value of the independent
variable increases, the dependent variable also tends to increase.
2 A negative coefficient suggests that as the independent variable
increases, the dependent variable tends to decrease or vice versa.
The intercept is the average amount when the independent variable is
zero
17 / 33
Interpretation of Coefficients in Regression Analysis
Cont’d
Now interpret this Regression Equation;
ŷ = 4.692 + 0.923x (15)
18 / 33
SIMPLE LINEAR REGRESSION
Line of best fit Plot
19 / 33
SIMPLE LINEAR REGRESSION
Estimating the Variance of the error term ε
The residual;
εi = yi − yˆi (16)
is used to obtain the estimate of the error term.
The sum of squares of the residuals(Error Sum of Squares) is;
n
X n
X
SSE = ε2i = (yi − yˆi )2 (17)
i=1 i=1
The expected value of the error sum of square is;
E (SSE ) = (n − 2)σ 2 (18)
20 / 33
REGRESSION
Cont’d
Therefore the unbiased estimator of σ 2 is;
SSE
σˆ2 = (19)
n−2
Also the standard error of estimate is;
s
SSE
Sε = (20)
n−2
If Sε is Zero, all the points fall on the regression line. If Sε is small, the fit
is excellent and the linear model should be used for forecasting. If Sε is
large, the model is poor.
21 / 33
Example
The following measurements of the specific heat of a certain chemical were
made in order to investigate the variation in specific heat with
temperature.
Temperature 0 C (x) 0 10 20 30 40
Specific heat (y) 0.51 0.55 0.57 0.59 0.63
Find the least squares regression line of specific heat on temperature, and
hence estimate the value of the specific heat when the temperature is 25
0C .
22 / 33
Solution
x y xy x2
0 0.51 0 0
10 0.55 5.55 100
20 0.57 11.4 400
30 0.59 17.7 900
40 0.63 25.2 1600
P
= 100 2.85 59.8 3000
Sxy = ni=1 xy − n1 ( ni=1 x )( ni=1 y )
P P P
Sxy = 5i=1 (59.8) − 51 (100)(2.85)
P
Sxy = 2.8
Sxx = ni=1 x 2 − n1 ( ni=1 x )2
P P
Sxx = 5i=1 (3000) − 51 (100)2
P
Sxx = 1000
Sxy 2.8
β1 = = = 0.00028
Sxx 1000
23 / 33
Solution
βˆ0 = ȳ − βˆ1 x̄
2.85
ȳ = = 0.57
5
100
x̄ = = 20
5
βˆ0 = 0.57 − 0.0028(20) = 0.5644
The fitteed squares regression line is ŷ = βˆ0 + βˆ1 x
ŷ = 0.5644 + 0.00028x
24 / 33
Solution
at 250 C
ŷ = 0.5644 + 0.00025(25)
ŷ = 0.5714
25 / 33
REGRESSION
Testing the slope
If no linear relationship exists between the two variables, we would
expect the regression line to be horizontal, that is, to have a slope of
zero.
We want to see if there is a linear relationship, i.e. we want to see if
the slope(β1 ) is something other than zero. Our research hypothesis
becomes:
H0 = β1 = 0 [no linear relationship]
H1 = β1 ̸= 0 [there is linear relationship]
26 / 33
REGRESSION
Cont’d
We can implement this test statistic to try our hypothesis:
βˆ1 − β1
t= (21)
Sβˆ1
Where Sβˆ1 is the standard deviation of βˆ1 , defined as:
s
σˆ2
Sβˆ1 = (22)
Sxx
where
n Pn 2
X ( i=1 xi )
Sxx = xi2 − (23)
i=1
n
27 / 33
REGRESSION
Cont’d
If the error term ε is normally distributed, the test statistic has a
student t-distribution with n-2 degrees of freedom. The rejection
region depends on whether or not we’re doing a one or two tail
test(two tail test is most typical)
We reject the null hypothesis H0 if tcal > tα/2 , n − 2
28 / 33
Properties of the OLS estimates
These can be summarized by: OLS Estimator is BLUE
B-Best
L-Linear
U-Unbiased
E-Estimator
Note: The Gauss Markov theorem is required for the proof.
29 / 33
GROUP ASSIGNMENT
1 PROVE THAT OLS IS BLUE
2 Estimate β0 and β1
Show working
30 / 33
Trial Questions
A study was made on the amount of converted sugar(y) in a certain
process at various temperatures(x). The data were coded and
recorded as follows:
(x) 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
(y) 8.1 7.8 8.5 9.8 9.5 8.9 8.6 10.2 9.3 9.2 10.5
a. Find the equation of the least squares regression line.
b. Estimate the converted sugar when the coded sugar is 1.75.
31 / 33
Trial Question
Regression methods were used to analyze the data from from a study
investigating the relationship between roadway surface temperature(x)
and pavement deflection(y).Summary quantities were;
P P 2 P P 2
n = 20, yi = 12.75, yi = 8.86, xi = 1478, xi =
P
143215.8 and xi yi = 1083.67
a. Calculate the least squares estimates of the slope and intercept of the
linear regression line.
b. Use the equation of the fitted regression line to predict the pavement
deflection when the surface temperature is 75 0 F .
32 / 33
Thank You.
33 / 33