0% found this document useful (0 votes)
31 views41 pages

STAT22209 - Chapter 02-Regression Analyisis - 2022

Regression analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views41 pages

STAT22209 - Chapter 02-Regression Analyisis - 2022

Regression analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Advanced Statistics II

( PST22209/ FST 22209/ ESNRM22209)

R.M. KAPILA RATHNAYAKA


B.Sc. Special (Math. & Stat. ) (Ruhuna), M.Sc. (Industrial Mathematics) (USJ),
M.Sc. (Stat. ) (WHUT, China),
Ph.D. (Applied Statistics, WHUT)
Introduction to Correlation
and
Regression Analysis

Chapter 01
The Regression Analysis
• Regression analysis is a powerful statistical method that allows you to
examine the relationship between two or more variables of interest.

• A regression analysis generates an equation to describe the statistical


relationship between one or more predictors and the response variable and
to predict new observations.

• It is a statistical tool used to determine the probable change in one


variable for the given amount of change in another. This means, the
value of the unknown variable can be estimated from the known value
of another variable
Regression Equation

• The Regression Equation is the algebraic expression of the


regression lines.

Y= a + b X
• X- independent variable Y - dependent variable

• a - intercept on Y axis b - slope of the line

• Dependent Variable: This is the main factor that you’re trying


to understand or predict.

• Independent Variables: These are the factors that you


hypothesize have an impact on your dependent variable.
• There are two methods of obtaining regression line
1) The scatter diagram method

2) Method of least square


The scatter diagram method
• Scatter diagram is the simplest method for representing data.

• Suppose the two variables are X and Y and there are ‘n’ pairs
of values
( x1 , y1 ), ( x2 , y2 ), ........ , ( xn , yn )
• Generally independent variable is plotted along the
horizontal (X) axis and depend variable plotted along the
vertical (Y) axis.

• Plotting your data is the first step in figuring out if there is a


relationship between your independent and dependent
variable
 Calculate ( X , Y ) values.
 The paired observations are plotted.
 Then draw the line through the mean point.

Variable Y (X , Y) Y

a X

Variable X

Y
b
X
Y a  bX
Example :
The data given below is collected from 7 persons from a
department of Physical Sciences and Technology referring to
years of service and their monthly income. Plot the values and
get the regression line X on Y.

Employee A B C D E F

Years of Service (X) 2 3 5 6 8 9

Income (in 1000 Rs.) (Y) 5 6 7 8 12 14

8
Why should your organization use regression analysis?

• Regression analysis is helpful statistical method that can be


leveraged across an organization to determine the degree to
which particular independent variables are influencing
dependent variables.
The Method of Ordinary Least Squares

• In ordinary least squares (OLS) regression, the estimated


equation is calculated by determining the equation that
minimizes the sum of the squared distances between the
sample's data points and the values predicted by the
equation.
The Classical Assumptions
Assumption 1: The disturbances have zero mean, i.e., for every .

• This assumption is needed to insure that on the average we are on


the true line.
Assumption 2: The disturbances have a constant variance, i.e., for
every . This insures that every observation is equally reliable.

Assumption 3: The disturbances are not correlated, i.e.,for ,

Assumption 4: The explanatory variable X is non-stochastic, i.e., fixed


in repeated samples, and hence, not correlated with the
disturbances. Also, and has a finite limit as n tends to infinity.
Least squares Estimation

• Least squares minimizes the residual sum of squares where


the residuals are given by

and and denote guesses on the regression parameters and ,


respectively.

• The residual sum of squares denoted by

is minimized by the two first-order conditions:


• The equations and are called the least-squares equations for
estimating the parameters of a line.
• The equations and are called the least-squares equations for
estimating the parameters of a line.

• The least-squares equations are linear in and and hence can


be solved simultaneously. The solutions are
Example :
The data given below is collected from 6
persons from SUSL referring to years of service and their
monthly income

Employee A B C D E F
Years of Service (X) 2 3 5 6 8 9
Income (in 1000 Rs.) (Y) 5 6 7 8 12 14

08/04/2024 17
X Y XY 2
x
2 5 15 4

3 6 18 9
5 5 25 25
6 8 48 36
8 12 96 64
9 14 126 81

 y  50  xy  328  x
2
 x  33  219

18
 XY  (  X Y
b n
)
& a
 Y  b X 
( x ) 2
n
X  n
2

(33  50)
328 
6 53 50  1.41 33
b 2
  1.41 & a  0.578
33 37.5 6
219 
6
 Y  0.578  1.41 X

08/04/2024 19
Example : Test score and sales Data of Salesmen.

Sales man A B C D E F G H I J
Test Score 50 80 60 70 90 60 80 50 70 90
(X)
Sales (‘000) 3.5 7.0 5.0 6.0 5.0 4.0 6.0 4.0 5.5 4.0
(Y)

From the above data calculate the regression line of Y on X


and estimate the probable weekly sales volume for a score
of 100 in the intelligence test.
08/04/2024 20
n  10  x 700  y  50   2000  xy  70
x 2

 Y  0.035 x  2.55

x  100  Y  0.035 (100)  2.55  6.05

Thus, the most probable weekly sales volume if


a salesman makes a score of 100 in the
intelligence test is 6050 or 6.05 thousands.

08/04/2024 21
Example
A sample of 6 persons was selected the value of their age ( x
variable) and their weight is demonstrated in the following
table.

Find the regression equation and what is the predicted weight


when age is 8.5 years.
Weight (y) Age (x) .Serial no
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
Exercise 2
• The following are the age (in years) and systolic
blood pressure of 20 apparently healthy adults.
B.P (y) Age (x) B.P (y) Age (x)

128 46 120 20
136 53 128 43
146 60 141 63
124 20 126 26
143 63 134 53
130 43 128 31
124 26 136 58
121 19 132 46
126 31 140 58
123 23 144 70
1. Find the correlation between age and
blood pressure using simple and
Spearman's correlation coefficients, and
comment.

2. Find the regression equation?

3. What is the predicted blood pressure for a


man aging 25 years?
Regression validation
• Model validation is possibly the most important step in the
model building sequence.

• There are many statistical tools for model validation can be


seen in the literature.

• But the primary tool for most process modeling applications


is graphical residual analysis.
Residual Plots
• A residual plot is a graph that shows the residuals on the
vertical axis and the independent variable on the horizontal
axis.

• If the points in a residual plot are randomly dispersed around


the horizontal axis, a linear regression model is appropriate
for the data; otherwise, a non-linear model is more
appropriate.
Chart displays the residual (e) and independent variable (X) as a residual plot.

• The residual plot shows a fairly random pattern

– The first residual is positive,

– the next two are negative,

– the fourth is positive,

– and the last residual is negative.

• This random pattern indicates that a linear


model provides a decent fit to the data.
Residual Plots
• The residual plots show three typical patterns.

• The first plot shows a random pattern, indicating a good fit


for a linear model.

• The other plot patterns are non-random (U-shaped and


inverted U), suggesting a better fit for a non-linear model.
What Is R-squared?
• R-squared is a statistical measure of how close the data are
to the fitted regression line.
• The definition of R-squared is fairly straight-forward; it is
the percentage of the response variable variation that is
explained by a linear model.

Total Variation =

Explained Variation=
What Is R-squared?
• R-squared is always between 0 and 100%:

• 0% indicates that the model explains none of the variability


of the response data around its mean.

• 100% indicates that the model explains all the variability of


the response data around its mean.

• In general, the higher the R-squared, the better the model fits
your data.
• The regression model on the left accounts for 38.0% of the variance while the one

on the right accounts for 87.4%.

• The more variance that is accounted for by the regression model the closer the

data points will fall to the fitted regression line.

• Theoretically, if a model could explain 100% of the variance, the fitted values

would always equal the observed values and, therefore, all the data points would
Exercise
• The following are the age (in years) and systolic
blood pressure of 20 apparently healthy adults.
B.P (y) Age (x) B.P (y) Age (x)

128 46 120 20
136 53 128 43
146 60 141 63
124 20 126 26
143 63 134 53
130 43 128 31
124 26 136 58
121 19 132 46
126 31 140 58
123 23 144 70
1. Find the correlation between age and
blood pressure using simple or Spearman's
correlation coefficients, and comment.

2. Find the regression equation?

3. Calculate R- Square ; comment.

4. What is the predicted blood pressure for a


man aging 25 years?

You might also like