0% found this document useful (0 votes)
44 views

Multiple linear regression

regression analysis

Uploaded by

bogale abunu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Multiple linear regression

regression analysis

Uploaded by

bogale abunu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Arba Minch University

College of medicine and health


sciences
School of Public Health

MULTIPLE LINEAR REGRESSION

Mesfin Kote (MPH/epidemiology& bios, PhD)

December 26, 2024


Multivariable Analysis
Multivariable analysis refers to the analysis
of data that takes into account a number of
explanatory variables and one outcome
variable simultaneously.
It allows for the efficient estimation of
measures of association while controlling
for a number of confounding factors.
All types of multivariate analyses involve the
construction of a mathematical model
to describe the association between
independent and dependent variables.
Multivariable Analysis
A large number of multivariate models have
been developed for specialized purposes,
each with a particular set of assumptions
underlying its applicability.
The choice of the appropriate model is
based on the underlying design of the
study, the nature of the variables, as well
as assumptions regarding the inter-
relationship between the exposures and
outcomes under investigation.
Multiple Linear regression
The simple linear regression model is easily
extended to the case of two or more explanatory
variables.
Multiple regression is a statistical technique that is
used to measure and describe the function relating
two (or more) predictors (independent) variables to a
single response (dependent) variable.
and has the form: y = a + b1x1 + b2x2 + … +
bnxn.
Where:
The regression coefficients (or b1 . . . bn) represent
the independent contributions of each explanatory
variable to the prediction of the dependent variable.
X1 . . . Xn represent the individual’s particular set of
values for the independent variables.
n shows the number of independent predictor
Multiple Linear
regression …
For example, birth weight depends on
maternal age, sex, gestational week and
parity, a model which includes these
variables might explain better the
variation in birth weight. We could fit a
multiple regression of the form:
 Birth weight = a + b1Maternal age +
b2Sex + b3Gestational age + b4Parity
Multiple Linear
regression …
After fitting a multiple regression model,
we will obtain a point estimate for each ‘b’
and for the intercept ‘a’.
Interpretation of the b coefficients is the
same as for the slope in simple linear
regression – that is, a change in xi of one
unit will produce a change in y of bi units.
We can also test a null hypothesis
HO : bi = 0, for each coefficient
Multiple linear regression

Relation between a continuous variable and a set of i continuous variables

Y α  β1x 1  β 2 x 2  ...  βi x i
Partial regression coefficients bi
Amount by which y changes on average
when xi changes by one unit
and all the other xis remain constant
Measures association between xi and y adjusted for all other xi

Example
SBP versus age, weight, height, etc
Multiple linear regression

Y  α  β1x 1  β 2 x 2  ...  βi x i

Predicted Predictor variables


Response variable Explanatory variables
Outcome variable Co-variables
Dependent Independent variables
Assumptions

a) First of all, as it is evident in the name multiple


linear regression, it is assumed that the
relationship between the dependent variable
and each continuous explanatory variable is
linear.
We can examine this assumption for any variable,
by plotting (i.e., by using bivariate scatter
plots) the residuals (the difference between
observed values of the dependent variable and
those predicted by the regression equation)
against that variable.
Any curvature in the pattern will indicate that a
non-linear relationship is more appropriate-
Assumption …

b) It is assumed in multiple regression that


the residuals should follow a normal
distribution and have the same
variability throughout the range.
c) The observations (explanatory variables)
are independent.
Predicted and Residual
Scores
The regression line expresses the best
prediction of the dependent variable (Y),
given the independent variables (X).
However, nature is rarely (if ever) perfectly
predictable, and usually there is substantial
variation of the observed points around the
fitted regression line.
The deviation of a particular point from the
regression line (its predicted value) is
called the residual value.
Residual Variance and R-
square
The smaller the variability of the residual
values around the regression line relative to
the overall variability, the better is our
prediction.
For example, if there is no relationship
between the X and Y variables, then
the ratio of the residual variability of
the Y variable to the original variance
is equal to 1.0.
If X and Y are perfectly related then there
is no residual variance and the ratio of
variance would be 0.0.
Residual Variance and R-square …

In most cases, the ratio would fall somewhere


between these extremes, that is, between 0.0 and
1.0.
One minus this ratio is referred to as R-square
or the coefficient of determination.
This value is immediately interpretable in the following
manner.
If we have an R-square of 0.4 then we know that the
variability of the Y values around the regression line is
1- 0.4 times the original variance.
In other words, we have explained 40% of the
original variability, and are left with 60%
residual variability.
Ideally, we would like to explain most if not all of the
Residual Variance and R-square

♣ The R-square value is an indicator of how well the model fits the
data
♣ An R-square close to 1.0 indicates that we have accounted for
almost all of the variability with the variables specified in the model.
N.B. a) The sources of variation are:
i) Due to regression ii) residual (about regression)
b) The sum of squares due to regression (SSR) over the total
sum of squares (TSS) is the proportion of the variability accounted
for by the regression.
♣ Therefore, the percentage variability accounted for or explained by
the regression is 100 times this proportion.
Interpretation

In a multiple regression model, we say that the effect


of an independent variable xi on the dependent
variable y has been adjusted for the other explanatory
variables in the model.
Adjusted estimates are less affected by confounding
between the factors.
In the birth weight example, after adjusting for sex,
gestational age and parity, the effect of maternal age
on birth weight is to change birth weight by b1 grams
for every additional one year of maternal age.
In other words if we take two newborns who have the
same sex, of similar gestational age and equal parity,
but if the age of one of the mother’s is one year older
than the second then the newborn of the first mother
will have a birth weight of b1 times bigger (for positive
Interpreting the Correlation
Coefficient R
Customarily, the degree to which two or more
predictors (independent or X variables) are related
to the dependent (Y) variable is expressed in the
correlation coefficient R, which is the square root of R-
square.
In multiple regression, R assumes values
between 0 and 1. This is true due to the fact
that no meaning can be given to the direction of
correlation in the multivariate case.
The larger R is, the more closely correlated the
predictor variables are with the outcome
variable.
When R=1, the variables are perfectly
correlated in the sense that the outcome
Types of multiple regression

There are three types of multiple regression,


each of which is designed to answer a
different question:
Standard multiple regression is used to
evaluate the relationships between a set of
independent variables and a dependent
variable.
Stepwise, or statistical, regression is
used to identify the subset of independent
variables that has the strongest relationship
to a dependent variable.
Standard multiple
regression
In standard multiple regression, all of the
independent variables are entered into the
regression equation at the same time
Multiple R and R² measure the strength of the
relationship between the set of independent
variables and the dependent variable.
An F test is used to determine if the
relationship can be generalized to the
population represented by the sample.
A t-test is used to evaluate the individual
relationship between each independent
variable and the dependent variable.
Stepwise multiple
regression
Stepwise regression is designed to find the
most parsimonious set of predictors that are
most effective in predicting the dependent
variable.
Variables are added to the regression
equation one at a time, using the statistical
criterion of maximizing the R² of the
included variables.
When none of the possible addition can
make a statistically significant improvement
in R², the analysis stops.
Choice of the Number of Variables

Multiple regression is a seductive technique: "plug in" as


many predictor variables as you can think of and usually at
least a few of them will come out significant.
This is because one is capitalizing on chance when simply
including as many variables as one can think of as predictors
of some other variable of interest. This problem is
compounded when, in addition, the number of observations
is relatively low.
Intuitively, it is clear that one can hardly draw conclusions
from an analysis of 100 questionnaire items based on 10
respondents.
Most authors recommend that one should have at least 10 to 20
times as many observations (cases, respondents) as one has
variables, otherwise the estimates of the regression line will
probably be unstable.
Sometimes we know in advance which variables we wish to
include in a multiple regression model.
Here it is straightforward to fit a regression model containing all
of those variables. Variables that are not significant can be omitted
and the analysis redone.
There is no hard rule about this, however. Sometimes it is
desirable to keep a variable in a model because past experience
shows that it is important.
 In large samples the omission of non-significant variables will
have little effect on the other regression coefficients.
The statistical significance of each variable in the
multiple regression model is obtained simply by calculating
the ratio of the regression coefficient to its standard error
and relating this value to the t distribution with n-k-1
degrees of freedom, where n is the sample size and k is
the number of variables in the model.
Usually it makes sense to omit variables that do not
contribute much to the model ( P < .05).
t-tests
t-tests are used to assess the
significance of individual b coefficients.
specifically testing the null hypothesis that
the regression coefficient is zero.
A common rule of thumb is to drop from
the equation all variables not significant at
the 0.05 level or better.
bi
 t( n  k  1) df
t= SE (bi )
n= sample size and k is the number of
variables in the model.
Stepwise regression
Stepwise regression is a technique for choosing predictor variables
from a large set.
The stepwise approach can be used with multiple linear, logistic
and Cox regressions. There are two basic strategies of applying this
technique known as forward and backward stepwise regression.
Forward stepwise regression :- The first step in many analyses of
multivariate data is to examine the simple relation between each
potential explanatory variable and the outcome variable of interest
ignoring all the other variables.
Forward stepwise regression analysis uses this analysis as its
starting point. Steps in applying this method are:
Forward stepwise regression:
Find the single variable that has the strongest
association with the dependent variable and
enter it into the model (i.e., the variable with
the smallest p-value).
Find the variable among those not in the model
that, when added to the model so far obtained,
explains the largest amount of the remaining
variability.
Repeat step (2) until the addition of an extra
variable is not statistically significant at some
chosen level such as P=.05.

N.B. You have to stop the process at some point


otherwise you will end up with all the variables
Backward stepwise
regression :
As its name indicates, with the backward stepwise
method we approach the problem from the other
direction.
The argument given is that we have collected data on
these variables because we believe them to be
potentially important explanatory variables.
Therefore, we should fit the full model, including all
of these variables, and then remove unimportant
variables one at a time until all those remaining in
the model contribute significantly.
We use the same criterion, say P<.05, to determine
significance. At each step we remove the variable with
the smallest contribution to the model (or the largest
P-value) as long as that P-value is greater than the
What do you do when you have a lot
of independent variables (say, 30 or
more)?

(Hint: start with the classical


bivariate analysis)
Multi-collinearity
This is a common problem in many correlation
analyses. Imagine that you have two predictors (X
variables) of a person's height:
(1) weight in pounds and (2) weight in ounces.
Obviously, our two predictors are completely
redundant; weight is one and the same variable,
regardless of whether it is measured in pounds or
ounces.
Trying to decide which one of the two measures is a
better predictor of height would be rather silly;
however, this is exactly what one would try to do if
one were to perform a multiple regression analysis
with height as the dependent (Y) variable and the
Multi-collinearity
Multi-collinearity is the main research
concern as it inflates standard errors and
makes assessment of the relative importance
of the independents unreliable.

However, if sheer prediction is the research


purpose (as opposed to causal analysis), it
may be noted that high multi collinearity of
the independents does not affect the
efficiency of the regression estimates
Ways to detect
Multicollinearity
Scatter plots of the predictor variables.
Compute the correlations matrix for the
predictor variables – the higher the
correlations, the worse the problem.
Examine the variance inflation factors (or
VIFs) that are automatically computed by
many computer packages that perform
regression – values of the VIF that are
larger than 10 usually signal substantial
amounts of collinearity.
What can be done to handle
multicollinearity?
Increasing the sample size is a common
first step since when sample size is
increased, standard error decreases (all
other things equal).
Use centering: transform the offending
independents by subtracting the mean
from each case. The resulting centered
data may well display considerably lower
multi collinearity
Combine variables into a
composite variable.
Remove the most inter
correlated variable (s) from
analysis.
Drop the inter correlated
variables from analysis but
substitute their cross
Importance of the residual
analysis

Even though most assumptions of multiple


regression cannot be tested explicitly, gross
violations can be detected and should be
dealt with appropriately.
In particular, outliers (i.e., extreme cases)
can seriously bias the results by "pulling" or
"pushing" the regression line in a particular
direction thereby leading to biased
regression coefficients.
Outliers are extreme cases
How Good is the Model?

 One of the measures of how well the model


explains the data is the R2 value.
Differences between observations that are
not explained by the model remain in the
error term.
 The R2 value tells you what percent of
those differences is explained by the model.
 An R2 of 0.68 means that 68% of the
variance in the observed values of the
dependent variable is explained by the
model, and 32% of those differences remains
Coefficient of determination
Explained variation + unexplained variation
=Total variation
The ratio of the explained variation to the total
variation measures how well the linear regression line
fits the given pairs of scores. It is called the
coefficient of determination, and is denoted by R².
exp lained var iation
R2 
total var iation

The explained variation is never negative and is never


larger than the total variation. Therefore, R² is always
between 0 and 1. If the explained variation equals 0,
R² = 0.
Table : Birth weight in relation to some
attributes, multiple regression analysis.

Characteristics β - coefficient P-value


Maternal age 4.6 <0.05
Gestational age 92.0 <0.001
Period (years)
1976-1979
1980-1989 15.4 <0.1
1990-1996 -81.4 <0.001
Sex of the baby
Males
Females -88.9 <0.001
Exercise
Data on FEV1 (forced expiratory volume in one second) (Y)
and height (X) of 20 male medical students are given
below:

Height (cm) FEV1(litres)


164.0 3.54
167.0 3.54
170.4 3.19
171.2 2.85
171.2 3.42
171.3 3.20
172.0 3.60
172.0 3.78
174.0 4.32
176.0 3.75
177.0 3.09
177.0 4.05
177.0 5.43
177.4 3.60
178.0 2.98
180.7 4.80
181.0 3.96
183.1 4.78
183.6 4.56
183.7 4.68
Questions
Find the regression of Y on X.
What is the expected FEV1 for a male student
whose height is 175 cm ?
What is the expected FEV1 for a female student
whose height is 166 cm ?
What is the expected FEV1 for a male student
whose height is 270 cm?
Determine the Karl Pearson’s linear correlation
coefficient.
Compute the coefficient of determination and
give an explanation for it.
Example 2

The following data were taken from a survey of women attending


an antenatal clinic. The objectives of the study were to identify
the factors responsible for low birth weight and to predict women
'at risk' of having a low birth weight baby.
Notations:
BW = Birth weight (kgs) of the child =X1
HEIGHT = Height of mother (cms) = X2
AGEMOTH = Age of mother (years) = X3
AGEFATH = Age of father (years) = X4
FAMINC = Monthly family income (Birr) = X5
GESTAT = Period of gestation (days) = X6

You might also like