0% found this document useful (0 votes)
22 views

Corelation With Example

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Corelation With Example

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 112

Regression

Simple linear regression


• is a statistical method that allows us to
summarize and study relationships between
two continuous (quantitative) variables:
• One variable, denoted x, is regarded as the
predictor, explanatory, or independent
variable.
• The other variable, denoted y, is regarded as
the response, outcome, or dependent
variable.
Linear Regression Basics

• A linear regression is a statistical model that


attempts to show the relationship between two
variables with a linear equation.
• A regression analysis involves graphing a line over a
set of data points that most closely fits the overall
shape of the data.
• A regression shows the extent to which changes in a
"dependent variable," which is put on the y-axis,
can be attributed to changes in an "explanatory
variable," which is placed on the x-axis.
Evaluating Trends and Sales Estimates

• Linear regressions can be used in business to


evaluate trends and make estimates or forecasts.
For example, if a company's sales have increased
steadily every month for the past few years,
conducting a linear analysis on the sales data with
monthly sales on the y-axis and time on the x-axis
would produce a line that that depicts the upward
trend in sales. After creating the trend line, the
company could use the slope of the line to forecast
sales in future months.
Analyzing the Impact of Price Changes

• Linear regression can also be used to analyze the


effect of pricing on consumer behavior. For instance, if
a company changes the price on a certain product
several times, it can record the quantity it sells for
each price level and then perform a linear regression
with quantity sold as the dependent variable and price
as the explanatory variable. The result would be a line
that depicts the extent to which consumers reduce
their consumption of the product as prices increase,
which could help guide future pricing decisions.
Assessing Risk

• Linear regression can be used to analyze risk.


For example, a health insurance company
might conduct a linear regression plotting
number of claims per customer against age
and discover that older customers tend to
make more health insurance claims. The
results of such an analysis might guide
important business decisions made to account
for risk.
Example
• A researcher believes that there is a linear
relationship between BMI (Kg/m2) of
pregnant mothers and the birth-weight (BW
in Kg) of their newborn

• The following data set provide information


on 15 pregnant mothers who were contacted
for this study
BMI (Kg/m2) Birth-weight (Kg)

20 2.7
30 2.9
50 3.4
45 3.0
10 2.2
30 3.1
40 3.3
25 2.3
50 3.5
20 2.5
10 1.5
55 3.8
60 3.7
50 3.1
35 2.8
Scatter Diagram
• Scatter diagram is a graphical method to
display the relationship between two variables

• Scatter diagram plots pairs of bivariate


observations (x, y) on the X-Y plane

• Y is called the dependent variable

• X is called an independent variable


Scatter diagram of BMI and Birthweight
4

3.5

2.5

1.5

0.5

0
0 10 20 30 40 50 60 70
Is there a linear relationship
between BMI and BW?
• Scatter diagrams are important for initial
exploration of the relationship between two
quantitative variables

• In the above example, we may wish to


summarize this relationship by a straight line
drawn through the scatter of points
Simple Linear Regression
• Although we could fit a line "by eye" e.g. using a
transparent ruler, this would be a subjective
approach and therefore unsatisfactory.
• An objective, and therefore better, way of
determining the position of a straight line is to use
the method of least squares.
• Using this method, we choose a line such that the
sum of squares of vertical distances of all points
from the line is minimized.
Least-squares or regression line
• These vertical distances, i.e., the distance
between y values and their corresponding
estimated values on the line are called
residuals
• The line which fits the best is called the
regression line or, sometimes, the least-
squares line
• The line always passes through the point
defined by the mean of Y and the mean of X
Linear Regression Model

• The method of least-squares is available in


most of the statistical packages (and also
on some calculators) and is usually
referred to as linear regression

• Y is also known as an outcome variable

• X is also called as a predictor


Estimated Regression Line

yˆ = ˆ + ˆ x = 1.775351 + 0.0330187 x

ˆ . 1.775351  is.called . y  int ercept

ˆ 0.0330187  is.called .the.slope


Application of Regression Line

This equation allows you to estimate BW of


other newborns when the BMI is given.
e.g., for a mother who has BMI=40, i.e. X = 40
we predict BW to be

yˆ = ˆ + ˆ x = 1.775351 + 0.0330187 (40) 3.096


Correlation Coefficient, R
• R is a measure of strength of the linear
association between two variables, x and y.

• Most statistical packages and some hand


calculators can calculate R

• For the data in our Example R=0.94



• R has some unique characteristics
Correlation Coefficient, R
• R takes values between -1 and +1

• R=0 represents no linear relationship


between the two variables

• R>0 implies a direct linear relationship


• R<0 implies an inverse linear relationship
• The closer R comes to either +1 or -1, the
stronger is the linear relationship
Positive relationship
18

16

14

12
Height in CM

10

0
0 10 20 30 40 50 60 70 80 90
Age in Weeks
Negative relationship

Reliability

Age of Car
No relation
Correlation Coefficient

Statistic showing the degree of relation


between two variables
Simple Correlation coefficient (r)

 It is also called Pearson's correlation


or product moment correlation
coefficient.
 It measures the nature and strength
between two variables of
the quantitative type.
The sign of r denotes the nature of
association

while the value of r denotes the


strength of association.
 If the sign is +ve this means the relation
is direct (an increase in one variable is
associated with an increase in the
other variable and a decrease in one
variable is associated with a
decrease in the other variable).

 While if the sign is -ve this means an


inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the other).
 The value of r ranges between ( -1) and ( +1)
 The value of r denotes the strength of the
association as illustrated
by the following diagram.

strong intermediate weak weak intermediate strong

1- -0.75 -0.25 0 0.25 0.75 1


indirect Direct
perfect perfect
correlation correlation
no relation
If r = Zero this means no association or
correlation between the two variables.

If 0 < r < 0.25 = weak correlation.

If 0.25 ≤ r < 0.75 = intermediate correlation.

If 0.75 ≤ r < 1 = strong correlation.

If r = l = perfect correlation.
How to compute the simple correlation
coefficient (r)

 xy   x y
r n
 ( x) 2
 (  y) 2

x 
2 .  y 
2 
 n  n 
  
Example:

A sample of 6 children was selected, data about their


age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.

Weight Age serial


(Kg) (years) No
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
These 2 variables are of the quantitative type, one
variable (Age) is called the independent and
denoted as (X) variable and the other (weight)
is called the dependent and denoted as (Y)
variables to find the relation between age and
weight compute the simple correlation coefficient
using the following formula:

 xy   x y
r  n
 ( x) 2  ( y)2 
x 
2 .  y 
2 
 n  n 
  
Weight Age
Serial
Y2 X2 xy (Kg) (years)
.n
(y) (x)
144 49 84 12 7 1

64 36 48 8 6 2

144 64 96 12 8 3

100 25 50 10 5 4

121 36 66 11 6 5

169 81 117 13 9 6

=y2∑ =x2∑ xy=∑ =y ∑ =x ∑ Total


742 291 461 66 41
41 66
461 
r 6
 (41) 2   (66) 2 
 291   . 742  
 6  6 

r = 0.759
strong direct correlation
EXAMPLE: Relationship between Anxiety and Test
Scores
Anxiety Test X2 Y2 XY
)X( score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 = 230∑ Y2 = 204∑ XY=129∑
Calculating Correlation Coefficient

(6)(129)  (32)(32) 774  1024


r   .94
6(230)  32 6(204)  32 
2 2
(356)( 200)

r = - 0.94

Indirect strong correlation


Spearman Rank Correlation Coefficient (rs)

It is a non-parametric measure of correlation.


This procedure makes use of the two sets of
ranks that may be assigned to the sample
values of x and Y.
Spearman Rank correlation coefficient could be
computed in the following cases:
Both variables are quantitative.
Both variables are qualitative ordinal.
One variable is quantitative and the other is
qualitative ordinal.
Procedure:
1. Rank the values of X from 1 to n where n
is the numbers of pairs of values of X and
Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair of
observation by subtracting the rank of Yi
from the rank of Xi
4. Square each di and compute ∑di2 which
is the sum of the squared values.
5. Apply the following formula

6 (di) 2
rs 1  2
n(n  1)

The value of rs denotes the magnitude


and nature of association giving the same
interpretation as simple r.
Example
In a study of the relationship between level
education and income the following data was
obtained. Find the relationship between them and
comment.

Income level education sample


(Y) (X) numbers
25 Preparatory. A
10 Primary. B
8 University. C
10 secondary D
15 secondary E
50 illiterate F
60 University. G
Answer:
di2 di Rank Rank
Y X (Y) (X)
4 2 3 5 25 Preparatory A

0.25 0.5 5.5 6 10 Primary. B


30.25 -5.5 7 1.5 8 University. C
4 -2 5.5 3.5 10 secondary D
0.25 -0.5 4 3.5 15 secondary E
25 5 2 7 50 illiterate F
0.25 0.5 1 1.5 60 university. G

∑ di2=64
6 64
rs 1   0.1
7(48)

Comment:
There is an indirect weak correlation
between level of education and income.
Regression Analyses
Regression: technique concerned with predicting
some variables by knowing others

The process of predicting variable Y using


variable X
Regression
 Uses a variable (x) to predict some outcome
variable (y)
 Tells you how values in y change as a function
of changes in values of x
Correlation and Regression

 Correlation describes the strength of a linear


relationship between two variables
 Linear means “straight line”

 Regression tells us how to draw the straight line


described by the correlation
Regression
 Calculates the “best-fit” line for a certain set of data
The regression line makes the sum of the squares of
the residuals smaller than for any other line
Regression minimizes residuals

220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120
By using the least squares method (a procedure
that minimizes the vertical deviations of plotted
points surrounding a straight line) we are
able to construct a best fitting straight line to the
scatter diagram points and then formulate a
regression equation in the form of:
ŷ a  bX
ŷ y  b(x  x)
Regression Equation
SBP(mmHg)
220

 Regression equation 200

describes the regression 180

160

line mathematically 140

– Intercept 120

100

– Slope 80
60 70 80 90 100 110 120
Wt (kg)
Linear Equations
Y
ŷY =bX
a +bX
a
Change
b = Slope in Y
Change in X
a = Y-intercept
X
Hours studying and grades
Regressing grades on hours


Linear Regression


90.00 Final grade in course = 59.95 + 3.17 * study
R-Square = 0.88


80.00  

70.00  

2.00 4.00 6.00 8.00 10.00

Number of hours spent studying

Predicted final grade in class =


59.95 + 3.17*(number of hours you study per week)
Predicted final grade in class = 59.95 + 3.17*(hours of study)

Predict the final grade of…

• Someone who studies for 12 hours


• Final grade = 59.95 + (3.17*12)
• Final grade = 97.99

• Someone who studies for 1 hour:


• Final grade = 59.95 + (3.17*1)
• Final grade = 63.12
Exercise
A sample of 6 persons was selected the
value of their age ( x variable) and their
weight is demonstrated in the following
table. Find the regression equation and
what is the predicted weight when age is
8.5 years.
Weight (y) Age (x) .Serial no
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
Answer

Y2 X2 xy Weight (y) Age (x) .Serial no


144 49 84 12 7 1
64 36 48 8 6 2
144 64 96 12 8 3
100 25 50 10 5 4
121 36 66 11 6 5
169 81 117 13 9 6

742 291 461 66 41 Total


41 66
x  6.83 y  11
6 6

41 66
461 
b 6 0.92
2
(41)
291 
6

Regression equation

ŷ (x) 11  0.9(x  6.83)


ŷ (x) 4.675  0.92x

ŷ (8.5) 4.675  0.92 * 8.5 12.50Kg

ŷ (7.5) 4.675  0.92 * 7.5 11 .58Kg


12.6
12.4
Weight (in Kg) 12.2
12
11.8
11.6
11.4
7 7.5 8 8.5 9
Age (in years)

we create a regression line by plotting two


estimated values for y against their X component,
then extending the line right and left.
Exercise 2
B.P Age B.P Age
(y) (x) (y) (x)
128 46 120 20
The following are the 136 53 128 43
age (in years) and
systolic blood 146 60 141 63
pressure of 20 124 20 126 26
apparently healthy 143 63 134 53
adults. 130 43 128 31
124 26 136 58
121 19 132 46
126 31 140 58
123 23 144 70
Find the correlation between age
and blood pressure using simple
and Spearman's correlation
coefficients, and comment.
Find the regression equation?
What is the predicted blood
pressure for a man aging 25 years?
x2 xy y x Serial
400 2400 120 20 1
1849 5504 128 43 2
3969 8883 141 63 3
676 3276 126 26 4
2809 7102 134 53 5
961 3968 128 31 6
3364 7888 136 58 7
2116 6072 132 46 8
3364 8120 140 58 9
4900 10080 144 70 10
x2 xy y x Serial
2116 5888 128 46 11
2809 7208 136 53 12
3600 8760 146 60 13
400 2480 124 20 14
3969 9009 143 63 15
1849 5590 130 43 16
676 3224 124 26 17
361 2299 121 19 18
961 3906 126 31 19
529 2829 123 23 20
41678 114486 2630 852 Total
 x y
 xy  n 114486 
852 2630
b1  = 20 0.4547
(  x) 2
852 2

x  n
2 41678 
20

ŷ =112.13 + 0.4547 x

for age 25
B.P = 112.13 + 0.4547 * 25=123.49 = 123.5 mm hg
Multiple Regression

Multiple regression analysis is a


straightforward extension of simple
regression analysis which allows more
than one independent variable.
Multiple Regression Analysis (MRA)
• Method for studying the relationship between
a dependent variable and two or more
independent variables.
• Purposes:
– Prediction
– Explanation
– Theory building
Design Requirements

• One dependent variable (criterion)


• Two or more independent variables
(predictor variables).
• Sample size: >= 50 (at least 10 times as
many cases as independent variables)
Assumptions
• Independence: the scores of any particular subject are
independent of the scores of all other subjects
• Normality: in the population, the scores on the dependent
variable are normally distributed for each of the possible
combinations of the level of the X variables; each of the
variables is normally distributed
• Homoscedasticity: in the population, the variances of the
dependent variable for each of the possible combinations of the
levels of the X variables are equal.
• Linearity: In the population, the relation between the
dependent variable and the independent variable is linear when
all the other independent variables are held constant.
Simple vs. Multiple Regression

• One dependent variable Y • One dependent variable Y


predicted from one predicted from a set of
independent variable X independent variables (X1,
X2 ….Xk)
• One regression coefficient • One regression coefficient for
each independent variable
• R2: proportion of variation in
• r2: proportion of variation in
dependent variable Y
dependent variable Y
predictable by set of
predictable from X
independent variables (X’s)
Example: Self Concept and Academic
Achievement (N=103)
Example: The Model

• Y’ = a + b1X1 + b2X2 + …bkXk


• The b’s are called partial regression
coefficients
• Our example-Predicting AA:
– Y’= 36.83 + (3.52)XASC + (-.44)XGSC
• Predicted AA for person with GSC of 4 and
ASC of 6
– Y’= 36.83 + (3.52)(6) + (-.44)(4) = 56.23
Multiple Correlation Coefficient (R) and
Coefficient of Multiple Determination (R2)

• R = the magnitude of the relationship


between the dependent variable and the
best linear combination of the predictor
variables
• R2 = the proportion of variation in Y
accounted for by the set of independent
variables (X’s).
Multiple regression
• Typically, we want to use more than a single
predictor (independent variable) to make
predictions

• Regression with more than one predictor is


called “multiple regression”
y
i   1 x1i   2 x 2i    p x pi   i
Motivating example: Gender discrimination in
wages
• In 1970’s, Harris Trust and Savings Bank was
sued for discrimination on the basis of sex.
• Analysis of salaries of employees of one type
(skilled, entry-level clerical) presented as
evidence by the defense.
• Did female employees tend to receive lower
starting salaries than similarly qualified and
experienced male employees?
Variables collected
• 93 employees on data file (61 female, 32 male).

– bsal: Annual salary at time of hire.


– sal77 : Annual salary in 1977.
– educ: years of education.
– exper: months previous work prior to hire at bank.
– fsex: 1 if female, 0 if male
– senior: months worked at bank since hired
– age: months

• So we have six x’s and and one y (bsal). However, in what


follows we won’t use sal77.
Comparison for male and females
• This shows men Oneway Analysis of bsal By fsex

started at higher 8000

salaries than 7000


women (t=6.3,
6000
p<.0001).

bsal
5000

• But, it doesn’t 4000

control for other Female Male

characteristics. fsex
Relationships of bsal with other variables

• Senior and education predict bsal well. We want to


control for them when judging gender effect.
F itY b y X G ro u p
B iv a ria te F ito fb s a lB y s e n io r B iv a ria te F ito fb s a lB y a g e B iv a ria te F ito fb s a lB y e d u c B iv a ria te F ito fb s a lB y e x p e r

8000 8000 8000 8000

7000 7000 7000 7000


bs al

bs al

bs al

bs al
6000 6000 6000 6000

5000 5000 5000 5000

4000 4000 4000 4000

60 65 70 75 80 85 90 95100 300 400 500 600 700 800 7 8 9 1 01 1 21 31 41 51 61 7 -5 0 0 5 0 1 0 01 5 20 0 02 5 03 0 30 5 04 0 0


s e n io r age educ exper

L in e a rF it L in e a rF it L in e a rF it L in e a rF it
Multiple regression model
• For any combination of values of the predictor
variables, the average value of the response
(bsal) lies on a straight line:
bsali   1fsex i   2seniori   3age i   4 educ i   5experi  i

• Just like in simple regression, assume that ε


follows a normal curve within any
combination of predictors.
Output from regression
(fsex = 1 for females, = 0 for males)
Term Estimate Std Error t Ratio Prob>|
Response bsal
Whole Model age educ exper

t|
Actual by Predicted Plot

8000

Int. 6277.9 652 9.62 <.0001

bsal Actual
7000

6000

5000

4000

Fsex -767.9 128.9 -5.95 <.0001 4000 5000 6000 7000


bsal Predicted P<.0001 RSq=0.52
8000

RMSE=508.09

Summary of Fit

Senior -22.6 5.3 -4.26 <.0001


RSquare 0.515156
RSquare Adj 0.487291
Root Mean Square Error 508.0906
Mean of Response 5420.323
Observations (or Sum Wgts) 93
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio

Age 0.63 .72 .88 .3837 Model


Error
C. Total
5
87
92
23863715
22459575
46323290
4772743
258156
18.4878
Prob > F
<.0001
Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 6277.8934 652.2713 9.62 <.0001

Educ 92.3 24.8 3.71 .0004


fsex -767.9127 128.97 -5.95 <.0001
senior -22.5823 5.295732 -4.26 <.0001
age 0.6309603 0.720654 0.88 0.3837
educ 92.306023 24.86354 3.71 0.0004
exper 0.5006397 1.055262 0.47 0.6364
Effect Tests

Exper 0.50 1.05 .47 .6364


Source Nparm DF Sum of Squares F Ratio Prob > F
fsex 1 1 9152264.3 35.4525 <.0001
senior 1 1 4694256.3 18.1838 <.0001
age 1 1 197894.0 0.7666 0.3837
educ 1 1 3558085.8 13.7827 0.0004
exper 1 1 58104.8 0.2251 0.6364
Residual by Predicted Plot

1500
bsal Residual

1000

500

-500

-1000

4000 5000 6000 7000 8000


bsal Predicted
Predictions
• Example: Prediction of beginning wages for a woman
with 10 months seniority, that is 25 years old, with 12
years of education, and two years of experience:
bsali   1fsex i   2seniori   3age i   4 educ i   5experi  i

 • Pred. bsal = 6277.9 - 767.9*1 - 22.6*10


+ .63*300 + 92.3*12 + .50*24
= 6592.6
Interpretation of coefficients in multiple
regression
• Each estimated coefficient is amount Y is expected to increase when
the value of its corresponding predictor is increased by one, holding
constant the values of the other predictors.

• Example: estimated coefficient of education equals 92.3.


For each additional year of education of employee, we expect salary
to increase by about 92 dollars, holding all other variables constant.

• Estimated coefficient of fsex equals -767.


For employees who started at the same time, had the same
education and experience, and were the same age, women earned
$767 less on average than men.
MLR- Examples
Examples:
• The selling price of a house can depend on the
desirability of the location, the number of bedrooms,
the number of bathrooms, the year the house was
built, the square footage of the lot and a number of
other factors.

• The height of a child can depend on the height of


the mother, the height of the father, nutrition, and
environmental factors.
Logistic Regression
Logistic Regression
• Models relationship between set of variables Xi
– dichotomous (yes/no, smoker/nonsmoker,…)
– categorical (social class, race, ... )
– continuous (age, weight, gestational age, ...)
and
– dichotomous categorical response variable Y
e.g. Success/Failure, Remission/No Remission
Survived/Died, CHD/No CHD, Low Birth Weight/Normal Birth
Weight, etc…
Logistic Regression
Example: Coronary Heart Disease (CD) and Age In this
study sampled individuals were examined for signs of CD
(present = 1 / absent = 0) and the potential relationship
between this outcome and their age (yrs.) was
considered.

This is a portion of the raw data for the 100 subjects who
participated in the study.
Logistic Regression
• How can we analyze these data?

Non-pooled t-test

The mean age of the individuals with some signs of coronary heart
disease is 51.28 years vs. 39.18 years for individuals without signs
(t = 5.95, p < .0001).
Logistic Regression
Simple Linear Regression? Smooth Regression Estimate?

E (CD | Age)  .54  .02 Age The smooth regression estimate is “S-
e.g . For an individual 50 years of age shaped” but what does the estimated
E (CD | Age 50)  .54  .02 50 .46 ?? mean value represent?
Answer: P(CD|Age)!!!!
Logistic Regression
We can group individuals into age classes and look at the
percentage/proportion showing signs of coronary heart
disease.
Diseased

Age group # in group # Proportion

1) 20 - 29 10 1 .100

2) 30 - 34 15 2 .133

3) 35 - 39 12 3 .250

4) 40 - 44 15 5 .333

5) 45 - 49 13 6 .462

6) 50 - 54 8 5 .625

7) 55 - 59 17 13 .765
Notice the “S-shape” to the
8) 60 – 64 10 8 .800
estimated proportions vs. age.
Logistic Function
e  o  1 X
1.0 P (" Success"| X ) 
1  e  o  1 X
0.8
P(“Success”|X)

0.6

0.4

0.2

0.0

X
Logit Transformation
The logistic regression model is given by
 o  1 X
e
P (Y | X )   o  1 X
1 e
which is equivalent to
 P (Y | X ) 
ln   o  1 X
 1  P (Y | X ) 

This is called the


Logit Transformation
Dichotomous Predictor
Consider a dichotomous predictor (X) which represents
the presence of risk (1 = present)
Risk Factor (X)
Disease (Y) Present Absent
(X = 1) (X = 0)

Yes (Y = 1) P (Y 1 X 1) P (Y 1 X 0)

No (Y = 0) 1  P(Y 1 X 1) 1  P(Y 1 X 0)

P(Y 1 | X 1)
Odds for Disease with Risk Present  e  o  1
P 1 - P(Y 1 | X 1)
e  o  1 X
1 P P(Y 1 | X 0)
Odds for Disease with Risk Absent  e  o
1 - P(Y 1 | X 0)

 o  1
Therefore the odds Odds for Disease with Risk Present e 1
ratio (OR)
   e
Odds for Disease with Risk Absent e o
Dichotomous Predictor

• Therefore, for the odds ratio associated with risk


1
presence we have OR e

• Taking the natural logarithm we have

ln(OR) 1
thus the estimated regression coefficient
associated with a 0-1 coded dichotomous
predictor is the natural log of the OR associated
with risk presence!!!
Why use logistic regression?

 There are many important research topics for


which the dependent variable is "limited."
 For example: whether or not a person smokes,
or drinks, or skips class, or takes advanced
mathematics. For these the outcome is not
continuous or distributed normally.
 Example: Are mother’s who have high school
education less likely to have children with IEP’s
(individualized plans, indicating cognitive or
emotional disabilities
 Binary logistic regression is a type of regression
analysis where the dependent variable is a
dummy variable: coded 0 (did not smoke) or
1(did smoke)
91
Logistic Regression
• Logistic Regression is used when the outcome
variable is categorical
• The independent variables could be either
categorical or continuous
• The slope coefficient in the Logistic Regression
Model has a relationship with the OR
• Multiple Logistic Regression model can be used to
adjust for the effect of other variables when
assessing the association between E & D
variables
A Problem with Linear Regression (slides 3-6 from Kim Maier)

However, transforming the independent variables does not remedy all of the potential
problems. What if we have a non-normally distributed dependent variable? The following
example depicts the problem of fitting a regular regression line to a non-normal dependent
variable).

Suppose you have a binary outcome variable. The problem of having a non-continuous
dependent variable becomes apparent when you create a scatterplot of the relationship.
Here, we see that it is very difficult to decipher a relationship among these variables.

93
A Problem with Linear Regression

We could severely simplify the plot by drawing a line between the means for the two
dependent variable levels, but this is problematic in two ways: (a) the line seems to
oversimplify the relationship and (b) it gives predictions that cannot be observable values
of Y for extreme values of X.

The reason this doesn’t work is


because the approach is analogous to
fitting a linear model to the probability
of the event. As you know,
probabilities can only take values
between 0 and 1. Hence, we need a
different approach to ensure that our
model is appropriate for the data.

94
A Problem with Linear Regression

The mean of a binomial variable coded as (1,0) is a proportion. We could plot conditional
probabilities as Y for each level of X. Of course, we could fit a linear model to these
conditional probabilities, but (as shown) the linear model does not predict the maximum
likelihood estimates for each group (the mean—shown by the circles) and it still produces
unobservable predictions for extreme values of the dependent variable.

This plot gives us a better picture of the


relationship between X and Y. It is clear that
the relationship is non-linear. In fact, the shape
of the curve is sigmoid.

95
The Linear Probability Model
In the OLS regression:
Y = β0 + β1X + e ; where Y = (0, 1)
 The error terms are heteroskedastic
 e is not normally distributed because
Y takes on only two values
 The predicted probabilities can be
greater than 1 or less than 0

96
A Problem with Linear Regression

If you think about the shape of this


distribution, you may posit that the
function is a cumulative probability
distribution. As stated previously, we can
model the nonlinear relationship
between X and Y by transforming one of
the variables. Two common
transformations that result in sigmoid
functions are probit and logit
transformations. In short, a probit
transformation imposes a cumulative
normal function on the data. But, probit
functions are difficult to work with because
they require integration. Logit
transformations, on the other hand, give
nearly identical values as a probit function,
but they are much easier to work with
because the function can be simplified to a
linear equation.
97
e    x
P( y x) 
1  e    x

98
The Logistic Regression Model
The "logit" model solves these problems:

ln[p/(1-p)] = 0 + 1X

 p is the probability that the event Y occurs,


p(Y=1)
 [range=0 to 1]
 p/(1-p) is the "odds ratio"
 [range=0 to ∞]
 ln[p/(1-p)]: log odds ratio, or "logit“
 [range=-∞ to +∞]

99
Odds & Odds Ratios
p
Recall the definitions of an odds: odds 
1 p
The odds has a range of 0 to  with values greater than 1 associated with an
event being more likely to occur than to not occur and values less than 1 associated with
an event that is less likely to occur than not occur.

The logit is defined as the log of the odds:


 p 
ln odds  ln   ln  p   ln 1  p 
 1 p 
This transformation is useful because it creates a variable with a range from - to +.
Hence, this transformation solves the problem we encountered in fitting a linear model to
probabilities. Because probabilities (the dependent variable) only range from 0 to 1, we
can get linear predictions that are outside of this range. If we transform our probabilities to
logits, then we do not have this problem because the range of the logit is not restricted. In
addition, the interpretation of logits is simple—take the exponential of the logit and you
have the odds for the two groups in question.

100
Guidelines for Choosing Between Linear and Nonlinear
Regression

• The general guideline is to use linear regression first to determine


whether it can fit the particular type of curve in your data. If you can’t
obtain an adequate fit using linear regression, that’s when you might
need to choose nonlinear regression.
• Linear regression is easier to use, simpler to interpret, and you obtain
more statistics that help you assess the model.
• While linear regression can model curves, it is relatively restricted in the
shapes of the curves that it can fit. Sometimes it can’t fit the specific
curve in your data.
• Nonlinear regression can fit many more types of curves, but it can
require more effort both to find the best fit and to interpret the role of
the independent variables. Additionally,
R-squared is not valid for nonlinear regression, and it is impossible to
calculate p-values for the parameter estimates.
Non linear regression
• Nonlinear regression is a form of regression
analysis in which data is fit to a model and then
expressed as a mathematical function.
• Simple linear regression relates two variables (X
and Y) with a straight line (y = mx + b), while
nonlinear regression must generate a line
(typically a curve) as if every value of Y was a
random variable.
BREAKING DOWN 'Nonlinear Regression'

• Nonlinear regression modeling is similar to linear regression


modeling in that both seek to graphically track a particular
response from a set of variables.
• Nonlinear models are more complicated than linear models to
develop because the function is created through a series of
approximations (iterations) that may stem from trial-and-error.
• Mathematicians use several established methods, such as the
Gauss-Newton method and the Levenberg-Marquardt
method.
Sources:

Power point presentations from Dr.


MoatazaMahmoud Abdel Wahab Lecturer of
Biostatistics High Institute of Public Health
University of Alexandria.
NonlinearRegression
https://www.investopedia.com/terms/n/nonli
near-regression.asp#ixzz5Yd2xH3dC

Follow us: Investopedia on Facebook

You might also like