Corelation With Example
Corelation With Example
20 2.7
30 2.9
50 3.4
45 3.0
10 2.2
30 3.1
40 3.3
25 2.3
50 3.5
20 2.5
10 1.5
55 3.8
60 3.7
50 3.1
35 2.8
Scatter Diagram
• Scatter diagram is a graphical method to
display the relationship between two variables
3.5
2.5
1.5
0.5
0
0 10 20 30 40 50 60 70
Is there a linear relationship
between BMI and BW?
• Scatter diagrams are important for initial
exploration of the relationship between two
quantitative variables
yˆ = ˆ + ˆ x = 1.775351 + 0.0330187 x
16
14
12
Height in CM
10
0
0 10 20 30 40 50 60 70 80 90
Age in Weeks
Negative relationship
Reliability
Age of Car
No relation
Correlation Coefficient
If r = l = perfect correlation.
How to compute the simple correlation
coefficient (r)
xy x y
r n
( x) 2
( y) 2
x
2 . y
2
n n
Example:
xy x y
r n
( x) 2 ( y)2
x
2 . y
2
n n
Weight Age
Serial
Y2 X2 xy (Kg) (years)
.n
(y) (x)
144 49 84 12 7 1
64 36 48 8 6 2
144 64 96 12 8 3
100 25 50 10 5 4
121 36 66 11 6 5
169 81 117 13 9 6
r = 0.759
strong direct correlation
EXAMPLE: Relationship between Anxiety and Test
Scores
Anxiety Test X2 Y2 XY
)X( score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 = 230∑ Y2 = 204∑ XY=129∑
Calculating Correlation Coefficient
r = - 0.94
6 (di) 2
rs 1 2
n(n 1)
∑ di2=64
6 64
rs 1 0.1
7(48)
Comment:
There is an indirect weak correlation
between level of education and income.
Regression Analyses
Regression: technique concerned with predicting
some variables by knowing others
220
200
180
160
140
120
100
80
Wt (kg)
60 70 80 90 100 110 120
By using the least squares method (a procedure
that minimizes the vertical deviations of plotted
points surrounding a straight line) we are
able to construct a best fitting straight line to the
scatter diagram points and then formulate a
regression equation in the form of:
ŷ a bX
ŷ y b(x x)
Regression Equation
SBP(mmHg)
220
160
– Intercept 120
100
– Slope 80
60 70 80 90 100 110 120
Wt (kg)
Linear Equations
Y
ŷY =bX
a +bX
a
Change
b = Slope in Y
Change in X
a = Y-intercept
X
Hours studying and grades
Regressing grades on hours
Linear Regression
90.00 Final grade in course = 59.95 + 3.17 * study
R-Square = 0.88
80.00
70.00
41 66
461
b 6 0.92
2
(41)
291
6
Regression equation
x n
2 41678
20
ŷ =112.13 + 0.4547 x
for age 25
B.P = 112.13 + 0.4547 * 25=123.49 = 123.5 mm hg
Multiple Regression
bsal
5000
characteristics. fsex
Relationships of bsal with other variables
bs al
bs al
bs al
6000 6000 6000 6000
L in e a rF it L in e a rF it L in e a rF it L in e a rF it
Multiple regression model
• For any combination of values of the predictor
variables, the average value of the response
(bsal) lies on a straight line:
bsali 1fsex i 2seniori 3age i 4 educ i 5experi i
t|
Actual by Predicted Plot
8000
bsal Actual
7000
6000
5000
4000
RMSE=508.09
Summary of Fit
1500
bsal Residual
1000
500
-500
-1000
This is a portion of the raw data for the 100 subjects who
participated in the study.
Logistic Regression
• How can we analyze these data?
Non-pooled t-test
The mean age of the individuals with some signs of coronary heart
disease is 51.28 years vs. 39.18 years for individuals without signs
(t = 5.95, p < .0001).
Logistic Regression
Simple Linear Regression? Smooth Regression Estimate?
E (CD | Age) .54 .02 Age The smooth regression estimate is “S-
e.g . For an individual 50 years of age shaped” but what does the estimated
E (CD | Age 50) .54 .02 50 .46 ?? mean value represent?
Answer: P(CD|Age)!!!!
Logistic Regression
We can group individuals into age classes and look at the
percentage/proportion showing signs of coronary heart
disease.
Diseased
1) 20 - 29 10 1 .100
2) 30 - 34 15 2 .133
3) 35 - 39 12 3 .250
4) 40 - 44 15 5 .333
5) 45 - 49 13 6 .462
6) 50 - 54 8 5 .625
7) 55 - 59 17 13 .765
Notice the “S-shape” to the
8) 60 – 64 10 8 .800
estimated proportions vs. age.
Logistic Function
e o 1 X
1.0 P (" Success"| X )
1 e o 1 X
0.8
P(“Success”|X)
0.6
0.4
0.2
0.0
X
Logit Transformation
The logistic regression model is given by
o 1 X
e
P (Y | X ) o 1 X
1 e
which is equivalent to
P (Y | X )
ln o 1 X
1 P (Y | X )
P(Y 1 | X 1)
Odds for Disease with Risk Present e o 1
P 1 - P(Y 1 | X 1)
e o 1 X
1 P P(Y 1 | X 0)
Odds for Disease with Risk Absent e o
1 - P(Y 1 | X 0)
o 1
Therefore the odds Odds for Disease with Risk Present e 1
ratio (OR)
e
Odds for Disease with Risk Absent e o
Dichotomous Predictor
ln(OR) 1
thus the estimated regression coefficient
associated with a 0-1 coded dichotomous
predictor is the natural log of the OR associated
with risk presence!!!
Why use logistic regression?
However, transforming the independent variables does not remedy all of the potential
problems. What if we have a non-normally distributed dependent variable? The following
example depicts the problem of fitting a regular regression line to a non-normal dependent
variable).
Suppose you have a binary outcome variable. The problem of having a non-continuous
dependent variable becomes apparent when you create a scatterplot of the relationship.
Here, we see that it is very difficult to decipher a relationship among these variables.
93
A Problem with Linear Regression
We could severely simplify the plot by drawing a line between the means for the two
dependent variable levels, but this is problematic in two ways: (a) the line seems to
oversimplify the relationship and (b) it gives predictions that cannot be observable values
of Y for extreme values of X.
94
A Problem with Linear Regression
The mean of a binomial variable coded as (1,0) is a proportion. We could plot conditional
probabilities as Y for each level of X. Of course, we could fit a linear model to these
conditional probabilities, but (as shown) the linear model does not predict the maximum
likelihood estimates for each group (the mean—shown by the circles) and it still produces
unobservable predictions for extreme values of the dependent variable.
95
The Linear Probability Model
In the OLS regression:
Y = β0 + β1X + e ; where Y = (0, 1)
The error terms are heteroskedastic
e is not normally distributed because
Y takes on only two values
The predicted probabilities can be
greater than 1 or less than 0
96
A Problem with Linear Regression
98
The Logistic Regression Model
The "logit" model solves these problems:
99
Odds & Odds Ratios
p
Recall the definitions of an odds: odds
1 p
The odds has a range of 0 to with values greater than 1 associated with an
event being more likely to occur than to not occur and values less than 1 associated with
an event that is less likely to occur than not occur.
100
Guidelines for Choosing Between Linear and Nonlinear
Regression