Chapter 8 - PSYC 284
Chapter 8 - PSYC 284
Introduction
There are situations in which both variables are continuous. When this is the case, the most appropriate
way to look at relationships between variables is by correlations and/ or linear regression.
An overview of correlation
Correlation is an overused term by non-statistical folks as it is often used to describe any type of
relationship between two things. Statistically speaking, though, the term correlation is reserved to
describe the relationship between two variables when at least one is continuous.
Correlations describe how two variables move together, and correlation coefficients can be either
positive or negative. A positive correlation means that as one variable increases, the other one does as
well. A negative correlation means that as one variable increases, the other decreases.
In scatter plots, the relationship between the x and y variables looks linear. Scatter plots include a point
for each observation denoting where the values of two variables meet. Correlations should only be used
when the relationship between two variables appears to be linear.
Correlation coefficients can range from 0, indicating no relationship between the variables, to 1, with
one variable perfectly predicting the second variable. Therefore, correlation coefficients can range from
– 1 to +1, or, with r representing the correlation coefficient: −1≤r≤+ 1. In reality, it is very unlikely that
you will ever see correlations of either +1 or - 1.
The strength of correlations can be described qualitatively. Since the strength of the relationship is not
indicated by positive or negative, the descriptors are based on the absolute value of r.
r Description of relationship
0-0.19 No relationship
0.2-0.39 Weak
0.4-0.59 Modest
0.6-0.79 Moderate
0.8-1.0 Strong
Caution: correlation ≠ causation
It should be noted that correlations, no matter how strong, are not suggestive of causation. To infer
causation three conditions must be met:
The independent variable (the cause) must precede the dependent variable (the effect) in time.
The two variables must be correlated with one another.
Grace Mouannes
The correlation between the two variables cannot be due to the influence of one or more
additional variables.
It is very difficult— nearly impossible in fact— to discern or rule out the influence of all additional factors
in determining the actual nature of the relationship between two variables.
The biased formula is indicated for populations and the unbiased formula is indicated for samples.
Σ XY
−μ x μ y
Biased correlation: N
ρ yx =
σxσ y
1
(Σ XY −N XY )
Unbiased correlation: N−1
r yx =
sx s y
In both of these equations, the numerator is a measure of the covariance of the variables, and the
denominator is the product of the standard deviations.
Calculation example
STEP 1: Determine how you are going to look at the relationship between the variables.
As in almost all cases, you will determine this by looking at the level of measurement of each of your
variables. Because both of these variables are continuous, you will know that examining the relationship
by calculating a correlation coefficient is appropriate.
Since you will be looking at the relationship between these variables by calculating a correlation
coefficient, you will want to first create a scatter plot to determine whether it even looks like the
relationship between the variables is linear. It does not matter on which axis you put each variable at
this point as the overall look should be the same regardless.
To do this in R, you could create a vector for both x and y that holds the data. In entering the data, it
must be entered for each variable in the exact order in which it was displayed.
>plot(variable, variable)
When you are obtaining information on a sample (and not the entire population), it is most appropriate
to use the unbiased formula.
You must assign one of your variables to be X and the other Y. It does not matter which is which.
On the face of it, this manually calculated result seems correct. Therefore, we can look at our hand-
calculated correlation coefficient and get a sense as to whether our calculated value looks to be a
reasonable result or not.
Hypothesis testing utilizing Pearson’s r is appropriate when we can assume that both variables are
normally distributed and when the pairs of observations are independent of one another.
When looking at the significance of a correlation coefficient, the hypothesis being tested is:
H 0 : ρ=0
Ha : ρ ≠ 0
Testing for significance uses the t- distribution, and df = n – 2, with n = number of pairs in the sample.
Statisticians have compiled critical values of r, above which H0 can be rejected and H1 can be accepted.
Refer to Table 4 in Appendix F for a table of critical r values. Ifr yx > r crit , we can reject the null hypothesis
and accept the alternate hypothesis.
When we are unable to assume normality or we have few pairs of observations, Pearson’s r is not an
appropriate method for computing a correlation coefficient. In these cases, the Spearman rank- order
correlation coefficient, or Spearman’s rho, is more appropriate.
When calculating Spearman’s rho, each observation for each variable is ranked in order from lowest to
highest. These ranks then replace the values initially assigned to each observation of each variable.
2
6ΣD
The formula for Spearman’s rho is: r s=1−
N ( N 2−1 ) '
Where D is the difference between a pair of ranks and N is the number of pairs.
Very often you will have variables in which multiple observations of the same variable all have the same
value so ranks will be tied. Refer to the book to know how to calculate it.
To calculate D for each pair, we will simply subtract the rank for variable 1 from the rank for variable 2.
Some values will be positive and others negative, but all values for D2 will, of course, be positive.
The hypothesis we are testing with Spearman’s rho is similar to if we were testing using Pearson’s r:
Grace Mouannes
H0: ranks are independent in the population from which the sample was taken
H1: ranks are not independent in the population from which the sample was taken
Because the hypothesis test is not, however, parametric, we cannot use the same table of p-values
when the sample size is small. Therefore, if n > 30, we will use the same table that we used for Pearson’s
r (Table 4 in Appendix F); however, if n ≤ 30, you will use Table 5 in Appendix F.
As stated earlier, you will want to start by displaying the data in a scatter plot to determine if there
appears to be a linear relationship between the variables.
To actually compute the correlation coefficient, we will use a function from the Hmisc package:
>rcorr(data$variable, data$variable)
In the top portion of this output, you will see the correlation coefficient for the two variables. Since x has
a perfect correlation with x and y has a perfect correlation with y, we will ignore the correlation
coefficient of 1. We are, however, interested in the correlation coefficient looking at the relationship
between x and y. In the second portion of this output, we see how many observations we have for each
of our variables. Finally, the bottom section in this output is the obtained probability value.
If you want to see the p-value in standard notation instead of scientific notation (if the number is too
small), enter the following into the Console:
Now, suppose you wanted to analyze this same data using Spearman’s rho instead of Pearson’s r. In that
case, simply add an option to the rcorr() function:
You should carefully select which correlation coefficient to use based on the characteristics of your data.
Linear regression
While it is often useful to examine the relationship between two variables, it is even more helpful to use
what we have learned so far to predict the value of one variable given a value for another variable.
Every line can be defined by two things: where it crosses the y-axis and its slope, which is defined as
rise Δ y
∨ . In the case that r=1, we see that the y-intercept is 14 and the slope of the line is
run Δ x
1. Therefore, this line, which best fits the data, is defined by the equation y = 1x + 14 or, simplified, y = x
+ 14. The idea with this regression line is that it can be used to predict future data points. For instance, if
Grace Mouannes
we know the value for x, we can now predict a value for y even though we have never observed either x
or y.
What defines the line of best fit is when the sum of squares in the definition of the line is minimized—
that is, the line that is defined when the square of the residuals, which is the difference between
predicted values (those y- values that would be calculated when you have the slope and y- intercept)
and observed values (the values for your dependent variable for each observation), is closest to zero.
Another term for this type of regression is OLS regression. In defining the OLS regression line, we do not
use the variable y; rather, we use the term ŷ to indicate that this is a predicted value and not one that
was actually observed.
Where byx is the slope of the line defined by the variables y and x and ayx is the y-intercept of the line
defined by the variables y and x.
Like other formulae, the formula for the OLS regression line is similar, but not the same, for both the
population, which is biased, and samples, which are unbiased.
σy
When considering the slope of the line, the biased formula is: b yx = ρ
σx
sy
When considering the slope of the line for a sample, the unbiased formula is: b yx = r
sx
Regardless of whether you are calculating the unbiased or biased regression line, it is imperative that
you compute the slope first, as it is needed for the calculation of the y-intercept.
Calculation example
STEP 1: Determine how you are going to look at the relationship between the variables.
Once we choose the formula, we need to determine which is our dependent variable, y, and which is our
independent variable, x. We want to predict the dependent variable from the independent variable.
The greater the correlation coefficient is, the better the fit of the regression line.
The coefficient of determination is a goodness-of-fit statistic that describes how well a regression
equation fits a set of data. The coefficient of determination is r2 and describes the proportion of the
dependent variable that is explained by the regression model. In a simple OLS regression with only one
predictor (independent variable), you can calculate the coefficient of determination by squaring the
correlation coefficient. The coefficient of determination ranges between 0 and 1: 0 ≤ r 2 ≤ 1
Another commonly used goodness-of-fit statistic in OLS regression is the standard error of the estimate.
This fit statistic measures the average deviation of predicted values from observed values.
√
2
√
2
Σ(Y −Y⏞ )
√
For a sample, the unbiased formulae for this are: N−1
sest = ∨s est =s y (1−r 2yx )
N −2 N−2
Computing ordinary least squares regression lines and goodness of fit statistics using R
The first thing we will do is create a vector to store the data created by the regression. In the second
step, we will view what we created in the regression. The lm() function means linear model.
> summary(r1)
There is another hypothesis test related to regression models that is associated with each independent
variable. The p-value associated with this test addresses the following hypotheses:
H0: byx = 0
H1: byx ≠ 0
When the slope of the regression line is 0, then the line of best fit is a horizontal line.
Diagnostic plots
We always recommend you assess a regression model more fully by evaluating diagnostic plots. After
you produce the vector with the regression, we can use the plot() function to produce these.
You will receive prompts in the Console to enable you to scroll through four diagnostic plots, displayed
in the Plots pane. The first displays residuals versus fitted values. The dotted horizontal line at zero
denotes a perfect fit for each observation. Ideally, you would want to observe the dots to be randomly
Grace Mouannes
around this zero point, which indicates a linear relationship between the variables and homogeneous
variance. The Normal Q-Q plot is used to help determine whether the data illustrated come from some
theoretical sampling distribution, in this case the normal distribution. Ideally, you would like to see all
points lying along the dotted line. The third diagnostic plot, the Scale Location plot, is used to see if
residuals are dispersed evenly among predictors. Ideally, we would want to see the line (which will be
red when you produce it in R) to be fairly horizontal. Finally, the last figure would illustrate residuals
versus leverage. This plot helps identify observations that actually impact the regression model itself.
Observations outside of a dotted line may be problematic in some way.