Lecture 7 Correlation
Lecture 7 Correlation
I N S T R U C T O R: FAT I M A Z A FA R
Correlation
Correlation refers to a process for establishing the relationships between two variables. You
learned a way to get a general idea about whether or not two variables are related, is to plot
them on a “scatter plot”.
The correlation analysis is used to measure the direction and relationship between two
variables. It's important to note that correlation does not equal causation. That means that
while a relationship may be observed, it's impossible to say that one variable caused or affected
the other variable. The relationship observed may be due to other variables not accounted for in
the model.
Correlation Coefficient
A correlation coefficient is a number between -1 and 1 that tells you the strength and direction
of a relationship between variables.
The correlation coefficient tells you how closely your data fit on a line. If you have a linear
relationship, you’ll draw a straight line of best fit that takes all of your data points into account
on a scatter plot.
The closer your points are to this line, the higher the absolute value of the correlation coefficient
and the stronger your linear correlation.
If all points are perfectly on this line,
you have a perfect correlation
If all points are close to this line, the
absolute value of your correlation
coefficient is high
If these points are spread far from this line,
the absolute value of your correlation
coefficient is low
Cont..
Note that the steepness or slope of the line isn’t related to the correlation coefficient value. The
correlation coefficient doesn’t help you predict how much one variable will change based on a
given change in the other, because two datasets with the same correlation coefficient value can
have lines with very different slopes.
Types of correlation
coefficients
You can choose from many different correlation coefficients based on the linearity of the
relationship, the level of measurement of your variables, and the distribution of your data.
For high statistical power and accuracy, it’s best to use the correlation coefficient that’s most
appropriate for your data.
The most commonly used correlation coefficient is Pearson’s r because it allows for strong
inferences. It’s parametric and measures linear relationships. But if your data do not meet
all assumptions for this test, you’ll need to use a non-parametric test instead.
Non-parametric tests of rank correlation coefficients summarize non-linear relationships
between variables. The Spearman’s rho and Kendall’s tau have the same conditions for use, but
Kendall’s tau is generally preferred for smaller samples whereas Spearman’s rho is more widely
used.
Cont..
Correlation coefficient Type of relationship Levels of measurement Data distribution
The Pearson’s product-moment correlation coefficient, also known as Pearson’s r, describes the
linear relationship between two quantitative variables.
These are the assumptions your data must meet if you want to use Pearson’s r:
I. Both variables are on an interval or ratio level of measurement
II. Data from both variables follow normal distributions
III. Your data have no outliers
IV. Your data is from a random or representative sample
V. You expect a linear relationship between the two variables
The Pearson’s r is a parametric test, so it has high power. But it’s not a good measure of
correlation if your variables have a nonlinear relationship, or if your data have outliers, skewed
distributions, or come from categorical variables. If any of these assumptions are violated, you
should consider a rank correlation measure.