0% found this document useful (0 votes)

32 views7 pages

Chapter 8 - PSYC 284

This document provides an overview of correlation and simple linear regression. It discusses how correlation coefficients describe the linear relationship between two continuous variables, ranging from -1 to 1. A correlation near 0 indicates no relationship, while values closer to 1 or -1 indicate a strong linear relationship. The document also explains how to calculate and interpret correlation coefficients, and how to test for statistical significance. It introduces linear regression as a way to predict the value of one variable based on the value of another variable.

Uploaded by

lamita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views7 pages

Chapter 8 - PSYC 284

Uploaded by

lamita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Grace Mouannes

Chapter 8: Correlation and simple linear regression

Introduction

There are situations in which both variables are continuous. When this is the case, the most appropriate
way to look at relationships between variables is by correlations and/ or linear regression.

An overview of correlation

Correlation is an overused term by non-statistical folks as it is often used to describe any type of
relationship between two things. Statistically speaking, though, the term correlation is reserved to
describe the relationship between two variables when at least one is continuous.

Correlations describe how two variables move together, and correlation coefficients can be either
positive or negative. A positive correlation means that as one variable increases, the other one does as
well. A negative correlation means that as one variable increases, the other decreases.

In scatter plots, the relationship between the x and y variables looks linear. Scatter plots include a point
for each observation denoting where the values of two variables meet. Correlations should only be used
when the relationship between two variables appears to be linear.

Correlation coefficients can range from 0, indicating no relationship between the variables, to 1, with
one variable perfectly predicting the second variable. Therefore, correlation coefficients can range from
– 1 to +1, or, with r representing the correlation coefficient: −1≤r≤+ 1. In reality, it is very unlikely that
you will ever see correlations of either +1 or - 1.

The strength of correlations can be described qualitatively. Since the strength of the relationship is not
indicated by positive or negative, the descriptors are based on the absolute value of r.

Interpreting correlation coefficients

r Description of relationship
0-0.19 No relationship
0.2-0.39 Weak
0.4-0.59 Modest
0.6-0.79 Moderate
0.8-1.0 Strong
Caution: correlation ≠ causation

It should be noted that correlations, no matter how strong, are not suggestive of causation. To infer
causation three conditions must be met:

 The independent variable (the cause) must precede the dependent variable (the effect) in time.
 The two variables must be correlated with one another.
Grace Mouannes

 The correlation between the two variables cannot be due to the influence of one or more
additional variables.

It is very difficult— nearly impossible in fact— to discern or rule out the influence of all additional factors
in determining the actual nature of the relationship between two variables.

Calculating correlation coefficients assuming normality

The biased formula is indicated for populations and the unbiased formula is indicated for samples.

Σ XY
−μ x μ y
Biased correlation: N
ρ yx =
σxσ y

1
(Σ XY −N XY )
Unbiased correlation: N−1
r yx =
sx s y

In both of these equations, the numerator is a measure of the covariance of the variables, and the
denominator is the product of the standard deviations.

Calculation example

STEP 1: Determine how you are going to look at the relationship between the variables.

As in almost all cases, you will determine this by looking at the level of measurement of each of your
variables. Because both of these variables are continuous, you will know that examining the relationship
by calculating a correlation coefficient is appropriate.

Since you will be looking at the relationship between these variables by calculating a correlation
coefficient, you will want to first create a scatter plot to determine whether it even looks like the
relationship between the variables is linear. It does not matter on which axis you put each variable at
this point as the overall look should be the same regardless.

To do this in R, you could create a vector for both x and y that holds the data. In entering the data, it
must be entered for each variable in the exact order in which it was displayed.

To create a scatter plot simply enter the following in the Console:

>plot(variable, variable)

STEP 2: Determine which formula to use.

When you are obtaining information on a sample (and not the entire population), it is most appropriate
to use the unbiased formula.

STEP 3: Compute each of the variables in the formula.

Grace Mouannes

You must assign one of your variables to be X and the other Y. It does not matter which is which.

On the face of it, this manually calculated result seems correct. Therefore, we can look at our hand-
calculated correlation coefficient and get a sense as to whether our calculated value looks to be a
reasonable result or not.

Testing for type I error with Pearson’s r

Hypothesis testing utilizing Pearson’s r is appropriate when we can assume that both variables are
normally distributed and when the pairs of observations are independent of one another.

When looking at the significance of a correlation coefficient, the hypothesis being tested is:

H 0 : ρ=0

Ha : ρ ≠ 0

Testing for significance uses the t- distribution, and df = n – 2, with n = number of pairs in the sample.
Statisticians have compiled critical values of r, above which H0 can be rejected and H1 can be accepted.
Refer to Table 4 in Appendix F for a table of critical r values. Ifr yx > r crit , we can reject the null hypothesis
and accept the alternate hypothesis.

Calculating correlation coefficients when we can’t assume normality

When we are unable to assume normality or we have few pairs of observations, Pearson’s r is not an
appropriate method for computing a correlation coefficient. In these cases, the Spearman rank- order
correlation coefficient, or Spearman’s rho, is more appropriate.
When calculating Spearman’s rho, each observation for each variable is ranked in order from lowest to
highest. These ranks then replace the values initially assigned to each observation of each variable.
2
6ΣD
The formula for Spearman’s rho is: r s=1−
N ( N 2−1 ) '

Where D is the difference between a pair of ranks and N is the number of pairs.

Dealing with tied ranks and a calculation example

Very often you will have variables in which multiple observations of the same variable all have the same
value so ranks will be tied. Refer to the book to know how to calculate it.

To calculate D for each pair, we will simply subtract the rank for variable 1 from the rank for variable 2.
Some values will be positive and others negative, but all values for D2 will, of course, be positive.

Testing hypotheses with Spearman’s rho

The hypothesis we are testing with Spearman’s rho is similar to if we were testing using Pearson’s r:
Grace Mouannes

H0: ranks are independent in the population from which the sample was taken
H1: ranks are not independent in the population from which the sample was taken

Because the hypothesis test is not, however, parametric, we cannot use the same table of p-values
when the sample size is small. Therefore, if n > 30, we will use the same table that we used for Pearson’s
r (Table 4 in Appendix F); however, if n ≤ 30, you will use Table 5 in Appendix F.

Calculating correlation coefficients using R

As stated earlier, you will want to start by displaying the data in a scatter plot to determine if there
appears to be a linear relationship between the variables.

> plot(data$variable, data$variable)

To actually compute the correlation coefficient, we will use a function from the Hmisc package:

>rcorr(data$variable, data$variable)

In the top portion of this output, you will see the correlation coefficient for the two variables. Since x has
a perfect correlation with x and y has a perfect correlation with y, we will ignore the correlation
coefficient of 1. We are, however, interested in the correlation coefficient looking at the relationship
between x and y. In the second portion of this output, we see how many observations we have for each
of our variables. Finally, the bottom section in this output is the obtained probability value.

If you want to see the p-value in standard notation instead of scientific notation (if the number is too
small), enter the following into the Console:

> format(3e- 04, scientific = F)

Now, suppose you wanted to analyze this same data using Spearman’s rho instead of Pearson’s r. In that
case, simply add an option to the rcorr() function:

>rcorr(data$variable, data$variable, type=“spearman”)

You should carefully select which correlation coefficient to use based on the characteristics of your data.

Linear regression

While it is often useful to examine the relationship between two variables, it is even more helpful to use
what we have learned so far to predict the value of one variable given a value for another variable.

Every line can be defined by two things: where it crosses the y-axis and its slope, which is defined as
rise Δ y
∨ . In the case that r=1, we see that the y-intercept is 14 and the slope of the line is
run Δ x
1. Therefore, this line, which best fits the data, is defined by the equation y = 1x + 14 or, simplified, y = x
+ 14. The idea with this regression line is that it can be used to predict future data points. For instance, if
Grace Mouannes

we know the value for x, we can now predict a value for y even though we have never observed either x
or y.

What defines the line of best fit is when the sum of squares in the definition of the line is minimized—
that is, the line that is defined when the square of the residuals, which is the difference between
predicted values (those y- values that would be calculated when you have the slope and y- intercept)
and observed values (the values for your dependent variable for each observation), is closest to zero.
Another term for this type of regression is OLS regression. In defining the OLS regression line, we do not
use the variable y; rather, we use the term ŷ to indicate that this is a predicted value and not one that
was actually observed.

How to calculate the ordinary least squares regression line

The overall formula for the regression line is: ^y =b yx x+ a yx

Where byx is the slope of the line defined by the variables y and x and ayx is the y-intercept of the line
defined by the variables y and x.

Like other formulae, the formula for the OLS regression line is similar, but not the same, for both the
population, which is biased, and samples, which are unbiased.

σy
When considering the slope of the line, the biased formula is: b yx = ρ
σx

And the biased formula for the y-intercept is: a yx =μ y −b yx μ x

sy
When considering the slope of the line for a sample, the unbiased formula is: b yx = r
sx

And the unbiased formula for the y-intercept is: a yx =Y −b yx X

Regardless of whether you are calculating the unbiased or biased regression line, it is imperative that
you compute the slope first, as it is needed for the calculation of the y-intercept.

Calculation example

STEP 1: Determine how you are going to look at the relationship between the variables.

To do this, we will have to use an OLS regression.

STEP 2: Determine which formula to use.

Once we choose the formula, we need to determine which is our dependent variable, y, and which is our
independent variable, x. We want to predict the dependent variable from the independent variable.

STEP 3: Compute each of the variables in the formula.

Grace Mouannes

How good is our model?

The greater the correlation coefficient is, the better the fit of the regression line.

The coefficient of determination is a goodness-of-fit statistic that describes how well a regression
equation fits a set of data. The coefficient of determination is r2 and describes the proportion of the
dependent variable that is explained by the regression model. In a simple OLS regression with only one
predictor (independent variable), you can calculate the coefficient of determination by squaring the
correlation coefficient. The coefficient of determination ranges between 0 and 1: 0 ≤ r 2 ≤ 1

Another commonly used goodness-of-fit statistic in OLS regression is the standard error of the estimate.
This fit statistic measures the average deviation of predicted values from observed values.

√
2

For a population, the biased formulae for this are:

σ est=
⏞ Y ) ∨σ =σ √1−r 2
Σ(Y −Y
est y yx
N

√
2

Σ(Y −Y⏞ )
√
For a sample, the unbiased formulae for this are: N−1
sest = ∨s est =s y (1−r 2yx )
N −2 N−2

Computing ordinary least squares regression lines and goodness of fit statistics using R

The first thing we will do is create a vector to store the data created by the regression. In the second
step, we will view what we created in the regression. The lm() function means linear model.

> r1<- lm(data$variable~data$variable)

> summary(r1)

There is another hypothesis test related to regression models that is associated with each independent
variable. The p-value associated with this test addresses the following hypotheses:

H0: byx = 0
H1: byx ≠ 0

When the slope of the regression line is 0, then the line of best fit is a horizontal line.

Diagnostic plots

We always recommend you assess a regression model more fully by evaluating diagnostic plots. After
you produce the vector with the regression, we can use the plot() function to produce these.

You will receive prompts in the Console to enable you to scroll through four diagnostic plots, displayed
in the Plots pane. The first displays residuals versus fitted values. The dotted horizontal line at zero
denotes a perfect fit for each observation. Ideally, you would want to observe the dots to be randomly
Grace Mouannes

around this zero point, which indicates a linear relationship between the variables and homogeneous
variance. The Normal Q-Q plot is used to help determine whether the data illustrated come from some
theoretical sampling distribution, in this case the normal distribution. Ideally, you would like to see all
points lying along the dotted line. The third diagnostic plot, the Scale Location plot, is used to see if
residuals are dispersed evenly among predictors. Ideally, we would want to see the line (which will be
red when you produce it in R) to be fairly horizontal. Finally, the last figure would illustrate residuals
versus leverage. This plot helps identify observations that actually impact the regression model itself.
Observations outside of a dotted line may be problematic in some way.

5 Test of Population Variance Workbook
No ratings yet
5 Test of Population Variance Workbook
5 pages
Tutorial 14 Correlation
No ratings yet
Tutorial 14 Correlation
3 pages
Correlation and Regression
100% (5)
Correlation and Regression
49 pages
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)
SPC Case Analysis: Americo Drilling Supplies MGT 3332 - Spring 2019
No ratings yet
SPC Case Analysis: Americo Drilling Supplies MGT 3332 - Spring 2019
10 pages
Analytic Phase Research
No ratings yet
Analytic Phase Research
5 pages
Microsoft PowerPoint Session 4 PDF
No ratings yet
Microsoft PowerPoint Session 4 PDF
86 pages
Correlation Constant
No ratings yet
Correlation Constant
23 pages
Regression Correlation
No ratings yet
Regression Correlation
22 pages
Correlation Coefficient Definition
100% (1)
Correlation Coefficient Definition
8 pages
Group Assignment
No ratings yet
Group Assignment
3 pages
Lecture 10 Correlation
No ratings yet
Lecture 10 Correlation
32 pages
Correlation Analysis - Final
No ratings yet
Correlation Analysis - Final
40 pages
MRS - Diana-Correlation Analysis-Notes
No ratings yet
MRS - Diana-Correlation Analysis-Notes
16 pages
Correlation and Regression
No ratings yet
Correlation and Regression
5 pages
Correlation
No ratings yet
Correlation
8 pages
Lecture 7 Correlation
No ratings yet
Lecture 7 Correlation
18 pages
07 - Correlation and Regression Analysis-1
No ratings yet
07 - Correlation and Regression Analysis-1
13 pages
Correlation: Khairil Anuar Md. Isa Bbiomedicalsc. (Hons), Ukm Msc. (Medical Stat), Usm
No ratings yet
Correlation: Khairil Anuar Md. Isa Bbiomedicalsc. (Hons), Ukm Msc. (Medical Stat), Usm
33 pages
Correlation and Regression
No ratings yet
Correlation and Regression
4 pages
Lecture 4 - Correlation and Regression
No ratings yet
Lecture 4 - Correlation and Regression
35 pages
Lesson 10 Relationship Between Variables
No ratings yet
Lesson 10 Relationship Between Variables
85 pages
26 - Correlation and Regression Analysis
No ratings yet
26 - Correlation and Regression Analysis
50 pages
Correlation Anad Regression
No ratings yet
Correlation Anad Regression
13 pages
Stat
No ratings yet
Stat
17 pages
Correlations
No ratings yet
Correlations
30 pages
06 Correlation and Regression
No ratings yet
06 Correlation and Regression
63 pages
Correlation & Regression
100% (1)
Correlation & Regression
23 pages
Unit6 - Spearman's and Kendall's Test
No ratings yet
Unit6 - Spearman's and Kendall's Test
5 pages
Correlation Analysis
No ratings yet
Correlation Analysis
7 pages
Regression and Correlation
No ratings yet
Regression and Correlation
23 pages
Correlation 1
100% (1)
Correlation 1
57 pages
Day 8 - Module Linear Correlation
No ratings yet
Day 8 - Module Linear Correlation
5 pages
8 Correlation
No ratings yet
8 Correlation
22 pages
L6 - Biostatistics - Linear Regression and Correlation
No ratings yet
L6 - Biostatistics - Linear Regression and Correlation
121 pages
Correlation and Regression - Interview Questions in Business Analytics
No ratings yet
Correlation and Regression - Interview Questions in Business Analytics
5 pages
Regression & Correlation 230224 221642
No ratings yet
Regression & Correlation 230224 221642
9 pages
Correlation Regression
No ratings yet
Correlation Regression
5 pages
Correlation and Regression Analysis
100% (1)
Correlation and Regression Analysis
59 pages
Chapter1-Introduction To Regression Analysis
No ratings yet
Chapter1-Introduction To Regression Analysis
12 pages
Oe Statistics Notes
No ratings yet
Oe Statistics Notes
32 pages
Correlation and Regression Original
No ratings yet
Correlation and Regression Original
44 pages
Correlation and Linear Regression
No ratings yet
Correlation and Linear Regression
63 pages
Correction
No ratings yet
Correction
10 pages
Correlation
No ratings yet
Correlation
6 pages
Correlation
No ratings yet
Correlation
20 pages
Cce 68 D 4 CC 4
No ratings yet
Cce 68 D 4 CC 4
28 pages
Correlation and Regration
No ratings yet
Correlation and Regration
57 pages
SOCI1005 - Correlation and Regression
No ratings yet
SOCI1005 - Correlation and Regression
36 pages
Correlation Coefficient in Medical Research
No ratings yet
Correlation Coefficient in Medical Research
6 pages
Introduction To Correlation and Regression Analysis
No ratings yet
Introduction To Correlation and Regression Analysis
14 pages
Correlation
100% (1)
Correlation
29 pages
PSY Chapter 7
No ratings yet
PSY Chapter 7
8 pages
Correlation
No ratings yet
Correlation
33 pages
Introduction To Correlationand Regression Analysis BY Farzad Javidanrad PDF
No ratings yet
Introduction To Correlationand Regression Analysis BY Farzad Javidanrad PDF
52 pages
PS - Module 3 - ViRa
No ratings yet
PS - Module 3 - ViRa
104 pages
ECN 652 Handout 9 Student
No ratings yet
ECN 652 Handout 9 Student
46 pages
Lecture - Correlation and Regression GEG 222
100% (1)
Lecture - Correlation and Regression GEG 222
67 pages
L7 Correlation
No ratings yet
L7 Correlation
40 pages
Correlation - Regression Complete
No ratings yet
Correlation - Regression Complete
130 pages
Correlation and Its Significance
No ratings yet
Correlation and Its Significance
15 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Exercises of Advanced Statistics
From Everand
Exercises of Advanced Statistics
Simone Malacrida
No ratings yet
An Analysis of The Probability Distribution of The Heights of Grade 11 Students by CUDIO Et Al, 2020
No ratings yet
An Analysis of The Probability Distribution of The Heights of Grade 11 Students by CUDIO Et Al, 2020
35 pages
Universiteit Hasselt Concepts in Bayesian Inference Exam June 2015
No ratings yet
Universiteit Hasselt Concepts in Bayesian Inference Exam June 2015
8 pages
Assignment Booklet PGDAST 2016
No ratings yet
Assignment Booklet PGDAST 2016
29 pages
Session 7 Statistics 2023 AP Daily Practice Sessions
100% (1)
Session 7 Statistics 2023 AP Daily Practice Sessions
2 pages
Module 7 Week 8
No ratings yet
Module 7 Week 8
37 pages
Coefficient of Skewness (Grouped Data)
No ratings yet
Coefficient of Skewness (Grouped Data)
4 pages
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
No ratings yet
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
34 pages
Stats 1
No ratings yet
Stats 1
8 pages
Causal-Comparative Research: Chapter Sixteen
No ratings yet
Causal-Comparative Research: Chapter Sixteen
15 pages
Sampling: Click at Http://goo - gl/7Dztn
No ratings yet
Sampling: Click at Http://goo - gl/7Dztn
8 pages
The T-Test For Correlated Samples
No ratings yet
The T-Test For Correlated Samples
9 pages
Dependent Variable
No ratings yet
Dependent Variable
3 pages
I PUC Stats Model QP2 Jan 2024
No ratings yet
I PUC Stats Model QP2 Jan 2024
4 pages
Chapter 13
No ratings yet
Chapter 13
8 pages
DTH S1-Chp6-StatisticalDistributions - Lesson 2
No ratings yet
DTH S1-Chp6-StatisticalDistributions - Lesson 2
8 pages
3 Dispersion Exercises
No ratings yet
3 Dispersion Exercises
3 pages
Cambridge International AS & A Level: Mathematics 9709/52
No ratings yet
Cambridge International AS & A Level: Mathematics 9709/52
16 pages
(Ebook PDF) Practicing Statistics: Guided Investigations For The Second Course Download
100% (4)
(Ebook PDF) Practicing Statistics: Guided Investigations For The Second Course Download
44 pages
Fundamentals of Data Science and Analytics On Descriptive Analysis
No ratings yet
Fundamentals of Data Science and Analytics On Descriptive Analysis
53 pages
Psikoislamedia - Islamia DKK Sep21 Indo
No ratings yet
Psikoislamedia - Islamia DKK Sep21 Indo
10 pages
Marketing Analytics
No ratings yet
Marketing Analytics
14 pages
Solution Manual
No ratings yet
Solution Manual
17 pages
Chris Brooks Chapter 3 Slides
No ratings yet
Chris Brooks Chapter 3 Slides
80 pages
l09 Machine Learning
No ratings yet
l09 Machine Learning
39 pages
Completely Randomized Design
No ratings yet
Completely Randomized Design
12 pages
Message - 2025-06-24T161457.090
No ratings yet
Message - 2025-06-24T161457.090
10 pages
BADM (2nd) May2019
No ratings yet
BADM (2nd) May2019
2 pages

Chapter 8 - PSYC 284

Uploaded by

Chapter 8 - PSYC 284

Uploaded by

Grace Mouannes

Chapter 8: Correlation and simple linear regression

Interpreting correlation coefficients

Calculating correlation coefficients assuming normality

To create a scatter plot simply enter the following in the Console:

STEP 2: Determine which formula to use.

STEP 3: Compute each of the variables in the formula.

Testing for type I error with Pearson’s r

Calculating correlation coefficients when we can’t assume normality

Dealing with tied ranks and a calculation example

Testing hypotheses with Spearman’s rho

Calculating correlation coefficients using R

> plot(data$variable, data$variable)

> format(3e- 04, scientific = F)

>rcorr(data$variable, data$variable, type=“spearman”)

How to calculate the ordinary least squares regression line

The overall formula for the regression line is: ^y =b yx x+ a yx

And the biased formula for the y-intercept is: a yx =μ y −b yx μ x

And the unbiased formula for the y-intercept is: a yx =Y −b yx X

To do this, we will have to use an OLS regression.

STEP 2: Determine which formula to use.

STEP 3: Compute each of the variables in the formula.

How good is our model?

For a population, the biased formulae for this are:

> r1<- lm(data$variable~data$variable)

You might also like