0% found this document useful (0 votes)
71 views

Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression

This document discusses analyzing the relationship between two quantitative variables through scatterplots, correlation, and linear regression. It provides examples analyzing the relationship between presidential approval ratings and election margins, cricket chirp rates and temperature, life expectancy and fat consumption, and more. Key points made include that correlation measures the strength and direction of a linear relationship, but does not prove causation, and that outliers can influence correlation. Linear regression fits a line of best fit to predict a response variable based on an explanatory variable.

Uploaded by

brownka5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Two Quantitative Variables: Scatterplot, Correlation, and Linear Regression

This document discusses analyzing the relationship between two quantitative variables through scatterplots, correlation, and linear regression. It provides examples analyzing the relationship between presidential approval ratings and election margins, cricket chirp rates and temperature, life expectancy and fat consumption, and more. Key points made include that correlation measures the strength and direction of a linear relationship, but does not prove causation, and that outliers can influence correlation. Linear regression fits a line of best fit to predict a response variable based on an explanatory variable.

Uploaded by

brownka5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Two Quantitative Variables: Scatterplot, Correlation, and Linear

Regression

Example. When a US president runs for re-election, how strong is the relationship between
the president’s approval rating and the outcome of the election? The table below includes
all the presidential elections since 1940 in which an incumbent was running and shows the
presidential approval rating at the time of the election.

Year Candidate Approval Margin Result


1940 Roosevelt 62 10.0 Won
1948 Truman 50 4.5 Won
1956 Eisenhower 70 15.4 Won
1964 Johnson 67 22.6 Won
1972 Nixon 57 23.2 Won
1976 Ford 48 -2.1 Lost
1980 Carter 31 -9.7 Lost
1984 Reagan 57 18.2 Won
1992 G. H. W. Bush 39 -5.5 Lost
1996 Clinton 55 8.5 Won
2004 G. W. Bush 49 2.4 Won

1. What was the highest approval rating for any of the losing presidents? What was the
lowest approval rating for any of the winning presidents? Make a conjecture about the
approval rating needed by a sitting president in order to win re-election.

2. Approval rating and margin of victory are both quantitative variables. Does there seem
to be an association between the two variables?
Scatterplot
A scatterplot is a graph of the relationship between two quantitative variables.
A scatterplot includes a pair of axes with appropriate numerical scales, one for each variable.
The paired data for each case are plotted as a point on the scatterplot. If there are explanatory
and response variables, we put the explanatory variable on the x-axis and the response
variable on the y-axis.

Example. Draw a scatterplot for the data on approval rating and margin of victory in the
table above.

Interpreting a Scatterplot
When looking at a scatterplot we often address the following questions:

• Do the points form a clear trend with a particular direction, are they more scattered
about a general trend, or is there no obvious pattern?

• If there is a trend, is it generally upward or generally downward as we look from left to


right? A general upward trend is called a positive association while a general downward
trend is called a negative association.

• If there is a trend, does it seem to follow a straight line, which we call a linear associ-
ation, or some other curve or pattern?

• Are there any outlier points that are clearly distinct from a general pattern in the data?

2
Example. Four scatterplots are shown in the figure below. For each pair of variables, discuss
the information contained in the scatterplot. If there appears to be a positive or negative
association, discuss what that means in the specific context.

3
Summarizing a Relationship between Two Quantitative Variables: Correlation
Just as the mean or median summarizes the center and the standard deviation or IQR mea-
sures the spread of the distribution for a single quantitative variable, we need a numerical
statistic to measure the strength and direction of association between two quantitative vari-
ables. One such statistic is the correlation.

Correlation
The correlation is a measure of the strength and direction of linear association between two
quantitative variables.

Notation for the Correlation


The correlation between two quantitative variables of a sample is denoted r.
The correlation between two quantitative variables of a population is denoted ρ.

The correlations for each of the pairs of variables that have been displayed in scatterplots
earlier in this section are displayed below.

Variable 1 Variable 2 Correlation


Margin of victory Approval rating 0.86
Average mercury Acidity -0.58
Average mercury Alkalinity -0.59
Alkalinity Acidity 0.72
Average mercury Standardized mercury 0.96

Properties if the Correlation


The sample correlation r has the following properties:

• Correlation is always between -1 and 1, inclusive: −1 ≤ r ≤ 1.

• The sign r (positive or negative) indicates the direction of association.

• Values of r close to ±1 show a strong linear relationship, while values of r close to 0


show no linear relationship.

• The correlation r has no units and is independent of the scale of either variable.

• The correlation is symmetric: The correlation between variables x and y is the same
as the correlation between y and x.

The population correlation ρ also satisfies these properties.

4
Example. Common folk wisdom claims that one can determine the temperature on a sum-
mer evening by counting how fast the crickets are chirping. Is there really an association
between chirp rate and temperature? The data below were collected by E.A. Bessey and
C.A. Bessey, who measured chirp rates for crickets and temperature during the summer of
1898.

Temperature (◦ F) 54.5 59.5 63.5 67.5 72.0 78.5 83.8


Chirps (per minute) 81 97 103 123 150 182 195

1. Use the scatterplot to estimate the correlation between chirp rate and temperature.
Explain your reasoning.

2. Use technology to find the correlation and use correlation notation.

3. Are chirp rate and temperature associated?

5
Example. The figure below shows the estimated average life expectancy (in years) for a
sample of 40 countries against the average amount of fat (measured in grams per capita per
day) in the food supply for each country. The scatterplot shows a clear positive association
(r = 0.70) between these two variables. The countries with short life expectancies all have
below-average fat consumption, while the countries consuming more than 100 grams of fat
on average all have life expectancies over 70 years. Does this mean that we should eat more
fat to live longer?

6
Correlation Caution #1
A strong positive or negative correlation does not (necessarily) imply a cause and effect
relationship between the two variables.

Example. Core body temperature for an individual person tends to fluctuate during the
day according to a regular circadian rhythm. Suppose that the body temperature for an
adult woman are recorded every hour of the day, starting at 6 am. The results are shown in
the figure below. Does there appear to be an association between the time of day and body
temperature? Estimate the correlation between the hour of the day and the woman’s body
temperature.

7
Correlation Caution #2
A correlation near zero does not (necessarily) mean that the two variables are not associated,
since the correlation measures only the strength of a linear relationship.

Example. The figure below shows the alcohol consumption (drinks per week) and average
daily caloric intake for 91 subjects who are at least 60 years old, from the data in Nutri-
tionStudy. Notice the distinct outlier who claims to imbibe 203 drinks per week as part of
a 6662 calorie diet! This is almost certainly an incorrect observation. The second plot shows
these same data with the outlier removed. How do you think the correlation between calories
and alcohol consumption change when the outlier is deleted?

8
Correlation Caution #3
Correlation can be heavily influenced by outliers. Always plot your data.

A Formula for Correlation


  
1 X x − x̄ y − ȳ
r=
n−1 sx sy

This formula essentially involves converting all values for both variables to z-scores, which
puts the correlation on a fixed ±1 scale and makes it independent of the scale of measurement.
For a positive association, large values of x tend to occur with large values of y (both z-scores
are positive) and small values (with negative z-scores) tend to occur together. In either case,
the products are positive, which leads to a positive sum. For a negative association, the
z-scores tend to have opposite signs (small x with large y and vice versa) so the products
tend to be negative.

9
The Regression Line
The process of fitting a line to a set of data is called linear regression and the line of the best
fit is called the regression line. The regression line provides a model of a linear association
between two variables, and we can use the regression line on a scatterplot to give a predicted
value of the response variable, based on a given value of the explanatory variable.

Example. Use the regression line in the figure below to estimate the predicted tip amount
on a $60 bill.

10
Explanatory and Response Variables
The regression line to predict y from x is NOT the same as the regression line to predict x
from y. Be sure to always pay attention to which is the explanatory variable and which is
the response variable.
A regression line is always in the form

\ = a + b · Explanatory
Response

The equation of the regression line is often called a prediction equation because we can use
it to make predictions. We substitute the value o the explanatory variable into the prediction
equation to calculate the predicted response.

Example. Three different bill amounts from the RestaurantTips dataset are given. In
d = −.292 + 0.182 · Bill to predict the tip.
each case, use the regression line Tip

1. A bill of $59.33

2. A bill of $9.52

3. A bill of $23.70

11
Residuals
The residual at a data value is the difference between the observed and predicted values of
the response variable:

Residual = Observed − Predicted = y − ŷ

On a scatterplot, the residual represents the vertical deviation from the line to a data point.
Points above will have positive residuals and points below the line will have negative residuals.
If the predicted values closely match the observed data values, the residuals will be small.

Example. In the previous example, we found the predicted tip amount for three different
bills in the restaurantTips dataset. The actual tips left by each of these customers are
shown below. Use the information to calculate the residuals for each of these sample points.

1. The tip left on a bill of $59.33 was $10.00

2. The tip left on a bill of $9.52 was $1.00

3. The tip left on a bill of $23.70 was $10.00

12
Example. The data from ElectionMargin are given below.

Year Candidate Approval Margin Result


1940 Roosevelt 62 10.0 Won
1948 Truman 50 4.5 Won
1956 Eisenhower 70 15.4 Won
1964 Johnson 67 22.6 Won
1972 Nixon 57 23.2 Won
1976 Ford 48 -2.1 Lost
1980 Carter 31 -9.7 Lost
1984 Reagan 57 18.2 Won
1992 G. H. W. Bush 39 -5.5 Lost
1996 Clinton 55 8.5 Won
2004 G. W. Bush 49 2.4 Won

1. The regression line for these 11 data points is

\ = −36.5 + 0.836(Approval)
Margin

Calculate the predicted values and the residuals for all the data points.

13
2. Which residual is the largest? For this largest residual, is the observed margin higher
or lower than the margin predicted by the regression line? To which president and year
does this residual correspond?

Least Squares Line


The least squares line, also called the line of best fit, is the line which minimizes the
sum of the squared residuals, (y − ŷ)2 .

Interpreting the Slope and Intercept of the Regression Line


For the regression line ŷ = a + bx,

• The slope b represents the predicted change in the response variable y given a one unit
increase in the explanatory variable x.

• The intercept a represents the predicted value of the response variable y if the explana-
tory variable x is zero. The interpretation may be nonsensical since it is often not
reasonable for the explanatory variable to be zero.

d = −0.292 + 0.182 · Bill, interpret


Example. For the RestaurantTips regression line Tip
the slope and the intercept in context.

14
Example. In an earlier example, we consider some scatterplots from the dataset Flori-
daLakes showing relationships between acidity, alkalinity, and fish mercury levels in n = 53
Florida lakes. We wish to predict a quantity that is difficult to measure (mercury level of
fish) using a value that is more easily obtained from a water sample (acidity). We saw that
there appears to be a negative linear association between these two variables, so a regression
line is appropriate.

1. Use technology to find the regression line to predict Mercury from pH, and plot it on
a scatterplot of the data.

2. Interpret the slope of the regression line in the context of Florida lakes.

15
Regression Caution #1
Avoid trying to apply a regression line to predict values far from those that were used to
create it.

Example. In the previous example, we used the acidity (pH) of Florida lakes to predict
mercury levels in fish. Suppose that, instead of mercury, we use acidity to predict the calcium
concentration (mg/l) in Florida lakes. The figure below shows a scatterplot of these data
\ = −51.4 + 11.17 · pH for the 53 lakes in our sample. Give an
with the regression line Calcium
interpretation for the slope in this situation. Does the intercept make sense? Comment on
how well the linear prediction equation describes the relationship between these two variables.

16
Regression Caution #2
Always plot the data. Although the regression line can be calculated for any set of paired
quantitative variables, it is only appropriate to use a regression line when there is a linear
trend in the data.

Regression Caution #3
Outliers can have a strong influence on the regression line, just as we saw for correlation. In
particular, data points for which the explanatory value is an outlier are often called influential
points because they exert an overly strong effect on the regression line.

17

You might also like