100% found this document useful (1 vote)
283 views

SampleRegressionAnalysisAnIntuitiveGuide PDF

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
283 views

SampleRegressionAnalysisAnIntuitiveGuide PDF

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Regression

Analysis
AN INTUITIVE GUIDE FOR USING
AND INTERPRETING LINEAR MODELS

Jim Frost
Copyright © 2019 by Jim Frost.

All rights reserved. No part of this publication may be reproduced, distributed


or transmitted in any form or by any means, including photocopying, record-
ing, or other electronic or mechanical methods, without the prior written per-
mission of the publisher, except in the case of brief quotations embodied in
critical reviews and certain other noncommercial uses permitted by copyright
law.

To contact the author, please email: statisticsbyjim @gmail.com.

Visit the author’s website at statisticsbyjim.com.

Ordering Information:

Quantity sales. Special discounts are available on quantity purchases by edu-


cators. For details, contact the email address above.

Regression Analysis / Jim Frost. —1st ed.

3
Contents

My Approach to Teaching Regression and Statistics ...... 13


Correlation and an Introduction to Regression ............... 16
Graph Your Data to Find Correlations .......................... 17

Interpret the Pearson’s Correlation Coefficient .......... 18

Graphs for Different Correlations .................................. 19

Discussion about the Correlation Scatterplots ............. 22

Pearson’s Correlation Coefficient Measures Linear


Relationships ...................................................................... 23

Hypothesis Test for Correlations ................................... 24

Interpreting our Height and Weight Correlation


Example .............................................................................. 24

Correlation Does Not Imply Causation ......................... 25

How Strong of a Correlation is Considered Good? ..... 25

Common Themes with Regression ................................ 26

Taking Correlation to the Next Level with Regression


.............................................................................................. 27

Fundamental Terms and Goals of Regression .............. 27

i
Regression Analyzes a Wide Variety of Relationships
.............................................................................................. 29

Using Regression to Control Independent Variables .. 31

An Introduction to Regression Output .......................... 32

Review and Next Steps..................................................... 33

Regression Basics and How it Works ................................ 35


Data Considerations for OLS ........................................... 36

How OLS Fits the Best Line ............................................. 37

Implications of Minimizing SSE ...................................... 42

Other Types of Sums of Squares ..................................... 43

Displaying a Regression Model on a Fitted Line Plot .. 45

Importance of Staying Close to Your Data.................... 46

Review and Next Steps..................................................... 48

The Chapters/Sections below are only in the full ebook.


iii
v
vii
To Carmen and Morgan who made this book possible through
their encouragement and support.
The best thing about being a statistician is that you get to play
in everyone’s backyard.

―John Tukey
INTRODUCTION

My Approach to
Teaching Regression and
Statistics

NOTE: This sample contains only the introduction and first two chap-
ters. Please buy the full ebook for all the content listed in the Table of
Contents. You can buy it in My Store.

I love statistics and analyzing data! I also love talking and writing
about it. I was a researcher at a major university. Then, I spent over a
decade working at a major statistical software company. During my
time at the statistical software company, I learned how to present sta-
tistics in a manner that makes it more intuitive. I want you to under-
stand the essential concepts, practices, and knowledge for regression
analysis so you can analyze your data confidently. That’s the goal of
my book.

In this book, you’ll learn many facets of regression analysis including


the following:

• How regression works and when to use it.

13
Jim Frost

• Selecting the correct type of regression analysis.


• Specifying the best model.
• Interpreting the results.
• Assessing the fit of the model.
• Generating predictions and evaluating their precision.
• Checking the assumptions.
• Examples of different types of regression analyses.

I’ll help you intuitively understand regression analysis by focusing on


concepts and graphs rather than equations and formulas. I use regular,
everyday language so you can grasp the fundamentals of regression
analysis at a deeper level. I’ll provide practical tips for performing
your analysis. You will learn how to interpret the results while being
confident that you’re conducting the analysis correctly. You’ll be able
to trust your results because you’ll know that you’re performing re-
gression properly and know how to detect and correct problems.

Regardless of your background, I will take you through how to per-


form regression analysis. Students, career changers, and even current
analysts looking to take your skills to the next level, this book has ab-
solutely everything you need to know for regression analysis.

I've literally received thousands of requests from aspiring data scien-


tists for guidance in performing regression analysis. This book is my
answer - years of knowledge and thousands of hours of hard work dis-
tilled into a thorough, practical guide for performing regression anal-
ysis.

You’ll notice that there are not many equations in this book. After all,
you should let your statistical software handle the calculations so you
don’t get bogged down in the calculations and can instead focus on
understanding your results. Instead, I focus on the concepts and prac-
tices that you’ll need to know to perform the analysis and interpret
the results correctly. I’ll use more graphs than equations!

14
Regression Analysis: An Intuitive Guide

Don’t get me wrong. Equations are important. Equations are the


framework that makes the magic, but the truly fascinating aspects are
what it all means. I want you to learn the true essence of regression
analysis. If you need the equations, you’ll find them in most textbooks.

Please note that throughout this book I use Minitab statistical soft-
ware. However, this book is not about teaching particular software but
rather how to perform regression analysis. All common statistical
software packages should be able to perform the analyses that I show.
There is nothing in here that is unique to Minitab.

15
CHAPTER 1

Correlation and an
Introduction to
Regression

Before we tackle regression analysis, we need to understand correla-


tion. In fact, I’ve described regression analysis as taking correlation to
the next level! Many of the practices and concepts surrounding corre-
lation also apply to regression analysis. It’s also a simpler analysis that
is a more familiar subject for many. Bear with me because the corre-
lation topics in this section apply to regression analysis as well. It’s a
great place to start!

A correlation between variables indicates that as one variable changes


in value, the other variable tends to change in a specific direction. Un-
derstanding that relationship is useful because we can use the value of
one variable to predict the value of the other variable. For example,
height and weight are correlated—as height increases, weight also
tends to increase. Consequently, if we observe an individual who is
unusually tall, we can predict that his weight is also above the average.
In statistics, correlation is a quantitative assessment that measures
both the direction and the strength of this tendency to vary together.

16
Regression Analysis: An Intuitive Guide

There are different types of correlation that you can use for different
kinds of data. In this chapter, I cover the most common type of corre-
lation—Pearson’s correlation coefficient.

Before we get into the numbers, let’s graph some data first so we can
understand the concept behind what we are measuring.

Graph Your Data to Find Correlations


Scatterplots are a great way to check quickly for relationships be-
tween pairs of continuous data. The scatterplot below displays the
height and weight of pre-teenage girls. Each dot on the graph repre-
sents an individual girl and her combination of height and weight.
These data are real data that I collected during an experiment. We’ll
return to this dataset multiple times throughout this book. Here is the
CSV dataset if you want to try it yourself: HeightWeight.

At a glance, you can see that there is a relationship between height and
weight. As height increases, weight also tends to increase. However,
it’s not a perfect relationship. If you look at a specific height, say 1.5
meters, you can see that there is a range of weights associated with it.
You can also find short people who weigh more than taller people.

17
Jim Frost

However, the general tendency that height and weight increase to-
gether is unquestionably present.

Pearson’s correlation takes all of the data points on this graph and rep-
resents them with a single summary statistic. In this case, the statisti-
cal output below indicates that the correlation is 0.705.

What do the correlation and p-value mean? We’ll interpret the output
soon. First, let’s look at a range of possible correlation values so we
can understand how our height and weight example fits in.

Interpret the Pearson’s Correlation Coefficient


Pearson’s correlation coefficient is represented by the Greek letter
rho (ρ) for the population parameter and r for a sample statistic. This
coefficient is a single number that measures both the strength and di-
rection of the linear relationship between two continuous variables.
Values can range from -1 to +1.

• Strength: The greater the absolute value of the coefficient, the


stronger the relationship.
o The extreme values of -1 and 1 indicate a perfectly linear
relationship where a change in one variable is accompa-
nied by a perfectly consistent change in the other. For
these relationships, all of the data points fall on a line. In
practice, you won’t see either type of perfect relationship.
o A coefficient of zero represents no linear relationship. As
one variable increases, there is no tendency in the other
variable to either increase or decrease.
o When the value is in-between 0 and +1/-1, there is a rela-
tionship, but the points don’t all fall on a line. As r

18
Regression Analysis: An Intuitive Guide

approaches -1 or 1, the strength of the relationship in-


creases and the data points tend to fall closer to a line.
• Direction: The coefficient sign represents the direction of the re-
lationship.
o Positive coefficients indicate that when the value of one
variable increases, the value of the other variable also
tends to increase. Positive relationships produce an up-
ward slope on a scatterplot.
o Negative coefficients represent cases when the value of
one variable increases, the value of the other variable
tends to decrease. Negative relationships produce a
downward slope.

Examples of Positive and Negative Correlations


An example of a positive correlation is the relationship between the
speed of a wind turbine and the amount of energy it produces. As the
turbine speed increases, electricity production also increases.

An example of a negative correlation is the relationship between out-


door temperature and heating costs. As the temperature increases,
heating costs decrease.

Graphs for Different Correlations


Graphs always help bring concepts to life. The scatterplots below rep-
resent a spectrum of different relationships. I’ve held the horizontal
and vertical scales of the scatterplots constant to allow for valid com-
parisons between them.

19
Jim Frost

Correlation = +1: A perfect positive relationship.

Correlation = 0.8: A fairly strong positive relationship.

Correlation = 0.6: A moderate positive relationship.

20
Regression Analysis: An Intuitive Guide

Correlation = 0: No relationship. As one value increases, there is no


tendency for the other value to change in a specific direction.

Correlation = -1: A perfect negative relationship.

Correlation = -0.8: A fairly strong negative relationship.

21
Jim Frost

Correlation = -0.6: A moderate negative relationship.

Discussion about the Correlation Scatterplots


For the scatterplots above, I created one positive relationship between
the variables and one negative relationship between the variables.
Then, I varied only the amount of dispersion between the data points
and the line that defines the relationship. That process illustrates how
correlation measures the strength of the relationship. The stronger
the relationship, the closer the data points fall to the line. I didn’t in-
clude plots for weaker correlations that are closer to zero than 0.6 and
-0.6 because they start to look like blobs of dots and it’s hard to see
the relationship.

A common misinterpretation is that a negative correlation coefficient


indicates there is no relationship between a pair of variables. After all,
a negative correlation sounds suspiciously like no relationship. How-
ever, the scatterplots for the negative correlations display real rela-
tionships. For negative relationships, high values of one variable are
associated with low values of another variable. For example, there is
a negative correlation between school absences and grades. As the
number of absences increases, the grades decrease.

Earlier I mentioned how crucial it is to graph your data to understand


them better. However, a quantitative assessment of the relationship
does have an advantage. Graphs are a great way to visualize the data,
but the scaling can exaggerate or weaken the appearance of a

22
Regression Analysis: An Intuitive Guide

relationship. Additionally, the automatic scaling in most statistical


software tends to make all data look similar.

Fortunately, Pearson’s correlation coefficient is unaffected by scaling


issues. Consequently, a statistical assessment is better for determining
the precise strength of the relationship.

Graphs and the relevant statistical measures often work better in tan-
dem.

Pearson’s Correlation Coefficient Measures Linear


Relationships
Pearson’s correlation measures only linear relationships. Conse-
quently, if your data contain a curvilinear relationship, the correlation
coefficient will not detect it. For example, the correlation for the data
in the scatterplot below is zero. However, there is a relationship be-
tween the two variables—it’s just not linear.

This example illustrates another reason to graph your data! Just be-
cause the coefficient is near zero, it doesn’t necessarily indicate that
there is no relationship.

23
Jim Frost

Hypothesis Test for Correlations


Correlations have a hypothesis test. As with any hypothesis test, this
test takes sample data and evaluates two mutually exclusive state-
ments about the population from which the sample was drawn. For
Pearson correlations, the two hypotheses are the following:

• Null hypothesis: There is no linear relationship between the


two variables. ρ = 0.
• Alternative hypothesis: There is a linear relationship be-
tween the two variables. ρ ≠ 0.

A correlation of zero indicates that no linear relationship exists. If


your p-value is less than your significance level, the sample contains
sufficient evidence to reject the null hypothesis and conclude that the
correlation does not equal zero. In other words, the sample data sup-
port the notion that the relationship exists in the population.

Interpreting our Height and Weight Correlation Ex-


ample
Now that we have seen a range of positive and negative relationships,
let’s see how our correlation of 0.705 fits in. We know that it’s a pos-
itive relationship. As height increases, weight tends to increase. Re-
garding the strength of the relationship, the graph shows that it’s not
a very strong relationship where the data points tightly hug a line.
However, it’s not an entirely amorphous blob with a very low corre-
lation. It’s somewhere in between. That description matches our mod-
erate correlation of 0.705.

For the hypothesis test, our p-value equals 0.000. This p-value is less
than any reasonable significance level. Consequently, we can reject
the null hypothesis and conclude that the relationship is statistically
significant. The sample data provide sufficient evidence to conclude
that the relationship between height and weight exists in the popula-
tion of preteen girls.

24
Regression Analysis: An Intuitive Guide

Correlation Does Not Imply Causation


I’m sure you’ve heard this expression before, and it is a crucial warn-
ing. Correlation between two variables indicates that changes in one
variable are associated with changes in the other variable. However,
correlation does not mean that the changes in one variable actually
cause the changes in the other variable.

Sometimes it is clear that there is a causal relationship. For the height


and weight data, it makes sense that adding more vertical structure to
a body causes the total mass to increase. Or, increasing the wattage of
lightbulbs causes the light output to increase.

However, in other cases, a causal relationship is not possible. For ex-


ample, ice cream sales and shark attacks are positively correlated.
Clearly, selling more ice cream does not cause shark attacks (or vice
versa). Instead, a third variable, outdoor temperatures, causes changes
in the other two variables. Higher temperatures increase both sales of
ice cream and the number of swimmers in the ocean, which creates
the apparent relationship between ice cream sales and shark attacks.

In statistics, you typically need to perform a randomized, controlled


experiment to determine that a relationship is causal rather than
merely correlation.

How Strong of a Correlation is Considered Good?


What is a good correlation? How high should it be? These are com-
monly asked questions. I have seen several schemes that attempt to
classify correlations as strong, medium, and weak.

However, there is only one correct answer. The correlation coeffi-


cient should accurately reflect the strength of the relationship. Take a
look at the correlation between the height and weight data, 0.705. It’s
not a very strong relationship, but it accurately represents our data.

25
Jim Frost

An accurate representation is the best-case scenario for using a statis-


tic to describe an entire dataset.

The strength of any relationship naturally depends on the specific pair


of variables. Some research questions involve weaker relationships
than other subject areas. Case in point, humans are hard to predict.
Studies that assess relationships involving human behavior tend to
have correlations weaker than +/- 0.6.

However, if you analyze two variables in a physical process, and have


very precise measurements, you might expect correlations near +1 or
-1. There is no one-size fits all best answer for how strong a relation-
ship should be. The correct correlation value depends on your study
area. We run into this same issue in regression analysis.

Common Themes with Regression


Understanding correlation is a good place to start learning regression.
In fact, there are several themes that I touch upon in this section that
show up throughout this book.

For instance, analysts naturally want to fit models that explain more
and more of the variability in the data. And, they come up with classi-
fication schemes for how well the model fits the data. However, there
is a natural amount of variability that the model can’t explain just as
there was in the height and weight correlation example. Regression
models can be forced to go past this natural boundary, but bad things
happen. Throughout this book, be aware of the tension between trying
to explain as much variability as possible and ensuring that you don’t
go too far. This issue pops up multiple times!

Additionally, for regression analysis, you’ll need to use statistical


measures in conjunction with graphs just like we did with correlation.
This combination provides you the best understanding of your data
and the analytical results.

26
Regression Analysis: An Intuitive Guide

Taking Correlation to the Next Level with


Regression
Wouldn’t it be nice if instead of just describing the strength of the
relationship between height and weight, we could define the relation-
ship itself using an equation? Regression analysis does just that by
finding the line and corresponding equation that provides the best fit
to our dataset. We can use that equation to understand how much
weight increases with each additional unit of height and to make pre-
dictions for specific heights.

Regression analysis allows us to expand on correlation in other ways.


If we have more variables that explain changes in weight, we can in-
clude them in the model and potentially improve our predictions.
And, if the relationship is curved, we can still fit a regression model to
the data.

Additionally, a form of the Pearson correlation coefficient shows up


in regression analysis. R-squared is a primary measure of how well a
regression model fits the data. This statistic represents the percentage
of variation in one variable that other variables explain. For a pair of
variables, R-squared is simply the square of the Pearson’s correlation
coefficient. For example, squaring the height-weight correlation coef-
ficient of 0.705 produces an R-squared of 0.497, or 49.7%. In other
words, height explains about half the variability of weight in preteen
girls.

But we’re getting ahead of ourselves. I’ll cover R-squared in much


more detail in both chapters 2 and 4.

Fundamental Terms and Goals of Regression


The first questions you have are probably: When should I use regres-
sion analysis? And, why? Let’s dig right into these questions! In this
section, I explain the capabilities of regression analysis, the types of
relationships it can assess, how it controls the variables, and generally

27
Jim Frost

why I love it! You’ll learn when you should consider using regression
analysis.

As a statistician, I should probably tell you that I love all statistical


analyses equally—like parents with their kids. But, shhh, I have secret!
Regression analysis is my favorite because it provides tremendous
flexibility and it is useful in so many different circumstances.

You might run across unfamiliar terms. Don’t worry. I’ll cover all of
them throughout this book! The upcoming section provides a preview
for things you’ll learn later in the book. For now, let’s define several
basics—the fundamental types of variables that you’ll include in your
regression analysis and your primary goals for using regression analy-
sis.

Dependent Variables
The dependent variable is a variable that you want to explain or pre-
dict using the model. The values of this variable depend on other vari-
ables. It’s also known as the response variable, outcome variable, and
it is commonly denoted using a Y. Traditionally, analysts graph de-
pendent variables and the vertical, or Y, axis.

Independent Variables
Independent variables are the variables that you include in the model
to explain or predict changes in the dependent variable. In controlled
experiments, independent variables are systematically set and
changed by the researchers. However, in observational studies, values
of the independent variables are not set by researchers but rather ob-
served. These variables are also known as predictor variables, input
variables, and are commonly denoted using Xs. On graphs, analysts
place independent variables on the horizontal, or X, axis.

Simple versus Multiple Regression


When you include one independent variable in the model, you are
performing simple regression. For more than one independent

28
Regression Analysis: An Intuitive Guide

variable, it is multiple regression. Despite the different names, it’s re-


ally the same analysis with the same interpretations and assumptions.

Goals of Regression Analysis


Regression analysis mathematically describes the relationships be-
tween independent variables and a dependent variable. Use regres-
sion for two primary goals:

• To understand the relationships between these variables.


How do changes in the independent variables relate to
changes in the dependent variable?
• To predict the dependent variable by entering values for the
independent variables into the regression equation.

Example of a Regression Analysis


Suppose a researcher studies the relationship between wattage and
the output from a light bulb. In this study, light output is the depend-
ent variable because it depends on the wattage. Wattage is the inde-
pendent variable.

After performing the regression analysis, the researcher will under-


stand the nature of the relationship between these two variables. Is
this relationship statistically significant? What effect does wattage
have on light output? For a given wattage, how much light output does
the model predict?

Specifically, the regression equation describes the mean change in


light output for every increase of one watt. P-values indicate whether
the relationship is statistically significant. And, the researcher can en-
ter wattage values into the equation to predict light output.

Regression Analyzes a Wide Variety of


Relationships
Use regression analysis to describe the relationships between a set of
independent variables and the dependent variable. Regression

29
Jim Frost

analysis produces a regression equation where the coefficients repre-


sent the relationship between each independent variable and the de-
pendent variable. You can also use the equation to make predictions.

Regression analysis can handle many things. For example, you can use
regression analysis to do the following:

• Model multiple independent variables


• Include continuous and categorical variables
• Model linear and curvilinear relationships
• Assess interaction terms to determine whether the effect of
one independent variable depends on the value of another
variable

These capabilities are all cool, but they don’t include an almost magi-
cal ability. Regression analysis can unscramble very intricate prob-
lems where the variables are entangled like spaghetti. For example,
imagine you’re a researcher studying any of the following:

• Do socio-economic status and race affect educational achieve-


ment?
• Do education and IQ affect earnings?
• Do exercise habits and diet effect weight?
• Are drinking coffee and smoking cigarettes related to mortal-
ity risk?
• Does a particular exercise intervention have an impact on
bone density that is a distinct effect from other physical ac-
tivities?

More on the last two examples later!

All these research questions have entwined independent variables


that can influence the dependent variables. How do you untangle a
web of related variables? Which variables are statistically significant
and what role does each one play? Regression comes to the rescue be-
cause you can use it for all of these scenarios!

30
Regression Analysis: An Intuitive Guide

Using Regression to Control Independent Variables


As I mentioned, regression analysis describes how the changes in each
independent variable are related to changes in the dependent variable.
Crucially, regression also statistically controls every variable in your
model.

What does controlling for a variable mean?


Typically, research studies need to isolate the role of each variable
they are assessing. For example, I participated in an exercise interven-
tion study where our goal was to determine whether the exercise in-
tervention increased the subjects’ bone mineral density. We needed
to isolate the role of the exercise intervention from everything else
that can impact bone mineral density, which ranges from diet to other
physical activity.

Regression analysis does this by estimating the effect that changing


one independent variable has on the dependent variable while holding
all the other independent variables constant. This process allows you
to understand the role of each independent variable without worrying
about the other variables in the model. Again, you want to isolate the
effect of each variable.

How do you control the other variables in regression?


A beautiful aspect of regression analysis is that you hold the other in-
dependent variables constant by merely including them in your
model! Let’s look at this in action with an example.

A recent study analyzed the effect of coffee consumption on mortal-


ity. The first results indicated that higher coffee intake is related to a
higher risk of death. However, coffee drinkers frequently smoke, and
the researchers did not include smoking in their initial model. After
they included smoking in the model, the regression results indicated
that coffee intake lowers the risk of mortality while smoking increases
it. This model isolates the role of each variable while holding the other

31
Jim Frost

variable constant. You can assess the effect of coffee intake while con-
trolling for smoking. Conveniently, you’re also controlling for coffee
intake when looking at the effect of smoking.

Note that the study also illustrates how excluding a relevant variable
can produce misleading results. Omitting an important variable causes
it to be uncontrolled, and it can bias the results for the variables that
you do include in the model. In the example above, the first model
without smoking could not control for this important variable, which
forced the model to include the effect of smoking in another variable
(coffee consumption).

This warning is particularly applicable for observational studies where


the effects of omitted variables might be unbalanced. On the other
hand, the randomization process in a true experiment tends to distrib-
ute the effects of these variables equally, which lessens omitted vari-
able bias. You’ll learn about this form of bias in detail in chapter 7.

An Introduction to Regression Output


It’s time to get our feet wet and interpret regression output. The best
way to understand the value of regression analysis is to see an exam-
ple. In Chapter 3, I cover all of these statistics in much greater detail.
For now, you just need to understand the type of information that re-
gression analysis provides.

P-values and coefficients are they key regression output. Collectively,


these statistics indicate whether the variables are statistically signifi-
cant and describe the relationships between the independent varia-
bles and the dependent variable.

Low p-values (typically < 0.05) indicate that the independent variable
is statistically significant. Regression analysis is a form of inferential
statistics. Consequently, the p-values help determine whether the re-
lationships that you observe in your sample also exist in the larger
population.

32
Regression Analysis: An Intuitive Guide

The coefficients for the independent variables represent the average


change in the dependent variable given a one-unit change in the inde-
pendent variable (IV) while controlling the other IVs.

For instance, if your dependent variable is income and your independ-


ent variables include IQ and education (among other relevant varia-
bles), you might see output like this:

The low p-values indicate that both education and IQ are statistically
significant. The coefficient for IQ (4.796) indicates that each addi-
tional IQ point increases your income by an average of approximately
$4.80 while controlling everything else in the model. Furthermore,
the education coefficient (24.215) indicates that an additional year of
education increases average earnings by $24.22 while holding the
other variables constant.

Using regression analysis gives you the ability to separate the effects
of complicated research questions. You can disentangle the spaghetti
noodles by modeling and controlling all relevant variables, and then
assess the role that each one plays.

We’ll cover how to interpret regression analysis in much more detail


in later chapters!

Review and Next Steps


In this chapter, we covered correlation between variables because it’s
such a good lead-in for regression. Correlation provides you with a

33
Jim Frost

look at some of the fundamental issues we’ll address in regression


analysis itself—different types of trends in the data and the variability
around those trends.

Then, you learned about regression’s fundamental goals, its capabili-


ties, and why you’d use it for your study. You can use regression mod-
els to describe the relationship between each independent variable
and the dependent variable. You can also enter values into the regres-
sion equation to predict the mean of the dependent variable. We even
took a quick peek at some example regression output and interpreted
it.

Finally, we saw how regression analysis controls, or holds constant,


all the variables you include in the model. This feature allows you to
isolate the role of each independent variable.

This chapter serves as an introduction to all the above. We’ll revisit


all these concepts throughout this book. Next, you’ll learn how least
squares regression fits the best line through a dataset.

34
CHAPTER 2

Regression Basics and


How it Works

There are many different types of regression analysis procedures.


This book focuses on linear regression analysis, specifically ordinary
least squares (OLS). Analysts use this type most frequently. Typically,
they’ll look towards least squares regression first, and then use other
types only when there are issues that prevent them from using OLS.

Even when you need to use a different variety of regression, under-


standing linear regression is crucial. Much of the knowledge about fit-
ting models, interpreting the results, and checking assumptions for
linear models that you will learn throughout this book also apply in
some fashion to other types of regression analysis. In short, this book
provides a broad foundation on the core type of regression, and it’s
also informative about using more specialized types of regression.

In later chapters, we’ll cover possible reasons for using other kinds of
regression analysis. I’ll ensure that you know when you should con-
sider a specialized type of analysis, and give you pointers about which
alternatives to consider for various issues.

35
Jim Frost

We’ll start by covering some basic data requirements. Don’t confuse


these with the analysis assumptions. I discuss those in chapter 9.
These data requirements help ensure that you are putting good data
into the analysis. You know that old expression, “garbage in, garbage
out?” Let’s avoid that!

Data Considerations for OLS


To help ensure that your results are valid for OLS linear regression,
consider the following principles while collecting data, performing
the analysis, and interpreting the results.

The independent variables can be either continuous or categorical.

• Continuous variables can take on almost any numeric value


and can be meaningfully divided into smaller increments, in-
cluding fractional and decimal values. You often measure a
continuous variable on a scale. For example, when you meas-
ure height, weight, and temperature, you have continuous
data.
• Categorical variables have values that you can put into a
countable number of distinct groups based on a characteristic.
Categorical variables are also called qualitative variables or at-
tribute variables. For example, college major is a categorical
variable that can have values such as psychology, political sci-
ence, engineering, biology, etc.

The dependent variable should be continuous. If it’s not continuous,


you will most likely need to use a different type of regression analysis
(chapter 12) because your model is unlikely to satisfy the OLS as-
sumptions and can produce results that you can’t trust.

Use best practices while collecting your data. The following are some
points to consider:

36
Regression Analysis: An Intuitive Guide

• Confirm that the data represent your population of interest.


• Collect a sufficient amount of data that allows you to fit a
model which is appropriately complex for the subject area
(chapter 8) and provides the necessary precision for the co-
efficients and predictions (chapters 3 and 10).
• Measure all variables with the highest accuracy and precision
possible.
• Record data in the order you collect it. This process helps you
assess an assumption about correlations between adjacent re-
siduals (chapter 9).

Now, let’s see how OLS regression goes beyond correlation and pro-
duces an equation for the line that best fits a dataset.

How OLS Fits the Best Line


Regression explains the variation in the dependent variable using var-
iation in the independent variables. In other words, it predicts the de-
pendent variable for a given set of independent variables.

Let’s start with some basic terms that I’ll use throughout this book.
While I strive to explain regression analysis in an intuitive manner
using everyday English, I do use proper statistical terminology. Doing
so will help you if you’re following along with a college statistics
course or need to communicate with professionals about your model.

Observed and Fitted Values


Observed values of the dependent variable are the values of the de-
pendent variable that you record during your study or experiment
along with the values of the independent variables. These values are
denoted using Y.

Fitted values are the values that the model predicts for the dependent
variable using the independent variables. If you input values for the
independent variables into the regression equation, you obtain the fit-
ted value. Predicted values and fitted values are synonyms.

37
Jim Frost

An observed value is one that exists in the real world while your
model generates the fitted/predicted value for that observation.

Standard notation uses to denote fitted values, which you pro-


nounce as Y-hat. In general, hatted values indicate they are a model’s
estimate for the corresponding non-hatted values.

Residuals: Difference between Observed and Fitted Values


Regression analysis predicts the dependent variable. For every ob-
served value of the dependent variable, the regression model calcu-
lates a corresponding fitted value. To understand how well your
model fits the data, you need to assess the differences between the
observed values and the fitted values. These differences represent the
error in the model. No model is perfect. The observed and fitted val-
ues will never exactly match. However, models can be good enough
to be useful.

This difference is known as a residual, and you’ll be learning a lot


about them in this book. A residual is the distance between an ob-
served value and the corresponding fitted value. To calculate the dif-
ference mathematically, it’s simple subtraction:

Residual = Observed value – Fitted value.

Graphically, residuals are the vertical distances between the observed


values and the fitted values. On the graph, the line represents the fit-
ted values from the regression model. We call this line . . . the fitted
line! The lines that connect the data points to the fitted line represent
the residuals.

38
Regression Analysis: An Intuitive Guide

The length of the line is the value of the residual. The equation below
shows how to calculate the residuals, or error, for the ith observation:

It makes sense, right? You want to minimize the distance between the
observed values and the fitted values. For a good model, the residuals
should be relatively small and unbiased. In statistics, bias indicates
that estimates are systematically too high or too low.

If the residuals become too large or biased, the model is no longer use-
ful. Consequently, these differences play a vital role during both the
model estimation process and later when you assess the quality of the
model.

Using the Sum of the Squared Errors (SSE) to Find the Best
Line
Let’s go back to the height and weight dataset for which we calculated
the correlation.

39
Jim Frost

The goal of regression analysis is to draw a line through these data


points that minimizes the overall distance of the points from the line.
How would you draw the best fitting straight line through this cloud
of points?

You could draw many different potential lines. Some observations will
fit the model better or worse than other points, and that will vary
based on the line that you draw. Which measure would you use to
quantify how well the line fits all of the data points? Using what you
learned above, you know that you want to minimize the residuals.
And, it should be a measure that factors in the difference for all of the
points. We need a summary statistic for the entire dataset.

Perhaps the average distance or residual value? If your model has


many residuals with values near +10 and -10, that averages to approx-
imately zero distance. However, another model with many residuals
near +1 and -1 also averages out to be nearly zero. Obviously, you’d
prefer the model with smaller distances. Unfortunately, using the av-
erage residual doesn’t distinguish between these models.

40
Regression Analysis: An Intuitive Guide

You can’t merely sum the residuals because the positive and negative
values will cancel each other out even when they tend to be relatively
large. Instead, OLS regression squares those residuals so they’re al-
ways positive. In this manner, the process can add them up without
canceling each other out.

This process produces squared errors (residuals). First, we obtain the


residuals between the observed and fitted values using simple subtrac-
tion, and then we just square them. Simple! A data point with a resid-
ual of 3 will have a squared error of 9. A residual of -4 produces a
squared error of 16.

Then, the ordinary least squares procedure sums these squared errors,
as shown in the equation below:

OLS draws the line that minimizes the sum of squared errors (SSE).
Hopefully, you’re gaining an appreciation for why the procedure is
named ordinary least squares!

SSE is a measure of variability. As the points spread out further from


the fitted line, SSE increases. Because the calculations use squared dif-
ferences, the variance is in squared units rather the original units of
the data. While higher values indicate greater variability, there is no
intuitive interpretation of specific values. However, for a given data
set, smaller SSE values signal that the observations fall closer to the
fitted values. OLS minimizes this value, which means you’re getting
the best possible line.

In textbooks, you’ll find equations for how OLS derives the line that
minimizes SSE. Statistical software packages use these equations to
solve for the solution directly. However, I’m not going to cover those
equations. Instead, it’s crucial for you to understand the concepts of

41
Jim Frost

residuals and how the procedure minimizes the SSE. If you were to
draw any line other than the one that OLS produces, the SSE would
increase—which indicates that the distances between the observed
and fitted values are growing, and the model is not as good.

Implications of Minimizing SSE


OLS minimizes the SSE. This fact has several important implications.

First, because OLS calculates squared errors using residuals, the model
fitting process ultimately ties back to the residuals very strongly. Re-
siduals are the underlying foundation for how least squares regression
fits the model. Consequently, understanding the properties of the re-
siduals for your model is vital. They play an enormous role in deter-
mining whether your model is good or not. You’ll hear so much about
them throughout this book. In fact, chapter 9 focuses on them. So, I
won’t say much more here. For now, just know that you want rela-
tively small and unbiased residuals (positive and negative are equally
likely) that don’t display patterns when you graph them.

Second, the fact that the OLS procedure squares the residuals has sig-
nificant ramifications. It makes the model susceptible to outliers and
unusual observations. To understand why, consider the following set
of residuals: {1 2 3}. Imagine most of your residuals are in this range.
These residuals produce the following squared errors: {1 4 9}. Now,
imagine that one observation has a residual of 6, which yields a
squared error of 36. Compare the magnitude of most squared errors
(1 – 9) to that of the unusual observation (36).

To minimize the squared errors, OLS factors in that unusual observa-


tion much more heavily than the other data points. The result is that
an individual outlier can exert a strong influence over the entire
model and, by itself, dramatically change the results. Chapter 9 dis-
cusses this problem in greater detail and how to detect and resolve it.
For now, be aware that OLS is susceptible to outliers!

42
Regression Analysis: An Intuitive Guide

Other Types of Sums of Squares


You learned about the error sum of squares above, but there are sev-
eral different types of sums of squares in OLS. We won’t focus on the
others as much as the SSE, but you should understand what they meas-
ure and how they’re related:

Sums of Squares Measures Calculation


Sum of Squared Overall variability of Sum of squared resid-
Errors (SSE) the distance be- uals.
tween the data
points and fitted val-
ues.
Regression Sum The amount of addi- Sum of the squared
of Squares (RSS) tional variability distances between the
your model explains fitted values and the
compared to a model mean of the depend-
that contains no var- ent variable (y-bar).
iables and uses only
the mean to predict
the dependent varia-
ble.
Total Sum of Overall variability of Sum of the squared
Squares (TSS) the dependent varia- distances between the
ble around its mean. observed values and
the mean of the de-
pendent variable.

43
Jim Frost

These three sums of squares have the following mathematical rela-


tionship:

RSS + SSE = TSS

Understanding this relationship is fairly straight forward.

• RSS represents the variability that your model explains.


Higher is usually good.
• SSE represents the variability that your model does not ex-
plain. Smaller is usually good.
• TSS represents the variability inherent in your dependent
variable.

Or, Explained Variability + Unexplained Variability = Total Variability

For the same dataset, as you fit better models, RSS increases and SSE
decreases by an exactly corresponding amount. RSS cannot be greater
than TSS while SSE cannot be less than zero.

Additionally, if you take RSS / TSS, you’ll obtain the percentage of the
variability of the dependent variable around its mean that your model
explains. This statistic is R-squared!

Based on the mathematical relationship shown above, you know that


R-squared can range from 0 – 100%. Zero indicates that the model ac-
counts for none of the variability in the dependent variable around its
mean. 100% signifies that the model explains all of that variability.

Keep in mind that these sums of squares all measure variability. You
might hear about models and variables accounting for variability, and
that harkens back to these measures of variability.

We’ll talk about R-squared in much greater detail in chapter 4, which


helps you determine how well your model fits the data. However, in

44
Regression Analysis: An Intuitive Guide

that chapter, I discuss it more from the conceptual standpoint and


what it means for your model. I also focus on various problems with
R-squared and alternative measures that address these problems. For
now, my goal is for you to understand the mathematical derivation of
this useful statistic.

Note: Some texts use RSS to refer to residual sums of squares (which
we’re calling SSE) rather than regression sums of squares. Be aware of
this potentially confusing use of terminology!

Displaying a Regression Model on a Fitted Line Plot


Let’s again return to our height and weight data. I’ll fit the ordinary
least squares model and display it in a fitted line plot. You can use this
model to estimate the effect of height on weight. You can also enter
height values to predict the corresponding weight. Here is the CSV
dataset: HeightWeight.

This graph shows all the observations together with a line that repre-
sents the fitted relationship. As is traditional, the Y-axis displays the
dependent variable, which is weight. The X-axis shows the independ-
ent variable, which is height. The line is the fitted line. If you enter the

45
Jim Frost

full range of height values that are on the X-axis into the regression
equation that the chart displays, you will obtain the line shown on the
graph. This line produces a smaller SSE than any other line you can
draw through these observations.

Visually, we see that that the fitted line has a positive slope that cor-
responds to the positive correlation we obtained earlier. The line fol-
lows the data points, which indicates that the model fits the data. The
slope of the line equals the coefficient that I circled. This coefficient
indicates how much mean weight tends to increase as we increase
height. We can also enter a height value into the equation and obtain
a prediction for the mean weight.

Each point on the fitted line represents the mean weight for a given
height. However, like any mean, there is variability around the mean.
Notice how there is a spread of data points around the line. You can
assess this variability by picking a spot on the line and observing the
range of data points above and below that point. Finally, the vertical
distance between each data point and the line is the residual for that
observation.

Importance of Staying Close to Your Data


It’s easy to get lost in the large volume of statistical output that regres-
sion produces. All of the numerical statistical measures can cause you
to lose touch with your data. However, ensuring that your model ad-
equately represents the data, and determining what the results mean,
requires that you stay close to the data. Graphs can help you meet this
challenge!

I love using fitted line plots to illustrate regression concepts. In my


mission to make regression analysis ideas more intuitive, fitted line
plots are one of my primary tools. I’ll summarize the concepts that
fitted line plots illustrate below, but I’ll come back to each one later in
the book to explore them in more detail.

46
Regression Analysis: An Intuitive Guide

Fitted line plots are great for showing the following:

• The regression coefficient in the equation corresponds to the


slope of the line. What does it mean?
• For different models, the data points vary around the line to a
greater or lesser extent, which reflects the precision of the
predictions and goodness-of-fit statistics, like R-squared.
We’ll explore this in more detail because the implications of
this precision are often forgotten. How precise are your
model’s predictions?
• Does the fitted line fit curvature that is present in the data?
For now, we’re fitting a straight line, but that might not always
be the case! Fitted line plots make curvature unmistakable.

As fantastic as fitted line plots are, they can only show simple regres-
sion models, which contain only one independent variable. Fitted line
plots use two axes—one for the dependent variable and the other for
the independent variable. Consequently, fitted line plots are great for
displaying simple regression models on a screen or printed on paper.
However, each additional independent variable requires another axis
or physical dimension. With two independent variables, we can use a
3D representation for it. Although, that’s beyond my abilities for this
book. With three independent variables, we’d need a four-dimen-
sional plot. That’s not going to happen!

If you have a simple regression model, I highly recommend creating a


fitted line plot for it and assessing the bullet points above. You’ll ob-
tain an excellent overview of how your model fits the data because
they’re graphed together. However, for multiple regression, we can’t
use fitted line plots to obtain that overview. For those cases, I’ll show
you other methods throughout this book for answering those ques-
tions. Sometimes these methods will be statistical measures, but
whenever possible I’ll show you special types of graphs because they
bring it to life. These graphical tools include main effects plots, inter-
action plots, and various residual plots.

47
Jim Frost

Review and Next Steps


In this chapter, I explained how learning about ordinary least squares
linear regression provides an excellent foundation for learning about
regression analysis. Not only is it the most frequently used type of re-
gression, but your knowledge of OLS will help inform your usage of
other types of regression. I showed you some foundational data con-
siderations to keep in mind so you can avoid the problem of “garbage
in, garbage out!”

You learned the basics of how OLS minimizes the sums of squared
errors (SSE) to produce the best fitting line for your dataset. And, how
SSE fits it in with two other sums of squares, regression sums of
squares (RSS) and total sums of squares (TSS). In the process, you
even got a sneak peek at R-squared (RSS / TSS)!

Then, we explored the height-weight regression model using a fitted


line plot.

From here, we’ll move on to learning how to interpret the different


types of effects for continuous and categorical independent variables,
the constant, what statistical significance indicates in this context, and
determining significance.

END OF FREE SAMPLE

NOTE: This sample contains only the introduction and first two chap-
ters. Please buy the full ebook for all the content listed in the Table of
Contents. You can buy it in My Store.

48

You might also like